{ "cells": [ { "cell_type": "markdown", "id": "6674fa96", "metadata": {}, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/duoan/TorchCode/blob/master/templates/39_ppo_loss.ipynb)\n", "\n", "# ๐Ÿ”ด Hard: PPO Clipped Loss\n", "\n", "Implement the **PPO (Proximal Policy Optimization)** **clipped surrogate loss**.\n", "\n", "Given:\n", "- `new_logps`: current policy log-probs $(B,)$\n", "- `old_logps`: old policy log-probs $(B,)$\n", "- `advantages`: advantage estimates $(B,)$\n", "\n", "Define the ratio\n", "\n", "$$ r_i = \\exp(\\text{new\\_logps}_i - \\text{old\\_logps}_i). $$\n", "\n", "Then compute\n", "- $L^{\\text{unclipped}}_i = r_i A_i$\n", "- $L^{\\text{clipped}}_i = \\operatorname{clip}(r_i, 1-\\epsilon, 1+\\epsilon) A_i$\n", "\n", "The loss is the negative batch mean of the elementwise minimum:\n", "\n", "$$\n", "\\mathcal{L}_\\text{PPO} = -\\mathbb{E}_i\\big[\\min(L^{\\text{unclipped}}_i, L^{\\text{clipped}}_i)\\big].\n", "$$\n", "\n", "Implementation notes: detach `old_logps` and `advantages` so gradients only flow through `new_logps`.\n", "\n", "### Signature\n", "```python\n", "from torch import Tensor\n", "\n", "def ppo_loss(new_logps: Tensor, old_logps: Tensor, advantages: Tensor,\n", " clip_ratio: float = 0.2) -> Tensor:\n", " \"\"\"PPO clipped surrogate loss over a batch.\"\"\"\n", "```\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Install torch-judge in Colab (no-op in JupyterLab/Docker)\n", "try:\n", " import google.colab\n", " get_ipython().run_line_magic('pip', 'install -q torch-judge')\n", "except ImportError:\n", " pass\n" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import torch.nn.functional as F\n", "from torch import Tensor\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# โœ๏ธ YOUR IMPLEMENTATION HERE\n", "\n", "def ppo_loss(new_logps: Tensor, old_logps: Tensor, advantages: Tensor,\n", " clip_ratio: float = 0.2) -> Tensor:\n", " pass # -mean(min(r * adv, clamp(r, 1-clip, 1+clip) * adv)) with gradients only through new_logps\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ๐Ÿงช Debug\n", "new_logps = torch.tensor([0.0, -0.2, -0.4, -0.6])\n", "old_logps = torch.tensor([0.0, -0.1, -0.5, -0.5])\n", "advantages = torch.tensor([1.0, -1.0, 0.5, -0.5])\n", "print('Loss:', ppo_loss(new_logps, old_logps, advantages, clip_ratio=0.2))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# โœ… SUBMIT\n", "from torch_judge import check\n", "check('ppo_loss')\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }