{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Google Colab\uc5d0\uc11c \ub178\ud2b8\ubd81\uc744 \uc2e4\ud589\ud558\uc2e4 \ub54c\uc5d0\ub294 \n# https://tutorials.pytorch.kr/beginner/colab \ub97c \ucc38\uace0\ud558\uc138\uc694.\n%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n[Introduction](introyt1_tutorial.html) ||\n[Tensors](tensors_deeper_tutorial.html) ||\n**Autograd** ||\n[Building Models](modelsyt_tutorial.html) ||\n[TensorBoard Support](tensorboardyt_tutorial.html) ||\n[Training Models](trainingyt.html) ||\n[Model Understanding](captumyt.html)\n\n# The Fundamentals of Autograd\n\nFollow along with the video below or on [youtube](https://www.youtube.com/watch?v=M0fX15_-xrY)_.\n\n.. raw:: html\n\n   <div style=\"margin-top:10px; margin-bottom:10px;\">\n     <iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/M0fX15_-xrY\" frameborder=\"0\" allow=\"accelerometer; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\n   </div>\n\nPyTorch\u2019s *Autograd* feature is part of what make PyTorch flexible and\nfast for building machine learning projects. It allows for the rapid and\neasy computation of multiple partial derivatives (also referred to as\n*gradients)* over a complex computation. This operation is central to\nbackpropagation-based neural network learning.\n\nThe power of autograd comes from the fact that it traces your\ncomputation dynamically *at runtime,* meaning that if your model has\ndecision branches, or loops whose lengths are not known until runtime,\nthe computation will still be traced correctly, and you\u2019ll get correct\ngradients to drive learning. This, combined with the fact that your\nmodels are built in Python, offers far more flexibility than frameworks\nthat rely on static analysis of a more rigidly-structured model for\ncomputing gradients.\n\n## What Do We Need Autograd For?\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A machine learning model is a *function*, with inputs and outputs. For\nthis discussion, we\u2019ll treat the inputs as an *i*-dimensional vector\n$\\vec{x}$, with elements $x_{i}$. We can then express the\nmodel, *M*, as a vector-valued function of the input: $\\vec{y} =\n\\vec{M}(\\vec{x})$. (We treat the value of M\u2019s output as\na vector because in general, a model may have any number of outputs.)\n\nSince we\u2019ll mostly be discussing autograd in the context of training,\nour output of interest will be the model\u2019s loss. The *loss function*\nL($\\vec{y}$) = L($\\vec{M}$\\ ($\\vec{x}$)) is a\nsingle-valued scalar function of the model\u2019s output. This function\nexpresses how far off our model\u2019s prediction was from a particular\ninput\u2019s *ideal* output. *Note: After this point, we will often omit the\nvector sign where it should be contextually clear - e.g.,* $y$\ninstead of $\\vec y$.\n\nIn training a model, we want to minimize the loss. In the idealized case\nof a perfect model, that means adjusting its learning weights - that is,\nthe adjustable parameters of the function - such that loss is zero for\nall inputs. In the real world, it means an iterative process of nudging\nthe learning weights until we see that we get a tolerable loss for a\nwide variety of inputs.\n\nHow do we decide how far and in which direction to nudge the weights? We\nwant to *minimize* the loss, which means making its first derivative\nwith respect to the input equal to 0:\n$\\frac{\\partial L}{\\partial x} = 0$.\n\nRecall, though, that the loss is not *directly* derived from the input,\nbut a function of the model\u2019s output (which is a function of the input\ndirectly), $\\frac{\\partial L}{\\partial x}$ =\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$. By the chain rule of\ndifferential calculus, we have\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial y}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial M(x)}{\\partial x}$.\n\n$\\frac{\\partial M(x)}{\\partial x}$ is where things get complex.\nThe partial derivatives of the model\u2019s outputs with respect to its\ninputs, if we were to expand the expression using the chain rule again,\nwould involve many local partial derivatives over every multiplied\nlearning weight, every activation function, and every other mathematical\ntransformation in the model. The full expression for each such partial\nderivative is the sum of the products of the local gradient of *every\npossible path* through the computation graph that ends with the variable\nwhose gradient we are trying to measure.\n\nIn particular, the gradients over the learning weights are of interest\nto us - they tell us *what direction to change each weight* to get the\nloss function closer to zero.\n\nSince the number of such local derivatives (each corresponding to a\nseparate path through the model\u2019s computation graph) will tend to go up\nexponentially with the depth of a neural network, so does the complexity\nin computing them. This is where autograd comes in: It tracks the\nhistory of every computation. Every computed tensor in your PyTorch\nmodel carries a history of its input tensors and the function used to\ncreate it. Combined with the fact that PyTorch functions meant to act on\ntensors each have a built-in implementation for computing their own\nderivatives, this greatly speeds the computation of the local\nderivatives needed for learning.\n\n## A Simple Example\n\nThat was a lot of theory - but what does it look like to use autograd in\npractice?\n\nLet\u2019s start with a straightforward example. First, we\u2019ll do some imports\nto let us graph our results:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# %matplotlib inline\n\nimport torch\n\nimport matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\nimport math"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we\u2019ll create an input tensor full of evenly spaced values on the\ninterval $[0, 2{\\pi}]$, and specify ``requires_grad=True``. (Like\nmost functions that create tensors, ``torch.linspace()`` accepts an\noptional ``requires_grad`` option.) Setting this flag means that in\nevery computation that follows, autograd will be accumulating the\nhistory of the computation in the output tensors of that computation.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nprint(a)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we\u2019ll perform a computation, and plot its output in terms of its\ninputs:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "b = torch.sin(a)\nplt.plot(a.detach(), b.detach())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\u2019s have a closer look at the tensor ``b``. When we print it, we see\nan indicator that it is tracking its computation history:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(b)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This ``grad_fn`` gives us a hint that when we execute the\nbackpropagation step and compute gradients, we\u2019ll need to compute the\nderivative of $\\sin(x)$ for all this tensor\u2019s inputs.\n\nLet\u2019s perform some more computations:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "c = 2 * b\nprint(c)\n\nd = c + 1\nprint(d)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, let\u2019s compute a single-element output. When you call\n``.backward()`` on a tensor with no arguments, it expects the calling\ntensor to contain only a single element, as is the case when computing a\nloss function.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out = d.sum()\nprint(out)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each ``grad_fn`` stored with our tensors allows you to walk the\ncomputation all the way back to its inputs with its ``next_functions``\nproperty. We can see below that drilling down on this property on ``d``\nshows us the gradient functions for all the prior tensors. Note that\n``a.grad_fn`` is reported as ``None``, indicating that this was an input\nto the function with no history of its own.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print('d:')\nprint(d.grad_fn)\nprint(d.grad_fn.next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)\nprint('\\nc:')\nprint(c.grad_fn)\nprint('\\nb:')\nprint(b.grad_fn)\nprint('\\na:')\nprint(a.grad_fn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With all this machinery in place, how do we get derivatives out? You\ncall the ``backward()`` method on the output, and check the input\u2019s\n``grad`` property to inspect the gradients:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "out.backward()\nprint(a.grad)\nplt.plot(a.detach(), a.grad.detach())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Recall the computation steps we took to get here:\n\n```python\na = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nb = torch.sin(a)\nc = 2 * b\nd = c + 1\nout = d.sum()\n```\nAdding a constant, as we did to compute ``d``, does not change the\nderivative. That leaves $c = 2 * b = 2 * \\sin(a)$, the derivative\nof which should be $2 * \\cos(a)$. Looking at the graph above,\nthat\u2019s just what we see.\n\nBe aware that only *leaf nodes* of the computation have their gradients\ncomputed. If you tried, for example, ``print(c.grad)`` you\u2019d get back\n``None``. In this simple example, only the input is a leaf node, so only\nit has gradients computed.\n\n## Autograd in Training\n\nWe\u2019ve had a brief look at how autograd works, but how does it look when\nit\u2019s used for its intended purpose? Let\u2019s define a small model and\nexamine how it changes after a single training batch. First, define a\nfew constants, our model, and some stand-ins for inputs and outputs:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "BATCH_SIZE = 16\nDIM_IN = 1000\nHIDDEN_SIZE = 100\nDIM_OUT = 10\n\nclass TinyModel(torch.nn.Module):\n\n    def __init__(self):\n        super(TinyModel, self).__init__()\n        \n        self.layer1 = torch.nn.Linear(DIM_IN, HIDDEN_SIZE)\n        self.relu = torch.nn.ReLU()\n        self.layer2 = torch.nn.Linear(HIDDEN_SIZE, DIM_OUT)\n    \n    def forward(self, x):\n        x = self.layer1(x)\n        x = self.relu(x)\n        x = self.layer2(x)\n        return x\n    \nsome_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)\nideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)\n\nmodel = TinyModel()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "One thing you might notice is that we never specify\n``requires_grad=True`` for the model\u2019s layers. Within a subclass of\n``torch.nn.Module``, it\u2019s assumed that we want to track gradients on the\nlayers\u2019 weights for learning.\n\nIf we look at the layers of the model, we can examine the values of the\nweights, and verify that no gradients have been computed yet:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(model.layer2.weight[0][0:10]) # just a small slice\nprint(model.layer2.weight.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\u2019s see how this changes when we run through one training batch. For a\nloss function, we\u2019ll just use the square of the Euclidean distance\nbetween our ``prediction`` and the ``ideal_output``, and we\u2019ll use a\nbasic stochastic gradient descent optimizer.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimizer = torch.optim.SGD(model.parameters(), lr=0.001)\n\nprediction = model(some_input)\n\nloss = (ideal_output - prediction).pow(2).sum()\nprint(loss)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let\u2019s call ``loss.backward()`` and see what happens:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "loss.backward()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that the gradients have been computed for each learning\nweight, but the weights remain unchanged, because we haven\u2019t run the\noptimizer yet. The optimizer is responsible for updating model weights\nbased on the computed gradients.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimizer.step()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You should see that ``layer2``\\ \u2019s weights have changed.\n\nOne important thing about the process: After calling\n``optimizer.step()``, you need to call ``optimizer.zero_grad()``, or\nelse every time you run ``loss.backward()``, the gradients on the\nlearning weights will accumulate:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(model.layer2.weight.grad[0][0:10])\n\nfor i in range(0, 5):\n    prediction = model(some_input)\n    loss = (ideal_output - prediction).pow(2).sum()\n    loss.backward()\n    \nprint(model.layer2.weight.grad[0][0:10])\n\noptimizer.zero_grad(set_to_none=False)\n\nprint(model.layer2.weight.grad[0][0:10])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "After running the cell above, you should see that after running\n``loss.backward()`` multiple times, the magnitudes of most of the\ngradients will be much larger. Failing to zero the gradients before\nrunning your next training batch will cause the gradients to blow up in\nthis manner, causing incorrect and unpredictable learning results.\n\n## Turning Autograd Off and On\n\nThere are situations where you will need fine-grained control over\nwhether autograd is enabled. There are multiple ways to do this,\ndepending on the situation.\n\nThe simplest is to change the ``requires_grad`` flag on a tensor\ndirectly:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.ones(2, 3, requires_grad=True)\nprint(a)\n\nb1 = 2 * a\nprint(b1)\n\na.requires_grad = False\nb2 = 2 * a\nprint(b2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the cell above, we see that ``b1`` has a ``grad_fn`` (i.e., a traced\ncomputation history), which is what we expect, since it was derived from\na tensor, ``a``, that had autograd turned on. When we turn off autograd\nexplicitly with ``a.requires_grad = False``, computation history is no\nlonger tracked, as we see when we compute ``b2``.\n\nIf you only need autograd turned off temporarily, a better way is to use\nthe ``torch.no_grad()``:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "a = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = a + b\nprint(c1)\n\nwith torch.no_grad():\n    c2 = a + b\n\nprint(c2)\n\nc3 = a * b\nprint(c3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "``torch.no_grad()`` can also be used as a function or method decorator:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def add_tensors1(x, y):\n    return x + y\n\n@torch.no_grad()\ndef add_tensors2(x, y):\n    return x + y\n\n\na = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = add_tensors1(a, b)\nprint(c1)\n\nc2 = add_tensors2(a, b)\nprint(c2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There\u2019s a corresponding context manager, ``torch.enable_grad()``, for\nturning autograd on when it isn\u2019t already. It may also be used as a\ndecorator.\n\nFinally, you may have a tensor that requires gradient tracking, but you\nwant a copy that does not. For this we have the ``Tensor`` object\u2019s\n``detach()`` method - it creates a copy of the tensor that is *detached*\nfrom the computation history:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = torch.rand(5, requires_grad=True)\ny = x.detach()\n\nprint(x)\nprint(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We did this above when we wanted to graph some of our tensors. This is\nbecause ``matplotlib`` expects a NumPy array as input, and the implicit\nconversion from a PyTorch tensor to a NumPy array is not enabled for\ntensors with requires_grad=True. Making a detached copy lets us move\nforward.\n\n### Autograd and In-place Operations\n\nIn every example in this notebook so far, we\u2019ve used variables to\ncapture the intermediate values of a computation. Autograd needs these\nintermediate values to perform gradient computations. *For this reason,\nyou must be careful about using in-place operations when using\nautograd.* Doing so can destroy information you need to compute\nderivatives in the ``backward()`` call. PyTorch will even stop you if\nyou attempt an in-place operation on leaf variable that requires\nautograd, as shown below.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>The following code cell throws a runtime error. This is expected.\n\n```python\na = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\ntorch.sin_(a)</p></div>\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Autograd Profiler\n\nAutograd tracks every step of your computation in detail. Such a\ncomputation history, combined with timing information, would make a\nhandy profiler - and autograd has that feature baked in. Here\u2019s a quick\nexample usage:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "device = torch.device('cpu')\nrun_on_gpu = False\nif torch.cuda.is_available():\n    device = torch.device('cuda')\n    run_on_gpu = True\n    \nx = torch.randn(2, 3, requires_grad=True)\ny = torch.rand(2, 3, requires_grad=True)\nz = torch.ones(2, 3, requires_grad=True)\n\nwith torch.autograd.profiler.profile(use_cuda=run_on_gpu) as prf:\n    for _ in range(1000):\n        z = (z / x) * y\n        \nprint(prf.key_averages().table(sort_by='self_cpu_time_total'))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The profiler can also label individual sub-blocks of code, break out the\ndata by input tensor shape, and export data as a Chrome tracing tools\nfile. For full details of the API, see the\n[documentation](https://pytorch.org/docs/stable/autograd.html#profiler)_.\n\n## Advanced Topic: More Autograd Detail and the High-Level API\n\nIf you have a function with an n-dimensional input and m-dimensional\noutput, $\\vec{y}=f(\\vec{x})$, the complete gradient is a matrix of\nthe derivative of every output with respect to every input, called the\n*Jacobian:*\n\n\\begin{align}J\n     =\n     \\left(\\begin{array}{ccc}\n     \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{1}}{\\partial x_{n}}\\\\\n     \\vdots & \\ddots & \\vdots\\\\\n     \\frac{\\partial y_{m}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n     \\end{array}\\right)\\end{align}\n\nIf you have a second function, $l=g\\left(\\vec{y}\\right)$ that\ntakes m-dimensional input (that is, the same dimensionality as the\noutput above), and returns a scalar output, you can express its\ngradients with respect to $\\vec{y}$ as a column vector,\n$v=\\left(\\begin{array}{ccc}\\frac{\\partial l}{\\partial y_{1}} & \\cdots & \\frac{\\partial l}{\\partial y_{m}}\\end{array}\\right)^{T}$\n- which is really just a one-column Jacobian.\n\nMore concretely, imagine the first function as your PyTorch model (with\npotentially many inputs and many outputs) and the second function as a\nloss function (with the model\u2019s output as input, and the loss value as\nthe scalar output).\n\nIf we multiply the first function\u2019s Jacobian by the gradient of the\nsecond function, and apply the chain rule, we get:\n\n\\begin{align}J^{T}\\cdot v=\\left(\\begin{array}{ccc}\n   \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{1}}\\\\\n   \\vdots & \\ddots & \\vdots\\\\\n   \\frac{\\partial y_{1}}{\\partial x_{n}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n   \\end{array}\\right)\\left(\\begin{array}{c}\n   \\frac{\\partial l}{\\partial y_{1}}\\\\\n   \\vdots\\\\\n   \\frac{\\partial l}{\\partial y_{m}}\n   \\end{array}\\right)=\\left(\\begin{array}{c}\n   \\frac{\\partial l}{\\partial x_{1}}\\\\\n   \\vdots\\\\\n   \\frac{\\partial l}{\\partial x_{n}}\n   \\end{array}\\right)\\end{align}\n\nNote: You could also use the equivalent operation $v^{T}\\cdot J$,\nand get back a row vector.\n\nThe resulting column vector is the *gradient of the second function with\nrespect to the inputs of the first* - or in the case of our model and\nloss function, the gradient of the loss with respect to the model\ninputs.\n\n**``torch.autograd`` is an engine for computing these products.** This\nis how we accumulate the gradients over the learning weights during the\nbackward pass.\n\nFor this reason, the ``backward()`` call can *also* take an optional\nvector input. This vector represents a set of gradients over the tensor,\nwhich are multiplied by the Jacobian of the autograd-traced tensor that\nprecedes it. Let\u2019s try a specific example with a small vector:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "x = torch.randn(3, requires_grad=True)\n\ny = x * 2\nwhile y.data.norm() < 1000:\n    y = y * 2\n\nprint(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If we tried to call ``y.backward()`` now, we\u2019d get a runtime error and a\nmessage that gradients can only be *implicitly* computed for scalar\noutputs. For a multi-dimensional output, autograd expects us to provide\ngradients for those three outputs that it can multiply into the\nJacobian:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float) # stand-in for gradients\ny.backward(v)\n\nprint(x.grad)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "(Note that the output gradients are all related to powers of two - which\nwe\u2019d expect from a repeated doubling operation.)\n\n### The High-Level API\n\nThere is an API on autograd that gives you direct access to important\ndifferential matrix and vector operations. In particular, it allows you\nto calculate the Jacobian and the *Hessian* matrices of a particular\nfunction for particular inputs. (The Hessian is like the Jacobian, but\nexpresses all partial *second* derivatives.) It also provides methods\nfor taking vector products with these matrices.\n\nLet\u2019s take the Jacobian of a simple function, evaluated for a 2\nsingle-element inputs:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def exp_adder(x, y):\n    return 2 * x.exp() + 3 * y\n\ninputs = (torch.rand(1), torch.rand(1)) # arguments for the function\nprint(inputs)\ntorch.autograd.functional.jacobian(exp_adder, inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you look closely, the first output should equal $2e^x$ (since\nthe derivative of $e^x$ is $e^x$), and the second value\nshould be 3.\n\nYou can, of course, do this with higher-order tensors:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "inputs = (torch.rand(3), torch.rand(3)) # arguments for the function\nprint(inputs)\ntorch.autograd.functional.jacobian(exp_adder, inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The ``torch.autograd.functional.hessian()`` method works identically\n(assuming your function is twice differentiable), but returns a matrix\nof all second derivatives.\n\nThere is also a function to directly compute the vector-Jacobian\nproduct, if you provide the vector:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def do_some_doubling(x):\n    y = x * 2\n    while y.data.norm() < 1000:\n        y = y * 2\n    return y\n\ninputs = torch.randn(3)\nmy_gradients = torch.tensor([0.1, 1.0, 0.0001])\ntorch.autograd.functional.vjp(do_some_doubling, inputs, v=my_gradients)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The ``torch.autograd.functional.jvp()`` method performs the same matrix\nmultiplication as ``vjp()`` with the operands reversed. The ``vhp()``\nand ``hvp()`` methods do the same for a vector-Hessian product.\n\nFor more information, including performance notes on the [docs for the\nfunctional\nAPI](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api)_\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.14"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}