\n \n

\n\nPyTorch\u2019s *Autograd* feature is part of what make PyTorch flexible and\nfast for building machine learning projects. It allows for the rapid and\neasy computation of multiple partial derivatives (also referred to as\n*gradients)* over a complex computation. This operation is central to\nbackpropagation-based neural network learning.\n\nThe power of autograd comes from the fact that it traces your\ncomputation dynamically *at runtime,* meaning that if your model has\ndecision branches, or loops whose lengths are not known until runtime,\nthe computation will still be traced correctly, and you\u2019ll get correct\ngradients to drive learning. This, combined with the fact that your\nmodels are built in Python, offers far more flexibility than frameworks\nthat rely on static analysis of a more rigidly-structured model for\ncomputing gradients.\n\n## What Do We Need Autograd For?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A machine learning model is a *function*, with inputs and outputs. For\nthis discussion, we\u2019ll treat the inputs as an *i*-dimensional vector\n$\\vec{x}$, with elements $x_{i}$. We can then express the\nmodel, *M*, as a vector-valued function of the input: $\\vec{y} =\n\\vec{M}(\\vec{x})$. (We treat the value of M\u2019s output as\na vector because in general, a model may have any number of outputs.)\n\nSince we\u2019ll mostly be discussing autograd in the context of training,\nour output of interest will be the model\u2019s loss. The *loss function*\nL($\\vec{y}$) = L($\\vec{M}$\\ ($\\vec{x}$)) is a\nsingle-valued scalar function of the model\u2019s output. This function\nexpresses how far off our model\u2019s prediction was from a particular\ninput\u2019s *ideal* output. *Note: After this point, we will often omit the\nvector sign where it should be contextually clear - e.g.,* $y$\ninstead of $\\vec y$.\n\nIn training a model, we want to minimize the loss. In the idealized case\nof a perfect model, that means adjusting its learning weights - that is,\nthe adjustable parameters of the function - such that loss is zero for\nall inputs. In the real world, it means an iterative process of nudging\nthe learning weights until we see that we get a tolerable loss for a\nwide variety of inputs.\n\nHow do we decide how far and in which direction to nudge the weights? We\nwant to *minimize* the loss, which means making its first derivative\nwith respect to the input equal to 0:\n$\\frac{\\partial L}{\\partial x} = 0$.\n\nRecall, though, that the loss is not *directly* derived from the input,\nbut a function of the model\u2019s output (which is a function of the input\ndirectly), $\\frac{\\partial L}{\\partial x}$ =\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$. By the chain rule of\ndifferential calculus, we have\n$\\frac{\\partial {L({\\vec y})}}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial y}{\\partial x}$ =\n$\\frac{\\partial L}{\\partial y}\\frac{\\partial M(x)}{\\partial x}$.\n\n$\\frac{\\partial M(x)}{\\partial x}$ is where things get complex.\nThe partial derivatives of the model\u2019s outputs with respect to its\ninputs, if we were to expand the expression using the chain rule again,\nwould involve many local partial derivatives over every multiplied\nlearning weight, every activation function, and every other mathematical\ntransformation in the model. The full expression for each such partial\nderivative is the sum of the products of the local gradient of *every\npossible path* through the computation graph that ends with the variable\nwhose gradient we are trying to measure.\n\nIn particular, the gradients over the learning weights are of interest\nto us - they tell us *what direction to change each weight* to get the\nloss function closer to zero.\n\nSince the number of such local derivatives (each corresponding to a\nseparate path through the model\u2019s computation graph) will tend to go up\nexponentially with the depth of a neural network, so does the complexity\nin computing them. This is where autograd comes in: It tracks the\nhistory of every computation. Every computed tensor in your PyTorch\nmodel carries a history of its input tensors and the function used to\ncreate it. Combined with the fact that PyTorch functions meant to act on\ntensors each have a built-in implementation for computing their own\nderivatives, this greatly speeds the computation of the local\nderivatives needed for learning.\n\n## A Simple Example\n\nThat was a lot of theory - but what does it look like to use autograd in\npractice?\n\nLet\u2019s start with a straightforward example. First, we\u2019ll do some imports\nto let us graph our results:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# %matplotlib inline\n\nimport torch\n\nimport matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\nimport math"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we\u2019ll create an input tensor full of evenly spaced values on the\ninterval $[0, 2{\\pi}]$, and specify ``requires_grad=True``. (Like\nmost functions that create tensors, ``torch.linspace()`` accepts an\noptional ``requires_grad`` option.) Setting this flag means that in\nevery computation that follows, autograd will be accumulating the\nhistory of the computation in the output tensors of that computation.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nprint(a)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we\u2019ll perform a computation, and plot its output in terms of its\ninputs:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"b = torch.sin(a)\nplt.plot(a.detach(), b.detach())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let\u2019s have a closer look at the tensor ``b``. When we print it, we see\nan indicator that it is tracking its computation history:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This ``grad_fn`` gives us a hint that when we execute the\nbackpropagation step and compute gradients, we\u2019ll need to compute the\nderivative of $\\sin(x)$ for all this tensor\u2019s inputs.\n\nLet\u2019s perform some more computations:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"c = 2 * b\nprint(c)\n\nd = c + 1\nprint(d)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let\u2019s compute a single-element output. When you call\n``.backward()`` on a tensor with no arguments, it expects the calling\ntensor to contain only a single element, as is the case when computing a\nloss function.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"out = d.sum()\nprint(out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each ``grad_fn`` stored with our tensors allows you to walk the\ncomputation all the way back to its inputs with its ``next_functions``\nproperty. We can see below that drilling down on this property on ``d``\nshows us the gradient functions for all the prior tensors. Note that\n``a.grad_fn`` is reported as ``None``, indicating that this was an input\nto the function with no history of its own.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print('d:')\nprint(d.grad_fn)\nprint(d.grad_fn.next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)\nprint(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)\nprint('\\nc:')\nprint(c.grad_fn)\nprint('\\nb:')\nprint(b.grad_fn)\nprint('\\na:')\nprint(a.grad_fn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With all this machinery in place, how do we get derivatives out? You\ncall the ``backward()`` method on the output, and check the input\u2019s\n``grad`` property to inspect the gradients:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"out.backward()\nprint(a.grad)\nplt.plot(a.detach(), a.grad.detach())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall the computation steps we took to get here:\n\n```python\na = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\nb = torch.sin(a)\nc = 2 * b\nd = c + 1\nout = d.sum()\n```\nAdding a constant, as we did to compute ``d``, does not change the\nderivative. That leaves $c = 2 * b = 2 * \\sin(a)$, the derivative\nof which should be $2 * \\cos(a)$. Looking at the graph above,\nthat\u2019s just what we see.\n\nBe aware that only *leaf nodes* of the computation have their gradients\ncomputed. If you tried, for example, ``print(c.grad)`` you\u2019d get back\n``None``. In this simple example, only the input is a leaf node, so only\nit has gradients computed.\n\n## Autograd in Training\n\nWe\u2019ve had a brief look at how autograd works, but how does it look when\nit\u2019s used for its intended purpose? Let\u2019s define a small model and\nexamine how it changes after a single training batch. First, define a\nfew constants, our model, and some stand-ins for inputs and outputs:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"BATCH_SIZE = 16\nDIM_IN = 1000\nHIDDEN_SIZE = 100\nDIM_OUT = 10\n\nclass TinyModel(torch.nn.Module):\n\n def __init__(self):\n super(TinyModel, self).__init__()\n \n self.layer1 = torch.nn.Linear(DIM_IN, HIDDEN_SIZE)\n self.relu = torch.nn.ReLU()\n self.layer2 = torch.nn.Linear(HIDDEN_SIZE, DIM_OUT)\n \n def forward(self, x):\n x = self.layer1(x)\n x = self.relu(x)\n x = self.layer2(x)\n return x\n \nsome_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)\nideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)\n\nmodel = TinyModel()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One thing you might notice is that we never specify\n``requires_grad=True`` for the model\u2019s layers. Within a subclass of\n``torch.nn.Module``, it\u2019s assumed that we want to track gradients on the\nlayers\u2019 weights for learning.\n\nIf we look at the layers of the model, we can examine the values of the\nweights, and verify that no gradients have been computed yet:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.layer2.weight[0][0:10]) # just a small slice\nprint(model.layer2.weight.grad)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let\u2019s see how this changes when we run through one training batch. For a\nloss function, we\u2019ll just use the square of the Euclidean distance\nbetween our ``prediction`` and the ``ideal_output``, and we\u2019ll use a\nbasic stochastic gradient descent optimizer.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"optimizer = torch.optim.SGD(model.parameters(), lr=0.001)\n\nprediction = model(some_input)\n\nloss = (ideal_output - prediction).pow(2).sum()\nprint(loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let\u2019s call ``loss.backward()`` and see what happens:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"loss.backward()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that the gradients have been computed for each learning\nweight, but the weights remain unchanged, because we haven\u2019t run the\noptimizer yet. The optimizer is responsible for updating model weights\nbased on the computed gradients.\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"optimizer.step()\nprint(model.layer2.weight[0][0:10])\nprint(model.layer2.weight.grad[0][0:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see that ``layer2``\\ \u2019s weights have changed.\n\nOne important thing about the process: After calling\n``optimizer.step()``, you need to call ``optimizer.zero_grad()``, or\nelse every time you run ``loss.backward()``, the gradients on the\nlearning weights will accumulate:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(model.layer2.weight.grad[0][0:10])\n\nfor i in range(0, 5):\n prediction = model(some_input)\n loss = (ideal_output - prediction).pow(2).sum()\n loss.backward()\n \nprint(model.layer2.weight.grad[0][0:10])\n\noptimizer.zero_grad(set_to_none=False)\n\nprint(model.layer2.weight.grad[0][0:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After running the cell above, you should see that after running\n``loss.backward()`` multiple times, the magnitudes of most of the\ngradients will be much larger. Failing to zero the gradients before\nrunning your next training batch will cause the gradients to blow up in\nthis manner, causing incorrect and unpredictable learning results.\n\n## Turning Autograd Off and On\n\nThere are situations where you will need fine-grained control over\nwhether autograd is enabled. There are multiple ways to do this,\ndepending on the situation.\n\nThe simplest is to change the ``requires_grad`` flag on a tensor\ndirectly:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a = torch.ones(2, 3, requires_grad=True)\nprint(a)\n\nb1 = 2 * a\nprint(b1)\n\na.requires_grad = False\nb2 = 2 * a\nprint(b2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell above, we see that ``b1`` has a ``grad_fn`` (i.e., a traced\ncomputation history), which is what we expect, since it was derived from\na tensor, ``a``, that had autograd turned on. When we turn off autograd\nexplicitly with ``a.requires_grad = False``, computation history is no\nlonger tracked, as we see when we compute ``b2``.\n\nIf you only need autograd turned off temporarily, a better way is to use\nthe ``torch.no_grad()``:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"a = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = a + b\nprint(c1)\n\nwith torch.no_grad():\n c2 = a + b\n\nprint(c2)\n\nc3 = a * b\nprint(c3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"``torch.no_grad()`` can also be used as a function or method decorator:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def add_tensors1(x, y):\n return x + y\n\n@torch.no_grad()\ndef add_tensors2(x, y):\n return x + y\n\n\na = torch.ones(2, 3, requires_grad=True) * 2\nb = torch.ones(2, 3, requires_grad=True) * 3\n\nc1 = add_tensors1(a, b)\nprint(c1)\n\nc2 = add_tensors2(a, b)\nprint(c2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There\u2019s a corresponding context manager, ``torch.enable_grad()``, for\nturning autograd on when it isn\u2019t already. It may also be used as a\ndecorator.\n\nFinally, you may have a tensor that requires gradient tracking, but you\nwant a copy that does not. For this we have the ``Tensor`` object\u2019s\n``detach()`` method - it creates a copy of the tensor that is *detached*\nfrom the computation history:\n\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"x = torch.rand(5, requires_grad=True)\ny = x.detach()\n\nprint(x)\nprint(y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We did this above when we wanted to graph some of our tensors. This is\nbecause ``matplotlib`` expects a NumPy array as input, and the implicit\nconversion from a PyTorch tensor to a NumPy array is not enabled for\ntensors with requires_grad=True. Making a detached copy lets us move\nforward.\n\n### Autograd and In-place Operations\n\nIn every example in this notebook so far, we\u2019ve used variables to\ncapture the intermediate values of a computation. Autograd needs these\nintermediate values to perform gradient computations. *For this reason,\nyou must be careful about using in-place operations when using\nautograd.* Doing so can destroy information you need to compute\nderivatives in the ``backward()`` call. PyTorch will even stop you if\nyou attempt an in-place operation on leaf variable that requires\nautograd, as shown below.\n\nThe following code cell throws a runtime error. This is expected.\n\n```python\na = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)\ntorch.sin_(a)