Machine learning derivatives

Published on 06/03/2021


Derivatives are frequently used in machine learning because it allows us to efficiently train a neural network. An analogy would be finding which direction you should take to reach the highest mountain but with the restriction of only being able to see one meter away. In this article, we will first recall the rules of derivatives and partial derivatives. Then we will feature a few derivatives of functions that are commonly used in machine learning.

The derivatives rules

The basics

Let’s start by the derivatives of common functions. In the following :

  • stem 44bc9d542a92714cac84e01cbbb7fd61 is a constant

  • stem 332cc365a4987aacce0ead01b8bdcc0b is the variable by which we derive the functions

  • stem 190083ef7a1625fbc75f243cffb9c96d and stem 3cf4fbd05970446973fc3d9fa3fe3c41 are functions

  • Also note that stem 86ed5d32062ae006b1478b6a7a4c11ed is equivalent to stem 9ce3fa8c71f5905e328dcae5b1d69e2d

Function stem 53ca4e53600bc1d57aac5a202821be35 Derivative stem c52843131962aa554b42ea1cf5228463

stem 44bc9d542a92714cac84e01cbbb7fd61

stem 29632a9bf827ce0200454dd32fc3be82

stem 332cc365a4987aacce0ead01b8bdcc0b

stem 034d0a6be0424bffe9a6e7ac9236c0f5

stem ca398edcb68de0f6480f309a9c343abb

stem 332cc365a4987aacce0ead01b8bdcc0b

stem 6177db6fc70d94fdb9dbe1907695fce6

stem 5a5214935f8b6ee914efeece84e7535c

stem dd14e4011870d961fb4f5097866d9009

stem e89f5edbd6936745b4e9d3bfa78589a2

stem bf2d19cfaedefc56a5dcb1d682141f40

stem 2ed987caf0c81606b3e75e2e1d580512

stem b6b70db98c2a5c2031dea120886f8211

stem b6b70db98c2a5c2031dea120886f8211

stem d920fa2c5de18d8732c2f6b5074660f1

stem e9b25d44048fe1907f8debd2395c328b

stem 9c2146e0144f51c13d756bc2ada28fcc

stem f6de2147c9c203a34732c0a74515a98c

stem fb9dd87ee5c46cc6f1bac5ad989387d4

stem 0a56d64204ad3484af8f8a1039645666

stem 0a56d64204ad3484af8f8a1039645666

stem 7b91b54d84b201968b942bf251bf3ed1

Here are a few derivative rules that we will use in the following sections.

Rule Function Derivative

Multiplication by a constant

stem 344b154430f36a0be3823506255474f2

stem 8fe7f41a7053b7326c1d40111d2207c1

Sum

stem 837808632dbb8d599a541a06f6a0019d

stem c336dccf2af000586379d218c8afa9d3

Difference

stem a69893469bcc1b610c58d0c675786aab

stem 47f2d66d50d370444b217489fb2c48be

Product

stem 53718b4cef55e2bedec08db1ed321df2

stem 21f8e45f8db182bcb5fb73f19d581dd7

Quotient

stem 716e54d671041d660590dbbd60bfbcb5

stem e753ef44cf0632276f5eb70ba956f442

Reciprocal (from quotient)

stem 908ffcd977040e3e12e152e5d839a878

stem 69af355606f1d556beca9f75ba054be4

The chain rule

$$(f \circ g)' = (f' \circ g) \cdot g'$$

stem 29f8035de6d3defd73810dc62e37f6b1 means stem 0c567fa116d65b8f55f385a26f439ee9. We can also rewrite it using Leibniz’s notation :

$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}$$

Which literally says : the derivative stem 06e7cc81ea7c4442d159c33723c273db by stem 332cc365a4987aacce0ead01b8bdcc0b is the derivative stem 06e7cc81ea7c4442d159c33723c273db by (stem c0463eeb4772bfde779c20d52901d01b) stem 3cf4fbd05970446973fc3d9fa3fe3c41 multiplied by the derivative stem ac31c2a831e69e0608b9e2cc6c98249e.

Partial derivatives and gradient

Let’s define a function :

$$f(x, y) = x^2 - 3y$$

How can we calculate the derivative of this function ?

When a functions takes more than one variable as an argument, we can calculate the derivative of the function with respect to each variable. When we do that, the variables are treated as constants :

$$\frac{\partial f}{\partial x} =
\frac{\partial (x^2 - 3 \cdot a)}{\partial x} =
2x - 0 =
2x$$
$$\frac{\partial f}{\partial y} =
\frac{\partial (a^2 - 3 y)}{\partial x} =
0 - 3 =
-3$$

We define the gradient of a function as the vector of all its parial derivatives, in this case the gradient stem 5eb3506cf8721a78598195593be1aa9b of the function stem 190083ef7a1625fbc75f243cffb9c96d is :

$$\nabla f = (2x, -3)$$

More generally, for a function stem 190083ef7a1625fbc75f243cffb9c96d that takes a vector stem 8ab0c01b645797173dc3a11d3799f1cf as an argument, its gradient is defined by stem 101cd389c65596a3826ae000401b0706.

The sigmoid

A sigmoid is a "S"-shaped curve. The one we use is called the logistic function and it is defined like this :

$$sigmoid(x) = f(x) = \frac{1}{1 + e^{-x}}$$

Here is a graph of what it looks like, along with its derivative :

sigmoid
Figure 1. The sigmoid function and its derivative

Let’s compute its derivative :

$$f(x) = \frac{1}{1 + e^{-x}}$$
$$f'(x) = -\frac{1}{(1 + e^{-x})^2} \cdot \frac{\partial (1 + e^{-x})}{\partial x}$$

After using the reciprocal rule. Now we first calculate the right part using again the reciprocal rule :

$$\frac{\partial (1 + e^{-x})}{\partial x} =
\frac{\partial (\frac{1}{e^{x}})}{\partial x} =
-\frac{e^x}{(e^x)^2} =
-\frac{1}{e^x} =
-e^{-x}$$

So we have :

$$f'(x) =
-\frac{1}{(1 + e^{-x})^2} \cdot (-e^{-x}) =
\frac{1}{(1 + e^{-x})^2} \cdot e^{-x} =
\frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}}$$

Now we use a little trick on the right part, adding stem 1c61d36375294ae2354453d4a168f832 on the top :

$$f'(x) =
\frac{1}{1 + e^{-x}} \cdot \left(\frac{1 + e^{-x} - 1}{1 + e^{-x}}\right) =
\frac{1}{1 + e^{-x}} \cdot \left(\frac{1 + e^{-x}}{1 + e^{-x}} - \frac{1}{1 + e^{-x}}\right) =
\frac{1}{1 + e^{-x}} \cdot \left(1 - \frac{1}{1 + e^{-x}}\right)$$

You remember that stem 2bdf51d6b6da5953c65a2097ac846972 ? So we can write :

$$f' = f \cdot (1 - f)$$
$$\frac{\partial (sigmoid)}{\partial x} = sigmoid(x) \cdot (1 - sigmoid(x))$$

The squared error function

The squared error function is defined by :

$$squared\_error(predicted\_output) = \frac{1}{2} \cdot (true\_output - predicted\_output)^2$$

We usually use this function for a batch of examples, which gives us the following formula where the variables are vectors :

$$squared\_error(\mathbf{predicted\_output}) = \frac{1}{2} \cdot \sum_{i=1}^{n\_examples} (\mathbf{true\_output}_{[i]} - \mathbf{predicted\_output}_{[i]})^2$$

And sometimes, the output of an example is itself a vector (think about an artificial neural network with multiple output units), we then have the following where the variables are matrices :

$$squared\_error(Predicted\_output) = \frac{1}{2} \cdot \sum_{i=1}^{n\_examples} \sum_{j=1}^{n\_outputs} (True\_output_{[i][j]} - Predicted\_output_{[i][j]})^2$$

The square of a difference is always positive, therefore the more difference there is between the true output and the predicted output, the bigger the result. The stem 47d54de4e337a06266c0e1d22c9b417b term is only used because it makes the derivative more convenient, as you we see.

Now what we want is the partial derivative with respect to :

  • stem 2c2451a796e3c36839574918cb21d0e3 for the first variant of the function;

  • stem e3d1843709bd5923c1bcc17e5015f873 for the second variant;

  • stem 9417bee9bdaec7d66776da5afaeec47e for the third variant.

When we calculate a partial derivative, all the other variables are treated as a constant. This means that in the case of the third variant, each stem 930e3598eb1e01e928729eff88b59d87 that are not of indexes stem fc6e56e7c0a3df5e7a2de6d18bd53fc4 will have a derivative equal to stem 29632a9bf827ce0200454dd32fc3be82.

Therefore we can get rid of the sums for the calculations of the partial derivatives of stem 29c65edc60b58f2ed843cdcbaaae33a8, which means we only have to consider the first variant.

We will name the specific variable we derive the function for stem 332cc365a4987aacce0ead01b8bdcc0b, and the corresponding true output stem 44bc9d542a92714cac84e01cbbb7fd61.

Now let’s start :

$$f(x) = \frac{1}{2} \cdot (a - x)^2$$
$$f'(x) = \frac{1}{2} \cdot 2 \cdot (a - x) \cdot (-1)$$

After using the chain rule. Now let’s simplify the result :

$$f'(x) = x - a$$
$$\frac{\partial (squared\_error)}{\partial predicted\_output} = predicted\_output - true\_output$$

Or for the second variant :

$$\frac{\partial (squared\_error)}{\partial \mathbf{predicted\_output}_{[specific\_example]}} = \mathbf{predicted\_output}_{[specific\_example]} - \mathbf{true\_output}_{[specific\_example]}$$

And for the third variant :

$$\frac{\partial (squared\_error)}{\partial Predicted\_output_{[specific\_example][specific\_output]}} =$$
$$Predicted\_output_{[specific\_example][specific\_output]} - True\_output_{[specific\_example][specific\_output]}$$

Leave a comment

Your email will not appear on the website.


Newsletter: Stayin' in touch

You found the content interesting and want to stay in touch ?