Skip to content

Automatic Differentiation

Motivation

  • It supports a lot of modern machine learning by allowing quick differentiation of complex mathematical functions. The 1st order optimization methods are ubiquitous in finding parameters of functions (not only in deep learning).

  • AD is interesting to study from both mathermatical and implementation perspective, since different approaches comes with different trade-offs. Julia offers many implementations (some of them are not maintained anymore), as it showed to implement (simple) AD is relatively simple.

  • We (authors of this course) believe that it is good to understand (at least roughly), how the methods work in order to use them effectively in your work.

  • Julia is unique in the effort separating definitions of AD rules from AD engines that use those rules to perform the AD and the backend which executes the rules This allows authors of generic libraries to add new rules that would be compatible with many frameworks. See juliadiff.org for a list.

Theory

The differentiation is routine process, as most of the time we break complicated functions down into small pieces that we know, how to differentiate and from that to assemble the gradient of the complex function back. Thus, the essential piece is the differentiation of the composed function f:RnRm

f(x)=f1(f2(f3(fn(x))))=(f1f2fn)(x)

which is computed by chainrule. Before we dive into the details, let's define the notation, which for the sake of clarity needs to be precise. The gradient of function f(x) with respect to x at point x0 is denoted as fx|x0

For a composed function f(x) the gradient with respect to x at point x0 is equal to

fx|x0=f1y1|y10×f2y2|y20××fnyn|yn0,

where yi denotes the input of function fi and

yi0= (fi+1fn)(x0)yn0= x0y00= f(x0)

How fiyi|yi0 looks like?

  • If fi:RR, then fiyiR is a real number R and we live in a high-school world, where it was sufficient to multiply real numbers.

  • If fi:RmiRni, then Ji=fiyi|yi0Rni,mi is a matrix with mi rows and ni columns.

The computation of gradient fx theoretically boils down to

  1. computing Jacobians {Ji}i=1n

  2. multiplication of Jacobians as it holds that fx|y0=J1×J2××Jn.

The complexity of the computation (at least one part of it) is therefore determined by the Matrix multiplication, which is generally expensive, as theoretically it has complexity at least O(n2.3728596), but in practice a little bit more as the lower bound hides the devil in the O notation. The order in which the Jacobians are multiplied has therefore a profound effect on the complexity of the AD engine. While determining the optimal order of multiplication of sequence of matrices is costly, in practice, we recognize two important cases.

  1. Jacobians are multiplied from right to left as J1×(J2×(×(Jn1×Jn))) which has the advantage when the input dimension of f:RnRm is smaller than the output dimension, n<m. - referred to as the FORWARD MODE

  2. Jacobians are multiplied from left to right as (((J1×J2)×J3)×)×Jn which has the advantage when the input dimension of f:RnRm is larger than the output dimension, n>m. - referred to as the BACKWARD MODE

The ubiquitous in machine learning to minimization of a scalar (loss) function of a large number of parameters. Also notice that for f of certain structures, it pays-off to do a mixed-mode AD, where some parts are done using forward diff and some parts using reverse diff.

Example

Let's workout an example

z=xy+sin(x)

How it maps to the notation we have used above? Particularly, what are f1,f2,,fn and the corresponding {yi}i=1n, such that (f1f2fn)(x,y)=xy+sin(x) ?

f1:R2Rf1(y1)=y1,1+y1,2y0=(xy+sin(x))f2:R3R2f2(y2)=(y2,1y2,2,y2,3)y1=(xy,sin(x))f3:R2R3f3(y3)=(y3,1,y3,2,sin(y3,1))y2=(x,y,sin(x))

The corresponding jacobians are

f1(y1)=y1,1+y1,2J1=[11]f2(y2)=(y2,1y2,2,y2,3)J2=[y2,20y2,1001]f3(y3)=(y3,1,y3,2,sin(y3,1))J3=[10cos(y3,1)010]

and for the gradient it holds that

[f(x,y)xf(x,y)y]=J3×J2×J1=[10cos(x)010]×[y0x001]×[11]=[ycos(x)x0]×[11]=[y+cos(x)x]

Note that from theoretical point of view this decomposition of a function is not unique, however as we will see later it usually given by the computational graph in a particular language/environment.

Calculation of the Forward mode

In theory, we can calculate the gradient using forward mode as follows Initialize the Jacobian of yn with respect to x to an identity matrix, because as we have stated above yn0=x, i.e. ynx=I. Iterate i from n down to 1 as

  • calculate the next intermediate output as yi10=fi(yi0)

  • calculate Jacobian Ji=fiyi|yi0

  • push forward the gradient as yi1x|x=Ji×yix|x

Notice that

  • on the very end, we are left with y=y00 and with y0x, which is the gradient we wanted to calculate;

  • if y is a scalar, then y0x is a matrix with single row

  • the Jacobian and the output of the function is calculated in one sweep.

The above is an idealized computation. The real implementation is a bit different, as we will see later.

Implementation of the forward mode using Dual numbers

Forward modes need to keep track of the output of the function and of the derivative at each computation step in the computation of the complicated function f. This can be elegantly realized with a dual number, which are conceptually similar to complex numbers, but instead of the imaginary number i dual numbers use ϵ in its second component:

x=v+v˙ϵ,

where (v,v˙)R and by definition ϵ2=0 (instead of i2=1 in complex numbers). What are the properties of these Dual numbers?

(v+v˙ϵ)+(u+u˙ϵ)=(v+u)+(v˙+u˙)ϵ(v+v˙ϵ)(u+u˙ϵ)=vu+(uv˙+u˙v)ϵ+v˙u˙ϵ2=vu+(uv˙+u˙v)ϵv+v˙ϵu+u˙ϵ=v+v˙ϵu+u˙ϵuu˙ϵuu˙ϵ=vu(u˙vuv˙)ϵu2

Let's evaluate the above equations at (v,v˙)=(v,1) and (u,u˙)=(u,0) we obtain

(v+v˙ϵ)+(u+u˙ϵ)=(v+u)+1ϵ(v+v˙ϵ)(u+u˙ϵ)=vu+uϵv+v˙ϵu+u˙ϵ=vu+1uϵ

and notice that terms (1,u,1u) corresponds to gradient of functions (u+v,uv,vu) with respect to v. We can repeat it with changed values of ϵ as (v,v˙)=(v,0) and (u,u˙)=(u,1) and we obtain

(v+v˙ϵ)+(u+u˙ϵ)=(v+u)+1ϵ(v+v˙ϵ)(u+u˙ϵ)=vu+vϵv+v˙ϵu+u˙ϵ=vuvu2ϵ

meaning that at this moment we have obtained gradients with respect to u.

All above functions (u+v,uv,uv) are of R2R, therefore we had to repeat the calculations twice to get gradients with respect to both inputs. This is inline with the above theory, where we have said that if input dimension is larger then output dimension, the backward mode is better. But consider a case, where we have a function

f(v)=(v+5,5v,5/v)

which is RR3. In this case, we obtain the Jacobian [1,5,5v2] in a single forward pass (whereas the reverse would require three passes over the backward calculation, as will be seen later).

Does dual numbers work universally?

Let's first work out polynomial. Let's assume the polynomial

p(v)=i=1npivi

and compute its value at v+v˙ϵ (note that we know how to do addition and multiplication)

p(v)=i=0npi(v+v˙ϵ)i=i=0n[pij=0n(ij)vij(v˙ϵ)i]=p0+i=1n[pij=01(ij)vij(v˙ϵ)j]==p0+i=1npi(vi+ivi1v˙ϵ)=p(v)+(i=1nipivi1)v˙ϵ

where in the multiplier of v˙ϵ: i=1nipivi1, we recognize the derivative of p(v) with respect to v. This proves that Dual numbers can be used to calculate the gradient of polynomials.

Let's now consider a general function f:RR. Its value at point v+v˙ϵ can be approximated using Taylor expansion at function at point v as

f(v+v˙ϵ)=i=0fi(v)v˙iϵii!=f(v)+f(v)v˙ϵ,

where all higher order terms can be dropped because ϵi=0 for i>1. This shows that we can calculate the gradient of f at point v by calculating its value at f(v+ϵ) and taking the multiplier of ϵ.

Implementing Dual number with Julia

To demonstrate the simplicity of Dual numbers, consider following definition of Dual numbers, where we define a new number type and overload functions +, -, *, and /. In Julia, this reads:

julia
struct Dual{T<:Number} <: Number
    x::T
    d::T
end

Base.:+(a::Dual, b::Dual)   = Dual(a.x+b.x, a.d+b.d)
Base.:-(a::Dual, b::Dual)   = Dual(a.x-b.x, a.d-b.d)
Base.:/(a::Dual, b::Dual)   = Dual(a.x/b.x, (a.d*b.x - a.x*b.d)/b.x^2) # recall  (a/b) =  a/b + (a'b - ab')/b^2 ϵ
Base.:*(a::Dual, b::Dual)   = Dual(a.x*b.x, a.d*b.x + a.x*b.d)

# Let's define some promotion rules
Dual(x::S, d::T) where {S<:Number, T<:Number} = Dual{promote_type(S, T)}(x, d)
Dual(x::Number) = Dual(x, zero(typeof(x)))
Dual{T}(x::Number) where {T} = Dual(T(x), zero(T))
Base.promote_rule(::Type{Dual{T}}, ::Type{S}) where {T<:Number,S<:Number} = Dual{promote_type(T,S)}
Base.promote_rule(::Type{Dual{T}}, ::Type{Dual{S}}) where {T<:Number,S<:Number} = Dual{promote_type(T,S)}

# and define api for forward differentionation
forward_diff(f::Function, x::Real) = _dual(f(Dual(x,1.0)))
_dual(x::Dual) = x.d
_dual(x::Vector) = _dual.(x)
_dual (generic function with 2 methods)

And let's test the Babylonian Square Root (an algorithm to compute x):

julia
julia> babysqrt(x, t=(1+x)/2, n=10) = n==0 ? t : babysqrt(x, (t+x/t)/2, n-1)
babysqrt (generic function with 3 methods)

julia> forward_diff(babysqrt, 2)
0.35355339059327373

julia> forward_diff(babysqrt, 2)  1/(2sqrt(2))
true

julia> forward_diff(x -> [1 + x, 5x, 5/x], 2)
3-element Vector{Float64}:
  1.0
  5.0
 -1.25

We now compare the analytic solution to values computed by the forward_diff and byt he finite differencing

f(x)=xf(x)=12x
julia
julia> using FiniteDifferences
ERROR: ArgumentError: Package FiniteDifferences not found in current path.
- Run `import Pkg; Pkg.add("FiniteDifferences")` to install the FiniteDifferences package.

julia> forward_dsqrt(x) = forward_diff(babysqrt,x)
forward_dsqrt (generic function with 1 method)

julia> analytc_dsqrt(x) = 1/(2babysqrt(x))
analytc_dsqrt (generic function with 1 method)

julia> forward_dsqrt(2.0)
0.35355339059327373

julia> analytc_dsqrt(2.0)
0.3535533905932738

julia> central_fdm(5, 1)(babysqrt, 2.0)
ERROR: UndefVarError: `central_fdm` not defined in `Main`
Suggestion: check for spelling errors or missing imports.
julia
plot(0.0:0.01:2, babysqrt, label="f(x) = babysqrt(x)", lw=3)
plot!(0.1:0.01:2, analytc_dsqrt, label="Analytic f'", ls=:dot, lw=3)
plot!(0.1:0.01:2, forward_dsqrt, label="Dual Forward Mode f'", lw=3, ls=:dash)


Takeaways

  1. Forward mode f is obtained simply by pushing a Dual through babysqrt

  2. To make the forward diff work in Julia, we only need to overload a few operators for forward mode AD to work on any function. Therefore the name of the approach is called operator overloading.

  3. For vector valued function we can use Hyperduals

  4. Forward diff can differentiation through the setindex! (called each time an element is assigned to a place in array, e.g. x = [1,2,3]; x[2] = 1)

  5. ForwardDiff is implemented in ForwardDiff.jl, which might appear to be neglected, but the truth is that it is very stable and general implementation.

  6. ForwardDiff does not have to be implemented through Dual numbers. It can be implemented similarly to ReverseDiff through multiplication of Jacobians, which is what is the community work on now (in Mooncake, Zygote with rules defined in ChainRules).


Reverse mode

In reverse mode, the computation of the gradient follow the opposite order. We initialize the computation by setting J0=yy0, which is again an identity matrix. Then we compute Jacobians and multiplications in the opposite order. The problem is that to calculate Ji we need to know the value of yi0, which cannot be calculated in the reverse pass. The backward pass therefore needs to be preceded by the forward pass, where {yi0}i=1n are calculated.

The complete reverse mode algorithm therefore proceeds as

  1. Forward pass: iterate i from n down to 1 as
  • calculate the next intermediate output as yi10=fi(yi0)
  1. Backward pass: iterate i from 1 down to n as
  • calculate Jacobian Ji=fiyi|yi0 at point yi0

  • pull back the gradient as f(x)yi|yi0=y0yi1|yi10×Ji

The need to store intermediate outs has a huge impact on memory requirements, which particularly on GPU is a big deal. Recall few lectures ago we have been discussing how excessive memory allocations can be damaging for performance, here we are given an algorithm where the excessive allocation is by design.

Tricks to decrease memory consumptions

  • Define custom rules over large functional blocks. For example while we can auto-grad (in theory) matrix product, it is much more efficient to define make a matrix multiplication as one large function, for which we define Jacobians (note that by doing so, we can dispatch on Blas). e.g
C=ABCA=BCB=AT
  • When differentiating Invertible functions, calculate intermediate outputs from the output. This can lead to huge performance gain, as all data needed for computations are in caches.

  • Checkpointing does not store intermediate ouputs after larger sequence of operations. When they are needed for forward pass, they are recalculated on demand.

Most reverse mode AD engines does not support mutating values of arrays (setindex! in julia). This is related to the memory consumption, where after every setindex! you need in theory save the full matrix. Enzyme differentiating directly LLVM code supports this, since in LLVM every variable is assigned just once. Mooncake supports this by saving values needed to reconstruct the arrays. ForwardDiff methods does not suffer this problem, as the gradient is computed at the time of the values.

Info

Reverse mode AD was first published in 1976 by Seppo Linnainmaa[1], a finnish computer scientist. It was popularized in the end of 80s when applied to training multi-layer perceptrons, which gave rise to the famous backpropagation algorithm[2], which is a special case of reverse mode AD.

Info

The terminology in automatic differentiation is everything but fixed. The community around ChainRules.jl went a great length to use something reasonable. They use pullback for a function realizing vector-Jacobian product in the reverse-diff reminding that the gradient is pulled back to the origin of the computation. The use pushforward to denote the same operation in the ForwardDiff, as the gradient is push forward through the computation.

Implementation details of reverse AD

Reverse-mode AD needs to record operations over variables when computing the value of a differentiated function, such that it can walk back when computing the gradient. This record is called tape, but it is effectively a directed acyclic graph. The construction of the tape can be either explicit or implicit. The code computing the gradient can be produced by operator-overloading or code-rewriting techniques. This give rise of four different takes on AD, and Julia has libraries for alll four.

Graph-based AD

In Graph-based approach, we start with a complete knowledge of the computation graph (which is known in many cases like classical neural networks) and augment it with nodes representing the computation of the computation of the gradient (backward path). We need to be careful to add all edges representing the flow of information needed to calculate the gradient. Once the computation graph is augmented, we can find the subgraph needed to compute the desired node(s).

Recall the example from the beginning of the lecture f(x,y)=sin(x)+xy, let's observe, how the extension of the computational graph will look like. The computation graph of function f looks like

where arrows denote the flow of operations and we have denoted the output of function f as z and outputs of intermediate nodes as hi standing for hidden.

We start from the top and add a node calculating zh3 which is an identity, needed to jump-start the differentiation.

We connect it with the output of h3, even though technically in this case it is not needed, as the z=h3. We then add a node calculating h3h2 for which we only need information about h2 and mark it in the graph (again, this edge can be theoretically dropped due to being equal to one regardless the inputs). Following the chain rule, we need to combine h3h2 with zh3 to compute zh2 which we note in the graph.

We continue with the same process with h3h1, which we again combine with zh1 to obtain zh1. Continuing the reverse diff we obtain the final graph

containing the desired nodes zx and zy. This computational graph can be passed to the compiler to compute desired values.

This approach to AD has been taken for example by Theano, TensorFlow, and JAX. In Tensorflow when you use functions like tf.mul( a, b ) or tf.add(a,b), you are not performing the computation in Python, but you are building the computational graph shown as above. You can then compute the values using tf.run with a desired inputs, but you are in fact computing the values in a different interpreter / compiler then in python. PyTorch now does this in compiled mode.

Advantages:

  • Knowing the computational graph in advance is great, as you can do expensive optimization steps to simplify the graph.

  • The computational graph have a simple semantics (limited support for loops, branches, no objects), and the compiler is therefore simpler than the compiler of full languages.

  • Since the computation of gradient augments the graph, you can run the process again to obtain higher order gradients.

  • TensorFlow allows you to specialize on sizes of Tensors, which means that it knows precisely how much memory you will need and where, which decreases the number of allocations. This is quite important in GPU.

Disadvantages:

  • You are restricted to fixed computation graph. It is generally difficult to implement if or while, and hence to change the computation according to values computed during the forward pass.

  • Development and debugging can be difficult, since you are not developing the computation graph in the host language.

  • Exploiting within computation graph parallelism might be difficult.

Comments:

  • DaggerFLux.jl use this approach to perform model-based paralelism, where parts of the computation graph (and especially parameters) can reside on different machines.

  • Umlaut.jl allows to easily obtain the tape through tracing of the execution of a function, which can be then used to implement the AD as described above (see Yota's documentation for complete example).

julia
julia> using Umlaut

julia> g(x, y) = x * y
g (generic function with 1 method)

julia> f(x, y) = g(x, y)+sin(x)
f (generic function with 1 method)

julia> tape = trace(f, 1.0, 2.0)[2]
ERROR: FieldError: type Compiler.IRCode has no field `linetable`, available fields: `stmts`, `argtypes`, `sptypes`, `debuginfo`, `cfg`, `new_nodes`, `meta`, `valid_worlds`

Yota.jl use the tape to generate the gradient as

julia
julia> tape = Yota.gradtape(f, 1.0, 2.0; seed=1.0)
ERROR: UndefVarError: `Yota` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

julia> Umlaut.to_expr(tape)
ERROR: UndefVarError: `tape` not defined in `Main`
Suggestion: add an appropriate import or assignment. This global was declared but not assigned.

Tracking-based AD

Alternative to static-graph based methods are methods, which builds the graph during invocation of functions and then use this dynamically built graph to know, how to compute the gradient. The dynamically built graph is frequently called tape. This approach is used by popular libraries like PyTorch, AutoGrad, and Chainer in Python ecosystem, or by Tracker.jl (Flux.jl's former AD backend), ReverseDiff.jl, and AutoGrad.jl (Knet.jl's AD backend) in Julia. This type of AD systems is also called operator overloading, since in order to record the operations performed on the arguments we need to replace/wrap the original implementation.

How do we build the tracing? Let's take a look what ReverseDiff.jl is doing. It defines TrackedArray (it also defines TrackedReal, but TrackedArray is more interesting) as

julia
struct TrackedArray{T,N,V<:AbstractArray{T,N}} <: AbstractArray{T,N}
    value::V
    deriv::Union{Nothing,V}
    tape::Vector{Any}
    string_tape::String
end

where in

  • value it stores the value of the array

  • deriv will hold the gradient of the tracked array

  • tape of will log operations performed with the tracked array, such that we can calculate the gradient as a sum of operations performed over the tape.

What do we need to store on the tape? Let's denote as a the current TrackedArray. The gradient with respect to some output z is equal to za=gizgi×gia where gi is the output of any function (in the computational graph) where a was a direct input. The InstructionTape will therefore contain a reference to gi (which has to be of TrackedArray and where we know zgi will be stored in deriv field) and we also need to a method calculating gia, which can be stored as an anonymous function will accepting the grad as an argument.

julia
TrackedArray(a::AbstractArray, string_tape::String = "") = TrackedArray(a, similar(a) .= 0, [], string_tape)
TrackedMatrix{T,V} = TrackedArray{T,2,V} where {T,V<:AbstractMatrix{T}}
TrackedVector{T,V} = TrackedArray{T,1,V} where {T,V<:AbstractVector{T}}
Base.show(io::IO, ::MIME"text/plain", a::TrackedArray) = show(io, a)
Base.show(io::IO, a::TrackedArray) = print(io, "TrackedArray($(size(a.value)))")
value(A::TrackedArray) = A.value
value(A) = A
track(A, string_tape = "") = TrackedArray(A, string_tape)
track(a::Number, string_tape) = TrackedArray(reshape([a], 1, 1), string_tape)

import Base: +, *
function *(A::TrackedMatrix, B::TrackedMatrix)
    a, b = value.((A, B))
    C = TrackedArray(a * b, "($(A.string_tape) * $(B.string_tape))")
    push!(A.tape, (C, ∂C -> ∂C * b'))
    push!(B.tape, (C, ∂C -> a' * ∂C))
    C
end

function *(A::TrackedMatrix, B::AbstractMatrix)
    a, b = value.((A, B))
    C = TrackedArray(a * b, "($(A.string_tape) * B)")
    push!(A.tape, (C, ∂C -> ∂C * b'))
    C
end

function *(A::Matrix, B::TrackedMatrix)
    a, b = value.((A, B))
    C = TrackedArray(a * b, "A * $(B.string_tape)")
    push!(A.tape, (C, ∂C -> ∂C * b'))
    C
end

function +(A::TrackedMatrix, B::TrackedMatrix)
    C = TrackedArray(value(A) + value(B), "($(A.string_tape) + $(B.string_tape))")
    push!(A.tape, (C, ∂C -> ∂C))
    push!(B.tape, (C, ∂C -> ∂C))
    C
end

function msin(A::TrackedMatrix)
    a = value(A)
    C = TrackedArray(sin.(a), "sin($(A.string_tape))")
    push!(A.tape, (C, ∂C -> cos.(a) .* ∂C))
    C
end

Let's observe that the operations are recorded on the tape as they should

julia
a = rand()
b = rand()
A = track(a, "A")
B = track(b, "B")
# R = A * B + msin(A)
C = A * B 
A.tape
B.tape
C.string_tape
R = C + msin(A)
A.tape
B.tape
R.string_tape

Let's now implement a function that will recursively calculate the gradient of a term of interest. It goes over its childs, if they not have calculated the gradients, calculate it, otherwise it adds it to its own after if not, ask them to calculate the gradient and otherwise

julia
function accum!(A::TrackedArray)
    isempty(A.tape) && return(A.deriv)
    A.deriv .= sum(g(accum!(r)) for (r, g) in A.tape)
    empty!(A.tape)
    A.deriv
end

We can calculate the gradient by initializing the gradient of the result to vector of ones simulating the sum function

julia
using FiniteDifferences
R.deriv .= 1
accum!(A)[1]
∇a = grad(central_fdm(5,1), a -> a*b + sin(a), a)[1]
A.deriv[1]  ∇a
accum!(B)[1]
∇b = grad(central_fdm(5,1), b -> a*b + sin(a), b)[1]
B.deriv[1]  ∇b

The api function for computing the grad might look like

julia
function trackedgrad(f, args...)
    args = track.(args)
    o = f(args...)
    fill!(o.deriv, 1)
    map(accum!, args)
end

where we should assert that the output dimension is 1. In our implementation we dirtily expect the output of f to be summed to a scalar.

Let's compare the results to those computed by FiniteDifferences

julia
A = rand(4,4)
B = rand(4,4)
trackedgrad(A -> A * B + msin(A), A)[1]
grad(central_fdm(5,1), A -> sum(A * B + sin.(A)), A)[1]
trackedgrad(A -> A * B + msin(A), B)[1]
grad(central_fdm(5,1), A -> sum(A * B + sin.(A)), B)[1]

To make the above AD system really useful, we would need to

  1. Add support for TrackedReal, which is straightforward (we might skip the anonymous function, as the derivative of a scalar function is always a number).

  2. We would need to add a lot of rules, how to work with basic values. This is why the the approach is called operator overloading since you need to overload a lot of functions (or methods or operators).

For example to add all combinations for +, we would need to add following rules.

julia
function +(A::TrackedMatrix, B::TrackedMatrix)
   C = TrackedArray(value(A) + value(B), "($(A.string_tape) + $(B.string_tape))")
   push!(A.tape, (C, ∂C -> ∂C ))
   push!(B.tape, (C, ∂C -> ∂C))
   C
end

function +(A::AbstractMatrix, B::TrackedMatrix)
   C = TrackedArray(A * value(B), "(A + $(B.string_tape))")
   push!(B.tape, (C, ∂C -> ∂C))
   C
end

Advantages:

  • Debugging and development is nicer, as AD is implemented in the same language.

  • The computation graph, tape, is dynamic, which makes it simpler to take the gradient in the presence of if and while.

Disadvantages:

  • The computation graph is created and differentiated during every computation, which might be costly. In most deep learning applications, this overhead is negligible in comparison to time of needed to perform the operations itself (ReverseDiff.jl allows to compile the tape).

  • The compiler has limited options for optimization, since the tape is created during the execution.

  • Since computation graph is dynamic, it cannot be optimized as the static graph, the same holds for the memory allocations.

A more complete example which allow to train feed-forward neural network on GPU can be found here.

Info

The difference between tracking and graph-based AD systems is conceptually similar to interpreted and compiled programming languages. Tracking AD systems interpret the time while computing the gradient, while graph-based AD systems compile the computation of the gradient.

ChainRules

From our discussions about AD systems so far we see that while the basic, engine, part is relatively straightforward, the devil is in writing the rules prescribing the computation of gradients. These rules are needed for every system whether it is graph based, tracking, or Wengert list based. ForwardDiff also needs a rule system, but rules are a bit different (as they are pushing the gradient forward rather than pulling it back). It is obviously a waste of effort for each AD system to have its own set of rules. Therefore the community (initiated by Catherine Frames White backed by Invenia) have started to work on a unified system to express differentiation rules, such that they can be shared between systems. So far, they are supported by Zygote.jl, Nabla.jl, ReverseDiff.jl and Diffractor.jl, suggesting that the unification approach is working (but not by Enzyme.jl).

The definition of reverse diff rules follows the idea we have nailed above (we refer readers interested in forward diff rules to official documentation).

ChainRules defines the reverse rules for function foo in a function rrule with the following signature

julia
function rrule(::typeof(foo), args...; kwargs...)
    ...
    return y, pullback
end

where

  • the first argument ::typeof(foo) allows to dispatch on the function for which the rules is written

  • the output of function foo(args...) is returned as the first argument

  • pullback(Δy) takes the gradient of upstream functions with respect to the output of foo(args) and returns it multiplied by the jacobian of the output of foo(args) with respect to parameters of the function itself (recall the function can have parameters, as it can be a closure or a functor), and with respect to the arguments.

julia
function pullback(Δy)
    ...
    return ∂self, ∂args...
end

Notice that key-word arguments are not differentiated. This is a design decision with the explanation that parametrize the function, but most of the time, they are not differentiable.

ChainRules.jl provides support for lazy (delayed) computation using Thunk. Its argument is a function, which is not evaluated until unthunk is called. There is also a support to signal that gradient is zero using ZeroTangent (which can save valuable memory) or to signal that the gradient does not exist using NoTangent.

How can we use ChainRules to define rules for our AD system? Let's first observe the output

julia
using ChainRulesCore, ChainRules
r, g = rrule(*, rand(2,2), rand(2,2))
g(r)

With that, we can extend our AD system as follows

julia
import Base: *, +, -
for f in [:*, :+, :-]
    @eval function $(f)(A::TrackedMatrix, B::AbstractMatrix)
       C, pullback = rrule($(f), value(A), B)
       C = track(C)
       push!(A.tape, (C, Δ -> pullback(Δ)[2]))
       C
    end

    @eval function $(f)(A::AbstractMatrix, B::TrackedMatrix)
       C, pullback = rrule($(f), A, value(B))
       C = track(C)
       push!(B.tape, (C, Δ -> pullback(Δ)[3]))
       C
    end

    @eval function $(f)(A::TrackedMatrix, B::TrackedMatrix)
       C, pullback = rrule($(f), value(A), value(B))
       C = track(C)
       push!(A.tape, (C, Δ -> pullback(Δ)[2]))
       push!(B.tape, (C, Δ -> pullback(Δ)[3]))
       C
    end
end

and we need to modify our accum! code to unthunk if needed

julia
function accum!(A::TrackedArray)
    isempty(A.tape) && return(A.deriv)
    A.deriv .= sum(unthunk(g(accum!(r))) for (r, g) in A.tape)
end
julia
A = rand(4,4)
B = rand(4,4)
grad(A -> (A * B + msin(A))*B, A)[1]
gradient(A -> sum(A * B + sin.(A)), A)[1]
grad(A -> A * B + msin(A), B)[1]
gradient(A -> sum(A * B + sin.(A)), B)[1]

Source-to-source AD using Wengert

Recall the compile stages of julia and look, how the lowered code for

julia
f(x,y) = x*y + sin(x)

looks like

julia
julia> @code_lowered f(1.0, 1.0)
CodeInfo(
1%1 = x * y
%2 = Main.sin(x)
%3 = %1 + %2
└──      return %3
)

This form is particularly nice for automatic differentiation, as we have on the left hand side always a single variable, which means the compiler has provided us with a form, on which we know, how to apply AD rules.

What if we somehow be able to talk to the compiler and get this form from him?

Sources for this lecture


  1. Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2), 146-160. ↩︎

  2. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. ↩︎