Tensorflow Quant Finance - Adjoint Algorithmic Differentiation

Introduction

In the previous blog I introduced TF Quant Finance (TFF). Here, I want to talk briefly about Adjoint Algorithmic Differentiation (or AAD as it is commonly known).

I will introduced the theory behind using a simple equation and use TFF to get the gradients at a given point. Time permitting, I will attempt to execute a full MVA use case, soon, to show the effect when used at scale.

Margin Value Adjustment (MVA) uses Monte Carlo Simulation for calculating initial margins. The significant contributions from Vega and Basis risk requires these to be addressed per simulation path. Traditional bump and revalue in such cases can prove to be computationally intensive. AAD helps reduce the complexity and the intensity of the compute.

Background Theory

I am just regurgitating what has been explained in this presentation and in this paper (hopefully, more succinctly). Please refer to them for greater detail.

A function, z, is define as

\[ z = x*y + sin(x)\]

Lets break this down as follows

$$\begin{eqnarray} x &=& ? \\ y &=& ? \\ a &=& x * y \\ b &=& sin(x) \\ z &=& a + b \\ \end{eqnarray}$$

Forward (Tangent Linear) mode

Differentiate each of the above equation with a yet to be defined variable t

$$\begin{eqnarray} \frac {\partial x} {\partial t} &=& ? \\ \\ \frac {\partial y} {\partial t} &=& ? \\ \\ \frac {\partial a} {\partial t} &=& \frac {\partial a} {\partial x} \frac {\partial x} {\partial t} + \frac {\partial a} {\partial y} \frac {\partial y} {\partial t}\\ \\ &=& y \frac {\partial x} {\partial t} + x \frac {\partial y} {\partial t} \\ \\ \frac {\partial b} {\partial t} &=& \frac {\partial b} {\partial x} \frac {\partial x} {\partial t} \\ \\ &=& cos(x) \frac {\partial x} {\partial t} \\ \\ \frac {\partial z} {\partial t} &=& \frac {\partial z} {\partial a} \frac {\partial a} {\partial t} + \frac {\partial z} {\partial b} \frac {\partial b} {\partial t} \\ \\ &=& \frac {\partial a} {\partial t} + \frac {\partial b} {\partial t} \\ \\ \end{eqnarray}$$

In programming terms we would write \( \frac {\partial x} {\partial t} \) as \( dx \). Writing the above equations in those terms

$$\begin{eqnarray} dx &=& ? \\ \\ dy &=& ? \\ \\ da &=& y dx + x dy \\ \\ db &=& cos(x) dx \\ \\ dz &=& da + db \\ \\ \end{eqnarray}$$

If we now set \( t \) as \( x \) (in other words, \( dx=1 \) and \( dy=0 \)), we get \( dz \) (which is in effect \( \frac {\partial z} {\partial x} \) ) as

$$\begin{eqnarray} dz = y + cos(x) \end{eqnarray}$$

Setting \( t \) as \( y \), \( dx=0 \) and \( dy=1 \), we get \( dz \) (which is in effect \( \frac {\partial z} {\partial y} \) ) as

$$\begin{eqnarray} dz = x \end{eqnarray}$$

If \( z\) depended on \( n \) number of input variables, then we would have to repeat this \( n \) number or times in the forward mode, to get the gradient of z at any point. Therefore in the forward mode, the cost of evaluating the gradient varies as \( O(n) \).

The computation graph proceeds as

Reverse (Adjoint) mode

The forward mode introduced some intermediate variables to which we then applied the chain rule

Let us again introduce a yet to be defined variable \( s \), such that

$$\begin{eqnarray} \frac {\partial s} {\partial z} &=& ? \\ \\ \frac {\partial s} {\partial b} &=& \frac {\partial s} {\partial z} \frac {\partial z} {\partial b}\\ \\ &=& \frac {\partial s} {\partial z} \\ \\ \frac {\partial s} {\partial a} &=& \frac {\partial s} {\partial z} \frac {\partial z} {\partial a}\\ \\ &=& \frac {\partial s} {\partial z} \\ \\ \frac {\partial s} {\partial x} &=& \frac {\partial s} {\partial a} \frac {\partial a} {\partial x} + \frac {\partial s} {\partial b} \frac {\partial b} {\partial x}\\ \\ &=& y \frac {\partial s} {\partial a} + cos(x) \frac {\partial s} {\partial b} \\ \\ \frac {\partial s} {\partial y} &=& \frac {\partial s} {\partial a} \frac {\partial a} {\partial y} + \frac {\partial s} {\partial b} \frac {\partial b} {\partial y}\\ \\ &=& x \frac {\partial s} {\partial a}\\ \\ \end{eqnarray}$$

As before, the following variables represent the partial differentials in programming terms.

\( \frac {\partial s} {\partial z} \) as \( gz \)
\( \frac {\partial s} {\partial b} \) as \( gb \)
\( \frac {\partial s} {\partial a} \) as \( ga \)
\( \frac {\partial s} {\partial x} \) as \( gx \)
\( \frac {\partial s} {\partial y} \) as \( gy \)

The above equations become

$$\begin{eqnarray} gz &=& ? \\ gb &=& gz \\ ga &=& gz \\ gy &=& x * ga \\ gx &=& y * ga + cos(x) * gb \end{eqnarray}$$

Now if we set \( s = z \), in other words, \( gz = 1 \), we get the two gradients with respect to \( z \) from the last two equations. We do not have to set this twice as we did in the forward mode.

In the adjoint mode, if you had many inputs that determine the gradient (a function depending on many variables), the gradient evaluation is a single pass process and therefore cheaper.

If the number of output (or functions) is m, the the cost varies as \( O(m) \)

The computation graph proceeds as

Conversely, if you have many outputs (or many functions) that depend on a small set of inputs, the forward mode could prove to be cheaper.

TFF Implementation


Single function

Here, we implement the original equation

\[ z = x*y + sin(x)\]

The presentation provides a simple python program to calculate the gradients. They worked out as follows.

The TFF implementation, of backward (or automatic) difference, is as follows ( follow TFF instruction for the required pip installs)

import tensorflow as tf
import functools
import tf_quant_finance.math.optimizer as tff_optimizer
import tensorflow_probability as tfp

# Decorator for automatic gradient evaluation.
def make_val_and_grad_fn(value_fn):
   @functools.wraps(value_fn)
   def val_and_grad(x):
      return tfp.math.value_and_gradient(value_fn, x)
   return val_and_grad

@make_val_and_grad_fn
def func(x):
   return (x[0] * x[1] + tf.sin(x[0]))

start = tf.constant([0.5, 4.2], dtype=tf.float64)

func(start)[1]

The output from TFF was a below

Multiple functions

Consider a simple function in two variables \(x\) and \(y\) but in three dimensions such that the function along each axes is [\(x^2\), \(y^2\), \(x y\)]

$$\begin{eqnarray} &f& = [f_1, f_2, f_3] \\ \\ where \\ \end{eqnarray}$$ $$\begin{eqnarray} f_1 &=& x^2 \\ f_2 &=& y^2 \\ f_3 &=& x * y \\ \end{eqnarray}$$

Lets define the function and gradients as follows.

def func(x):
   #f
   func = tf.stack([x[0]**2, x[1]**2, x[0]*x[1]])

def func_grad(x):
   #df/dx, df/dy
   grad = tf.stack([2.0*x[0]+x[1], 2.0*x[1]+x[0]])
   return func(x), grad

start = tf.constant([1,1], dtype=tf.float64)

Reverse mode

$$\begin{eqnarray} \frac {\partial f}{\partial x} &=& u_1 \frac{\partial f_1}{\partial x} + u_2 \frac{\partial f_2}{\partial x} + u_3 \frac{\partial f_3}{\partial x} \\ \\ &=& u_1 2 x + u_3 y\\ \\ \frac {\partial f}{\partial y} &=& u_1 \frac{\partial f_1}{\partial y} + u_2 \frac{\partial f_2}{\partial y} + u_3 \frac{\partial f_3}{\partial y} \\ \\ &=& u_1 2 y + u_3 x \end{eqnarray}$$

In Tensorflow, [\(u_1\), \(u_2\), \(u_3\)] is normally set to [1, 1, 1].
Setting [\(x\), \(y\)] to [1,1]
Reverse mode returns the gradients summed up by components

Forward mode

$$\begin{eqnarray} {\partial f_1} &=& w_1 \frac{\partial f_1}{\partial x} + w_2 \frac{\partial f_1}{\partial y} \\ \\ &=& w_1 2 x \\ \\ {\partial f_2} &=& w_1 \frac{\partial f_2}{\partial x} + w_2 \frac{\partial f_2}{\partial y} \\ \\ &=& w_2 2 y \\ \\ {\partial f_3} &=& w_1 \frac{\partial f_3}{\partial x} + w_2 \frac{\partial f_3}{\partial y} \\ \\ &=& w_1 x + w_2 y \\ \\ \end{eqnarray}$$

In Tensorflow, [\(w_1\), \(w_2\)] is normally set to [1, 1].
Setting [\(x\), \(y\)] to [1,1]
Forward mode returns the gradients by components

Conclusions

(Thanks to Ashish Saxena for the explanations around forward and backward differences on multifunction tensorflow use cases)

Remember, Tensorflow is the tool used in Machine Learning. In Machine Learning, the aim is to minmise the scalar loss function, the loss function being the sum of the gradient with respect to the feature set. This lowest loss (or cost, if you follow Andrew Ng ) is the loss summed up over all training examples.

However, lets take the use case where we are valuing a set of Options, say ten, against a single spot price \( S_0 \). We now have ten price functions and we need their gradients against spot \( S_0\) (ten delta's).

Using the forward gradients with respect to \( S_0\) would give us the ten delta's in a single pass.

Using the backward gradients would result in the sum of the ten delta's, which may not be that useful.

It is useful to note that varying the weights would also give you individual components of the gradients (in other words [1,0] and [0,1] as values of [\(w_1\), \( w_2\)], instead of the default [1,1], similarly for backward. This is, of course, at the expense of more compute.