Accelerating Financial calculation grids with FPGAs

Introduction

Traditional financial workloads have large data sets that need to be processed by analytics libraries. When completed, a series of aggregation functions then need to post process data. These tasks are both computationally and IO intensive due to the sheer volume of data.

CPU vs FPGA

Traditionally, software has been written to run on CPUs which can take generalised instruction sets. FPGAs can perform much faster using specialised kernels where the hardware can be described as "programmable fabric". In fact, an FPGA like the Arria 10 has processing rates up to 10 TFLOPs (Tera Floating-point Operations Per Second). To give a *very* rough comparison, Seti had a Core i9 at around 100 GFLOPS.

The data challenge

With all of raw speed of the FPGA, the biggest challenge to harnessing this capability is to feed it enough data. To achieve this, we incorporated Apache Beam as a client API to define our processing workflow and used Apache Flink as our Beam runner implementation. In addition, we were given a big help from InAccel with their FPGA orchestration framework which allows us to simplify our programming and use multiple overlaying process calls rather than using multiple threads.

Grid architecture

We feel this is a good representation of the modern grid. Note that this architecture can be used for on-premise grids or any cloud hosting provider (GCP, AWS, Azure - subject to FPGA availability of course!):

The Option pricing test

A simple Beam workflow was created to read input from a data source (filesystem), which is then passed to a function to process the file contents. The analytics operate against the largest data structure that the FPGA can handle due to on board memory constraints, and results are written back to file.

The golden source used was a total of 34 million european option trades which were valued using a Monte Carlo pricer implemented as an OpenCL FPGA kernel. As part of the test, the Flink parallelism was varied to observe the effect of overlaying process calls to the FPGA, which were then scheduled by the InAccel orchestrator.

Each parallelism from 1 to 8 was run three times. The more tasks queued, the higher the utilisation of the FPGA and the lower the overall client time:

Conclusions

Please read the solution paper here

Thanks to Chris Kachris and his team at InAccel for all the help and support to bring these benchmarks to life. Also, thanks to Graham Mckenzie and Natalia Poliakova from Intel.