Traditional financial workloads have large data sets that need to be processed by analytics libraries. When completed, a series of aggregation functions then need to post process data. These tasks are both computationally and IO intensive due to the sheer volume of data.
Traditionally, software has been written to run on CPUs which can take generalised instruction sets. FPGAs can perform much faster using specialised kernels where the hardware can be described as "programmable fabric". In fact, an FPGA like the Arria 10 has processing rates up to 10 TFLOPs (Tera Floating-point Operations Per Second). To give a *very* rough comparison, Seti had a Core i9 at around 100 GFLOPS.
With all of raw speed of the FPGA, the biggest challenge to harnessing this capability is to feed it enough data. To achieve this, we incorporated Apache Beam as a client API to define our processing workflow and used Apache Flink as our Beam runner implementation. In addition, we were given a big help from InAccel with their FPGA orchestration framework which allows us to simplify our programming and use multiple overlaying process calls rather than using multiple threads.
We feel this is a good representation of the modern grid. Note that this architecture can be used for on-premise grids or any cloud hosting provider (GCP, AWS, Azure - subject to FPGA availability of course!):
A simple Beam workflow was created to read input from a data source (filesystem), which is then passed to a function to process the file contents. The analytics operate against the largest data structure that the FPGA can handle due to on board memory constraints, and results are written back to file.
The golden source used was a total of 34 million european option trades which were valued using a Monte Carlo pricer implemented as an OpenCL FPGA kernel. As part of the test, the Flink parallelism was varied to observe the effect of overlaying process calls to the FPGA, which were then scheduled by the InAccel orchestrator.
Each parallelism from 1 to 8 was run three times. The more tasks queued, the higher the utilisation of the FPGA and the lower the overall client time: