September 24, 2024

What Makes Plonky So Fast on the VPU?

Last week, we shared the news that Fabric Cryptography and Polygon Labs are collaborating to accelerate Polygon’s foundational ZK proof  systems, Plonky2 and Plonky3, on our upcoming custom chip for cryptography, the Verifiable Processing Unit (VPU).  In this blog, we explain what makes the Plonky family so efficient on a VPU, compared to other computer architectures.

We had inspiring conversations with people like Daniel Lubarov, Bobbin Threadbare, and Jordi Baylina about their experience accelerating Plonky 2/3 on existing computer architectures that have deeply influenced how we approached building the VPU.

TL;DR: 

  • Software-hardware codesign = solving the global co-optimization problem in compute, memory, network, input/output, and software. 
  • Fabric has been collaborating with Polygon Labs since 2022. The Polygon Labs team was incredibly helpful in discussing Plonky workloads and existing acceleration efforts, which ensured the VPU architecture would address the bottlenecks on existing hardware. 
  • We solve the PCIe bottleneck problem between CPUs and co-processors by adding a RISC-V (open source CPU) processor directly onto the chip.
  • Number theoretic transforms (NTTs) are memory-hard computations found in ZKPs that are difficult to optimize on existing architectures. We solve the memory bottleneck with a memory architecture that enables end-to-end workloads on the chip, without intermediate memory fetches.  
  • We support recursive proving, a technique in ZKPs to turn large proofs into a set of smaller ones, to maximize the utilization of VPU chips working in parallel.
  • We co-designed new instructions with Daniel Lubarov to provide an additional performance boost in Plonky and general ZK workloads.

The Balancing Act of Computer Architecture 

In 2022, Fabric’s co-founders Michael and Tina had only just begun to conceptualize and explore the VPU’s architecture. We believed making a custom silicon processor with a cryptography-focused instruction set architecture (ISA) would be an important foundation for the new kinds of trust infrastructure being built.

Deliberate choices to accelerate finite-field arithmetic, the main mathematical operations found in advanced cryptography, would give the VPU the processing power needed for cheap, fast, and accessible ZKPs. Prioritizing raw computational power inside our chips was an obvious target, but good computer architecture is always a co-optimization problem in compute, memory, network, I/O and software – and knowing the right tradeoff points would require talking with the experts in production ZK workloads!

We’ve been collaborating with the Polygon team for nearly two years. At SBC in 2022, Bobbin Threadbare, one of Polygon co-founders and the creator of Miden, worked with us to map out the full impact of these lesser-known hardware bottlenecks that made Plonky optimizations difficult on existing computing systems.

Hardware Bottlenecks in ZK

One of the biggest lessons to learn from AI hardware is that it’s not all about arithmetic. Data movement often winds up being a bigger limiter on hardware performance than performing math really fast. When we looked at ZKPs, we were unsurprised to find that many similar challenges existed, but after talking to the ZK experts at Polygon, we were excited to find that together, we could surmount these challenges through in-depth hardware/software co-design. 

Any time two chips need to trade data, for example a host CPU and a co-processor (GPU, FPGA, ASIC), or a co-processor and an external memory, this data has to be transferred over a chip-to-chip interconnect. If a processor produces results faster than input data enters the chip, then some of the available compute will be idle, and the computing system is bottlenecked by bandwidth.

In this blog post, we’ll explore two critical components of ZKP computation, witness generation and the number-theoretic transform (NTT), highlighting different aspects of these challenges.

Resolving the PCIe Bottleneck Between CPUs and Co-Processors

We discussed the PCIe bottleneck issue at length with Jordi Baylina, the leader of Polygon’s Hermez team. He shared that this was a primary bottleneck he faced when exploring Hermez acceleration on GPUs and FPGAs. PCIe bandwidth issues are commonly faced in ZK acceleration on these architectures during witness generation. 

Witness generation, a small (2%) but crucial portion of end-to-end proving, is a workload well-suited for systems that process workloads sequentially and can be programmed easily. In this respect, CPUs excel over parallel architectures like the GPU. Modern CPUs offer high clock frequencies, sophisticated out-of-order execution, and compatibility with programming languages like Rust, making them ideal for this part of the workload.

By default, co-processors like GPUs or FPGAs are less suited for witness generation due to the complexities of porting existing Rust code to a new processor architecture and the native advantage of sequential processing on CPUs. Yet, if we naively split the workload, using the CPU for witness generation and the co-processors for subsequent steps, we introduce a significant bottleneck when transferring data between the CPU and the co-processor over interfaces like PCIe. 

The limited bandwidth of PCIe means that even if the processor can produce results rapidly, the data transfer rate becomes the limiting factor, leading to underutilization of computing resources. We made the key decision to include a RISC-V CPU embedded in the VPU chip itself and a high bandwidth network-on-chip, allowing us to run most of witness generation on the VPU and then send the data not over (slower) chip-to-chip interconnects but (faster) on-chip interconnects instead.

Resolving Memory Bottlenecks by Enabling End-to-End Proving on Chip

NTT introduces another issue for existing architectures. NTTs are notoriously difficult to optimize on standard CPUs, GPUs and FPGAs because they’re incredibly demanding on memory systems. These algorithms need to access large chunks of data frequently and in complex patterns, which often leads to a lot of waiting around for data to be fetched from the external main memory. It's like trying to cook a complex recipe when your ingredients are scattered all over town instead of neatly arranged in your kitchen. 

Bobbin is no stranger to these kinds of memory bottlenecks. In fact, there are many tricks a programmer could theoretically use to overcome these issues. But without control over a commercial processor’s cache memory, the system will prevent the programmer from hand-tuning the memory controls themselves. Bobbin described the process as “arguing with the built-in features.”

It was exciting to hear that many of our architectural decisions would already address the problem. We already knew we needed a blazing-fast interconnect, which would mean adding RISC-V CPU cores onto the VPU chip itself. We were also starting to add in a 320 GB/s GDDR memory interface for each VPU chip with enough memory capacity to keep our arithmetic logic units (ALUs) running at their limit. But this conversation is when we realized something important.

If we could manage to put enough scratchpad memory on the chip, design an ultra-fast way to shuttle it around, and add a general processor to handle the control logic of a ZK system, then we would be able to compute an entire ZK proof end-to-end, without needing to send intermediate results over the interconnect. All the data required to handle the workload could be uploaded to the VPU in a single batch, used in computation, and read back out at enormous speed.

Planning Ahead for Recursive Proofs

When we first spoke with Bobbin at SBC in 2022, he shared the staggering 500 GB of memory required for Plonky2 at the time. You can’t fit that much memory into any PCIe card, let alone a single chip! This meant that, even with on-chip memory optimization, large ZKPs would still have intermediate results that have to be stored in an external memory. 

Luckily, Bobbin suggested that we factor in the use of recursive proving, which allows a proof system to generate ZKPs of ZKPs. By nesting ZKPs inside each other, it’s possible to verify an enormous amount of data within a single, aggregated proof. While recursive proving was originally designed to reduce on-chain verification costs and make proving more parallelizable over transactions, we discovered together that we can use recursive proving to make efficient use of memory resources inside the VPU!

The normal way to compute a large ZKP would be to process it directly, generating lots of intermediate results along the way that have to be stored off-chip. But with recursive proving, the VPU can split up a large proof into pieces that are small enough to fit into the available memory of the hardware, and then reconstruct an “aggregated proof” at the end of the process. This technique gives a VPU programmer the flexibility to use all the memory and computational resources on chip, while relying minimally on the slower interconnect speed between the VPU chip and external memory

In 2022, recursive proving was not yet in widespread deployment. To us, this was such a compelling idea that we decided to plan ahead, and design the VPU with built-in support for recursive proving across chips. With these two observations, the Fabric team was able to design around the memory bottleneck that would by default plague a naively designed accelerator.  It was the next series of conversations, early this year that got both our teams excited enough for a full collaboration on the instruction and software level.

Optimizing Performance with New VPU Instructions

With these major bottlenecks resolved, we started to look at new ways to improve the computational speed of the Plonky 2/3 on the VPU using custom instructions.

The one person most excited by the possibility of inventing their own VPU instructions was Daniel Lubarov, the creator of the Plonky proof systems. We had many deep dives into the VPU architecture, and it did not take long for us to start dreaming up new ideas for the VPU ISA.

Fabric’s cryptography, software and chip teams worked closely with Daniel to craft additional instructions to further optimize processing Plonky2, and the upcoming Plonky3, on the VPU in a fraction of the steps it would take on a conventional architecture. Getting to shrink operations composed of many steps, used many times in Plonky 2/3, into one-step instructions had Daniel compare SW/HW co-design to “playing God mode on silicon”. 

This is exactly how we want to support cryptographers, researchers, and developers. 

Co-Design for the Community

It has been a rewarding journey to work with Bobbin, Jordi, and Daniel over the last two years to co-design the VPU.

Together, we are always discovering more exciting ideas to accelerate ZK on the VPU, which will deliver revolutionary performance not just for Polygon, but the wider ZK community for blockchain and beyond.

Our hope is that we continue to meet the most ambitious cryptography teams, who inspire us to show what the VPU can do for their new research and ideas. 

If that’s you, let’s co-design together!

If you’re interested in operating or testing our first machines, we invite you to submit a pre-order or testboard request!

Author:

Fabric Team

A World Beyond Trust
< Previous Post
Next Post >