October 31, 2024

Fabric and RISC Zero Partner to Accelerate Boundless on the VPU

We are excited to announce a strategic partnership between Fabric and RISC Zero to accelerate Boundless on the VPU. Last week marked the one year anniversary since Fabric and RISC Zero commenced co-design efforts, focusing particularly on resolving data movement bottlenecks, continuously compatible constraint evaluation, and RISC Zero-native prime field instructions. 

We present an order of magnitude acceleration over GPU for eval_check, a compute-intensive and heterogeneous portion of the RISC Zero zkVM workload. This acceleration generalizes over the end-to-end workflow, demonstrating that real-time zero knowledge – for any use case – is just around the corner.

RISC Zero x Fabric: Real-Time RISC-V ZK 

While computers have become exponentially faster over time, a curious pattern emerges: instead of making existing tasks much faster, we often use the extra speed to make software development easier and more powerful.

Consider opening a window on an Apple computer: despite CPUs being thousands of times faster in 2020 than in 1980, this basic action still takes roughly the same fraction of a second. Why? Because we've traded raw speed for flexibility and developer productivity.  

Modern applications use layers of abstraction – like high-level programming languages, frameworks, and APIs – that make development faster and more reliable, but consume more computational resources. 

Likewise, by implementing the complete RISC-V architecture as a ZK circuit, RISC Zero makes verifiable computation accessible to developers (not just cryptographers) working in standard languages like Rust and C++, but while RISC-V is a strong design choice for handling any application, it remains optimized for traditional silicon execution – and introduces significant performance overhead when translated into a ZK circuit.

The Fabric and RISC Zero teams have worked together to eliminate this overhead with specialized hardware for RISC-V zkVM operations, narrowing the gap between today’s software and verifiable software.

Boundless: The Universal ZK Protocol

Today's blockchain ecosystem faces a scaling dilemma where solutions typically create new networks, leading to ecosystem fragmentation and divided liquidity. Boundless takes a different approach, using zero-knowledge proofs to achieve true horizontal scalability—improving existing networks through parallel computation rather than creating new isolated environments.

For developers, Boundless delivers:

  • Familiar language support for ZK application development
  • O(1) gas consumption for any computation through off-chain execution
  • Seamless proof aggregation and verification across different systems
  • A decentralized proof marketplace

Understanding RISC Zero’s zkVM Workload Profile

The RISC Zero zkVM proving pipeline, provided by the RISC Zero team. 

The RISC Zero zkVM transforms arbitrary RISC-V code into verifiable computations through an elegant multi-stage proof system. Starting from STARK proofs of individual segments, it recursively combines these until producing a final proof of the entire program’s execution. Below, we show the performance profile for a segment proof, which constitutes the bulk of the proving costs in applications like rollups. 

Nvidia Nsight trace of RISC Zero’s zkVM 1.1.3 proof run on an Nvidia RTX 4090, provided by the RISC Zero team. 

This pipeline contains two distinct types of operations:

  • Homogenous: regular, highly parallelizable operations like hash functions and NTTs where identical computations are performed across large datasets.
  • Heterogenous: complex, data-dependent operations like WitGen and Eval_Check that require varied arithmetic operations with irregular access patterns.

While GPUs can independently handle homogenous workloads, heterogeneous workloads are most efficiently processed on CPUs, necessitating CPU-GPU systems to handle end-to-end proving. Proving each segment requires both WitGen and eval_check. These heterogeneous steps alone take ~45% of proof generation time on RTX 3090/4090, and accounting for data movement between the scratchpad and external memory (alongside the CPU and GPU) makes up for ~50% of the time. This represents a significant performance bottleneck in existing hardware solutions.

Ground-Up Approach to Hardware Acceleration 

These contrasting computational patterns and data movement challenges demand that we rethink the hardware architecture for proof generation. Rather than adapting existing GPU and ASIC architectures, we designed the VPU from first principles for cryptographic workloads. The VPU design addresses zkVM architecture with three key innovations: 

  1. Native RISC-V Integration: We integrated a full RISC-V processor core with an ultra high-bandwidth interconnect (more performant than any PCIe interface on the market) to 40 ALU tiles, enabling tight coupling between control logic and compute resources. This opens an intriguing possibility: running RISC-V zkVM witness generation natively on RISC-V silicon could eliminate current x86 emulation overhead – a natural architectural alignment. 

  1. End-to-End Data Movement: We built a unified memory architecture with direct paths between compute and memory resources. A high-speed network-on-chip connects 40 ALU tiles to internal GDDR6 and PCIe units, in addition to direct connection with the RISC-V processor – eliminating multi-step data transfers in CPU-GPU systems. 

  1. Specialized Finite-Field Compute: We designed an arithmetic logic unit (ALU) for BabyBear prime field arithmetic, alongside other well-adopted fields, used extensively throughout the RISC Zero proving system. Contrast this with a GPU, which dedicates ~80% of cores to floating-point arithmetic. With nearly all the VPU’s arithmetic resources towards finite-field operations, this enables trillions of operations per second (TOP/s) and close to ~100% utilization.

Why Co-Design Is Important

To illustrate the necessity of co-design for a comprehensive performance improvement, let’s consider eval_check – a uniquely difficult and heterogeneous workload that changes with every version of the RISC Zero circuit. Hard-coded approaches like ASICs and FPGAs focus on regular patterns, with circuit changes and heterogeneity being offloaded to CPUs.

  1. Fixed-function chips are optimized for regular patterns. ASICs are designed to be extremely efficient for specific tasks that have a well-defined, predictable execution pattern, but the code shown in eval_check involves a variety of arithmetic operations on thousands of variables, with dependencies on previous results in an arbitrary pattern. This kind of irregular pattern cannot easily be turned into a fixed-function ASIC, or it would require billions of gates for the hardcoded eval_check module alone.

  2. Fixed-function chips can’t adapt to frequently changing workloads. Although ASICs can easily be made for stable algorithms like the Poseidon hash function, eval_check changes on a frequent basis – its last change was on Aug 3, 2024 and it was quite a major change:

This means that only a programmable processor like the VPU could efficiently accelerate eval_check to reduce proving time and cost. And because eval_check and other heterogeneous parts like witness generation are a larger portion of the workload than regular parts like NTT and Poseidon, fixed-function CPU-ASIC or CPU-FPGA systems would struggle to meaningfully accelerate RISC Zero’s workload.

In fact, RISC Zero’s workload represents a pattern that we believe generalizes to other zkVMs once accelerated by hardware. As more zkVMs have been accelerated on GPUs, heterogeneous operations typically begin to occupy more runtime than relatively homogeneous operations. No matter what the proof system is, this is ultimately what the pattern will be.

This means that to meaningfully improve upon the GPU, we need to make a domain-specific processor like the VPU with three important design choices – a cryptographically optimized instruction set, a reservoir of finite-field ALUs, and an ultra-fast network-on-chip to carry lots of data quickly across the chip. 

Further Improvements with Baby Bear Instructions

At Fabric, we are committed to co-designing our instruction set with the greater community. Working with partners like RISC Zero provides an opportunity to reach the highest levels of performance and useability. Large hardware manufacturers, like Intel and NVIDIA, cater to many kinds of customers and use-cases. As a result, they keep their instruction set closely guarded, and are very careful about any modifications.

We believe in collaborative improvement. Over the past two years, we've openly shared our instruction set with partners and actively sought their input. Their valuable insights and suggestions have helped shape and strengthen these instructions over time. This open dialogue has made our ISA more robust and practical.

Since we began our co-design effort with RISC Zero, we made the decision to add in specialized instructions for the Baby Bear prime field. While the GPU implementation of a Baby Bear multiplication requires 3x 64-bit (uint64) instructions and 5x 32-bit instructions, the VPU can accomplish the same in a single instruction.

Putting It All Together: Order of Magnitude Acceleration over GPU

Our VPU architecture demonstrates strong performance gains in the eval_check stage – one of the most computationally intensive components of the proving pipeline. A single VPU card, comprising three chips, achieves an 11.1x speedup over the RTX 3090Ti and 5.5x over the RTX 4090. This is despite the VPU being on an older 12nm process node versus 8nm and 5nm nodes respectively. 

This architectural advantage is particularly novel given that the VPU contains 1/10th of the RTX 3090’s logic transistors. Through our combination of high-speed memory architecture, finite-field ALU design, and native Baby Bear modular arithmetic instructions, we've achieved these significant eval_check acceleration gains. Since eval_check's demanding requirements for randomized modular arithmetic mirror other proving stages, these improvements likely extend across the entire proving pipeline.

Looking ahead, transitioning to a 5nm process node alone would yield another order of magnitude improvement through increased transistor density (5x) and frequency (2x). Combined with ongoing architectural refinements and proof system co-design, we project two to three orders of magnitude performance gains in future generations.

The Future of RISC Zero x Fabric 

We are excited to work with RISC Zero to make production-grade ZKPs available for any use case on Boundless. As we work together with the RISC Zero team, we expect to be able to find further optimizations to the RISC Zero zkVM – optimizations that could result in even better performance on the VPU. Because the VPU is a programmable chip, its proving performance will get better over time!

Our partnership continues the trend of democratization through silicon. When we dramatically improve hardware efficiency, we lower development barriers and exponentially expand the set of people who can build these applications. 

It's a pattern we've seen before: just as silicon innovation brought supercomputers to our pockets, we envision our collaboration with RISC Zero bringing verifiable computation to every digital interaction.

Author:

Fabric Team

World Beyond Trust
< Previous Post
Next Post >