Ventus: A High-performance Open-source GPGPU Based on RISC-V and Its Vector Extension
Agenda
Time: Afternoon, Saturday October 18
Time | Topic | Materials |
13:00 - 13:40 | Overview of the Ventus project | |
13:40 - 14:20 | Microarchitecture and Design Philosophy | |
14:20 - 15:00 | Software Stack | |
15:00 - 15:30 | Coffee Break | |
15:30 - 16:40 | Hands-on Demo | |
16:40 - 17:00 | Discussion and Summary |
Abstract
Motivation: over the past decade, open-hardware initiatives have surged in academia and industry, establishing themselves as a mainstream paradigm for accelerator design. The explosive demand for compute power—driven by generative AI and large-scale language models—has rendered high-performance GPGPUs one of today’s most critical and scarce resources. Yet, commercial GPUs keep their ISAs and microarchitectures proprietary, forcing researchers to rely on cycle-level simulators such as GPGPU-Sim, whose results can diverge significantly from real silicon. While pioneering open-source GPUs (e.g., Vortex) have laid important groundwork, they still trail commercial offerings in both features and raw performance.
Ventus GPGPU addresses this gap by leveraging the RISC-V Vector Extension (RVV) to deliver a scalable SIMT microarchitecture, fully implemented in Chisel HDL. The project ships with a parameterizable RTL design, a cycle-accurate simulator, an instruction-level simulator, and a complete software stack—all open-sourced on GitHub. The technical results, published at IEEE ICCD 2024, demonstrate that Ventus achieves the highest raw performance among publicly available RISC-V GPGPUs to date.
Ventus also spearheaded the public definition of an Open GPGPU ISA. Within the 32-bit RISC-V instruction format, it standardizes extensions for warp branching, synchronization, and private/shared address spaces; repurposes the 16-bit instruction–encoding space originally reserved for the RISC-V compressed (C) extension—unneeded by GPU pipelines—for custom GPGPU operations; and defines a register model comprising 64 scalar GPRs and 256 vector GPRs, underpinned by a multi-level memory hierarchy.
The project’s rapid momentum has been showcased at several industry events: at the Khronos Open Processor Innovation Forum (March 2025), we unveiled the ISA roadmap; at the 2024 RISC-V Summit China and 2023 RISC-V Summit Europe, we presented our unified toolchain and multi-backend framework—underscoring Ventus’s growing influence in the open-GPU ecosystem.
This tutorial focuses on “deploying and running Ventus GPGPUs of various scales on FPGA (using Xilinx VCU128 as the example platform) and simulator”, guiding attendees through an end-to-end workflow from a single SM to a multi-cluster configuration.
Ventus GPGPU Microarchitecture |
Ventus Software Stack |
To be covered
• Overview of the Ventus Project
We will introduce the motivation behind Ventus and its role in bridging the gap between academic GPU simulation and real hardware. By reinterpreting RVV with targeted custom extensions, Ventus delivers an open GPGPU capable of running unmodified OpenCL 2.0 workloads. Compared with Vortex, Ventus reduces dynamic instruction count by 83.9% and lowers CPI by 87.4% on GPU-Rodinia benchmarks, while passing the OpenCL conformance tests. We will also outline future directions for expanding benchmark coverage and deepening microarchitectural optimizations to further elevate throughput.
• Microarchitecture and Design Philosophy
This section delves into Ventus’s hardware building blocks: the CTA scheduler, Streaming Multiprocessors (SMs), and cache subsystem. Each SM features a dual-issue pipeline, a fine-grained warp scheduler, and a multi-bank SRAM register file. Control-flow divergence is mitigated via a hardware reconvergence stack and custom setrpc/vbranch/join instructions, enabling early reconvergence in nested branches. A hierarchical cache system with sharded L2 slices supports multiple SMs, and a Release-Consistency-Directed Coherence (RCC) protocol replaces MESI by issuing selective invalidations and flushes, cutting coherence traffic. An integrated tensor core accelerates 8×4×8 FP16 GEMM, reducing instruction count by 31% and cycle count by 32%.
• Software stack
We will survey the end-to-end software ecosystem that powers Ventus: an LLVM-based compiler backend, a bespoke linking framework, a PoCL-based OpenCL runtime, and a kernel-level driver aligned with our ISA specification. Attendees will see how the stack supports multiple backends—including FPGA deployment on Xilinx VCU128, Verilator-backed RTL simulation, a cycle-accurate C++ model, and Spike-based ISA simulation.
• Hands-on practice on Ventus GPGPU
In the final segment, participants will gain practical experience deploying and executing OpenCL programs on Ventus—both in simulation and on actual FPGA hardware. We will guide attendees through an end-to-end workflow, from configuring a single SM to scaling across multiple clusters. Each step will highlight critical considerations and showcase how Ventus’s parameterizable design enables flexible trade-offs among functionality, performance, and area.
Call for Contributions
To foster a collaborative ecosystem around Ventus, we are opening a call for community talks. If you are conducting research, building extensions, or have a unique application involving the Ventus GPGPU, we would be delighted to feature your work. Please send a proposal including a title and a brief abstract to mmy23@mails.tsinghua.edu.cn. To ensure your submission is properly routed, please use the subject line: [Ventus Talk Proposal].
Citation
Plain Text |
Contact us
For any further questions please contact mmy23@mails.tsinghua.edu.cn or hehu@tsinghua.edu.cn.