MICRO 2025 tutorial-OpenGPGPU

MICRO 2025 tutorial

Ventus: A High-performance Open-source GPGPU Based on RISC-V and Its Vector Extension

Agenda

Time: Afternoon, Saturday October 18

Time	Session	Title	Speaker	Content
13:00~13:05	Overview of the Ventus project	Project Overview	Hu He
13:10~13:40	Hardware & MICROarchitecture	GPU Design Philosophy and Core Architecture	Mingyuan Ma
13:40~14:05		Cache Subsystem Design	Haonan Sun
14:05~14:30		A Multi-Precision Reusable Tensor Core Design	Wei Liu
14:35~15:00	Verification with Chisel	GPU Verification Model	Wenxuan Xie
15:00~15:30	Coffee Break
15:30~16:00	Software Stack	Triton and OpenCL Compiler for Ventus GPGPU	Hualin Wu
16:00~16:15	Software Stack	Ventus Software Toolchain	Li Kong
16:20~TBD	Hands-on Demo	Hands-on Simulation with Multiple Backends	Yuhan Wang
16:20~TBD	Hands-on Demo	Hands-on FPGA Execution	Xiaochuan Chen

Abstract

Motivation: over the past decade, open-hardware initiatives have surged in academia and industry, establishing themselves as a mainstream paradigm for accelerator design. The explosive demand for compute power—driven by generative AI and large-scale language models—has rendered high-performance GPGPUs one of today’s most critical and scarce resources. Yet, commercial GPUs keep their ISAs and microarchitectures proprietary, forcing researchers to rely on cycle-level simulators such as GPGPU-Sim, whose results can diverge significantly from real silicon. While pioneering open-source GPUs have laid important groundwork, they still trail commercial offerings in both features and raw performance.

Ventus GPGPU addresses this gap by leveraging the RISC-V Vector Extension (RVV) to deliver a scalable SIMT microarchitecture, fully implemented in Chisel HDL. The project ships with a parameterizable RTL design, a cycle-accurate simulator, an instruction-level simulator, and a complete software stack—all open-sourced on GitHub. The technical results, published at IEEE ICCD 2024, demonstrate that Ventus achieves the highest raw performance among publicly available RISC-V GPGPUs to date.

Ventus also spearheaded the public definition of an Open GPGPU ISA. Within the 32-bit RISC-V instruction format, it standardizes extensions for warp branching, synchronization, and private/shared address spaces; repurposes the 16-bit instruction–encoding space originally reserved for the RISC-V compressed (C) extension—unneeded by GPU pipelines—for custom GPGPU operations; and defines a register model comprising 64 scalar GPRs and 256 vector GPRs, underpinned by a multi-level memory hierarchy.

The project’s rapid momentum has been showcased at several industry events: at the Khronos Open Processor Innovation Forum (March 2025), we unveiled the ISA roadmap; at the 2024 RISC-V Summit China and 2023 RISC-V Summit Europe, we presented our unified toolchain and multi-backend framework—underscoring Ventus’s growing influence in the open-GPU ecosystem.

This tutorial focuses on “deploying and running Ventus GPGPUs of various scales on FPGA (using Xilinx VCU128 as the example platform) and simulator”, guiding attendees through an end-to-end workflow from a single SM to a multi-cluster configuration.

Ventus GPGPU Microarchitecture

Ventus Software Stack

To be covered

• Overview of the Ventus Project

We will introduce the motivation behind Ventus and its role in bridging the gap between academic GPU simulation and real hardware. By reinterpreting RVV with targeted custom extensions, Ventus delivers an open GPGPU capable of running unmodified OpenCL 2.0 workloads. Compared with Vortex, Ventus reduces dynamic instruction count by 83.9% and lowers CPI by 87.4% on GPU-Rodinia benchmarks, while passing the OpenCL conformance tests. We will also outline future directions for expanding benchmark coverage and deepening microarchitectural optimizations to further elevate throughput.

• Microarchitecture and Design Philosophy

This section delves into Ventus’s hardware building blocks: the CTA scheduler, Streaming Multiprocessors (SMs), and cache subsystem. Each SM features a dual-issue pipeline, a fine-grained warp scheduler, and a multi-bank SRAM register file. Control-flow divergence is mitigated via a hardware reconvergence stack and custom setrpc/vbranch/join instructions, enabling early reconvergence in nested branches. A hierarchical cache system with sharded L2 slices supports multiple SMs, and a Release-Consistency-Directed Coherence (RCC) protocol replaces MESI by issuing selective invalidations and flushes, cutting coherence traffic. An integrated tensor core accelerates 8×4×8 FP16 GEMM, reducing instruction count by 31% and cycle count by 32%.

• Software stack

We will survey the end-to-end software ecosystem that powers Ventus: an LLVM-based compiler backend, a bespoke linking framework, a PoCL-based OpenCL runtime, and a kernel-level driver aligned with our ISA specification. Attendees will see how the stack supports multiple backends—including FPGA deployment on Xilinx VCU128, Verilator-backed RTL simulation, a cycle-accurate C++ model, and Spike-based ISA simulation.

• Hands-on practice on Ventus GPGPU

In the final segment, participants will gain practical experience deploying and executing OpenCL programs on Ventus—both in simulation and on actual FPGA hardware. We will guide attendees through an end-to-end workflow. Each step will highlight critical considerations and showcase how Ventus’s parameterizable design enables flexible trade-offs among functionality, performance, and area.

Call for Contributions

To foster a collaborative ecosystem around Ventus, we are opening a call for community talks. If you are conducting research, building extensions, or have a unique application involving the Ventus GPGPU, we would be delighted to feature your work. Please send a proposal including a title and a brief abstract to mmy23@mails.tsinghua.edu.cn. To ensure your submission is properly routed, please use the subject line: [Ventus Talk Proposal].

Citation

@ARTICLE{tvlsi2025ventus,
  title={{ RISC-V-Based GPGPU With Vector Capabilities for High-Performance Computing }},
  author={Li, Jingzhou and Yu, Fangfei and Ma, Mingyuan and Liu, Wei and Wang, Yuhan and Wu, Hualin and He, Hu},
  journal={ IEEE Transactions on Very Large Scale Integration (VLSI) Systems },
  year={2025},
  doi={10.1109/TVLSI.2025.3574427}
}
@INPROCEEDINGS{iccd2024ventus,
  title={Ventus: A High-performance Open-source GPGPU Based on RISC-V and Its Vector Extension},
  author={Li, Jingzhou and Yang, Kexiang and Jin, Chufeng and Liu, Xudong and Yang, Zexia and Yu, Fangfei and Shi, Yujie and Ma, Mingyuan and Kong, Li and Zhou, Jing and Wu, Hualin and He, Hu},
  booktitle={2024 IEEE 42nd International Conference on Computer Design (ICCD)},
  doi={10.1109/ICCD63220.2024.00049}
}

Contact us

For any further questions please contact mmy23@mails.tsinghua.edu.cn or hehu@tsinghua.edu.cn.

首页

新闻

项目

贡献

关于我们