Compute Core Modules¶
RTL source on GitHub
SystemVerilog sources documented on this page:
hw/rtl/MAT_CORE/GEMM_systolic_top.sv— View on GitHubhw/rtl/VEC_CORE/GEMV_top.sv— View on GitHubhw/rtl/CVO_CORE/CVO_top.sv— View on GitHubhw/rtl/MAT_CORE/GEMM_dsp_unit.sv— View on GitHub
1. Matrix Core — Systolic Top¶
GEMM_systolic_top.sv wraps the 32 × 32 systolic array (cascade
split at row 16 into two 32 × 16 sub-chains). It receives weight tiles
from HP0/HP1 and activation rows from the L2 cache, and streams
accumulated results to the post-processor.
See also
2. Vector Core — GEMV Top¶
GEMV_top.sv instantiates 4 parallel GEMV cores. Each core has a
32-wide LUT-based MAC and a 5-stage reduction tree (Stage 1 uses 16
DSP48E2 slices; Stages 2–5 are LUT adders). Weights stream from HP2/HP3.
See also
3. CVO / SFU Core¶
CVO_top.sv orchestrates the CORDIC + LUT hybrid units for
non-linear operations: exp, sqrt, gelu, sin, cos,
reduce_sum, scale, recip. Precision is promoted to BF16/FP32
for all computations.
See also
4. DSP48E2 MAC Unit¶
GEMM_dsp_unit.sv implements the dual-channel W4A8 MAC using a single
DSP48E2 slice. See DSP48E2 W4A8 Bit Packing and Sign Recovery for the
bit-packing derivation.
Last verified against
Commit 773bd82 @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-21).