ASPLOS'26 Tutorial: RAIC

FEATHER

Reconfigurable AI Accelerator

FEATHER is a state-of-the-art reconfigurable accelerator architecture that enables low-cost switching between dataflows and layouts, efficiently supporting diverse workload patterns.

MINISA Interactive Visualizer

Use this interactive tool to explore FEATHER's MINISA instruction set. Configure hardware parameters, build ISA traces, and visualize the cycle-accurate NEST PE array pipeline, BIRRD reduction network, and VN buffer layouts. The tool resets on page refresh.

Hardware Configuration
Workload (GEMM)
ISA Trace Sequence
Instruction Details
Select an instruction to view details.
Animation Controls
No animation generated
0/0
File Operations
NEST PE Array
BIRRD Network
VN Buffers

Microarchitecture

Details about FEATHER and FEATHER+ microarchitecture...

MINISA ISA 2.0 Specification

MINISA ISA 2.0 defines 8 variable-width instructions for the FEATHER+ reconfigurable accelerator. Each instruction begins with a 3-bit opcode. Layout and execute instructions scale as O(log AH + log AW) with array size, while DMA instructions are fixed at 33 bits. This compact encoding achieves 24x–39,681x instruction reduction over direct per-cycle micro-configuration across 9 hardware configurations.

Design Principles:

  • Parametric encoding: Six parameters θ = (r0, c0, Gr, Gc, sr, sc) generate the entire AH×AW PE-to-WVN mapping algebraically.
  • Buffer-aware sizing: Field widths are derived from on-chip buffer depths, not fixed constants.
  • Per-VN variable size: The vn_size field in ExecuteStreaming supports K dimensions not divisible by AH.
  • Dual dataflow: A single-bit dataflow field selects WO-S or IO-S. Under IO-S the compiler transposes (M,K,N) to (N,K,M).

Instruction Set Summary

OpcodeInstructionPurposeWidth
000SetWVNLayoutConfigure stationary buffer (weight) layoutVariable
001SetIVNLayoutConfigure streaming buffer (input) layoutVariable
010SetOVNLayoutConfigure output buffer layoutVariable
011ExecuteStreamingConfigure operand streaming parametersVariable
100StoreDMA store to off-chip memoryFixed 33b
101LoadDMA load from off-chip memoryFixed 33b
110ActivationActivation function (reserved)Fixed 11b
111ExecuteMappingConfigure PE-to-WVN mappingVariable

Buffer Architecture

MINISA operates on three on-chip SRAM buffers, each banked by AW (one bank per PE column):

BufferControlled ByStoresDefault Alloc
Streaming (str)SetIVNLayoutInput activations (IVNs)40%
Stationary (sta)SetWVNLayoutWeights (WVNs)40%
Output (ob)SetOVNLayoutPartial sums / outputs (OVNs)20%

Per-bank scalar depth: Dstr = stream_bytes / (AW × in_bytes), etc. VN row count per bank = D / AH. Total VN capacity = vn_rows × AW.

Instruction Formats

SetWVNLayout (opcode 000) — Configures stationary buffer layout for weight VNs. Defines the 3-level address mapping WVN(k, n) → bank address.

FieldWidthDescription
opcode3000
order3Permutation order (0–5)
N_L0bawInner N-dimension factor (number of banks)
N_L1bsta_rowsMiddle N-dimension factor
K_L1bsta_rowsOuter K-dimension factor

SetIVNLayout (opcode 001) — Configures streaming buffer layout for input VNs. Fields: opcode(3), order(3), M_L0(baw), M_L1(bstr_rows), J_L1(bstr_rows).

SetOVNLayout (opcode 010) — Configures output buffer layout for output VNs. Fields: opcode(3), order(3), P_L0(baw), P_L1(bstr_rows), Q_L1(bstr_rows).

ExecuteMapping (opcode 111) — Configures the PE-to-WVN mapping. The mapping equations:

r(ah, aw) = r0 + floor(aw / Gr)
c(ah, aw) = c0 + sr · ah + sc · (aw mod Gc)

FieldWidthDescription
opcode3111
G_rbawRow-sharing group size
G_cbawReplication period
r_0bsta_totalBase WVN row index
c_0bsta_totalBase WVN column index
s_rbsta_totalTemporal stride per PE row
s_cbsta_rowsSpatial stride within one period

ExecuteStreaming (opcode 011) — Configures operand streaming parameters. Paired with ExecuteMapping.

FieldWidthDescription
opcode3011
dataflow10 = IO-S, 1 = WO-S
m_0bstr_rowsBase streaming row index
s_mbstr_rowsStreaming row stride
Tbstr_rowsNumber of streaming steps per column
vn_sizebvn_sizeActive VN height − 1

Load (opcode 101) / Store (opcode 100) — DMA load/store between off-chip HBM and on-chip buffer. Fixed 33 bits: opcode(3) + target(1) + hbm_addr(29).

Activation (opcode 110) — Reserved. Fixed 11 bits: opcode(3) + tbd(8).

On-Chip SRAM Capacity

AHTotal SRAMStreaming (40%)Stationary (40%)Output (20%)
44 MB1.6 MB1.6 MB0.8 MB
816 MB6.4 MB6.4 MB3.2 MB
1664 MB25.6 MB25.6 MB12.8 MB

Instruction Width Examples (bits)

Instruction4×44×164×648×88×328×12816×1616×6416×256
SetWVNLayout424038434139444240
SetIVNLayout424038434139444240
SetOVNLayout424038434139444240
ExecuteMapping818385868890919395
ExecuteStreaming575145585246595347
Load / Store333333333333333333
Activation111111111111111111

6-Stage Compilation Pipeline

The MINISA compiler translates a GEMM workload C[M,N] = A[M,K] × B[K,N] into MINISA instructions through six stages:

  • Pre-stage — Dataflow Selection: Choose WO-S (weights stationary, search over M,K,N) or IO-S (inputs stationary, transposes to N,K,M). Auto mode runs both and picks lower latency.
  • Stage 1 — Tile: Enumerate legal tiling choices (Mt, Kt, Nt) that fit on-chip buffers. Power-of-2 values, up to 512 candidates sorted by tile volume.
  • Stage 2 — Lower: Deterministic lowering into VN structure. Computes Kg = ceil(Kt/AH) K-groups, per-group vn_sizes, and n_col_types = ceil(Nt/AH).
  • Stage 3 — Group: Form VN groups VG(mt, kg, nt). Knob: WVN column stride ∈ {block, strided}.
  • Stage 4 — Combine: Combine VN groups sharing the same WVN set. Knob: duplication factor d ∈ [1, dmax]. Controls trade-off between M-parallelism and K-packing.
  • Stage 5 — Map: Derive (ExecuteMapping, ExecuteStreaming) parameter pairs. Knob: IVN distribution ∈ {interleaved, consecutive}.
  • Stage 6 — Layout: Search for bank-conflict-free buffer address permutation orders. Exhaustive (216 combos) or sequential mode. Falls back to Stage 3 with alternative design choices on failure.

Execution Model

Per EM/ES pair timing (with vn_size):

  1. WVN load to PE registers: vn_size2 cycles
  2. IVN streaming: T × vn_size cycles
  3. Pipeline fill: vn_size cycles
  4. BIRRD drain: 2⌈log2(AW)⌉ cycles

Inter-EM Pipelining: WVN load for the next EM overlaps with the current EM's IVN streaming. The effective period: em_period = max(nest_time, vn_size2 − vn_size). For Kg consecutive EMs: C = vn_size02 + Σ em_periodi + nest_timelast + birrd_drain.

Typical Instruction Sequence

SetOVNLayout  order, P_L0, P_L1, Q_L1     ; configure output buffer

for each K-step:
    SetIVNLayout  order, M_L0, M_L1, J_L1  ; configure streaming buffer
    Load          target=1, hbm_addr        ; DMA load inputs
    SetWVNLayout  order, N_L0, N_L1, K_L1  ; configure stationary buffer
    Load          target=0, hbm_addr        ; DMA load weights

    for each EM in tile:
        ExecuteMapping   G_r, G_c, r_0, c_0, s_r, s_c
        ExecuteStreaming  dataflow, m_0, s_m, T, vn_size

Store  target=0, hbm_addr                   ; DMA store output

Evaluation Results (450 workload-config pairs)

ConfigAvg CompressionInst ReductionAvg UtilizationAvg Latency
4×47,684x24x92.1%40,045,201
4×167,608x54x89.2%10,117,115
4×6411,782x184x91.6%2,510,426
8×812,407x196x80.1%10,470,460
8×3214,216x600x82.0%2,617,074
8×12830,250x2,465x82.4%710,018
16×1621,363x2,176x69.3%2,833,058
16×6434,634x10,275x69.3%841,181
16×25632,443x39,681x69.0%215,097

Search Optimizations

The MINISA compilation search uses 11 optimizations to scale to large workloads (up to M=65536) across all 9 array configurations:

  • Analytical Fast-Path: Derives EM parameters directly from tile dimensions in O(n_EMs) time and O(1) memory, replacing O(Mt×Kg×n_col_types) Python object materialization. Reduces ~13 GB to negligible.
  • WVN Bank Conflict Deduplication: When Gr ≥ AW, changing r0 only shifts all addresses by a constant offset. Groups EMs by (Gr,Gc,sr,sc,c0) and checks one representative per group. 1024 EMs → 1 check.
  • IVN Single-Address Skip: When actual_kg × effective_reps ≤ 1, each PE column group reads at most one IVN address per step — no conflict possible regardless of layout.
  • OB Conflict Deduplication: OB conflict depends only on (effective_reps, n_col_types), not K-group index. Tracks checked tuples, skipping duplicates.
  • Factored Layout × Conflict Search: Pre-filters valid ordero values (6 → 2–4), then valid (orderw, orderi) pairs (36 → 4–8), then cross-products survivors. 216 full checks → ~42 total.
  • EM Parameter Signature Deduplication: Different dup_factor values often produce identical EM parameter sets. Hashes signatures into a seen set. AW=256: 256 evaluations → 8–12.
  • Tile Enumeration Bounds: Power-of-2 tile sizes with buffer capacity filtering, capped at 512. 4096³ → ~200 candidates.
  • Design Choice Fallback: When layout search fails, retries Stages 3–6 with (block/strided) × (interleaved/consecutive) combinations. Critical for rectangular configs (AW >> AH).
  • Early Termination: Stops after first valid tile (sorted largest-first by volume). 100–500 tiles → 1–5.
  • Dual-Dataflow with Transposition: Runs both WO-S and IO-S (transposed M,K,N → N,K,M), picks lower latency.
  • Batch Info Pre-computation: Computes per-batch metadata once, reused across WVN/IVN/OB conflict checks.

Together, these optimizations reduced 53 previously-failing benchmark points (K=2048–4096, OOM/timeout) to complete successfully while keeping all 450/450 points passing both ISA-level and Config-level verification.

Mapper

FEATHER's mapper for mapping and layout search...

LayoutLoop

FEATHER's analytical modeling with LayoutLoop...

RTL Simulation

RTL simulation flow for FEATHER(+)...