FEATHER
Reconfigurable AI Accelerator
FEATHER is a state-of-the-art reconfigurable accelerator architecture that enables low-cost switching between dataflows and layouts, efficiently supporting diverse workload patterns.
MINISA Interactive Visualizer
Use this interactive tool to explore FEATHER's MINISA instruction set. Configure hardware parameters, build ISA traces, and visualize the cycle-accurate NEST PE array pipeline, BIRRD reduction network, and VN buffer layouts. The tool resets on page refresh.
Microarchitecture
Details about FEATHER and FEATHER+ microarchitecture...
MINISA ISA 2.0 Specification
MINISA ISA 2.0 defines 8 variable-width instructions for the FEATHER+ reconfigurable accelerator. Each instruction begins with a 3-bit opcode. Layout and execute instructions scale as O(log AH + log AW) with array size, while DMA instructions are fixed at 33 bits. This compact encoding achieves 24x–39,681x instruction reduction over direct per-cycle micro-configuration across 9 hardware configurations.
Design Principles:
- Parametric encoding: Six parameters θ = (r0, c0, Gr, Gc, sr, sc) generate the entire AH×AW PE-to-WVN mapping algebraically.
- Buffer-aware sizing: Field widths are derived from on-chip buffer depths, not fixed constants.
- Per-VN variable size: The
vn_sizefield in ExecuteStreaming supports K dimensions not divisible by AH. - Dual dataflow: A single-bit
dataflowfield selects WO-S or IO-S. Under IO-S the compiler transposes (M,K,N) to (N,K,M).
Instruction Set Summary
| Opcode | Instruction | Purpose | Width |
|---|---|---|---|
| 000 | SetWVNLayout | Configure stationary buffer (weight) layout | Variable |
| 001 | SetIVNLayout | Configure streaming buffer (input) layout | Variable |
| 010 | SetOVNLayout | Configure output buffer layout | Variable |
| 011 | ExecuteStreaming | Configure operand streaming parameters | Variable |
| 100 | Store | DMA store to off-chip memory | Fixed 33b |
| 101 | Load | DMA load from off-chip memory | Fixed 33b |
| 110 | Activation | Activation function (reserved) | Fixed 11b |
| 111 | ExecuteMapping | Configure PE-to-WVN mapping | Variable |
Buffer Architecture
MINISA operates on three on-chip SRAM buffers, each banked by AW (one bank per PE column):
| Buffer | Controlled By | Stores | Default Alloc |
|---|---|---|---|
| Streaming (str) | SetIVNLayout | Input activations (IVNs) | 40% |
| Stationary (sta) | SetWVNLayout | Weights (WVNs) | 40% |
| Output (ob) | SetOVNLayout | Partial sums / outputs (OVNs) | 20% |
Per-bank scalar depth: Dstr = stream_bytes / (AW × in_bytes), etc. VN row count per bank = D / AH. Total VN capacity = vn_rows × AW.
Instruction Formats
SetWVNLayout (opcode 000) — Configures stationary buffer layout for weight VNs. Defines the 3-level address mapping WVN(k, n) → bank address.
| Field | Width | Description |
|---|---|---|
| opcode | 3 | 000 |
| order | 3 | Permutation order (0–5) |
| N_L0 | baw | Inner N-dimension factor (number of banks) |
| N_L1 | bsta_rows | Middle N-dimension factor |
| K_L1 | bsta_rows | Outer K-dimension factor |
SetIVNLayout (opcode 001) — Configures streaming buffer layout for input VNs. Fields: opcode(3), order(3), M_L0(baw), M_L1(bstr_rows), J_L1(bstr_rows).
SetOVNLayout (opcode 010) — Configures output buffer layout for output VNs. Fields: opcode(3), order(3), P_L0(baw), P_L1(bstr_rows), Q_L1(bstr_rows).
ExecuteMapping (opcode 111) — Configures the PE-to-WVN mapping. The mapping equations:
r(ah, aw) = r0 + floor(aw / Gr)
c(ah, aw) = c0 + sr · ah + sc · (aw mod Gc)
| Field | Width | Description |
|---|---|---|
| opcode | 3 | 111 |
| G_r | baw | Row-sharing group size |
| G_c | baw | Replication period |
| r_0 | bsta_total | Base WVN row index |
| c_0 | bsta_total | Base WVN column index |
| s_r | bsta_total | Temporal stride per PE row |
| s_c | bsta_rows | Spatial stride within one period |
ExecuteStreaming (opcode 011) — Configures operand streaming parameters. Paired with ExecuteMapping.
| Field | Width | Description |
|---|---|---|
| opcode | 3 | 011 |
| dataflow | 1 | 0 = IO-S, 1 = WO-S |
| m_0 | bstr_rows | Base streaming row index |
| s_m | bstr_rows | Streaming row stride |
| T | bstr_rows | Number of streaming steps per column |
| vn_size | bvn_size | Active VN height − 1 |
Load (opcode 101) / Store (opcode 100) — DMA load/store between off-chip HBM and on-chip buffer. Fixed 33 bits: opcode(3) + target(1) + hbm_addr(29).
Activation (opcode 110) — Reserved. Fixed 11 bits: opcode(3) + tbd(8).
On-Chip SRAM Capacity
| AH | Total SRAM | Streaming (40%) | Stationary (40%) | Output (20%) |
|---|---|---|---|---|
| 4 | 4 MB | 1.6 MB | 1.6 MB | 0.8 MB |
| 8 | 16 MB | 6.4 MB | 6.4 MB | 3.2 MB |
| 16 | 64 MB | 25.6 MB | 25.6 MB | 12.8 MB |
Instruction Width Examples (bits)
| Instruction | 4×4 | 4×16 | 4×64 | 8×8 | 8×32 | 8×128 | 16×16 | 16×64 | 16×256 |
|---|---|---|---|---|---|---|---|---|---|
| SetWVNLayout | 42 | 40 | 38 | 43 | 41 | 39 | 44 | 42 | 40 |
| SetIVNLayout | 42 | 40 | 38 | 43 | 41 | 39 | 44 | 42 | 40 |
| SetOVNLayout | 42 | 40 | 38 | 43 | 41 | 39 | 44 | 42 | 40 |
| ExecuteMapping | 81 | 83 | 85 | 86 | 88 | 90 | 91 | 93 | 95 |
| ExecuteStreaming | 57 | 51 | 45 | 58 | 52 | 46 | 59 | 53 | 47 |
| Load / Store | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 |
| Activation | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 | 11 |
6-Stage Compilation Pipeline
The MINISA compiler translates a GEMM workload C[M,N] = A[M,K] × B[K,N] into MINISA instructions through six stages:
- Pre-stage — Dataflow Selection: Choose WO-S (weights stationary, search over M,K,N) or IO-S (inputs stationary, transposes to N,K,M). Auto mode runs both and picks lower latency.
- Stage 1 — Tile: Enumerate legal tiling choices (Mt, Kt, Nt) that fit on-chip buffers. Power-of-2 values, up to 512 candidates sorted by tile volume.
- Stage 2 — Lower: Deterministic lowering into VN structure. Computes Kg = ceil(Kt/AH) K-groups, per-group vn_sizes, and n_col_types = ceil(Nt/AH).
- Stage 3 — Group: Form VN groups VG(mt, kg, nt). Knob: WVN column stride ∈ {block, strided}.
- Stage 4 — Combine: Combine VN groups sharing the same WVN set. Knob: duplication factor d ∈ [1, dmax]. Controls trade-off between M-parallelism and K-packing.
- Stage 5 — Map: Derive (ExecuteMapping, ExecuteStreaming) parameter pairs. Knob: IVN distribution ∈ {interleaved, consecutive}.
- Stage 6 — Layout: Search for bank-conflict-free buffer address permutation orders. Exhaustive (216 combos) or sequential mode. Falls back to Stage 3 with alternative design choices on failure.
Execution Model
Per EM/ES pair timing (with vn_size):
- WVN load to PE registers: vn_size2 cycles
- IVN streaming: T × vn_size cycles
- Pipeline fill: vn_size cycles
- BIRRD drain: 2⌈log2(AW)⌉ cycles
Inter-EM Pipelining: WVN load for the next EM overlaps with the current EM's IVN streaming. The effective period: em_period = max(nest_time, vn_size2 − vn_size). For Kg consecutive EMs: C = vn_size02 + Σ em_periodi + nest_timelast + birrd_drain.
Typical Instruction Sequence
SetOVNLayout order, P_L0, P_L1, Q_L1 ; configure output buffer
for each K-step:
SetIVNLayout order, M_L0, M_L1, J_L1 ; configure streaming buffer
Load target=1, hbm_addr ; DMA load inputs
SetWVNLayout order, N_L0, N_L1, K_L1 ; configure stationary buffer
Load target=0, hbm_addr ; DMA load weights
for each EM in tile:
ExecuteMapping G_r, G_c, r_0, c_0, s_r, s_c
ExecuteStreaming dataflow, m_0, s_m, T, vn_size
Store target=0, hbm_addr ; DMA store output
Evaluation Results (450 workload-config pairs)
| Config | Avg Compression | Inst Reduction | Avg Utilization | Avg Latency |
|---|---|---|---|---|
| 4×4 | 7,684x | 24x | 92.1% | 40,045,201 |
| 4×16 | 7,608x | 54x | 89.2% | 10,117,115 |
| 4×64 | 11,782x | 184x | 91.6% | 2,510,426 |
| 8×8 | 12,407x | 196x | 80.1% | 10,470,460 |
| 8×32 | 14,216x | 600x | 82.0% | 2,617,074 |
| 8×128 | 30,250x | 2,465x | 82.4% | 710,018 |
| 16×16 | 21,363x | 2,176x | 69.3% | 2,833,058 |
| 16×64 | 34,634x | 10,275x | 69.3% | 841,181 |
| 16×256 | 32,443x | 39,681x | 69.0% | 215,097 |
Search Optimizations
The MINISA compilation search uses 11 optimizations to scale to large workloads (up to M=65536) across all 9 array configurations:
- Analytical Fast-Path: Derives EM parameters directly from tile dimensions in O(n_EMs) time and O(1) memory, replacing O(Mt×Kg×n_col_types) Python object materialization. Reduces ~13 GB to negligible.
- WVN Bank Conflict Deduplication: When Gr ≥ AW, changing r0 only shifts all addresses by a constant offset. Groups EMs by (Gr,Gc,sr,sc,c0) and checks one representative per group. 1024 EMs → 1 check.
- IVN Single-Address Skip: When actual_kg × effective_reps ≤ 1, each PE column group reads at most one IVN address per step — no conflict possible regardless of layout.
- OB Conflict Deduplication: OB conflict depends only on (effective_reps, n_col_types), not K-group index. Tracks checked tuples, skipping duplicates.
- Factored Layout × Conflict Search: Pre-filters valid ordero values (6 → 2–4), then valid (orderw, orderi) pairs (36 → 4–8), then cross-products survivors. 216 full checks → ~42 total.
- EM Parameter Signature Deduplication: Different dup_factor values often produce identical EM parameter sets. Hashes signatures into a seen set. AW=256: 256 evaluations → 8–12.
- Tile Enumeration Bounds: Power-of-2 tile sizes with buffer capacity filtering, capped at 512. 4096³ → ~200 candidates.
- Design Choice Fallback: When layout search fails, retries Stages 3–6 with (block/strided) × (interleaved/consecutive) combinations. Critical for rectangular configs (AW >> AH).
- Early Termination: Stops after first valid tile (sorted largest-first by volume). 100–500 tiles → 1–5.
- Dual-Dataflow with Transposition: Runs both WO-S and IO-S (transposed M,K,N → N,K,M), picks lower latency.
- Batch Info Pre-computation: Computes per-batch metadata once, reused across WVN/IVN/OB conflict checks.
Together, these optimizations reduced 53 previously-failing benchmark points (K=2048–4096, OOM/timeout) to complete successfully while keeping all 450/450 points passing both ISA-level and Config-level verification.
Mapper
FEATHER's mapper for mapping and layout search...
LayoutLoop
FEATHER's analytical modeling with LayoutLoop...
RTL Simulation
RTL simulation flow for FEATHER(+)...