FEATHER

Reconfigurable AI Accelerator

FEATHER is a state-of-the-art reconfigurable accelerator architecture that enables low-cost switching between dataflows and layouts, efficiently supporting diverse workload patterns.

FEATHER Microarchitecture and MINISA ISA Visualizer

Use this interactive tool to explore FEATHER's MINISA instruction set. Configure hardware parameters, build ISA traces, and visualize the cycle-accurate NEST PE array pipeline, BIRRD reduction network, and VN buffer layouts. The tool resets on page refresh.

Hardware Configuration

NEST Size:

SRAM (MB):

Workload (GEMM)

M: K: N:

ISA Trace Sequence

Instruction Details

Select an instruction to view details.

Animation Controls

No animation generated

Speed: 0/0

File Operations

NEST PE Array

BIRRD Network

VN Buffers

MINISA ISA 2.0 Specification

MINISA ISA 2.0 defines 8 variable-width instructions for the FEATHER+ reconfigurable accelerator. Each instruction begins with a 3-bit opcode. Layout and execute instructions scale as O(log AH + log AW) with array size, while DMA instructions are fixed at 33 bits. This compact encoding achieves 24x–39,681x instruction reduction over direct per-cycle micro-configuration across 9 hardware configurations.

Design Principles:

Parametric encoding: Six parameters θ = (r₀, c₀, G_r, G_c, s_r, s_c) generate the entire AH×AW PE-to-WVN mapping algebraically.
Buffer-aware sizing: Field widths are derived from on-chip buffer depths, not fixed constants.
Per-VN variable size: The vn_size field in ExecuteStreaming supports K dimensions not divisible by AH.
Dual dataflow: A single-bit dataflow field selects WO-S or IO-S. Under IO-S the compiler transposes (M,K,N) to (N,K,M).

Instruction Set Summary

Opcode	Instruction	Purpose	Width
000	SetWVNLayout	Configure stationary buffer (weight) layout	Variable
001	SetIVNLayout	Configure streaming buffer (input) layout	Variable
010	SetOVNLayout	Configure output buffer layout	Variable
011	ExecuteStreaming	Configure operand streaming parameters	Variable
100	Store	DMA store to off-chip memory	Fixed 33b
101	Load	DMA load from off-chip memory	Fixed 33b
110	Activation	Activation function (reserved)	Fixed 11b
111	ExecuteMapping	Configure PE-to-WVN mapping	Variable

Buffer Architecture

MINISA operates on three on-chip SRAM buffers, each banked by AW (one bank per PE column):

Buffer	Controlled By	Stores	Default Alloc
Streaming (str)	SetIVNLayout	Input activations (IVNs)	40%
Stationary (sta)	SetWVNLayout	Weights (WVNs)	40%
Output (ob)	SetOVNLayout	Partial sums / outputs (OVNs)	20%

Per-bank scalar depth: D_str = stream_bytes / (AW × in_bytes), etc. VN row count per bank = D / AH. Total VN capacity = vn_rows × AW.

Instruction Formats

SetWVNLayout (opcode 000) — Configures stationary buffer layout for weight VNs. Defines the 3-level address mapping WVN(k, n) → bank address.

Field	Width	Description
opcode	3	`000`
order	3	Permutation order (0–5)
N_L0	b_aw	Inner N-dimension factor (number of banks)
N_L1	b_{sta_rows}	Middle N-dimension factor
K_L1	b_{sta_rows}	Outer K-dimension factor

SetIVNLayout (opcode 001) — Configures streaming buffer layout for input VNs. Fields: opcode(3), order(3), M_L0(b_aw), M_L1(b_{str_rows}), J_L1(b_{str_rows}).

SetOVNLayout (opcode 010) — Configures output buffer layout for output VNs. Fields: opcode(3), order(3), P_L0(b_aw), P_L1(b_{str_rows}), Q_L1(b_{str_rows}).

ExecuteMapping (opcode 111) — Configures the PE-to-WVN mapping. The mapping equations:

r(a_h, a_w) = r₀ + floor(a_w / G_r)
c(a_h, a_w) = c₀ + s_r · a_h + s_c · (a_w mod G_c)

Field	Width	Description
opcode	3	`111`
G_r	b_aw	Row-sharing group size
G_c	b_aw	Replication period
r_0	b_{sta_total}	Base WVN row index
c_0	b_{sta_total}	Base WVN column index
s_r	b_{sta_total}	Temporal stride per PE row
s_c	b_{sta_rows}	Spatial stride within one period

ExecuteStreaming (opcode 011) — Configures operand streaming parameters. Paired with ExecuteMapping.

Field	Width	Description
opcode	3	`011`
dataflow	1	0 = IO-S, 1 = WO-S
m_0	b_{str_rows}	Base streaming row index
s_m	b_{str_rows}	Streaming row stride
T	b_{str_rows}	Number of streaming steps per column
vn_size	b_{vn_size}	Active VN height − 1

Load (opcode 101) / Store (opcode 100) — DMA load/store between off-chip HBM and on-chip buffer. Fixed 33 bits: opcode(3) + target(1) + hbm_addr(29).

Activation (opcode 110) — Reserved. Fixed 11 bits: opcode(3) + tbd(8).

On-Chip SRAM Capacity

AH	Total SRAM	Streaming (40%)	Stationary (40%)	Output (20%)
4	4 MB	1.6 MB	1.6 MB	0.8 MB
8	16 MB	6.4 MB	6.4 MB	3.2 MB
16	64 MB	25.6 MB	25.6 MB	12.8 MB

Instruction Width Examples (bits)

Instruction	4×4	4×16	4×64	8×8	8×32	8×128	16×16	16×64	16×256
SetWVNLayout	42	40	38	43	41	39	44	42	40
SetIVNLayout	42	40	38	43	41	39	44	42	40
SetOVNLayout	42	40	38	43	41	39	44	42	40
ExecuteMapping	81	83	85	86	88	90	91	93	95
ExecuteStreaming	57	51	45	58	52	46	59	53	47
Load / Store	33	33	33	33	33	33	33	33	33
Activation	11	11	11	11	11	11	11	11	11

6-Stage Compilation Pipeline

The MINISA compiler translates a GEMM workload C[M,N] = A[M,K] × B[K,N] into MINISA instructions through six stages:

Pre-stage — Dataflow Selection: Choose WO-S (weights stationary, search over M,K,N) or IO-S (inputs stationary, transposes to N,K,M). Auto mode runs both and picks lower latency.
Stage 1 — Tile: Enumerate legal tiling choices (M_t, K_t, N_t) that fit on-chip buffers. Power-of-2 values, up to 512 candidates sorted by tile volume.
Stage 2 — Lower: Deterministic lowering into VN structure. Computes K_g = ceil(K_t/AH) K-groups, per-group vn_sizes, and n_col_types = ceil(N_t/AH).
Stage 3 — Group: Form VN groups VG(m_t, k_g, n_t). Knob: WVN column stride ∈ {block, strided}.
Stage 4 — Combine: Combine VN groups sharing the same WVN set. Knob: duplication factor d ∈ [1, d_max]. Controls trade-off between M-parallelism and K-packing.
Stage 5 — Map: Derive (ExecuteMapping, ExecuteStreaming) parameter pairs. Knob: IVN distribution ∈ {interleaved, consecutive}.
Stage 6 — Layout: Search for bank-conflict-free buffer address permutation orders. Exhaustive (216 combos) or sequential mode. Falls back to Stage 3 with alternative design choices on failure.

Execution Model

Per EM/ES pair timing (with vn_size):

WVN load to PE registers: vn_size² cycles
IVN streaming: T × vn_size cycles
Pipeline fill: vn_size cycles
BIRRD drain: 2⌈log₂(AW)⌉ cycles

Inter-EM Pipelining: WVN load for the next EM overlaps with the current EM's IVN streaming. The effective period: em_period = max(nest_time, vn_size² − vn_size). For K_g consecutive EMs: C = vn_size₀² + Σ em_period_i + nest_time_last + birrd_drain.

Typical Instruction Sequence

SetOVNLayout  order, P_L0, P_L1, Q_L1     ; configure output buffer

for each K-step:
    SetIVNLayout  order, M_L0, M_L1, J_L1  ; configure streaming buffer
    Load          target=1, hbm_addr        ; DMA load inputs
    SetWVNLayout  order, N_L0, N_L1, K_L1  ; configure stationary buffer
    Load          target=0, hbm_addr        ; DMA load weights

    for each EM in tile:
        ExecuteMapping   G_r, G_c, r_0, c_0, s_r, s_c
        ExecuteStreaming  dataflow, m_0, s_m, T, vn_size

Store  target=0, hbm_addr                   ; DMA store output

Evaluation Results (450 workload-config pairs)

Config	Avg Compression	Inst Reduction	Avg Utilization	Avg Latency
4×4	7,684x	24x	92.1%	40,045,201
4×16	7,608x	54x	89.2%	10,117,115
4×64	11,782x	184x	91.6%	2,510,426
8×8	12,407x	196x	80.1%	10,470,460
8×32	14,216x	600x	82.0%	2,617,074
8×128	30,250x	2,465x	82.4%	710,018
16×16	21,363x	2,176x	69.3%	2,833,058
16×64	34,634x	10,275x	69.3%	841,181
16×256	32,443x	39,681x	69.0%	215,097

Search Optimizations

The MINISA compilation search uses 11 optimizations to scale to large workloads (up to M=65536) across all 9 array configurations:

Analytical Fast-Path: Derives EM parameters directly from tile dimensions in O(n_EMs) time and O(1) memory, replacing O(M_t×K_g×n_col_types) Python object materialization. Reduces ~13 GB to negligible.
WVN Bank Conflict Deduplication: When G_r ≥ AW, changing r₀ only shifts all addresses by a constant offset. Groups EMs by (G_r,G_c,s_r,s_c,c₀) and checks one representative per group. 1024 EMs → 1 check.
IVN Single-Address Skip: When actual_k_g × effective_reps ≤ 1, each PE column group reads at most one IVN address per step — no conflict possible regardless of layout.
OB Conflict Deduplication: OB conflict depends only on (effective_reps, n_col_types), not K-group index. Tracks checked tuples, skipping duplicates.
Factored Layout × Conflict Search: Pre-filters valid order_o values (6 → 2–4), then valid (order_w, order_i) pairs (36 → 4–8), then cross-products survivors. 216 full checks → ~42 total.
EM Parameter Signature Deduplication: Different dup_factor values often produce identical EM parameter sets. Hashes signatures into a seen set. AW=256: 256 evaluations → 8–12.
Tile Enumeration Bounds: Power-of-2 tile sizes with buffer capacity filtering, capped at 512. 4096³ → ~200 candidates.
Design Choice Fallback: When layout search fails, retries Stages 3–6 with (block/strided) × (interleaved/consecutive) combinations. Critical for rectangular configs (AW >> AH).
Early Termination: Stops after first valid tile (sorted largest-first by volume). 100–500 tiles → 1–5.
Dual-Dataflow with Transposition: Runs both WO-S and IO-S (transposed M,K,N → N,K,M), picks lower latency.
Batch Info Pre-computation: Computes per-batch metadata once, reused across WVN/IVN/OB conflict checks.

Together, these optimizations reduced 53 previously-failing benchmark points (K=2048–4096, OOM/timeout) to complete successfully while keeping all 450/450 points passing both ISA-level and Config-level verification.