



# MAERI-FPGA: Enabling HW Design Space Exploration on Real FPGA Hardware Platform

# Tushar Krishna

Associate Professor School of ECE Georgia Institute of Technology

| ICS 2022 |  |
|----------|--|
| Tutorial |  |
|          |  |

June 27, 2022

Email: tushar@ece.gatech.edu

### Presenters



**Tushar Krishna** Associate Professor Georgia Tech



Jianming Tong PhD Student Georgia Tech

#### **Other Contributors**

- Yangyu Chen
- Yue Pan
- Abhimanyu Bambhaniya
- Taekyung Heo
- Hyoukjun Kwon

Acknowledgment: Some of the work done as part of ARIAA Co-Design Center (Georgia Tech, PNNL, Sandia National Labs)

## Schedule (EST)

| Time slot      | Торіс                               |          |
|----------------|-------------------------------------|----------|
| 14:00 to 14:30 | Introduction to DNN Accelerators    | Tushar   |
| 14:30 - 14:40  | Break                               |          |
| 14:40: 15:10   | MAERI2.0 Architecture and Tool Flow | Jianming |
| 15:10 to 15:30 | Demo on FPGA                        | Jianming |

Brief Q/A at the end of each talk.

Please feel free to interrupt and ask questions or use chat

Attention: Tutorial is being recorded!

https://maeri-project.github.io/tutorials/ics-2022

### **Deep Learning Applications**

### "AI is the new electricity" – Andrew Ng

#### **Object Detection**



#### Image Segmentation



#### **Medical Imaging**



#### Speech Recognition



#### **Text to Speech**

Speech

Text

#### Recommendations

Games



## **Computation Platforms in Deep Learning**



## **Challenges in Design and Deployment**



### Outline

- Background on DNNs
- DNN Accelerators
- Dataflow and Mapping
- Flexibility

### Outline

- Background on DNNs
- DNN Accelerators
- Dataflow and Mapping
- Flexibility

### What is a Deep Neural Network?

![](_page_8_Figure_1.jpeg)

### Modern Deep Learning Landscape

![](_page_9_Figure_1.jpeg)

### Computations in a DNN $\rightarrow$ Linear Algebra

![](_page_10_Figure_1.jpeg)

![](_page_10_Figure_2.jpeg)

Neuron => Vector x Vector

### Computations in a DNN $\rightarrow$ Linear Algebra

![](_page_11_Figure_1.jpeg)

### Computations in a DNN $\rightarrow$ Linear Algebra

![](_page_12_Figure_1.jpeg)

### **Convolutional Neural Networks**

![](_page_13_Figure_1.jpeg)

**Shared Weights:** All neurons use the *same* filter weights

![](_page_14_Figure_1.jpeg)

![](_page_15_Figure_1.jpeg)

![](_page_16_Figure_1.jpeg)

![](_page_17_Figure_1.jpeg)

MAERI-FPGA @ ICS 2022

## Loop Nest Representation

7<sup>th</sup> (outermost) loop used during training

## **Challenges with DNN Computations**

### • Millions of Parameters (i.e., weights)

• Billions of computations

| DNN Topology     | Number of Weights |
|------------------|-------------------|
| AlexNet (2012)   | 3.98M             |
| VGGnet-16 (2014) | 28.25M            |
| GoogleNet (2015) | 6.77M             |
| Resnet-50 (2016) | 23M               |
| DLRM (2019)      | 540M              |
| Megatron (2019)  | 8.3B              |

![](_page_19_Figure_4.jpeg)

DRAM

Buffer

![](_page_19_Figure_5.jpeg)

**Need lots of parallel compute** 

This makes CPUs

inefficient

### Outline

- Background on DNNs
- DNN Accelerators
- Dataflow and Mapping
- Flexibility

### The DL Inference Accelerator Zoo

![](_page_21_Figure_1.jpeg)

## Spatial (or Dataflow) Accelerators

- Millions of Parameters (i.e., weights)
  - Billions of computations **Memory Hierarchy** \* Spread computations across hundreds of ALUs ALU ALU ALU ALU emor Control ALU ALU ALU ALU Register/FIFO/SRAM erarch Heavy data movement ALU ALU ALU ALU Reuse data within the ALU array via local memories **ALU** ALU ALU and direct communication

Processing Element (PE)

Tushar Krishna | School of ECE | Georgia Institute of Technology

\*Y. Chen et. al., "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," ISCA, 2016.

## Types of Algorithmic Data Reuse in DNNs

![](_page_23_Figure_1.jpeg)

### Hardware structures to exploit reuse

![](_page_24_Figure_1.jpeg)

Tushar Krishna | School of ECE | Georgia Institute of Technology

## Mapping and Dataflow

#### 7-dimensional network layer

![](_page_25_Figure_2.jpeg)

- Goal of Mapping: translate algorithmic data reuse to HW data reuse
- Precise Definition of Mapping: Fine-grained schedule of computations within DNN accelerators
  - **Computation Order** (slowest tensor dimension often called "stationary")
  - Parallelization Strategy (which loops to unroll spatially)
  - Tiling Strategy (number of levels of memory hierarchy)
  - Tile Sizes

## Architectural Components of a DNN Accelerator

![](_page_26_Figure_1.jpeg)

27

### Architectural Components of a DNN Accelerator

![](_page_27_Figure_1.jpeg)

**HW Design-Space** 

### Architectural Components of a DNN Accelerator

![](_page_28_Figure_1.jpeg)

MAERI-FPGA @ ICS 2022

Tushar Krishna | School of ECE | Georgia Institute of Technology

## **GEMM vs CONV2D Accelerators**

#### **GEMM Operation**

![](_page_29_Figure_2.jpeg)

Result matrix C

![](_page_29_Picture_4.jpeg)

#### **CONV2D** Operation

![](_page_29_Figure_6.jpeg)

#### 3 Loops

- Less Opportunities for Reuse
- More general: any DNN layer (including convolutions) can be lowered to GEMM (e.g, Im2Col)
- E.g., NVIDIA Tensor Core, Google TPU

#### 7 Loops

- More Opportunities for Reuse
- Only applicable for convolution layers
- E.g., NVDLA, MAERI (this work)

### Outline

- Background on DNNs
- DNN Accelerators
- Dataflow and Mapping
- Flexibility

## **Dataflow and Mapping**

#### 7-dimensional network layer

![](_page_31_Figure_2.jpeg)

- Goal of Mapping: translate algorithmic data reuse to HW data reuse
- Precise Definition of Mapping: Fine-grained schedule of computations within DNN accelerators
  - Computation Order (slowest tensor dimension often called "stationary")
  - Parallelization Strategy (which loops to unroll spatially)
  - Tiling Strategy (number of levels of memory hierarchy)
  - Tile Sizes

Dataflow

![](_page_32_Figure_0.jpeg)

![](_page_33_Figure_0.jpeg)

### Takeaways: Data Reuse + Hardware Support

- Dataflow exposes data reuse opportunities
- Hardware support is needed to leverage reuse opportunity

| Hardware<br>Structure     | Per Data Type       | Weight Stationary<br>Dataflow Implication | Output Stationary<br>Dataflow Implication |                                          |
|---------------------------|---------------------|-------------------------------------------|-------------------------------------------|------------------------------------------|
| Bandwidth to              | Weight Fetch Rate   | Every S Cycles                            | Every Cycle                               |                                          |
| MAC                       | Input Fetch Rate    | Every Cycle                               | Every Cycle                               |                                          |
|                           | Output Fetch Rate   | Every Cycle                               | Every S Cycles                            | Note: for full 6                         |
| Local Buffer              | Weight Buffer Size  | 1                                         | 3                                         | <i>conv,</i> trillions c                 |
| Sizes for                 | Input Buffer Size   | 3                                         | 3)                                        | valid dataflow choices $\rightarrow$ Hud |
| Reuse                     | Output Buffer Size  | 3                                         | 1                                         | Design Space                             |
| Network-on-               | Weight Distribution | Unicast                                   | Spatial Multicast                         |                                          |
| Chip for Spatial<br>Reuse | Input Distribution  | Spatial Multicast                         | Unicast                                   |                                          |
|                           | Output Collection   | Spatial Reduction                         | Temporal Reduction                        |                                          |

## **Dataflow and Mapping**

#### 7-dimensional network layer

![](_page_35_Figure_2.jpeg)

- Goal of Mapping: translate algorithmic data reuse to HW data reuse
- Precise Definition of Mapping: Fine-grained schedule of computations within DNN accelerators
  - Computation Order (slowest tensor dimension often called "stationary")
  - Parallelization Strategy (which loops to unroll spatially)
  - Tiling Strategy (number of levels of memory hierarchy)
  - Tile Sizes

Dataflow

### Impact of Parallelization

(i.e., Simplified Fully-connected layer)

![](_page_36_Figure_2.jpeg)

### Impact of Parallelization

Example Model B: Matrix-Vector Multiplication (i.e., Simplified Fully-connected layer)

![](_page_37_Figure_2.jpeg)

![](_page_37_Figure_3.jpeg)

Can we map it in a better way?

### Impact of Parallelization

![](_page_38_Figure_1.jpeg)

### Outline

- Background on DNNs
- DNN Accelerators
- Dataflow and Mapping
- Flexibility

### Trend 1: Diversity in DNN Models

- Layer Sizes
- Layer Shapes
- Layer Types

<Number of new ML papers in Arxiv>

![](_page_40_Figure_6.jpeg)

![](_page_40_Figure_7.jpeg)

**Evolution of DNN Models** 

- Trend 1: Diversity in DNN Models
  - Layer Sizes
  - Layer Shapes
  - Layer Types

![](_page_41_Figure_5.jpeg)

### • **Trend 2: Diversity in Implementations**

- Depth-wise/Point-wise Convolutions
- Pruning  $\rightarrow$  Sparsity

### e.g. of Depth-wise Separable CONV

![](_page_41_Figure_10.jpeg)

### • Trend 1: Diversity in DNN Models

- Layer Sizes
- Layer Shapes
- Layer Types

### Trend 2: Diversity in Implementations

- Depth-wise/Point-wise Convolutions
- Pruning  $\rightarrow$  Sparsity

### Trend 3: Diversity in Mapping/Dataflow

- Loop Transformations ("Dataflow")
  - Order, Parallelization, Tiling
  - "Weight Stationary", "Row Stationary"
- Partitioning Strategies Per Layer, Cross Layer, ..

![](_page_42_Figure_13.jpeg)

### Trend 1: Diversity in DNN Models

- Layer Sizes
- Layer Shapes
- Layer Types

### Trend 2: Diversity in Implementations

- Depth-wise/Point-wise Convolutions
- Pruning  $\rightarrow$  Sparsity

### Trend 3: Diversity in Mapping/Dataflow

- Loop Transformations ("Dataflow")
  - Order, Parallelization, Tiling
  - "Weight Stationary", "Row Stationary"
- Partitioning Strategies Per Layer, Cross Layer, ...

Myriad "irregular" shapes, sizes, accesses

### **Challenge:**

Getting high-utilization from accelerator for all cases.

Why? Aren't DNNs essentially Matrix-Matrix multiplications?

## Example of GEMM Operation

![](_page_44_Figure_1.jpeg)

![](_page_45_Figure_0.jpeg)

![](_page_45_Figure_1.jpeg)

**Distribute** Row multicast

Collect Column Reduce

Communication

Distribute

Collect

![](_page_46_Figure_0.jpeg)

Tushar Krishna | School of ECE | Georgia Institute of Technology

#### Mapping Efficiency needs Mapping Flexibility **Sparse** Irregular Irregular Regular 8 8 4 **→**2 5 8 4 Logical: {3x1, 2x1, 4x1, Logical: 2x8 Logical: 5x3 **Physical Array: 4x4** 1x1, 4x1, 2x1 **Map Effic. = 100% Map Effic. = 100% Map Effic. = 100% Map Effic. = 94%** How to support Mapping Flexibility? Distribute **Spatial Multicast** Row multicast Multicast to non-neighbors Only send non-zeros **Multiple Parallel** Variable Length Variable Non-Uniform Length Collect **Column Reduce** Flexible data distribution and reduction

Tushar Krishna | School of ECE | Georgia Institute of Technology

June 27, 2022

## Levels of Flexibility

![](_page_48_Figure_1.jpeg)

## Introducing MAERI2.0 – A Flexible DNN Accelerator

![](_page_49_Figure_1.jpeg)

ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

50

### Focus of Today's Tutorial

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- FPGA DEMO

#### **Future Work:**

- Support for Sparsity
- Support for Multi-layer Mapping
- Compiler support

## Schedule (EST)

| Time slot      | Торіс                               |          |
|----------------|-------------------------------------|----------|
| 14:00 to 14:30 | Introduction to DNN Accelerators    | Tushar   |
| 14:30 – 14:40  | Break                               |          |
| 14:40: 15:10   | MAERI2.0 Architecture and Tool Flow | Jianming |
| 15:10 to 15:30 | Demo on FPGA                        | Jianming |

Brief Q/A at the end of each talk.

Please feel free to interrupt and ask questions or use chat

Attention: Tutorial is being recorded!

https://maeri-project.github.io/tutorials/ics-2022