

# **MAERI 2.0**

### End-to-end framework to explore architecture design space on FPGA

Jianming Tong, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Taekyung Heo, Tushar Krishna Georgia Institute of Technology jianming.tong@gatech.edu











### Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO







Supported Models from **PyTorch** Framework

| Model Part NN Model         | Layer/Feature               | Attribute      | Range                     |
|-----------------------------|-----------------------------|----------------|---------------------------|
| ~ (   <b>X • X</b>          | Convolution                 | Kernel Sizes   | w, h: 1, 3                |
| Quantization                |                             | Strides        | w, h: 1, 2                |
|                             |                             | Padding        | w: [0, kernel_w-1]        |
|                             |                             |                | h: [0, kernel_h-1]        |
| Data Layout Reorder         |                             | Input Size     | Arbitrary                 |
| Ť T                         |                             | Input Channel  | Arbitrary                 |
| Heterogeneous Scheduling    |                             | Output Channel | Arbitrary                 |
| Process                     |                             | Activation     | ReLU, ReLU6 and LeakyReLU |
| CPU Kernels FPGA            |                             | Dilation       | Future Work               |
| FC, BN, Pooling<br>Platform | Max Pooling/Average Pooling | Kernel Sizes   | Arbitrary                 |
|                             |                             | Strides        | Arbitrary                 |
|                             |                             | Padding        | Arbitrary                 |
| Vectorized Operation        | Fully Connected             | Input_channel  | Arbitrary                 |
|                             |                             | Output_channel | Arbitrary                 |
|                             | Skip Add                    | Distance       | Arbitrary                 |

#### Supported Models from **PyTorch** Framework

### **Outlines**

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO



C. Torres-Huitzil and B. Girau, "Fault and Error Tolerance in Neural Networks: A Review," in IEEE Access, vol. 5, pp. 17322-17341, 2017, doi: 10.1109/ACCESS.2017.2742698.
Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018

47

Ο



- Neural Network are error-tolerant [1]
  - Use less precision with little scarifice of accuracy save compute [2]



- Neural Network are error-tolerant [1]
  - Use less precision with little scarifice of accuracy save compute [2]



#### • Neural Network are error-tolerant [1]

• Use less precision with little scarifice of accuracy - save compute [2]



50



- Neural Network are error-tolerant [1]
  - Use less precision with little scarifice of accuracy save compute [2]



51



#### • Neural Network are error-tolerant [1]

• Use less precision with little scarifice of accuracy - save compute [2]





- Neural Network are error-tolerant [1]
  - Use less precision with little scarifice of accuracy save compute [2]





#### • Neural Network are error-tolerant [1]

• Use less precision with little scarifice of accuracy - save compute [2]





#### • Neural Network are error-tolerant [1]

• Use less precision with little scarifice of accuracy - save compute [2]





- Neural Network are error-tolerant [1]
  - Use less precision with little scarifice of accuracy save compute [2]



### Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO

### MAERI 2.0 Model Terminology

### MAERI 2.0 Model Terminology



### MAERI 2.0 Model Terminology









• DRAM is 1D -> each address refers to a single data.



• DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

iActs



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

Weights Layout in DRAM



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

Weights Layout in DRAM
#### MAERI 2.0 Data Layout



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

Weights Layout in DRAM

#### MAERI 2.0 Data Layout



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

Weights Layout in DRAM

#### MAERI 2.0 Data Layout



DRAM is 1D -> each address refers to a single data.

iAct Layout in DRAM

Weights Layout in DRAM

### Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO







FPGA Convolution Batch Normalization MaxPooling Average Pooling Linear ReLU CPU



Heterogeneous Scheduling





Heterogeneous Scheduling





- Heterogeneous Scheduling
  - Heterogeneous parallelism optimization in progress



## Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
  - Data Processing Order
  - Microarchitecture
- DEMO











iAct Reuse Multi-Kernel



iAct Reuse Multi-Kernel



**iAct Reuse Multi-Kernel**  Multi-iAct Tiling



Insight 1: iAct and weights are reused.



Insight 1: iAct and weights are reused.

Insight 2: iAct access are not continuous, weights access are continuous.

## Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
  - Data Processing Order
  - Challenges and Proposed Microarchitecture
- DEMO













Insight 1: need on-chip buffer to store data for leveraging reuse.













#### MAERI 2.0 Micro-architecture - Computation



#### MAERI 2.0 Micro-architecture - Computation



#### MAERI 2.0 Micro-architecture - Computation








- Large Kernel could be decomposed into serial of 3x3.
  - Save computation & storage
  - Small Kernel directly put into design.



- Large Kernel could be decomposed into serial of 3x3.
  - Save computation & storage
  - Small Kernel directly put into design.



- Large Kernel could be decomposed into serial of 3x3.
  - Save computation & storage
  - Small Kernel directly put into design.



- Large Kernel could be decomposed into serial of 3x3.
  - Save computation & storage
  - Small Kernel directly put into design.











- Sliding Windows Parallelism.
- Kernel Parallelism.









• Process  $Y_P = 2$  Sliding Windows



• Process  $Y_P = 2$  Sliding Windows

Weights



#### Weights



#### Weights



### Weights



• Process  $K_P = 3$  kernels in parallel







#### Continuous Reading





DRAM

# • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

#### + - & > = B Q Q X X 4 H H Z 2 4 F F 4 - H

| ILA Status: Idle                                                    |                                          |   | 14 |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
|---------------------------------------------------------------------|------------------------------------------|---|----|----------|-----|-----|-----|-----|----|---|-------|-----|----------|------------|-------|-------|-------|-------|------------|--------------|---------|-------|-------|
| Name                                                                | Value                                    | 8 |    | 100      | 200 | 300 | 400 | 500 | 60 | e | 700   | 300 | 988      | 1,000      | 1,100 | 1,200 | 1,300 | 1,400 | 1,500      | 1,600        | 1,700   | 1,800 | 1,900 |
| > slot_0 : off_chip_access_test_1_m_axi_input_r : Interface         | Active                                   |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
| > slot_0 : off_chip_access_test_1i_input_r : Read Transactions      | OVERFLOW                                 |   |    |          |     |     |     |     |    |   |       |     |          | OVERFLOW   |       |       |       |       |            |              |         |       |       |
| > 🐂 slot_0 : off_chip_access_test_1_m_axi_input_r : AR Channel      | -                                        |   |    | 11 11 11 |     |     |     |     |    |   | 11#   |     | 11 11 11 |            |       |       |       |       |            |              |         |       |       |
| v slot_0 : off_chip_access_test_1_m_axi_input_r : R Channel         | Last Data                                | н |    |          |     |     |     |     |    |   |       |     |          |            |       | 1     |       |       |            |              |         |       | _     |
| <pre>% slot_0 : off_chip_access_test_1_m_axi_input_r : RVALID</pre> | 1                                        |   |    |          |     |     |     |     |    |   |       |     |          | пт         | ГПТ   |       |       |       |            |              |         |       |       |
| <pre>% slot_0 : off_chip_access_test_1_m_axi_input_r : RREADY</pre> | 1                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
| <pre>% slot_0 : off_chip_access_test_1_m_axi_input_r : RLAST</pre>  | 1                                        |   |    |          |     |     |     |     |    |   |       |     |          |            | ГПТ   |       |       |       |            |              |         |       |       |
| > 😻 slot_0 : off_chip_access_test_1_m_axi_input_r : RDATA           | 4040404040404040404040404040404040404040 |   |    |          |     |     |     | 003 |    |   |       |     |          |            |       | X     |       | 4f4f  | F4F4F4F4F4 | f4f4f4f4f4f4 | f4f4f4f |       |       |
| > 😻 slot_0 : off_chip_access_test_1_m_axi_input_r : RRESP           | OKAY                                     |   |    |          |     |     |     |     |    |   |       |     |          | OKAY       |       |       |       |       |            |              |         |       |       |
| <pre>@ slot_0 : off_chip_access_test_1_m_axi_input_r : R_CNT</pre>  | 2                                        |   |    |          |     | ÷ ( |     |     | 2) |   | 0 0 X | X   |          |            |       | X     |       |       |            | 0            |         |       |       |
| > slot_0 : off_chip_access_test_1_m_axi_input_r : AW Channel        | No Write Addr Cmds                       |   |    |          |     |     |     |     |    |   |       |     | No Wr    | ite Addr ( | inds  |       |       |       |            |              |         |       |       |
| > 🐚 slot_0 : off_chip_access_test_1_m_axi_input_r : W Channel       | No Write Data Beats                      |   |    |          |     |     |     |     |    |   |       |     | No Wri   | te Data B  | eats  |       |       |       |            |              |         |       |       |
| > 🐚 slot_0 : off_chip_access_test_1_m_axi_input_r : B Channel       | No Write Responses                       |   |    |          |     |     |     |     |    |   |       |     | No hir   | ite Respon | ises  |       |       |       |            |              |         |       |       |
| slot_1 : off_chip_access_test_1_m_axi_output_r : Interface          | -                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
| > slot_1 : off_chip_access_test_1output_r : Write Transactions      | -                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            | OVER         | FLOW    |       |       |
| > 🐚 slot_1 : off_chip_access_test_1_m_axi_output_r : AR Channel     | No Read Addr Cmds                        |   |    |          |     |     |     |     |    |   |       |     | No R     | ad Addr C  | nds   |       |       |       |            |              |         |       |       |
| > slot_1 : off_chip_access_test_1_m_axi_output_r : R Channel        | No Read Data Beats                       |   |    |          |     |     |     |     |    |   |       |     | No Re    | ad Data Be | ats   |       |       |       |            |              |         |       |       |
| > slot_1 : off_chip_access_test_1_m_axi_output_r : AW Channel       | -                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       | 1     |
| > slot_1 : off_chip_access_test_1_m_axi_output_r : W Channel        | -                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
| > > slot_1 : off_chip_access_test_1_m_axi_output_r : B Channel      | -                                        |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       | 111   |
|                                                                     |                                          |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
|                                                                     |                                          |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |
|                                                                     |                                          |   |    |          |     |     |     |     |    |   |       |     |          |            |       |       |       |       |            |              |         |       |       |





| Latency Operation        | Continuous | Jump Mode |
|--------------------------|------------|-----------|
| Read 256 data (128 bit)  | 256        | 1182      |
| Write 128 data (128 bit) | 128        | 576       |



| Q <b>+  −                                 </b> | » 📕 📴 🍳 | <b>Θ ∷ × ∗</b> Η | H 12 27 H F | •   •F   •F    - |
|------------------------------------------------|---------|------------------|-------------|------------------|
|------------------------------------------------|---------|------------------|-------------|------------------|







| Latency Operation        | Continuous | Jump Mode |
|--------------------------|------------|-----------|
| Read 256 data (128 bit)  | 256        | 1182      |
| Write 128 data (128 bit) | 128        | 576       |



| ILA Status: Idle                                                      |                                  | ľ |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
|-----------------------------------------------------------------------|----------------------------------|---|-----|-----|-----|-----|----------|-----|-----|-----|----|-----|--------|---------|-----------|-------|-----|------|--------|---------|-----------|---------------------|----------|-------|-------|
| Name                                                                  | Value                            | 8 | 100 | 200 | 300 | 400 | 5        | 88  | 688 | 700 |    | 866 | 900    |         | 1,000     | 1,100 | 1   | ,200 | 1,300  | 1,400   | 1,500     | 1,600               | 1,700    | 1,800 | 1,900 |
| Wislot_0 : off_chip_access_test_1_m_axi_input_r : Interface           | Active                           |   |     |     | HHH |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
| > slot_0 : off_chip_access_test_1i_input_r : Read Transactions        | OVERFLOW                         | П |     |     |     |     |          |     |     |     |    |     |        | (       | VERFLOW   |       |     |      |        |         |           |                     |          | _     |       |
| > slot_0 : off_chip_access_test_1_m_axi_input_r : AR Channel          |                                  | н |     |     |     |     |          |     |     | нH  | HH |     |        |         |           |       | + + |      |        |         |           |                     |          |       |       |
| v slot_0 : off_chip_access_test_1_m_axi_input_r : R Channel           | Last Data                        |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
| <pre>% slot_0 : off_chip_access_test_1_m_axi_input_r : RVALID</pre>   | 1                                | 1 |     |     |     |     |          |     |     |     |    |     |        |         |           | тп    |     |      |        |         |           |                     |          |       |       |
| <pre>\$ slot_0 : off_chip_access_test_1_m_axi_input_r : RREADY </pre> | 1                                |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
| <pre>islot_0 : off_chip_access_test_1_m_axi_input_r : RLAST</pre>     | 1                                |   |     |     |     |     | 101011   |     |     |     |    |     |        |         |           | тп    |     |      |        |         |           |                     |          |       |       |
| > 🛡 slot_0 : off_chip_access_test_1_m_axi_input_r : RDATA             | 40404040404040404040404040404040 |   |     | 111 | 111 |     | 11 11 11 |     |     |     |    |     |        |         | 11        | 111   | 1   |      |        | 4 f 4 f | 464646464 | 4 64 64 64 64 64 64 | 4f4f4f4f |       |       |
| > W slot_0 : off_chip_access_test_1_m_axi_input_r : RRESP             | OKAY                             |   |     |     |     |     |          |     |     |     |    |     | 101 10 |         | OKAY      |       |     |      |        |         |           |                     |          |       |       |
| <pre>wislot_0 : off_chip_access_test_1_m_axi_input_r : R_CNT</pre>    | 2                                | • |     | ¥   |     |     | ** * *   | 1 2 |     |     | 11 | ¥   |        |         |           |       | N.  |      |        |         |           | 0                   |          |       |       |
| > slot_0 : off_chip_access_test_1_m_axi_input_r : AW Channel          | No Write Addr Cmds               |   |     |     |     |     |          |     |     |     |    |     |        | lo Writ | te Addr I | Cnds  |     |      |        |         |           |                     |          |       |       |
| > 🐚 slot_0 : off_chip_access_test_1_m_axi_input_r : W Channel         | No Write Data Beats              |   |     |     |     |     |          |     |     |     |    |     | N      | e Writ  | e Data B  | leats |     |      |        |         |           |                     |          |       |       |
| > % slot_0 : off_chip_access_test_1_m_axi_input_r : B Channel         | No Write Responses               |   |     |     | _   |     |          |     |     | _   |    |     |        | lo Writ | te Respo  | nses  |     |      |        |         |           |                     |          |       | _     |
| slot_1 : off_chip_access_test_1_m_axi_output_r : Interface            | _                                |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      | - 1111 |         |           |                     |          |       |       |
| > slot_1 : off_chip_access_test_1output_r : Write Transactions        | _                                |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           | OVE                 | RFLOW    |       |       |
| > % slot_1 : off_chip_access_test_1_m_axi_output_r : AR Channel       | No Read Addr Cmds                |   |     |     |     | _   | _        |     | -   |     |    | _   |        | No Rea  | d Addr C  | ads   |     |      | _      |         |           |                     |          |       |       |
| > slot_1 : off_chip_access_test_1_m_axi_output_r : R Channel          | No Read Data Beats               |   |     |     |     |     |          |     |     | _   |    |     |        | lo Read | Data B    | eats  |     |      |        |         |           |                     |          |       |       |
| > slot_1 : off_chip_access_test_1_m_axi_output_r : AW Channel         |                                  |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
| > lislot_1 : off_chip_access_test_1_m_axi_output_r : W Channel        |                                  | 4 |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          | ****  | عكل   |
| > hislot_1 : off_chip_access_test_1_m_axi_output_r : B Channel        |                                  |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
|                                                                       |                                  |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |
|                                                                       |                                  |   |     |     |     |     |          |     |     |     |    |     |        |         |           |       |     |      |        |         |           |                     |          |       |       |

Insight 2: need multi-level tiling to delivery continuous DRAM access.





• L3 Tile: Transferred Data from DRAM to achieve continuous data access.



- L3 Tile: Transferred Data from DRAM to achieve continuous data access.
- L2 Tile: Data the entire DPE Array requires every cycle.



- L3 Tile: Transferred Data from DRAM to achieve continuous data access.
- L2 Tile: Data the entire DPE Array requires every cycle.
- L1 Tile: The data each single PE requires.











Streaming Buffer → Reuse L2 iAct tiles for multiple kernels



- Streaming Buffer → Reuse L2 iAct tiles for multiple kernels
- Line Buffer  $\rightarrow$  Reuse overlapped iAct for sliding windows.



- Streaming Buffer  $\rightarrow$  Reuse L2 iAct tiles for multiple kernels
- Line Buffer → Reuse overlapped iAct for sliding windows.
- iAct Reuse in different DPE rows → Reuse iAct by multiple kernels




# MAERI 2.0 Micro-architecture - Buffers for weights









- Stationary Buffer  $\rightarrow$  Enable weights reuse on different L2 weights tile
- Weights Buffer → Double buffer for L2 weights tile.



# MAERI 2.0 Micro-architecture - Buffers for weights

- Stationary Buffer  $\rightarrow$  Enable weights reuse on different L2 weights tile
- Weights Buffer  $\rightarrow$  Double buffer for L2 weights tile.
- Weights are broadcasted in different DPE columns



- Stationary Buffer  $\rightarrow$  Enable weights reuse on different L2 weights tile
- Weights Buffer  $\rightarrow$  Double buffer for L2 weights tile.
- Weights are broadcasted in different DPE columns



- Stationary Buffer  $\rightarrow$  Enable weights reuse on different L2 weights tile
- Weights Buffer  $\rightarrow$  Double buffer for L2 weights tile.
- Weights are broadcasted in different DPE columns
- Buffer Write and Data Forward happen in parallel for fetching first weights
   146































































































































































































## Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO
  - Entire Flow Demonstration
  - Walk Through Example of deploying ResNet50

Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

#### **Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow**

#### • Workload: ResNet 50

| Layer Type       | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |
|------------------|----------------|--------------|------|----------|-----------------|-----------------|
| Number of Layers | 53             | 53           | 49   | 16       | 1               | 1               |

#### **Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow**

### • Workload: ResNet 50

| Layer     | Type      | Convolution  | n 2d E | BatchNorm | 2d ReLU      | Skip Ad | d Average   | Pooling  | Fully Connected  |
|-----------|-----------|--------------|--------|-----------|--------------|---------|-------------|----------|------------------|
| Number of | of Layers | 53           |        | 53        | 49           | 16      | ]           | L        | 1                |
| •         | Platforn  | n: zcu 104   | 4 🔍    |           |              |         |             |          |                  |
| Name      |           | OS           | python | PyTorch   | Quantization | Scheme  | Dataset     |          | CPU              |
| Version   | pynqlinux | v2.6 (18.04) | 3.6.5  | 1.8.1     | qnnpa        | ck      | imagenet 1k | Dual-cor | e Arm Cortex-R5F |

### Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

### • Workload: ResNet 50

| Layer Type       | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |
|------------------|----------------|--------------|------|----------|-----------------|-----------------|
| Number of Layers | 53             | 53           | 49   | 16       | 1               | 1               |
| Platforn         | n: zcu 104 🔌   |              |      |          |                 |                 |

| Name    | OS                        | python | PyTorch | Quantization Scheme | Dataset     | $\operatorname{CPU}$     |
|---------|---------------------------|--------|---------|---------------------|-------------|--------------------------|
| Version | pynqlinux v $2.6$ (18.04) | 3.6.5  | 1.8.1   | qnnpack             | imagenet 1k | Dual-core Arm Cortex-R5F |

- Setup Environment
- Model Quantization
- Data Layout Reorder (PyTorch Default order)
- Custom Inference

### Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

### • Workload: ResNet 50

| Layer Type       | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |
|------------------|----------------|--------------|------|----------|-----------------|-----------------|
| Number of Layers | 53             | 53           | 49   | 16       | 1               | 1               |
| Platforn         | n: zcu 104 🔌   |              |      |          |                 |                 |

|         |                               | ,<br>, |         |                     |             |                          |
|---------|-------------------------------|--------|---------|---------------------|-------------|--------------------------|
| Name    | OS                            | python | PyTorch | Quantization Scheme | Dataset     | $\operatorname{CPU}$     |
| Version | pynqlinux v $2.6$ (1 $8.04$ ) | 3.6.5  | 1.8.1   | qnnpack             | imagenet 1k | Dual-core Arm Cortex-R5F |

- Setup Environment
- Model Quantization
- Data Layout Reorder (PyTorch Default order)
- Custom Inference

### Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

### • Workload: ResNet 50

| Layer Type       | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |
|------------------|----------------|--------------|------|----------|-----------------|-----------------|
| Number of Layers | 53             | 53           | 49   | 16       | 1               | 1               |
| Platforn         | n: zcu 104 🔌   |              |      |          |                 |                 |

|         |                               | •      |         |                     |             |                          |
|---------|-------------------------------|--------|---------|---------------------|-------------|--------------------------|
| Name    | OS                            | python | PyTorch | Quantization Scheme | Dataset     | $\operatorname{CPU}$     |
| Version | pynqlinux v $2.6$ (1 $8.04$ ) | 3.6.5  | 1.8.1   | qnnpack             | imagenet 1k | Dual-core Arm Cortex-R5F |

- Setup Environment
- Model Quantization
- Data Layout Reorder (PyTorch Default order)
- Custom Inference

### Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

### • Workload: ResNet 50

| Layer Type       | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |
|------------------|----------------|--------------|------|----------|-----------------|-----------------|
| Number of Layers | 53             | 53           | 49   | 16       | 1               | 1               |
| Platforn         | n: zcu 104 🔌   |              |      |          |                 |                 |

| Name    | OS                        | python | PyTorch | Quantization Scheme | Dataset     | $\operatorname{CPU}$     |
|---------|---------------------------|--------|---------|---------------------|-------------|--------------------------|
| Version | pynqlinux v $2.6$ (18.04) | 3.6.5  | 1.8.1   | qnnpack             | imagenet 1k | Dual-core Arm Cortex-R5F |

- Setup Environment
- Model Quantization
- Data Layout Reorder (PyTorch Default order)
- Custom Inference
### Goal of DEMO 1: Demonstrate the entire MAERI 2.0 Flow

### • Workload: ResNet 50

| Layer Type          | Convolution 2d | BatchNorm 2d | ReLU | Skip Add | Average Pooling | Fully Connected |  |  |  |
|---------------------|----------------|--------------|------|----------|-----------------|-----------------|--|--|--|
| Number of Layers    | 53             | 53           | 49   | 16       | 1               | 1               |  |  |  |
| • Platform: zcu 104 |                |              |      |          |                 |                 |  |  |  |

| 0       |                               |        |         |                     |             |                          |
|---------|-------------------------------|--------|---------|---------------------|-------------|--------------------------|
| Name    | OS                            | python | PyTorch | Quantization Scheme | Dataset     | $\operatorname{CPU}$     |
| Version | pynqlinux v $2.6$ (1 $8.04$ ) | 3.6.5  | 1.8.1   | qnnpack             | imagenet 1k | Dual-core Arm Cortex-R5F |

### • DEMO 1 (Pre-recorded)

- Setup Environment
- Model Quantization
- Data Layout Reorder (PyTorch Default order)
- Custom Inference

- DEMO 1 (Pre-run offline)
  - Setup Environment
  - Model Quantization
  - Data Layout Reorder (PyTorch Default order)

- DEMO 1 (Pre-run offline)
  - Setup Environment
  - Model Quantization
  - Data Layout Reorder (PyTorch Default order)

- DEMO 1 (Pre-run offline)
  - Setup Environment
  - Model Quantization
  - Data Layout Reorder (PyTorch Default order)

- DEMO 1 (Pre-run offline)
  - Setup Environment
  - Model Quantization
  - Data Layout Reorder (PyTorch Default order)

#### • DEMO 1 (Pre-run offline)

• Setup Environment

Validate: 100% 250/250 [24:13<00:00, 5.78s/it, loss=0.909, top1=76.7, top5=93, img\_size=224] Results: loss=0.90904, top1=76.7, top5=93.0

#### • DEMO 1 (Pre-run offline)

• Setup Environment



## Outlines

- Supported Neural Network Model
- Quantization Flow
- Memory Layout
- Heterogeneous Scheduling
- MAERI 2.0 Microarchitecture
- DEMO
  - Entire Flow Demonstration
  - Performance Evaluation of Conv Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit



- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)



- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):



- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$



### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit





### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit





### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit





### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit



### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit



### Goal of DEMO: Evaluate convolution performance of MAERI 2.0 Accel.

- Workload: Once-for-all ResNet50 [1]
  - 36 Convolution Layers
- Platform: zcu 104
  - DRAM-PL bandwidth (High Performance Interface): use 512 bits out of 768-bit
    - Bandwidth restricted parallelism:  $K_P = 3$  (3 kernels)
    - Workload Preferred parallelism:  $Y_P = 4$  (4 Sliding Windows)
  - L2 tile size (base on Parallelism):
  - L3 tile size (Based on DSE Tool):  $(T_K, T_C, R, S, T_X, T_Y) = (12, 32, 3, 3, 104, 226)$

- Streaming Buffer: 85696 x 64-bit
- Stationary Buffer: 8192 x 80-bit
- Line Buffer: 88 x 64-bit
- No Weights Buffer (Weights Stationary)
- Output Buffer: 1326 x 512-bit

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

- Run DSE tool to get best hardware configuration
  - 30-mins on 10-th Core I7 10750H.
- Generate tiling strategy for specific workload
- Run on the Xilinx ZCU 104 (DEMO 2 today)
  - Load Pre-generated Tiling Strategy
  - Configure FPGA with MAERI 2.0 bitstream
  - Run the network inference.

Conv Layer Index: 0 Conv Laver Index: 1 Conv Layer Index: 2 Conv Laver Index: 3 Conv Layer Index: 4 Conv Laver Index: 5 Conv Layer Index: 6 Conv Laver Index: 7 Conv Layer Index: 8 Conv Layer Index: 9 Conv Layer Index: 10 Conv Laver Index: 11 Conv Laver Index: 12 Conv Layer Index: 13 Conv Laver Index: 14 Conv Layer Index: 15 Conv Layer Index: 16 Conv Layer Index: 17 Conv Layer Index: 18 Conv Layer Index: 19 Conv Laver Index: 20 Conv Laver Index: 21 Conv Layer Index: 22 Conv Laver Index: 23 Conv Layer Index: 24 Conv Layer Index: 25 Conv Layer Index: 26 Conv Laver Index: 27 Conv Layer Index: 28 Conv Laver Index: 29 Conv Layer Index: 30 Conv Layer Index: 31 Conv Laver Index: 32 Conv Layer Index: 33 Conv Laver Index: 34 Conv Layer Index: 35 overall latency of running all layers = 103.56596040725708 seconds







• takes PyTorch NN model

## Summary



- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct


- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct
- PyTorch default data layout



- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct
- PyTorch default data layout
- Accelerate Conv on MAERI 2.0 (FPGA)





- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct
- PyTorch default data layout
- Accelerate Conv on MAERI 2.0 (FPGA)
- Multi-tiling memory hierarchy





- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct
- PyTorch default data layout
- Accelerate Conv on MAERI 2.0 (FPGA)
- Multi-tiling memory hierarchy





- takes PyTorch NN model
- Quint8 iAct, Qint Weights, Quint8 oAct
- PyTorch default data layout
- Accelerate Conv on MAERI 2.0 (FPGA)
- Multi-tiling memory hierarchy
- Design optimization in progress



#### Thank You! Welcome for Questions! https://maeri-project.github.io/

#### Join us to build a better framework for researcher!



main ming memory meratory

Design optimization in progress