

# Deep Neural Network and Accelerator Co-Design: Present and Future

Cong (Callie) Hao

Assistant Professor

Georgia Institute of Technology

School of Electrical and Computer Engineering



Sharc-lab @ Georgia Tech https://sharclab.ece.gatech.edu/



#### Deep Neural Network (DNN) Design

#### **Accelerator Design**







Deep Neural Network (DNN) Design



**Accelerator Design** 





#### Deep Neural Network (DNN) Design

- An automatic <u>neural architecture search</u> (NAS) methodology – a.k.a. AutoML
- Boosts the quality and accuracy of machine learning algorithms



Zoph, Barret, et at. "Learning transferable architectures for scalable image recognition." CVPR 2018

#### **Accelerator Design**

- On GPU/TPU/NPU: optimized (tuned) neural network implementations
- On FPGA: customized DNN accelerators



Chen Zhang, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks", FPGA 2015



#### Deep Neural Network (DNN) Design

An automatic neural architecture search (NAS) methodology - a.k.a. AutoML

Boosts the quality and accuracy of machine learning algorithms





#### **Accelerator Design**

On GPU/TPU/NPU: optimized (tuned) neural network implementations

On FPGA: customized DNN accelerators



Chen Zhang, et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks", FPGA 2015

### Three Levels of Co-Design in Cake Factory...





## Three Levels of Co-Design in Cake Factory...





#### Three Levels of Co-Design in Cake Factory...



Build a plate... Bake a cake... Oops cake too big Level o Plate is different from expected... Level 1 Bake&build in pairs!) Level 2







Deep Neural Network (DNN) Design

**Accelerator Design** 





Deep Neural Network (DNN) Design

**Accelerator Design** 





True DNN and Accelerator Co-Design

#### Level 2 Co-Design for DNN/Accelerator





#### Level 2 Co-Design for DNN/Accelerator





#### Level 2 Co-Design for DNN/Accelerator





### NAIS: Simultaneous NAS + Implementation



Simultaneous NAS and Implementation search



## **NAIS: Simultaneous NAS + Implementation**



Simultaneous NAS and Implementation search





Automated AI algorithm development and deployment



Bridge the gap between SW/HW for higher quality solutions



**Software:** 

**Neural Architecture Search Space** 

**Hardware:** 

**Implementation Search Space** 



#### **Software:**

**Neural Architecture Search Space** 

**Hardware: Implementation Search** 

Space

Method 1: find something in the middle and connect to both SW and HW





#### **Software:**

**Neural Architecture Search Space** 

**Hardware:** 

**Implementation Search Space** 

Method 1: find something in the middle and connect to both SW and HW



Method 2: merge the two spaces – formulate both in one equation





#### **Software:**

**Neural Architecture Search Space** 

#### **Hardware:**

**Implementation Search** Space

Method 1: find something in the middle and connect to both SW and HW



Hao, Cong, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen-mei Hwu, and Deming Chen. "FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge." ACM/IEEE DAC, 2019. (seems to be most cited in DAC 2019)





**Accelerators (FPGA)** 



**DNNs** are usually built by repeated or similar basic blocks Activation function Activation function  $f(\mathbf{x})$  $f(\mathbf{x})$  $f(\mathbf{x}) - \mathbf{x}$ Weight layer Weight layer Activation function **Activation function** Weight layer Weight layer

**Accelerators (FPGA)** 

https://d2l.ai/chapter\_convolutional-modern/resnet.html



**DNNs** are usually built by repeated or similar basic blocks Activation function Activation function  $f(\mathbf{x})$  $f(\mathbf{x})$  $f(\mathbf{x}) - \mathbf{x}$ Weight layer Weight layer Activation function Activation function Weight layer Weight layer

**Accelerators (FPGA) are usually** built by Processing Elements (PE) Relu Convolution 1X1 **Depth-wise** Convolution Convolution 3x3 3X3 **Pooling** 

https://d2l.ai/chapter\_convolutional-modern/resnet.html















































# **DNN/FPGA Co-Design Flow**





### **DNN/FPGA Co-Design Flow**







- **Design Automation Conference System Design Contest (DAC-SDC)** 
  - Object detection on FPGA/GPU
- **Our Achievements** 
  - **2018:** Third place @ FPGA (3 out of 51)







- **Design Automation Conference System Design Contest (DAC-SDC)** 
  - Object detection on FPGA/GPU
- **Our Achievements** 
  - **2018:** Third place @ FPGA (3 out of 51)



Independently designed DNN and FPGA accelerator – a lot of iterations!







- **Design Automation Conference System Design Contest (DAC-SDC)** 
  - Object detection on FPGA/GPU
- **Our Achievements** 
  - **2018:** Third place @ FPGA (3 out of 51)



Independently designed DNN and FPGA accelerator – a lot of iterations!

2019: Double championship @ FPGA and GPU (1 out of 58, 1 out of 56)







- **Design Automation Conference System Design Contest (DAC-SDC)** 
  - Object detection on FPGA/GPU
- **Our Achievements** 
  - **2018:** Third place @ FPGA (3 out of 51)



Independently designed DNN and FPGA accelerator – a lot of iterations!

2019: Double championship @ FPGA and GPU (1 out of 58, 1 out of 56)

NAIS co-design leads to victory!







Media coverage and open-source code

https://www.ibm.com/blogs/research/2019/06/winning-ai-algorithms-drones/



2019: https://github.com/TomGoo8/SkyNet

2020: https://github.com/jgoeders/dac\_sdc\_2020\_designs











# **Key Methodologies for Co-Design**



#### **Software:**

**Neural Architecture Search Space** 

**Hardware: Implementation Search** Space

Method 1: find something in the middle and connect to both SW and HW



# **Key Methodologies for Co-Design**



#### **Software:**

**Neural Architecture Search Space** 

**Hardware: Implementation Search** Space

Li, Yuhong, Cong Hao, Xiaofan Zhang, Xinheng Liu, Yao Chen, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. "EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions." DAC 2020

Method 2: merge the two spaces – formulate both in one equation



A True NAIS work EDD [ICCAD'19, DAC'20]

# **Key Methodologies for Co-Design**



#### **Software:**

**Neural Architecture Search Space** 

**{A**}

**Hardware: Implementation Search** Space

**{I}** 

- Put {A, I} into one formulation, preferably differentiable
- Solve {A, I} using continuous optimization, e.g., Gradient Descent
- Method 2: merge the two spaces – formulate both in one equation



A True NAIS work EDD [ICCAD'19, DAC'20]

### **NAIS** Formulation





$$min: \ \mathcal{L} = \underline{Acc_{loss}(A)} \cdot Perf_{loss}(I_0)$$

A is differentiable with respect to  $\mathcal{L}$ 

Implementation  $I_0$  is fixed (not in the search space)

**NAIS Formulation** True co-design

$$min: \ \mathcal{L} = \underline{Acc_{loss}(A, I)} \cdot \underline{Perf_{loss}(I)} + \beta \cdot \underline{C^{RES(I) - RES_{ub}}}$$

A is differentiable with respect to  $\mathcal{L}$ 

Implementation I is also variable

**Consider resource** constraints

### **NAIS** Formulation



(not in the search space)



$$min: \ \mathcal{L} = \underbrace{Acc_{loss}(A) \cdot Perf_{loss}(I_0)}_{A \text{ is differentiable}}$$
 Implementation  $I_0$  is fixed

**NAIS Formulation** True co-design

$$min: \ \mathcal{L} = \underbrace{Acc_{loss}(\textbf{A}, \textbf{I})}_{A \text{ is differentiable with respect to } \mathcal{L}} \cdot \underbrace{Perf_{loss}(\textbf{I})}_{Implementation \ \textbf{I}} + \beta \cdot \underbrace{C^{RES(\textbf{I}) - RES_{ub}}}_{Consider \text{ resource constraints}}$$

### Challenge

How to formulate I as differentiable with respect to  $\mathcal{L}$ ?

with respect to  $\mathcal{L}$ 

# Differentiable DNN Architecture Search



### **Neural Architecture Search (NAS)**



# Differentiable DNN Architecture Search



### **Neural Architecture Search (NAS)**



### Differentiable DNN Architecture Search



### **Neural Architecture Search (NAS)**



From discrete to continuous for differentiable:

**Gumbel-Softmax** 

- Sampling parameter  $\theta_{i,m}$
- Operations sampled following Gumbel-Softmax distribution
- $\theta_{i,m}$  is differentiable with respect to  $\mathcal L$

### Georgia Tech

### Differentiable Implementation Search



### Differentiable Implementation Search





### Differentiable Implementation Search





# Now Since Everything is Differentiable...



#### **NAS**

#### **Implementation Search**





min:  $\mathcal{L} = Acc_{loss}(\mathbf{A}, \mathbf{I}) \cdot Perf_{loss}(\mathbf{I}) + \beta \cdot C^{RES(\mathbf{I}) - RES_{ub}}$ 

### Continuous Optimization: gradient descent

dw-conv k×k

Conv 1×1

 $q_1$ -bit  $q_2$ -bit  $q_2$ -bit  $q_2$ -bit  $q_2$ -bit  $q_3$ -bit  $q_4$ 

 $oldsymbol{I_i^m}$  : other implementation variables

$$Perf^q(op_i^m) = f(I_i^m)$$

$$Res^q(op_i^m) = g(I_i^m)$$



# Comparisons with hardware-aware NAS

|                        | Test Error (%) |       | GPU Latency | FPGA Latency |                   |
|------------------------|----------------|-------|-------------|--------------|-------------------|
|                        | Top-1          | Top-5 | Titan RTX   | ZCU102 [22]  |                   |
| <b>Baseline Models</b> |                |       |             |              |                   |
| GoogleNet              | 30.22          | 10.47 | 27.75 ms    | 13.25 ms     | •                 |
| MobileNet-V2           | 28.1           | 9.7   | 17.87 ms    | 10.85 ms     |                   |
| ShuffleNet-V2          | 30.6           | 11.7  | 21.91 ms    | NA           |                   |
| ResNet18               | 30.2           | 10.9  | 9.71 ms     | 10.15ms      |                   |
| Hardware-aware N       | AS Mo          | dels  |             |              |                   |
| MNasNet-A1             | 24.8           | 7.5   | 17.94 ms    | 8.78 ms      | ,                 |
| FBNet-C                | 24.9           | 7.6   | 22.54 ms    | 12.21 ms     |                   |
| Proxyless-cpu          | 24.7           | 7.6   | 21.34 ms    | 10.81 ms     |                   |
| Proxyless-Mobile       | 25.4           | 7.8   | 21.23 ms    | 10.78 ms     |                   |
| Proxyless-gpu          | 24.9           | 7.5   | 15.72 ms    | 10.79 ms     |                   |
| EDD-Net-1              | 25.3           | 7.7   | 11.17 ms    | 11.15 ms     | GPU-oriented DNN  |
| EDD-Net-2              | 25.4           | 7.9   | 13.00 ms    | 7.96 ms      | FPGA-oriented DNI |

### Comparisons with hardware-aware NAS





EDD-Net-1: targets GPU

EDD-Net-2: targets recursive FPGA accelerator

EDD-Net-3: targets pipelined FPGA accelerator

# Follow-up Works using Differentiable Approach Tech

Dna: **Differentiable** network-accelerator co-search

Y Zhang, Y Fu, W Jiang, C Li, H You, M Li... - arXiv preprint arXiv ..., 2020 - arxiv.org

ConCoDE: Hard-constrained **Differentiable** Co-Exploration Method for Neural Architectures and Hardware Accelerators

D Hong, K Choi, HY Lee, J Yu, Y Kim, N Park, J Lee - 2021 - openreview.net

Dance: Differentiable accelerator/network co-exploration

K Choi, D Hong, H Yoon, J Yu, Y Kim... - 2021 58th ACM/IEEE ..., 2021 - ieeexplore.ieee.org

DIAN: Differentiable accelerator-network co-search towards maximal dnn efficiency

Y Zhang, Y Fu, W Jiang, C Li, H You... - 2021 IEEE/ACM ..., 2021 - ieeexplore.ieee.org

Triple-Search: Differentiable Joint-Search of Networks, Precision, and

Accelerators

Y Fu, Y Zhang, H You, Y Lin - 2020 - openreview.net



Software: Neural Architecture Search (NAS) Hardware: Implementation Search NAIS I



Software: Neural Architecture Search (NAS) Hardware: Implementation Search NAIS





















Multi-modal Multi-task Models



Software: Neural Architecture Search (NAS)









Multi-modal Multi-task Models



**Heterogeneous Platform Mapping-aware NAIS** 

### Multi-modal Multi-Task Models (MMMT)



- Multi-modal: process and relate information from multiple modalities
  - Text, visual, vocal, motion, etc.



### Multi-modal Multi-Task Models (MMMT)



- Multi-modal: process and relate information from multiple modalities
  - Text, visual, vocal, motion, etc.
- Multi-task: to learn multiple related tasks jointly
  - Knowledge transfer
  - Improve the generalization performance
  - Mitigate training (labeled) data sparsity





Ruder, Sebastian. "An overview of multi-task learning in deep neural networks." arXiv preprint arXiv:1706.05098 (2017).

### Multi-modal Multi-Task Models (MMMT)



- Multi-modal: process and relate information from multiple modalities
  - Text, visual, vocal, motion, etc.
- Multi-task: to learn multiple related tasks jointly
  - Knowledge transfer
  - Improve the generalization performance
  - Mitigate training (labeled) data sparsity

Largely increased complexity in model structure

# **Heterogeneous Platforms**





https://www.xilinx.com/support/documentation/white\_papers /wp505-versal-acap.pdf



Talpes, Emil, et al. "Compute solution for tesla's full selfdriving computer." IEEE Micro 40, no. 2 (2020): 25-35.

# Heterogeneous Platforms





### When MMMT Meets Heterogeneity



Largely increased complexity in model structure



Largely increased complexity in heterogeneous platforms



Mapping starts to matter...



Scheduling starts to matter...



**Optimization on each device** also matters...

### When MMMT Meets Heterogeneity



Largely increased complexity in model structure



Largely increased complexity in heterogeneous platforms



Mapping starts to matter...



Scheduling starts to matter...



**Optimization on each device** also matters...





**Mapping Formulation** 

**Scheduling Formulation** 

# An Example of NAIS + Mapping Formulation







**Mapping Formulation** 



Hao, Cong, and Deming Chen. "Software/Hardware Co-design for Multi-modal Multi-task Learning in Autonomous Systems." In 2021 IEEE 3rd AICAS, 2021.

### NAIS + Scheduling + Mapping







Mapping

Scheduling





Xinyi, Zhang, Cong Hao, et al., "H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness" To appear at DAC'22

### NAIS + MMMT + Heterogeneity









#### Mapping

Hao, Cong, and Deming Chen. "Software/Hardware Co-design for Multimodal Multi-task Learning in Autonomous Systems." IEEE 3rd AICAS, 2021.

### Scheduling

Xinyi, Zhang, Cong Hao, et al., "H2H: Heterogeneous Model to Heterogeneous System Mapping with Computation and Communication Awareness"To appear at DAC'22

### **Implementation Optimization**

Li, Yuhong, Cong Hao, et al. "EDD: Efficient differentiable dnn architecture and implementation co-search for embedded ai solutions." ACM/IEEE DAC, 2020

### **Summary & Thanks!**



- **Basic: DNN and Accelerator** Co-design – three levels
- NAIS: simultaneous neural architecture and implementation co-search
- **Future:** when multi-modal multi-task (MMMT) models meet heterogeneous platforms

