#### Memristor Empowered Ultra-fast Baseband Processing

Kaibin Huang

Dept. of Electrical & Electronic Engineering The University of Hong Kong Hong Kong



Acknowledgement: Parts of presentation were created by Zhongrui Wang and Qunsong Zeng

## 6G — Fusion of Communication and Computing





## Revolution in Computing — "Living on the Edge"



About 150 trillion gigabytes of data will need

analysis by 2025 (Forbes)



Machine Learning

Artificial Intelligence

#### From Shannon 1.0 to Edge Al

#### Shannon 1.0 — Rate Maximization

"Given a constraint on distortion,

transmit as much data as possible"



#### <u>Shannon 2.0 – Fast Edge Intelligence</u>

"Given a constraint on learning/decision accuracy,

distill or use intelligence as fast as possible"



G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, "Towards an Intelligent Edge: Wireless Communication Meets Machine Learning", IEEE Commun. Magazine, 2020.

#### 6G — Shannon Meets Turing





Alan Turing (Father of AI)



#### 6G Sub-millisecond Latency





Martin Cooper with 1G Phone

#### Question A: Is analog communication dead?

#### Over-the-Air Computing



G. Zhu, Y. Wang, and K. Huang, "Broadband Analog Aggregation for Low-Latency Federated Edge Learning," IEEE TWC, 2020.

#### Turning Channel Noise into Accelerator



Z. Zhang, G. Zhu, R. Wang, V. K. N. Lau, and K. Huang, "Turning Channel Noise into an Accelerator for Over-the-Air Principal Component Analysis," TWC 2022

#### Noise Tolerance of Edge Inference

#### 0.6 Class 1 Class 2 100 \*\* 0.4 80 0.2 Margin Accuracy (%) 0.0 60 \* $(\mathbf{x})_2$ -0.2 $\star$ 40 $\star$ -0.4 -0.6 20 -0.8 0 Additive Noise Variance -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 $(\mathbf{x})_1$

#### Margin of Classifier

#### Question B - Is analog computing dead?

| Analog Computer              | Digital Computer                                 |  |
|------------------------------|--------------------------------------------------|--|
| Specialized for one problem  | Flexible due to Boolean algebra                  |  |
| Small errors can accumulate  | Resilient to noise                               |  |
| Cannot get same answer twice | Reproducible results                             |  |
| ?                            | Advent of solid-state electronics<br>allows VLSI |  |





Alan Turing used analog computer to crack Nazi's enigma code in WWII



Cerebras Wafer-size chip - 1.2 trillion transistors and 400,000 AI cores.

## Outline

#### I. Analog Neuromorphic Computing

II. Memristor Empowered Ultra-fast Baseband

## The digital "brain"



#### Are digital chips approaching the performance of the brain?



[1] NVidia [2] S. Furber, J. Neural Eng. 13, 051001 (2016) [3] Horowitz, ISSCC 2014

## Why Neuromorphic Computing



 $10^{10}$  transistors vs.  $10^{15}$ synapse  $10^{-12}$  vs.  $10^{-15}$  J per operation

#### Von Neumann Bottleneck





#### Semiconductor device fabrication



MOSFET scaling (process nodes) 10 µm – 1971 6 μm – 1974 **3 μm** – 1977 1.5 µm – 1981  $1 \,\mu m - 1984$ 800 nm - 1987 600 nm - 1990 350 nm - 1993 250 nm - 1996 180 nm – 1999 130 nm – 2001 90 nm - 2003 65 nm - 2005 45 nm - 2007 32 nm - 2009 22 nm - 2012 14 nm – 2014 10 nm – 2016 7 nm – 2018 5 nm - 2020 3 nm - 2022 Future 2 nm ~ 2024

#### Approaching Transistor Scaling Limit

## Possible Solution — In-Memory Computing



Z. Wang et al., Nat. Rev. Mater. doi:10.1038/s41578-019-0159-3

## Why memristors? Stack-ability and Scalability



P. Lin et al., Nat. Nanotechnol. 14, 35-39



P. Lin et al., Nat. Electron., in press

#### Outline

#### I. Analog Neuromorphic Computing







- Zhongrui Wang
- Qunsong Zeng
- Jiawei Liu



Q. Zeng, J. Liu, J. Lan, Y. Gong, Z. Wang, Y. Li, and K. Huang, "Realizing Ultra-Fast and Energy-Efficient Baseband Processing Using Analogue Switching Memory", [Online] http://arxiv.org/abs/2205.03561.

#### In-Memory Empowered Ultra-Fast 6G Communication



#### Fabrication: Resistive Random-Access Memory



#### **MIMO-OFDM** Transceiver



- Key modules of MIMO-OFDM transceiver:
  - OFDM: Orthogonal frequency-division multiplexing
    - IDFT: inverse discrete Fourier transform (Tx)
    - DFT: discrete Fourier transform (Rx)
  - MIMO: Multiple-input multiple-output
    - MIMO detection: recover signal by channel inversion (Rx)
    - Channel estimation: obtain channel state information (Rx)

## **Design:** DFT Module



#### **Highlight:**

- DFT operation in **one-step** (i.e., O(1) complexity).
- Traditional FFT algorithms complexity  $O(N_c \log N_c)$ .



## Validation: OFDM System

• Hardware implementation of OFDM system:



## DFT matrix written into differential RRAM arrays



Real mapped DFT matrix (µS)



#### Matrix Inversion Using RRAM Crossbar



Sun, Zhong, et al., "One-step Regression and Classification with Cross-Point Resistive Memory Arrays", Science advances, 2020.

#### **Design:** MIMO Detection Module



#### 🗳 Highlight:

- MIMO detection in one-step (i.e., O(1) complexity).
- Conventional computational complexity is  $O(N^3)$ .



#### **L-MMSE detection:**

$$\hat{\mathbf{x}} = \left(\mathbf{H}^{\mathsf{H}}\mathbf{H} + \frac{1}{\mathsf{SNR}}\mathbf{I}\right)^{-1}\mathbf{H}^{\mathsf{H}}\mathbf{y}$$

• SNR 
$$\propto (g_1g_2)^{-1}$$

 L-MMSE ⇒ ZF by turning off the transistors

## Validation: MIMO System

• Hardware implementation of MIMO system:



#### **Performance Evaluation:** Complete System

Para

► 2









Digital processor (benchmark)



**Verification is necessary!** 

rs for OFDM

for MIMO

# ut-veri

verification

# RRAM-based baseband processing

Channel SNR = 30dB

#### System Performance Improvements

| <u>100-Time Faster and More Energy Efficient</u> |              |             |  |
|--------------------------------------------------|--------------|-------------|--|
|                                                  | Latency (ms) | Energy (mJ) |  |
| Qualcomm Snapdragon X65                          | <10          | N/A         |  |
| Domain Adaptive Processor [1]                    | 28.56        | 27.71       |  |
| Combined FFT [2] + MIMO [3]                      | 23.16        | 22.98       |  |
| Our RRAM-based processor                         | 0.2322       | 0.01015     |  |

[1] K.-Y. Chen, *et al.*, "A 507 GMACs/J 256-Core Domain Adaptive Systolic-Array-Processor for Wireless Communication and Linear-Algebra Kernels in 12nm FINFET", *Proc. VLSI Techn. Circuits*, 2022.

[2] S. Liu, *et al.*, "A high-flexible low-latency memory-based FFT processor for 4G, WLAN, and future 5G", *IEEE Trans. VLSI Syst.*, vol. 27 no. 3, pp. 511-523, 2018.

[3] W. Tang, *et al.* "A 2.4-mm<sup>2</sup> 130-mW MMSE-Nonbinary LDPC Iterative Detector Decoder for 4×4 256-QAM MIMO in 65-nm CMOS." *IEEE J. Solid-State Circuits*, vol. 54, no. 7, pp. 2070-2080, 2019.

#### System Performance Improvements

- Latency Several microseconds (µs)
- Energy Several micro-Jules (µJ)
- Performance Approach digital baseband

#### Memristor Models:

- Ferroelectric field-effect transistor (FeFET)
- Ferroelectric tunnel junction (FTJ)







## Thank You

