Neural Networks on Silicon
Fengbin Tu is currently an Adjunct Assistant Professor in the Department of Electronic and Computer Engineering at The Hong Kong University of Science and Technology. He is also a Postdoctoral Fellow at the AI Chip Center for Emerging Smart Systems (ACCESS), working with Prof. Tim Cheng and Prof. ChiYing Tsui. He received the Ph.D. degree from the Institute of Microelectronics, Tsinghua University, under the supervision of Prof. Shaojun Wei and Prof. Shouyi Yin. He worked with Prof. Yuan Xie and Prof. Yufei Ding, as a Postdoctoral Scholar at the Scalable Energyefficient Architecture Lab (SEAL), University of California, Santa Barbara, from 2019 to 2022. For more informantion about Dr. Tu, please refer to his homepage. Dr. Tu's main research interest is chip and architecture design for AI. This is an exciting field where fresh ideas come out every day, so he's collecting works on related topics. Welcome to join!
Table of Contents

My Contributions

Conference Papers
 2014: ASPLOS, MICRO
 2015: ISCA, ASPLOS, FPGA, DAC
 2016: ISSCC, ISCA, MICRO, HPCA, DAC, FPGA, ICCAD, DATE, ASPDAC, VLSI, FPL
 2017: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, DATE, VLSI, FCCM, HotChips
 2018: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, DATE, ASPDAC, VLSI, HotChips
 2019: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, ASPDAC, VLSI, HotChips, ASSCC
 2020: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, VLSI, HotChips
 2021: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, ICCAD, VLSI, HotChips
 2022: ISSCC, ISCA, MICRO, HPCA, ASPLOS, HotChips
 2023: ISSCC, HPCA
My Contributions
My main research interest is chip and architecture design for AI. For more informantion about me and my research, you can go to my homepage.
[Feb. 2023] Two ComputingInMemory AI chips will appear at ISSCC'23.
[Feb. 2022] Reconfigurable Digital ComputingInMemory AI Chip.
 I designed an innovative AI chip architecture, Reconfigurable Digital ComputingInMemory. The architecture fuses the philosophy of reconfigurable computing and digital computinginmemory, balancing efficiency, accuracy, and flexibility for emerging AI chips. I designed two 28nm chips based on the new architecture, Reconfigurable Digital CIM (ReDCIM) and Transformer CIM (TranCIM).
 ReDCIM (pronounced as "redCIM") is the first CIM chip for cloud AI with flexible FP/INT support, which was covered by Synced. TranCIM is the first CIM chip for Transformer models, which tackles the memory and computation challenges raised by Transformer's attention mechanism.
 ReDCIM: A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise inMemory Booth Multiplication for Cloud Deep Learning Acceleration (ISSCC'22, extended to JSSC'23)
 ReDCIM is designed for cloud AI, with flexible FP/INT support and three features from top to bottom.
 ReDCIM is designed on an inmemory alignmentfree FP MAC pipeline that interleaves exponent alignment and INT mantissa MAC. Both inputs and weights are prealigned to their local maximum exponents, so CIM focuses on only MAC acceleration without complex alignment logic.
 A Bitwise inMemory Booth Multiplication (BM^2) architecture is designed with bitwise input Booth encoding in the BM^2 controller and partial product recoding in the SRAMCIM macro, which reduces nearly 50% cycle count and bitwise multiplications.
 ReDCIM implements hierarchical and reconfigurable inmemory accumulators to enable flexible support of BF16 (BFloat16)/FP32 and INT8/16 in the same CIM macro.
 TranCIM: A 28nm 15.59$\mu$J/Token FullDigital BitlineTranspose CIMbased Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes (ISSCC'22, extended to JSSC'23)
 TranCIM has three features targeting the challenges raised by the attention mechanism of Transformer models.
 TranCIM connects its CIM engines through a reconfigurable streaming network (RSN) with dedicated modes for different layers in Transformer: Pipeline mode for attention layers and parallel mode for fullyconnected layers.
 TranCIM's SRAMCIM macro is designed with a bitlinetranspose structure to align the directions of input feeding and weight writing. Thus in the QKTpipeline mode, transposing K is realized without additional storage and buffer access.
 TranCIM implements a sparse attention scheduler (SAS) to dynamically configure CIM workload for different sparse attention patterns.
[Aug. 2020] Evolver, Evolvable AI Chip.

Evolver: A Deep Learning Processor with OnDevice QuantizationVoltageFrequency Tuning (JSSC'21)
 I designed a 28nm evolvable AI chip (Evolver) with DNN training and reinforcement learning capabilities, to enable intelligence evolution during the chip's long lifetime. This work demonstrates a lifelong learning example of ondevice quantizationvoltagefrequency (QVF) tuning. Compared with conventional QVF tuning that determines policies offline, Evolver makes optimal customizations for varying local user scenarios. To improve the performance and energy efficiency of both DNN training and inference, we introduce three techniques in the architecture level.
 Evolver contains a reinforcement learning unit (RLU) that searches QVF polices based on its direct feedbacks. An outlierskipping scheme is proposed to save unnecessary training for invalid policies under the profiled latency and energy constraints.
 We exploit the inherent sparsity of feature/error maps in DNN training’s feedforward and backpropagation passes, and design a bidirectional speculation unit (BSU) to capture runtime sparsity and discard zerooutput computation, thus reducing training cost. The feedforward speculation also benefits the execution mode.
 Since the runtime sparsity causes timevarying workload parallelism that harms performance and efficiency, we design a reconfigurable computing engine (RCE) with an online configuration compiler (OCC) for Evolver, in order to dynamically reconfigure dataflow parallelism to match workload parallelism.
 Evolver was nominated for 2021 Top10 Research in China’s Semiconductors.
[Jun. 2018] RANA, SoftwareHardware Codesign for AI Chip Memory Optimization.

RANA: Towards Efficient Neural Acceleration with RefreshOptimized Embedded DRAM (ISCA'18)
 I designed a retentionaware neural acceleration (RANA) framework, which strengthens DNN accelerators with refreshoptimized eDRAM to save total system energy. RANA includes three techniques from the training, scheduling, architecture levels respectively.

Training Level: A retentionaware training method is proposed to improve eDRAM's tolerable retention time with no accuracy loss. Bitlevel retention errors are injected during training, so the network' s tolerance to retention failures is improved. A higher tolerable failure rate leads to longer tolerable retention time, so more refresh can be removed.

Scheduling Level: A system energy consumption model is built in consideration of computing energy, onchip buffer access energy, refresh energy and offchip memory access energy. RANA schedules networks in a hybrid computation pattern based on this model. Each layer is assigned with the computation pattern that costs the lowest energy.

Architecture Level: RANA independently disables refresh to eDRAM banks based on their storing data's lifetime, saving more refresh energy. A programmable eDRAM controller is proposed to enable the above finegrained refresh controls.
 RANA was the only work firstauthored by a Chinese research team in ISCA'18, and covered by Tsinghua University News and AI Tech Talk.
[Apr. 2017] DNA and Thinker, Reconfigurable AI Chip.
 DNA: Deep Convolutional Neural Network Architecture with Reconfigurable Computation Patterns (TVLSI popular paper. No.5/2/6/8/8 Downloaded Manuscripts in 2017~2021: 6 Times Monthly No.1 since Sep. 2017.)
 I designed a deep convolutional neural network accelerator (DNA) targeting flexible and efficient CNN acceleration. This is the first work to assign Input/Output/Weight Reuse to different layers of a CNN, which optimizes systemlevel energy consumption based on different CONV parameters. DNA has two main features, in the architecture level and scheduling leverl, respectively.
 A 4level CONV engine is designed to to support different tiling parameters for higher resource utilization and performance.
 A layerbased scheduling framework is proposed to optimize both systemlevel energy efficiency and performance.
 Thinker: A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications (JSSC'18)
Conference Papers
This is a collection of conference papers that interest me. They are focused on, but not limited to neural networks on silicon.
2014 ASPLOS

DianNao: A SmallFootprint HighThroughput Accelerator for Ubiquitous MachineLearning. (CAS, Inria)
2014 MICRO

DaDianNao: A MachineLearning Supercomputer. (CAS, Inria, Inner Mongolia University)
2015 ISCA

ShiDianNao: Shifting Vision Processing Closer to the Sensor. (CAS, EPFL, Inria)
2015 ASPLOS

PuDianNao: A Polyvalent Machine Learning Accelerator. (CAS, USTC, Inria)
2015 FPGA

Optimizing FPGAbased Accelerator Design for Deep Convolutional Neural Networks. (Peking University, UCLA)
2015 DAC
 Reno: A HighlyEfficient Reconfigurable Neuromorphic Computing Accelerator Design. (Universtiy of Pittsburgh, Tsinghua University, San Francisco State University, Air Force Research Laboratory, University of Massachusetts.)
 Scalable Effort Classifiers for Energy Efficient Machine Learning. (Purdue University, Microsoft Research)
 Design Methodology for Operating in NearThreshold Computing (NTC) Region. (AMD)
 Opportunistic Turbo Execution in NTC: Exploiting the Paradigm Shift in Performance Bottlenecks. (Utah State University)
2016 DAC

DeepBurning: Automatic Generation of FPGAbased Learning Accelerators for the Neural Network Family. (Chinese Academy of Sciences)

Hardware generator: Basic buliding blocks for neural networks, and address generation unit (RTL).

Compiler: Dynamic control flow (configurations for different models), and data layout in memory.

Simply report their framework and describe some stages.

CBrain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive DataLevel Parallelization. (Chinese Academy of Sciences)

Simplifying Deep Neural Networks for Neuromorphic Architectures. (Incheon National University)

Dynamic EnergyAccuracy Tradeoff Using Stochastic Computing in Deep Neural Networks. (Samsung, Seoul National University, Ulsan National Institute of Science and Technology)

Optimal Design of JPEG Hardware under the Approximate Computing Paradigm. (University of Minnesota, TAMU)
 PerformML: Performance Optimized Machine Learning by Platform and Content Aware Customization. (Rice University, UCSD)
 LowPower Approximate Convolution Computing Unit with DomainWall Motion Based “SpinMemristor” for Image Processing Applications. (Purdue University)
 CrossLayer Approximations for Neuromorphic Computing: From Devices to Circuits and Systems. (Purdue University)
 Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network. (Tsinghua University)
 A 2.2 GHz SRAM with High Temperature Variation Immunity for Deep Learning Application under 28nm. (UCLA, Bell Labs)
2016 ISSCC

A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems. (KAIST)

Eyeriss: An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. (MIT, NVIDIA)
 A 126.1mW RealTime Natural UI/UX Processor with Embedded Deep Learning Core for LowPower Smart Glasses Systems. (KAIST)
 A 502GOPS and 0.984mW DualMode ADAS SoC with RNNFIS Engine for Intention Prediction in Automotive BlackBox System. (KAIST)
 A 0.55V 1.1mW ArtificialIntelligence Processor with PVT Compensation for Micro Robots. (KAIST)
 A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications. (Waseda University)
2016 ISCA

Cnvlutin: IneffectualNeuronFree Deep Convolutional Neural Network Computing. (University of Toronto, University of British Columbia)

EIE: Efficient Inference Engine on Compressed Deep Neural Network. (Stanford University, Tsinghua University)

Minerva: Enabling LowPower, HighAccuracy Deep Neural Network Accelerators. (Harvard University)

Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks. (MIT, NVIDIA)

Present an energy analysis framework.

Propose an energyefficienct dataflow called Row Stationary, which considers three levels of reuse.

Neurocube: A Programmable Digital Neuromorphic Architecture with HighDensity 3D Memory. (Georgia Institute of Technology, SRI International)

Propose an architecture integrated in 3D DRAM, with a meshlike NOC in the logic layer.

Detailedly describe the data movements in the NOC.
 ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars. (University of Utah, HP Labs)

An advance over ISAAC has been published in "Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration" (IEEE Micro).
 A Novel Processinginmemory Architecture for Neural Network Computation in ReRAMbased Main Memory. (UCSB, HP Labs, NVIDIA, Tsinghua University)
 RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. (Rice University)
 Cambricon: An Instruction Set Architecture for Neural Networks. (Chinese Academy of Sciences, UCSB)
2016 DATE

The Neuro Vector Engine: Flexibility to Improve Convolutional Network Efficiency for Wearable Vision. (Eindhoven University of Technology, Soochow University, TU Berlin)

Propose an SIMD accelerator for CNN.

Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical3D Compute Array. (UNIST, Seoul National University)

The compute tile is organized on 3 dimensions: Tm, Tr, Tc.
 NEURODSP: A MultiPurpose EnergyOptimized Accelerator for Neural Networks. (CEA LIST)
 MNSIM: Simulation Platform for MemristorBased Neuromorphic Computing System. (Tsinghua University, UCSB, Arizona State University)
 Accelerated Artificial Neural Networks on FPGA for Fault Detection in Automotive Systems. (Nanyang Technological University, University of Warwick)
 Significance Driven Hybrid 8T6T SRAM for EnergyEfficient Synaptic Storage in Artificial Neural Networks. (Purdue University)
2016 FPGA

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. [Slides][Demo] (Tsinghua University, MSRA)

The first work I see, which runs the entire flow of CNN, including both CONV and FC layers.

Point out that CONV layers are computationalcentric, while FC layrers are memorycentric.

The FPGA runs VGG16SVD without reconfiguring its resources, but the convolver can only support k=3.

Dynamicprecision data quantization is creative, but not implemented on hardware.

ThroughputOptimized OpenCLbased FPGA Accelerator for LargeScale Convolutional Neural Networks. [Slides] (Arizona State Univ, ARM)

Spatially allocate FPGA's resources to CONV/POOL/NORM/FC layers.
2016 ASPDAC

Design Space Exploration of FPGABased Deep Convolutional Neural Networks. (UC Davis)

LRADNN: HighThroughput and EnergyEfficient Deep Neural Network Accelerator using Low Rank Approximation. (Hong Kong University of Science and Technology, Shanghai Jiao Tong University)

Efficient Embedded Learning for IoT Devices. (Purdue University)
 ACR: Enabling Computation Reuse for Approximate Computing. (Chinese Academy of Sciences)
2016 VLSI

A 0.3‐2.6 TOPS/W Precision‐Scalable Processor for Real‐Time Large‐Scale ConvNets. (KU Leuven)

Use dynamic precision for different CONV layers, and scales down the MAC array's supply voltage at lower precision.

Prevent memory fetches and MAC operations based on the ReLU sparsity.

A 1.40mm2 141mW 898GOPS Sparse Neuromorphic Processor in 40nm CMOS. (University of Michigan)
 A 58.6mW RealTime Programmable Object Detector with MultiScale MultiObject Support Using Deformable Parts Model on 1920x1080 Video at 30fps. (MIT)
 A Machinelearning Classifier Implemented in a Standard 6T SRAM Array. (Princeton)
2016 ICCAD

Efficient Memory Compression in Deep Neural Networks Using CoarseGrain Sparsification for Speech Applications. (Arizona State University)

Memsqueezer: Rearchitecting the Onchip memory Subsystem of Deep Learning Accelerator for Embedded Devices. (Chinese Academy of Sciences)

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. (Peking University, UCLA, Falcon)

Propose a uniformed convolutional matrixmultiplication representation for accelerating CONV and FC layers on FPGA.

Propose a weightmajor convolutional mapping method for FC layers, which has good data reuse, DRAM access burst length and effective bandwidth.

BoostNoC: Power Efficient NetworkonChip Architecture for Near Threshold Computing. (Utah State University)
 Design of PowerEfficient Approximate Multipliers for Approximate Artificial Neural Network. (Brno University of Technology, Brno University of Technology)
 Neural Networks Designing Neural Networks: MultiObjective HyperParameter Optimization. (McGill University)
2016 MICRO

From HighLevel Deep Neural Models to FPGAs. (Georgia Institute of Technology, Intel)

Develop a macro dataflow ISA for DNN accelerators.

Develop handoptimized template designs that are scalable and highly customizable.

Provide a Template Resource Optimization search algorithm to cooptimize the accelerator architecture and scheduling.

vDNN: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design. (NVIDIA)

Stripes: BitSerial Deep Neural Network Computing. (University of Toronto, University of British Columbia)

Introduce serial computation and reduced precision computation to neural network accelerator designs, enabling accuracy vs. performance tradeoffs.

Design a bitserial computing unit to enable linear scaling the performance with precision reduction.

CambriconX: An Accelerator for Sparse Neural Networks. (Chinese Academy of Sciences)

NEUTRAMS: Neural Network Transformation and Codesign under Neuromorphic Hardware Constraints. (Tsinghua University, UCSB)

FusedLayer CNN Accelerators. (Stony Brook University)

Fuse multiple CNN layers (CONV+POOL) to reduce DRAM access for input/output data.

Bridging the I/O Performance Gap for Big Data Workloads: A New NVDIMMbased Approach. (The Hong Kong Polytechnic University, NSF/University of Florida)

A Patch Memory System For Image Processing and Computer Vision. (NVIDIA)

An Ultra LowPower Hardware Accelerator for Automatic Speech Recognition. (Universitat Politecnica de Catalunya)
 Perceptron Learning for Reuse Prediction. (TAMU, Intel Labs)

Train neural networks to predict reuse of cache blocks.
 A CloudScale Acceleration Architecture. (Microsoft Research)
 Reducing Data Movement Energy via Online Data Clustering and Encoding. (University of Rochester)
 The Microarchitecture of a Realtime Robot Motion Planning Accelerator. (Duke University)
 Chameleon: Versatile and Practical NearDRAM Acceleration Architecture for Large Memory Systems. (UIUC, Seoul National University)
2016 FPL

A High Performance FPGAbased Accelerator for LargeScale Convolutional Neural Network. (Fudan University)

Overcoming Resource Underutilization in Spatial CNN Accelerators. (Stony Brook University)

Build multiple accelerators, each specialized for specific CNN layers, instead of a single accelerator with uniform tiling parameters.

Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC. (Intel)
2016 HPCA

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs. (Nanyang Technological University, HKUST, Cornell University)

TABLA: A Unified Templatebased Architecture for Accelerating Statistical Machine Learning. (Georgia Institute of Technology)
 Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning. (University of Rochester)
2017 FPGA

An OpenCL Deep Learning Accelerator on Arria 10. (Intel)

Minimum bandwidth requirement: All the intermediate data in AlexNet's CONV layers are cached in the onchip buffer, so their architecture is computebound.

Reduced operations: Winograd transformation.

High usage of the available DSPs+Reduced computation > Higher performance on FPGA > Competitive efficiency vs. TitanX.

ESE: Efficient Speech Recognition Engine for Compressed LSTM on FPGA. (Stanford University, DeepPhi, Tsinghua University, NVIDIA)

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. (Xilinx, Norwegian University of Science and Technology, University of Sydney)

Can FPGA Beat GPUs in Accelerating NextGeneration Deep Neural Networks? (Intel)

Accelerating Binarized Convolutional Neural Networks with SoftwareProgrammable FPGAs. (Cornell University, UCLA, UCSD)

Improving the Performance of OpenCLbased FPGA Accelerator for Convolutional Neural Network. (UWMadison)

Frequency Domain Acceleration of Convolutional Neural Networks on CPUFPGA Shared Memory System. (USC)

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. (Arizona State University)
2017 ISSCC

A 2.9TOPS/W Deep Convolutional Neural Network SoC in FDSOI 28nm for Intelligent Embedded Systems. (ST)

DNPU: An 8.1TOPS/W Reconfigurable CNNRNN Processor for General Purpose Deep Neural Networks. (KAIST)

ENVISION: A 0.26to10TOPS/W SubwordParallel Computational AccuracyVoltageFrequencyScalable Convolutional Neural Network Processor in 28nm FDSOI. (KU Leuven)

A 288µW Programmable DeepLearning Processor with 270KB OnChip Weight Storage Using NonUniform Memory Hierarchy for Mobile Intelligence. (University of Michigan, CubeWorks)
 A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse DeepNeuralNetwork Engine with >0.1 Timing Error Rate Tolerance for IoT Applications. (Harvard)
 A Scalable Speech Recognizer with DeepNeuralNetwork Acoustic Models and VoiceActivated Power Gating (MIT)
 A 0.62mW UltraLowPower ConvolutionalNeuralNetwork Face Recognition Processor and a CIS Integrated with AlwaysOn HaarLike Face Detector. (KAIST)
2017 HPCA

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. (Chinese Academy of Sciences)

PipeLayer: A Pipelined ReRAMBased Accelerator for Deep Learning. (University of Pittsburgh, University of Southern California)
 Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. (University of Florida)

Satisfaction of CNN (SoC) is the combination of SoCtime, SoCaccuracy and energy consumption.

The PCNN framework is composed of offline compilation and runtime management.

Offline compilation: Generally optimizes runtime, and generates scheduling configurations for the runtime stage.

Runtime management: Generates tuning tables through accuracy tuning, and calibrate accuracy+runtime (select the best tuning table) during the longterm execution.
 Supporting Address Translation for AcceleratorCentric Architectures. (UCLA)
2017 ASPLOS

Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory. (Stanford University)

Move accumulation operations close to the DRAM banks.

Develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators.
 SCDCNN: HighlyScalable Deep Convolutional Neural Network using Stochastic Computing. (Syracuse University, USC, The City College of New York)
2017 ISCA

Maximizing CNN Accelerator Efficiency Through Resource Partitioning. (Stony Brook University)

An Extension of their FPL'16 paper.

InDatacenter Performance Analysis of a Tensor Processing Unit. (Google)

SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. (Purdue University, Intel)

Propose a fullsystem (server node) architecture, focusing on the challenge of DNN training (intra and interlayer heterogeneity).

SCNN: An Accelerator for Compressedsparse Convolutional Neural Networks. (NVIDIA, MIT, UC Berkeley, Stanford University)

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. (University of Michigan, ARM)
 Understanding and Optimizing Asynchronous LowPrecision Stochastic Gradient Descent. (Stanford)
 LogCA: A HighLevel Performance Model for Hardware Accelerators. (AMD, University of WisconsinMadison)
 APPROXNoC: A Data Approximation Framework for NetworkOnChip Architectures. (TAMU)
2017 FCCM

Escher: A CNN Accelerator with Flexible Buffering to Minimize OffChip Transfer. (Stony Brook University)

Customizing Neural Networks for Efficient FPGA Implementation.

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs.

FPDNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTLHLS Hybrid Templates. (Peking University, HKUST, MSRA, UCLA)

Computeinstensive part: RTLbased generalized matrix multiplication kernel.

Layerspecific part: HLSbased control logic.

Memoryinstensive part: Several techniques for lower DRAM bandwidth requirements.
 FPGA accelerated Dense Linear Machine Learning: A PrecisionConvergence Tradeoff.
 A Configurable FPGA Implementation of the Tanh Function using DCT Interpolation.
2017 DAC

Deep^3: Leveraging Three Levels of Parallelism for Efficient Deep Learning. (UCSD, Rice)

RealTime meets Approximate Computing: An Elastic Deep Learning Accelerator Design with Adaptive Tradeoff between QoS and QoR. (CAS)

I'm not sure whether the proposed tuning scenario and direction are reasonable enough to find out feasible solutions.

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. (PKU, CUHK, SenseTime)

HardwareSoftware Codesign of Highly Accurate, Multiplierfree Deep Neural Networks. (Brown University)

A Kernel Decomposition Architecture for Binaryweight Convolutional Neural Networks. (KAIST)

Design of An EnergyEfficient Accelerator for Training of Convolutional Neural Networks using FrequencyDomain Computation. (Georgia Tech)

New Stochastic Computing Multiplier and Its Application to Deep Neural Networks. (UNIST)

TIME: A Traininginmemory Architecture for Memristorbased Deep Neural Networks. (THU, UCSB)

FaultTolerant Training with OnLine Fault Detection for RRAMBased Neural Computing Systems. (THU, Duke)

Automating the systolic array generation and optimizations for high throughput convolution neural network. (PKU, UCLA, Falcon)

Towards FullSystem EnergyAccuracy Tradeoffs: A Case Study of An Approximate Smart Camera System. (Purdue)

Synergistically tunes componetlevel approximation knobs to achieve systemlevel energyaccuracy tradeoffs.

Error Propagation Aware Timing Relaxation For Approximate Near Threshold Computing. (KIT)
 RESPARC: A Reconfigurable and EnergyEfficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks. (Purdue)
 Rescuing Memristorbased Neuromorphic Design with High Defects. (University of Pittsburgh, HP Lab, Duke)
 Group Scissor: Scaling Neuromorphic Computing Design to Big Neural Networks. (University of Pittsburgh, Duke)
 Towards Aginginduced Approximations. (KIT, UT Austin)
 SABER: Selection of Approximate Bits for the Design of Error Tolerant Circuits. (University of Minnesota, TAMU)
 On Quality Tradeoff Control for Approximate Computing using Iterative Training. (SJTU, CUHK)
2017 DATE

DVAFS: Trading Computational Accuracy for Energy Through DynamicVoltageAccuracyFrequencyScaling. (KU Leuven)

Acceleratorfriendly Neuralnetwork Training: Learning Variations and Defects in RRAM Crossbar. (Shanghai Jiao Tong University, University of Pittsburgh, Lynmax Research)

A Novel Zero Weight/ActivationAware Hardware Architecture of Convolutional Neural Network. (Seoul National University)

Solve the zeroinduced load imbalance problem.

Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks. (Brown University)

Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks. (Samsung, UNIST, Seoul National University)

MoDNN: Local Distributed Mobile Computing System for Deep Neural Network. (University of Pittsburgh, George Mason University, University of Maryland)

ChainNN: An EnergyEfficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks. (Waseda University)

LookNN: Neural Network with No Multiplication. (UCSD)

Cluster weights and use LUT to avoid multiplication.
 EnergyEfficient Approximate Multiplier Design using Bit SignificanceDriven Logic Compression. (Newcastle University)
 Revamping Timing Error Resilience to Tackle Choke Points at NTC Systems. (Utah State University)
2017 VLSI

A 3.43TOPS/W 48.9pJ/Pixel 50.1nJ/Classification 512 Analog Neuron Sparse Coding Neural Network with OnChip Learning and Classification in 40nm CMOS. (University of Michigan, Intel)

BRein Memory: A 13Layer 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable InMemory Deep Neural Network Accelerator in 65 nm CMOS. (Hokkaido University, Tokyo Institute of Technology, Keio University)

A 1.06To5.09 TOPS/W Reconfigurable HybridNeuralNetwork Processor for Deep Learning Applications. (Tsinghua University)

A 127mW 1.63TOPS sparse spatiotemporal cognitive SoC for action classification and motion tracking in videos. (University of Michigan)
2017 ICCAD

AEP: An Errorbearing Neural Network Accelerator for Energy Efficiency and Model Protection. (University of Pittsburgh)
 VoCaM: Visualization oriented convolutional neural network acceleration on mobile system. (George Mason University, Duke)
 AdaLearner: An Adaptive Distributed Mobile Learning System for Neural Networks. (Duke)
 MeDNN: A Distributed Mobile System with Enhanced Partition and Deployment for LargeScale DNNs. (Duke)
 TraNNsformer: Neural Network Transformation for Memristive Crossbar based Neuromorphic System Design. (Purdue).
 A Closedloop Design to Enhance Weight Stability of Memristor Based Neural Network Chips. (Duke)
 Fault injection attack on deep neural network. (CUHK)
 ORCHARD: Visual Object Recognition Accelerator Based on Approximate InMemory Processing. (UCSD)
2017 HotChips

A Dataflow Processing Chip for Training Deep Neural Networks. (Wave Computing)

Brainwave: Accelerating Persistent Neural Networks at Datacenter Scale. (Microsoft)

DNN ENGINE: A 16nm SubuJ Deep Neural Network Inference Accelerator for the Embedded Masses. (Harvard, ARM)

DNPU: An EnergyEfficient Deep Neural Network Processor with OnChip Stereo Matching. (KAIST)

Evaluation of the Tensor Processing Unit (TPU): A Deep Neural Network Accelerator for the Datacenter. (Google)
 NVIDIA’s Volta GPU: Programmability and Performance for GPU Computing. (NVIDIA)
 Knights Mill: Intel Xeon Phi Processor for Machine Learning. (Intel)
 XPU: A programmable FPGA Accelerator for diverse workloads. (Baidu)
2017 MICRO

BitPragmatic Deep Neural Network Computing. (NVIDIA, University of Toronto)

CirCNN: Accelerating and Compressing Deep Neural Networks Using BlockCirculant Weight Matrices. (Syracuse University, City University of New York, USC, California State University, Northeastern University)

DRISA: A DRAMbased Reconfigurable InSitu Accelerator. (UCSB, Samsung)

ScaleOut Acceleration for Machine Learning. (Georgia Tech, UCSD)
 Propose CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale.
 DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Nearcompute Data Fission. (Univ. of Michigan, Univ. of Nevada)
 Data Movement Aware Computation Partitioning. (PSU, TOBB University of Economics and Technology)

Partition computation on a manycore system for near data processing.
2018 ASPDAC

ReGAN: A Pipelined ReRAMBased Accelerator for Generative Adversarial Networks. (University of Pittsburgh, Duke)

Acceleratorcentric Deep Learning Systems for Enhanced Scalability, Energyefficiency, and Programmability. (POSTECH)

Architectures and Algorithms for User Customization of CNNs. (Seoul National University, Samsung)

Optimizing FPGAbased Convolutional Neural Networks Accelerator for Image SuperResolution. (Sogang University)

Running sparse and lowprecision neural network: when algorithm meets hardware. (Duke)
2018 ISSCC

A 55nm TimeDomain MixedSignal Neuromorphic Accelerator with Stochastic Synapses and Embedded Reinforcement Learning for Autonomous MicroRobots. (Georgia Tech)

A Shift Towards Edge MachineLearning Processing. (Google)

QUEST: A 7.49TOPS MultiPurpose LogQuantized DNN Inference Engine Stacked on 96MB 3D SRAM Using InductiveCoupling Technology in 40nm CMOS. (Hokkaido University, Ultra Memory, Keio University)

UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1bto16b FullyVariable Weight BitPrecision. (KAIST)

A 9.02mW CNNStereoBased RealTime 3D HandGesture Recognition Processor for Smart Mobile Devices. (KAIST)

An AlwaysOn 3.8μJ/86% CIFAR10 MixedSignal Binary CNN Processor with All Memory on Chip in 28nm CMOS. (Stanford, KU Leuven)

ConvRAM: An EnergyEfficient SRAM with Embedded Convolution Computation for LowPower CNNBased Machine Learning Applications. (MIT)

A 42pJ/Decision 3.12TOPS/W Robust InMemory Machine Learning Classifier with OnChip Training. (UIUC)

BrainInspired Computing Exploiting Carbon Nanotube FETs and Resistive RAM: Hyperdimensional Computing Case Study. (Stanford, UC Berkeley, MIT)

A 65nm 1Mb Nonvolatile ComputinginMemory ReRAM Macro with Sub16ns MultiplyandAccumulate for Binary DNN AI Edge Processors. (NTHU)

A 65nm 4Kb AlgorithmDependent ComputinginMemory SRAM Unit Macro with 2.3ns and 55.8TOPS/W Fully Parallel ProductSum Operation for Binary DNN Edge Processors. (NTHU, TSMC, UESTC, ASU)

A 1μW Voice Activity Detector Using Analog Feature Extraction and Digital Deep Neural Network. (Columbia University)
2018 HPCA

Making Memristive Neural Network Accelerators Reliable. (University of Rochester)

Towards Efficient Microarchitectural Design for Accelerating Unsupervised GANbased Deep Learning. (University of Florida)

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. (POSTECH, NVIDIA, UTAustin)

Insitu AI: Towards Autonomous and Incremental Deep Learning for IoT Systems. (University of Florida, Chongqing University, Capital Normal University)
 RCNVM: Enabling Symmetric Row and Column Memory Accesses for InMemory Databases. (PKU, NUDT, Duke, UCLA, PSU)
 GraphR: Accelerating Graph Processing Using ReRAM. (Duke, USC, Binghamton University SUNY)
 GraphP: Reducing Communication of PIMbased Graph Processing with Efficient Data Partition. (THU, USC, Stanford)
 PM3: Power Modeling and Power Management for ProcessinginMemory. (PKU)
2018 ASPLOS

Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler. (Tsinghua, UCSB)

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. (Georgia Tech)

Higher PE utilization: Use an augmented reduction tree (reconfigurable interconnects) to construct arbitrary sized virtual neurons.

VIBNN: Hardware Acceleration of Bayesian Neural Networks. (Syracuse University, USC)
 Exploiting Dynamical Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications. (Guizhou University, University of Florida)
 Potluck: Crossapplication Approximate Deduplication for ComputationIntensive Mobile Applications. (Yale)
2018 VLSI

STICKER: A 0.41‐62.1 TOPS/W 8bit Neural Network Processor with Multi‐Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers. (THU)

2.9TOPS/W Reconfigurable Dense/Sparse Matrix‐Multiply Accelerator with Unified INT8/INT16/FP16 Datapath in 14nm Tri‐gate CMOS. (Intel)

A Scalable Multi‐TeraOPS Deep Learning Processor Core for AI Training and Inference. (IBM)

An Ultra‐high Energy‐efficient reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28nm CMOS. (THU)

B‐Face: 0.2 mW CNN‐Based Face Recognition Processor with Face Alignment for Mobile User Identification. (KAIST)

A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Selflearning Speech Recognition Processor in 28nm CMOS. (THU)

A Mixed‐Signal Binarized Convolutional‐NeuralNetwork Accelerator Integrating Dense Weight Storage and Multiplication for Reduced Data Movement. (Princeton)

PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In‐Sensor‐Computed Deep Learning Accelerators. (Toshiba)
2018 FPGA

CLSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. (Peking Univ, Syracuse Univ, CUNY)

DeltaRNN: A Powerefficient Recurrent Neural Network Accelerator. (ETHZ, BenevolentAI)

Towards a Uniform Templatebased Architecture for Accelerating 2D and 3D CNNs on FPGA. (National Univ of Defense Tech)

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform  A Deep Learning Case Study. (The Univ of Sydney, Intel)

A Framework for Generating High Throughput CNN Implementations on FPGAs. (USC)
 Liquid Silicon: A DataCentric Reconfigurable Architecture enabled by RRAM Technology. (UW Madison)
2018 ISCA

RANA: Towards Efficient Neural Acceleration with RefreshOptimized Embedded DRAM. (THU)

Brainwave: A Configurable CloudScale DNN Processor for RealTime AI. (Microsoft)

PROMISE: An EndtoEnd Design of a Programmable MixedSignal Accelerator for Machine Learning Algorithms. (UIUC)

Computation Reuse in DNNs by Exploiting Input Similarity. (UPC)

GANAX: A Unified SIMDMIMD Acceleration for Generative Adversarial Network. (Georiga Tech, IPM, Qualcomm, UCSD, UIUC)

SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. (UCSD, Georgia Tech, Qualcomm)

UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. (UIUC, NVIDIA)

An EnergyEfficient Neural Network Accelerator based on OutlierAware Low Precision Computation. (Seoul National)

Prediction based Execution on Deep Neural Networks. (Florida)

Bit Fusion: BitLevel Dynamically Composable Architecture for Accelerating Deep Neural Networks. (Georgia Tech, ARM, UCSD)

Gist: Efficient Data Encoding for Deep Neural Network Training. (Michigan, Microsoft, Toronto)

The Dark Side of DNN Pruning. (UPC)

Neural Cache: BitSerial InCache Acceleration of Deep Neural Networks. (Michigan)
 EVA^2: Exploiting Temporal Redundancy in Live Computer Vision. (Cornell)
 Euphrates: AlgorithmSoC CoDesign for LowPower Mobile Continuous Vision. (Rochester, Georgia Tech, ARM)
 FeatureDriven and Spatially Folded Digital Neurons for Efficient Spiking Neural Network Simulations. (POSTECH/Berkeley, Seoul National)
 SpaceTime Algebra: A Model for Neocortical Computation. (Wisconsin)
 Scaling Datacenter Accelerators With ComputeReuse Architectures. (Princeton)

Add a NVMbased storage layer to the accelerator, for computation reuse.
 Enabling Scientific Computing on Memristive Accelerators. (Rochester)
2018 DATE

MATIC: Learning Around Errors for Efficient LowVoltage Neural Network Accelerators. (University of Washington)

Learn around errors resulting from SRAM voltage scaling, demonstrated on a fabricated 65nm test chip.

Maximizing System Performance by Balancing Computation Loads in LSTM Accelerators. (POSTECH)

Sparse matrix format that load balances computation, demonstrated for LSTMs.

CCR: A Concise Convolution Rule for Sparse Neural Network Accelerators. (CAS)

Decompose convolution into multiple dense and zero kernels for sparsity savings.

Block Convolution: Towards MemoryEfficient Inference of LargeScale CNNs on FPGA. (CAS)

moDNN: Memory Optimal DNN Training on GPUs. (University of Notre Dame, CAS)
 HyperPower: Power and MemoryConstrained HyperParameter Optimization for Neural Networks. (CMU, Google)
2018 DAC

CompensatedDNN: Energy Efficient LowPrecision Deep Neural Networks by Compensating Quantization Errors. (Best Paper, Purdue, IBM)

Introduce a new fixedpoint representation, Fixed Point with Error Compensation (FPEC): Computation bits, +compensation bits that represent quantization error.

Propose a lowoverhead sparse compensation scheme to estimate the error in MAC design.

Calibrating Process Variation at System Level with InSitu LowPrecision Transfer Learning for Analog Neural Network Processors. (THU)

DPS: Dynamic Precision Scaling for Stochastic ComputingBased Deep Neural Networks. (UNIST)

DyHardDNN: Even More DNN Acceleration With Dynamic Hardware Reconfiguration. (Univ. of Virginia)

Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization. (Univ. of Washington)

LCP: Layer Clusters Paralleling Mapping Mechanism for Accelerating Inception and Residual Networks on FPGA. (THU)

A Kernel Decomposition Architecture for Binaryweight Convolutional Neural Networks. (THU)

Ares: A Framework for Quantifying the Resilience of Deep Neural Networks. (Harvard)

ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient
Deep Learning Accelerators (New York Univ., IIT Kanpur)

Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks. (Univ. of Toronto)

Parallelizing SRAM Arrays with Customized BitCell for Binary Neural Networks. (Arizona)

ThermalAware Optimizations of ReRAMBased Neuromorphic Computing Systems. (Northwestern Univ.)

SNrram: An Efficient Sparse Neural Network Computation Architecture Based on Resistive RandomAccess Memory. (THU, UCSB)

Long Live TIME: Improving Lifetime for TrainingInMemory Engines by Structured Gradient Sparsification. (THU, CAS, MIT)

BandwidthEfficient Deep Learning. (MIT, Stanford)

CoDesign of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications. (Berkeley)

SignMagnitude SC: Getting 10X Accuracy for Free in Stochastic Computing for Deep Neural Networks. (UNIST)

DrAcc: A DRAM Based Accelerator for Accurate CNN Inference. (National Univ. of Defense Technology, Indiana Univ., Univ. of Pittsburgh)

OnChip Deep Neural Network Storage With MultiLevel eNVM. (Harvard)
 VRLDRAM: Improving DRAM Performance via Variable Refresh Latency. (Drexel Univ., ETHZ)
2018 HotChips

ARM's First Generation ML Processor. (ARM)

The NVIDIA Deep Learning Accelerator. (NVIDIA)

Xilinx Tensor Processor: An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs. (Xilinx)
 Tachyum Cloud Chip for Hyperscale workloads, deep ML, general, symbolic and bio AI. (Tachyum)
 SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices. (ARM)
 NVIDIA's Xavier SystemonChip. (NVIDIA)
 Xilinx Project Everest: HW/SW Programmable Engine. (Xilinx)
2018 ICCAD

Tetris: Rearchitecting Convolutional Neural Network Computation for Machine Learning Accelerators. (CAS)

3DICT: A Reliable and QoS Capable Mobile ProcessInMemory Architecture for Lookupbased CNNs in 3D XPoint ReRAMs. (Indiana   University Bloomington, Florida International Univ.)

TGPA: TileGrained Pipeline Architecture for Low Latency CNN Inference. (PKU, UCLA, Falcon)

NID: Processing Binary Convolutional Neural Network in Commodity DRAM. (KAIST)

AdaptivePrecision Framework for SGD using Deep QLearning. (PKU)

Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary logbase. (Robert Bosch GmbH)

CGOOD: Ccode Generation Framework for Optimized Ondevice Deep Learning. (SNU)

Mixed Size Crossbar based RRAM CNN Accelerator with Overlapped Mapping Method. (THU)

FCNEngine: Accelerating Deconvolutional Layers in Classic CNN Processors. (HUT, CAS, NUS)

DNNBuilder: an Automated Tool for Building HighPerformance DNN Hardware Accelerators for FPGAs. (UIUC)

DIMA: A Depthwise CNN InMemory Accelerator. (Univ. of Central Florida)

EMAT: An Efficient MultiTask Architecture for Transfer Learning using ReRAM. (Duke)

FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design. (NYU)

Designing Adaptive Neural Networks for EnergyConstrained Image Classification. (CMU)
 Watermarking Deep Neural Networks for Embedded Systems. (UCLA)
 Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks. (Northeastern Univ., Boston Univ., Florida International Univ.)
 A CrossLayer Methodology for Design and Optimization of Networks in 2.5D Systems. (Boston Univ., UCSD)
2018 MICRO

Addressing Irregularity in Sparse Neural Networks: A Cooperative Software/Hardware Approach. (USTC, CAS)

Diffy: a Deja vuFree Differential Deep Neural Network Accelerator. (University of Toronto)

Beyond the Memory Wall: A Case for Memorycentric HPC System for Deep Learning. (KAIST)

Towards Memory Friendly LongShort Term Memory Networks (LSTMs) on Mobile GPUs. (University of Houston, Capital Normal University)

A NetworkCentric Hardware/Algorithm CoDesign to Accelerate Distributed Training of Deep Neural Networks. (UIUC, THU, SJTU, Intel, UCSD)

PermDNN: Efficient Compressed Deep Neural Network Architecture with Permuted Diagonal Matrices. (City University of New York, University of Minnesota, USC)

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware. (Georgia Tech)

ProcessinginMemory for Energyefficient Neural Network Training: A Heterogeneous Approach. (UCM, UCSD, UCSC)
 Schedules computing resources provided by CPU and heterogeneous PIMs (fixedfunction logic + programmable ARM cores), to optimized energy efficiency and hardware utilization.

LerGAN: A Zerofree, Low Data Movement and PIMbased GAN Architecture. (THU, University of Florida)

Multidimensional Parallel Training of Winograd Layer through Distributed NearData Processing. (KAIST)
 Winograd is applied to training to extend traditional data parallelsim with a new dimension named intratile parallelism. With intratile parallelism, nodes ara dividied into several groups, and weight update communication only occurs independtly in the group. The method shows better scalability for training clusters, as the total commnication doesn't scale with the increasing of node count.

SCOPE: A Stochastic Computing Engine for DRAMbased Insitu Accelerator. (UCSB, Samsung)

Morph: Flexible Acceleration for 3D CNNbased Video Understanding. (UIUC)
 Interthread Communication in Multithreaded, Reconfigurable Coarsegrain Arrays. (Technion)
 An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. (Cornell)
2019 ASPDAC

An Nway group association architecture and sparse data group association load balancing algorithm for sparse CNN accelerators. (THU)

TNPU: An Efficient Accelerator Architecture for Training Convolutional Neural Networks. (ICT)

NeuralHMC: An Efficient HMCBased Accelerator for Deep Neural Networks. (University of Pittsburgh, Duke)

P3M: A PIMbased Neural Network Model Protection Scheme for Deep Learning Accelerator. (ICT)
 GraphSAR: A SparsityAware ProcessinginMemory Architecture for LargeScale Graph Processing on ReRAMs. (Tsinghua, MIT, Berkely)
2019 ISSCC

An 11.5TOPS/W 1024MAC Butterfly Structure DualCore SparsityAware Neural Processing Unit in 8nm Flagship Mobile SoC. (Samsung)

A 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with ISO26262 for Automotive Applications. (Toshiba)

An 879GOPS 243mW 80fps VGA Fully Visual CNNSLAM Processor for WideRange Autonomous Exploration. (Michigan)

A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression. (KAIST)

A 65nm 0.39to140.3TOPS/W 1to12b Unified NeuralNetwork Processor Using BlockCirculantEnabled TransposeDomain Acceleration with 8.1× Higher TOPS/mm2 and 6T HBSTTRAMBased 2D DataReuse Architecture. (THU, National Tsing Hua University, Northeastern University)

A 65nm 236.5nJ/Classification Neuromorphic Processor with 7.5% Energy Overhead OnChip Learning Using Direct SpikeOnly Feedback. (SNU)

LNPU: A 25.3TFLOPS/W Sparse DeepNeuralNetwork Learning Processor with FineGrained Mixed Precision of FP8FP16. (KAIST)
 A 1Mb Multibit ReRAM ComputingInMemory Macro with 14.6ns Parallel MAC Computing Time for CNNBased AI Edge Processors. (National Tsing Hua University)
 SandwichRAM: An EnergyEfficient InMemory BWN Architecture with PulseWidth Modulation. (Southeast University, Boxing Electronics, THU)
 A Twin8T SRAM ComputationInMemory Macro for MultipleBit CNN Based Machine Learning. (National Tsing Hua University, University of Electronic Science and Technology of China, ASU, Georgia Tech)
 A Reconfigurable RRAM Physically Unclonable Function Utilizing PostProcess Randomness Source with <6×106 Native Bit Error Rate. (THU, National Tsing Hua University, Georgia Tech)
 A 65nm 1.1to9.1TOPS/W HybridDigitalMixedSignal Computing Platform for Accelerating ModelBased and ModelFree Swarm Robotics. (Georgia Tech)
 A Compute SRAM with BitSerial Integer/FloatingPoint Operations for Programmable InMemory Vector Acceleration. (Michigan)
 AllDigital TimeDomain CNN Engine Using Bidirectional Memory Delay Lines for EnergyEfficient Edge Computing. (UT Austin)
2019 HPCA

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. (Duke, USC)

ERNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs. (Syracuse University, Northeastern University, Florida International University, USC, University at Buffalo)

Bit Prudent InCache Acceleration of Deep Convolutional Neural Networks. (Michigan, Intel)

Shortcut Mining: Exploiting Crosslayer Shortcut Reuse in DCNN Accelerators. (OSU)

NANDNet: Minimizing Computational Complexity of InMemory Processing for Binary Neural Networks. (KAIST)

Kelp: QoS for Accelerators in Machine Learning Platforms. (Microsoft, Google, UT Austin)

Machine Learning at Facebook: Understanding Inference at the Edge. (Facebook)
 The Accelerator Wall: Limits of Chip Specialization. (Princeton)
2019 ASPLOS

FA3C: FPGAAccelerated Deep Reinforcement Learning. (Hongik University, SNU)

PUMA: A Programmable Ultraefficient Memristorbased Accelerator for Machine Learning Inference. (Purdue, UIUC, HP)

FPSA: A Full System Stack Solution for Reconfigurable ReRAMbased NN Accelerator Architecture. (THU, UCSB)

BitTactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks. (Toronto, NVIDIA)

TANGRAM: Optimized CoarseGrained Dataflow for Scalable NN Accelerators. (Stanford)

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization. (Harvard)

SplitCNN: Splitting Windowbased Operations in Convolutional Neural Networks for Memory System Optimization. (IBM, Kyungpook National University)

HOP: HeterogeneityAware Decentralized Training. (USC, THU)

Astra: Exploiting Predictability to Optimize Deep Learning. (Microsoft)

ADMMNN: An AlgorithmHardware CoDesign Framework of DNNs Using Alternating Direction Methods of Multipliers. (Northeastern, Syracuse, SUNY, Buffalo, USC)

DeepSigns: An EndtoEnd Watermarking Framework for Protecting the Ownership of Deep Neural Networks. (UCSD)
2019 ISCA

Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity on Compressed Neural Network. (NTU, Academia Sinica, Macronix)

MnnFast: A Fast and Scalable System Architecture for MemoryAugmented Neural Networks. (POSTECH, SNU)

TIE: Energyefficient Tensor Trainbased Inference Engine for Deep Neural Network. (Rutgers University, Nanjing University, USC)

Accelerating Distributed Reinforcement Learning with InSwitch Computing. (UIUC)

Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks. (University of Florida)

Laconic Deep Learning Inference Acceleration. (Toronto)

DeepAttest: An EndtoEnd Attestation Framework for Deep Neural Networks. (UCSD)

A StochasticComputing based Deep Learning Framework using Adiabatic QuantumFluxParametron Superconducting Technology. (Northeastern, Yokohama National University, USC, University of Alberta)

Fractal Machine Learning Computers. (ICT)

FloatPIM: InMemory Acceleration of Deep Neural Network Training with High Precision. (UCSD)
 EnergyEfficient Video Processing for Virtual Reality. (UIUC, University of Rochester)
 Scalable Interconnects for Reconfigurable Spatial Architectures. (Stanford)
 CoNDA: Enabling Efficient NearData Accelerator Communication by Optimizing Data Movement. (CMU, ETHZ)
2019 DAC

Accuracy vs. Efficiency: Achieving Both through FPGAImplementation Aware Neural Architecture Search. (East China Normal University, Pittsburgh, Chongqing University, UCI, Notre Dame)

FPGA/DNN CoDesign: An Efficient Design Methodology for IoT Intelligence on the Edge. (UIUC, IBM, Inspirit IoT)

An Optimized Design Technique of LowBit Neural Network Training for Personalization on IoT Devices. (KAIST)

ReForm: Static and Dynamic ResourceAware DNN Reconfiguration Framework for Mobile Devices. (George Mason, Clarkson)

DRIS3: Deep Neural Network Reliability Improvement Scheme in 3D DieStacked Memory based on Fault Analysis. (Sungkyunkwan University)

ZARA: A Novel Zerofree Dataflow Accelerator for Generative Adversarial Networks in 3D ReRAM. (Duke)

BitBlade: Area and EnergyEfficient PrecisionScalable Neural Network Accelerator with Bitwise Summation. (POSTECH)
 XMANN: A Crossbar based Architecture for Memory Augmented Neural Networks. (Purdue, Intel)
 ThermalAware Design and Management for Searchbased InMemory Acceleration. (UCSD)
 An EnergyEfficient NetworkonChip Design using Reinforcement Learning. (George Washington)
 Designing Vertical Processors in Monolithic 3D. (UIUC)
2019 MICRO

WireAware Architecture and Dataflow for CNN Accelerators. (Utah)

ShapeShifter: Enabling FineGrain Data Width Adaptation in Deep Learning. (Toronto)

Simba: Scaling DeepLearning Inference with MultiChipModuleBased Architecture. (NVIDIA)

ZCOMP: Reducing DNN CrossLayer Memory Footprint Using Vector Extensions. (Google, Intel)

Boosting the Performance of CNN Accelerators with Dynamic FineGrained Channel Gating. (Cornell)

SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. (Purdue)

EDEN: Enabling Approximate DRAM for DNN Inference using ErrorResilient Neural Networks. (ETHZ, CMU)

eCNN: a BlockBased and HighlyParallel CNN Accelerator for Edge Inference. (NTHU)

TensorDIMM: A Practical NearMemory Processing Architecture for Embeddings and Tensor Operations in Deep Learning. (KAIST)

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A DataCentric Approach. (Georgia Tech, NVIDIA)

MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation. (Harvard, Facebook)

NeuronLevel Fuzzy Memoization in RNNs. (UPC)

Manna: An Accelerator for MemoryAugmented Neural Networks. (Purdue, Intel)
 eAP: A Scalable and Efficient InMemory Accelerator for Automata Processing. (Virginia)
 ComputeDRAM: InMemory Compute Using OfftheShelf DRAMs. (Princeton)
 ExTensor: An Accelerator for Sparse Tensor Algebra. (UIUC, NVIDIA)
 Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable MultiWay Merge Parallelization. (CMU)
 Sparse Tensor Core: Algorithm and Hardware CoDesign for Vectorwise Sparse Neural Networks on Modern GPUs. (UCSB, Alibaba)
 DynaSprint: Microarchitectural Sprints with Dynamic Utility and Thermal Management. (Waterloo, ARM, Duke)
 MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm. (UCSB, ICT)
 Tigris: Architecture and Algorithms for 3D Perception in Point Clouds. (Rochester)
 ASV: Accelerated Stereo Vision System. (Rochester)
 Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software CoDesign Approach. (UCSB, ICT)
2019 ICCAD

Zac: Towards Automatic Optimization and Deployment of Quantized Deep Neural Networks on Embedded Devices. (PKU)

NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving. (UIUC)

MAGNet: A Modular Accelerator Generator for Neural Networks. (NVIDIA)

ReDRAM: A Reconfigurable ProcessinginDRAM Platform for Accelerating Bulk BitWise Operations. (ASU)

Accelergy: An ArchitectureLevel Energy Estimation Methodology for Accelerator Designs. (MIT)
2019 ASSCC

A 47.4µJ/epoch Trainable Deep Convolutional Neural Network Accelerator for InSitu Personalization on Smart Devices. (KAIST)

A 2.25 TOPS/W FullyIntegrated Deep CNN Learning Processor with OnChip Training. (NTU)

A SparseAdaptive CNN Processor with Area/Performance Balanced NWay SetAssociate Pe Arrays Assisted by a CollisionAware Scheduler. (THU, Northeastern)
 A 24 Kb SingleWell Mixed 3T GainCell eDRAM with BodyBias in 28 nm FDSOI for RefreshFree DSP Applications. (EPFL)
2019 VLSI

AreaEfficient and VariationTolerant InMemory BNN Computing Using 6T SRAM Array. (POSTECH)

A 5.1pJ/Neuron 127.3us/Inference RNNBased Speech Recognition Processor Using 16 ComputinginMemory SRAM Macros in 65nm CMOS. (THU, NTU, TsingMicro)

A 0.11 pJ/Op, 0.32128 TOPS, Scalable, MultiChipModuleBased Deep Neural Network Accelerator with GroundReference Signaling in 16nm. (NVIDIA)

SNAP: A 1.67 – 21.55TOPS/W Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference in 16nm CMOS. (UMich, NVIDA)

A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices. (KAIST)

A 1.32 TOPS/W Energy Efficient Deep Neural Network Learning Processor with Direct Feedback Alignment based Heterogeneous Core Architecture. (KAIST)
 Considerations of Integrating ComputingInMemory and ProcessingInSensorinto Convolutional Neural Network Accelerators for LowPower Edge Devices. (NTU, NCHU)
 Computational MemoryBased Inference and Training of Deep Neural Networks. (IBM, EPFL, ETHZ, et al)
 A Ternary Based Bit Scalable, 8.80 TOPS/W CNN A95Accelerator with ManyCore ProcessinginMemory Architecture with 896K Synapses/mm2. (Renesas)
 InMemory Reinforcement Learning with ModeratelyStochastic Conductance Switching of Ferroelectric Tunnel Junctions. (Toshiba)
2019 HotChips

MLPerf: A Benchmark Suite for Machine Learning from an AcademicIndustry Cooperative. (MLPerf)

Zion: Facebook NextGeneration Largememory Unified Training Platform. (Facebook)

A Scalable Unified Architecture for Neural Network Computing from NanoLevel to High Performance Computing. (Huawei)

Deep Learning Training at Scale – Spring Crest Deep Learning Accelerator. (Intel)

Spring Hill – Intel’s Data Center Inference Chip. (Intel)

Wafer Scale Deep Learning. (Cerebras)

Habana Labs Approach to Scaling AI Training. (Habana)

Ouroboros: A WaveNet Inference Engine for TTS Applications on Embedded Devices. (Alibaba)

A 0.11 pJ/Op, 0.32128 TOPS, Scalable MultiChipModulebased Deep Neural Network Accelerator Designed with a HighProductivity VLSI Methodology. (NVIDIA)

Xilinx Versal/AI Engine. (Xilinx)
 A Programmable Embedded Microprocessor for Bitscalable Inmemory Computing. (Princeton)
2019 FPGA

Synetgy: Algorithmhardware Codesign for ConvNet Accelerators on Embedded FPGAs. (THU, Berkeley, Politecnico di Torino, Xilinx)

REQYOLO: A ResourceAware, Efficient Quantization Framework for Object Detection on FPGAs. (PKU, Northeastern）

Reconfigurable Convolutional Kernels for Neural Networks on FPGAs. (University of Kassel)

Efficient and Effective Sparse LSTM on FPGA with BankBalanced Sparsity. (Harbin Institute of Technology, Microsoft, THU, Beihang)

CloudDNN: An Open Framework for Mapping DNN Models to Cloud FPGAs. (Advanced Digital Sciences Center, UIUC)
 F5HD: Fast Flexible FPGAbased Framework for Refreshing Hyperdimensional Computing. (UCSD)
 Xilinx Adaptive Compute Acceleration Platform: Versal Architecture. (Xilinx)
2020 ISSCC

A 3.4to13.3TOPS/W 3.6TOPS DualCore DeepLearning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC. (MediaTek)

A 12nm Programmable ConvolutionEfficient NeuralProcessingUnit Chip Achieving 825TOPS. (Alibaba)

STATICA: A 512Spin 0.25MWeight FullDigital Annealing Processor with a NearMemory AllSpinUpdatesatOnce Architecture for Combinatorial Optimization with Complete SpinSpin Interactions. (Tokyo Institute of Technology, Hokkaido Univ., Univ. of Tokyo)

GANPU: A 135TFLOPS/W MultiDNN Training Processor for GANs with Speculative DualSparsity Exploitation. (KAIST)

A 510nW 0.41V LowMemory LowComputation KeywordSpotting Chip Using Serial FFTBased MFCC and Binarized Depthwise Separable Convolutional Neural Network in 28nm CMOS. (Southeast, EPFL, Columbia)

A 65nm 24.7μJ/Frame 12.3mW ActivationSimilarity Aware Convolutional Neural Network Video Processor Using Hybrid Precision, InterFrame Data Reuse and MixedBitWidth DifferenceFrame Data Codec. (THU)

A 65nm ComputinginMemoryBased CNN Processor with 2.9to35.8TOPS/W System Energy Efficiency Using DynamicSparsity PerformanceScaling Architecture and EnergyEfficient Inter/IntraMacro Data Reuse. (THU, NTHU)
 A 28nm 64Kb InferenceTraining TwoWay Transpose Multibit 6T SRAM ComputeinMemory Macro for AI Edge Chips. (NTU)
 A 351TOPS/W and 372.4GOPS ComputeinMemory SRAM Macro in 7nm FinFET CMOS for MachineLearning Applications. (TSMC)
 A 22nm 2Mb ReRAM ComputeinMemory Macro with 12128TOPS/W for Multibit MAC Computing for Tiny AI Edge Devices. (NTHU)
 A 28nm 64Kb 6T SRAM ComputinginMemory Macro with 8b MAC Operation for AI Edge Chips. (NTHU)
 A 1.5μJ/Task PathPlanning Processor for 2D/3D Autonomous Navigation of Micro Robots. (NTHU)
 A 65nm 8.79TOPS/W 23.82mW MixedSignal OscillatorBased NeuroSLAM Accelerator for Applications in Edge Robotics. (Georgia Tech)
 CIMSpin: A 0.5to1.2V Scalable Annealing Processor Using Digital ComputeInMemory Spin Operators and RegisterBased Spins for Combinatorial Optimization Problems. (NTU)
 A ComputeAdaptive Elastic ClockChain Technique with Dynamic Timing Enhancement for 2D PEArrayBased Accelerators. (Northwestern)
 A 74 TMACS/W CMOSRRAM Neurosynaptic Core with Dynamically Reconfigurable Dataflow and Insitu Transposable Weights for Probabilistic Graphical Models. (Stanford, UCSD, THU, Notre Dame)
 A Fully Integrated Analog ReRAM Based 78.4TOPS/W ComputeInMemory Chip with Fully Parallel MAC Computing. (THU, NTHU)
2020 HPCA

Deep Learning Acceleration with NeurontoMemory Transformation. (UCSD)

HyGCN: A GCN Accelerator with Hybrid Architecture. (ICT, UCSB)

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. (Georgia Tech)

PREMA: A Predictive Multitask Scheduling Algorithm For Preemptible NPUs. (KAIST)

ALRESCHA: A Lightweight Reconfigurable SparseComputation Accelerator. (Georgia Tech)

SpArch: Efficient Architecture for Sparse Matrix Multiplication. (MIT, NVIDIA)

A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. (SNU)

AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerator Arrays. (Duke, USC)

PIXEL: Photonic Neural Network Accelerator. (Ohio, George Washington)

The Architectural Implications of Facebook’s DNNbased Personalized Recommendation. (Facebook)

Enabling Highly Efficient Capsule Networks Processing Through A PIMBased Architecture Design. (Houston)

Missing the Forest for the Trees: EndtoEnd AI Application Performance in Edge Data. (UT Austin, Intel)

Communication Lower Bound in Convolution Accelerators. (ICT, THU)

Fulcrum: a Simplified Control and Access Mechanism toward Flexible and Practical insitu Accelerators. (Virginia, UCSB, Micron)

EFLOPS: Algorithm and System Codesign for a High Performance Distributed Training Platform. (Alibaba)

Experiences with MLDriven Design: A NoC Case Study. (AMD)

Tensaurus: A Versatile Accelerator for Mixed SparseDense Tensor Computations. (Cornell, Intel)

A Hybrid SystolicDataflow Architecture for Inductive Matrix Algorithms. (UCLA)
 A Deep Reinforcement Learning Framework for Architectural Exploration: A Routerless NoC Case Study. (USC, OSU)
 QuickNN: Memory and Performance Optimization of kd Tree Based Nearest Neighbor Search for 3D Point Clouds. (Umich, General Motors)
 Orbital Edge Computing: Machine Inference in Space. (CMU)
 A Scalable and Efficient inMemory Interconnect Architecture for Automata Processing. (Virginia)
 Techniques for Reducing the ConnectedStandby Energy Consumption of Mobile Devices. (ETHZ, Cyprus, CMU)
2020 ASPLOS

Shredder: Learning Noise Distributions to Protect Inference Privacy. (UCSD)

DNNGuard: An Elastic Heterogeneous DNN Accelerator Architecture against Adversarial Attacks. (CAS, USC)

Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators. (Stanford, THU)

DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints. (UCSB)

Prague: HighPerformance HeterogeneityAware Asynchronous Decentralized Training. (USC)

PatDNN: Achieving RealTime DNN Execution on Mobile Devices with Patternbased Weight Pruning. (College of William and Mary, Northeastern , USC)

Capuchin: Tensorbased GPU Memory Management for Deep Learning. (HUST, MSRA, USC)

NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units. (KAIST)

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. (PKU)
2020 DAC

A Pragmatic Approach to Ondevice Incremental Learning System with Selective Weight Updates.

A Twoway SRAM Array based Accelerator for Deep Neural Network Onchip Training.

AlgorithmHardware CoDesign for InMemory Neural Network Computing with Minimal Peripheral Circuit Overhead.

AlgorithmHardware CoDesign of Adaptive FloatingPoint Encodings for Resilient Deep Learning Inference.

Hardware Acceleration of Graph Neural Networks.

Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training.

LowPower Acceleration of Deep Neural Network Training Using Computational Storage Devices.

Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training.

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training.

SCA: A Secure CNN Accelerator for both Training and Inference.

STC: Significanceaware Transformbased Codec Framework for External Memory Access Reduction.
2020 FPGA

AutoDNNchip: An Automated DNN Chip Generator through Compilation, Optimization, and Exploration. （Rice, UIUC)

Accelerating GCN Training on CPUFPGA Heterogeneous Platforms. (USC)
 Massively Simulating Adiabatic Bifurcations with FPGA to Solve Combinatorial Optimization. (Central Florida)
2020 ISCA

Data Compression Accelerator on IBM POWER9 and z15 Processors. (IBM)

HighPerformance DeepLearning Coprocessor Integrated into x86 SoC with ServerClass CPUs. (Centaur )

Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads. (Groq)

MLPerf Inference: A Benchmarking Methodology for Machine Learning Inference Systems.

A MultiNeural Network Acceleration Architecture. (SNU)

SmartExchange: Trading HigherCost Memory Storage/Access for LowerCost Computation. (Rice, TAMU, UCSB)

Centaur: A ChipletBased, Hybrid SparseDense Accelerator for Personalized Recommendations. (KAIST)

DeepRecSys: A System for Optimizing EndtoEnd AtScale Neural Recommendation Inference. (Facebook, Harvard)

An InNetwork Architecture for Accelerating SharedMemory Multiprocessor Collectives. (NVIDIA)

DRQ: Dynamic RegionBased Quantization for Deep Neural Network Acceleration. (SJTU)
 The IBM z15 High Frequency Mainframe Branch Predictor. (ETHZ)
 Déjà View: SpatioTemporal Compute Reuse for EnergyEfficient 360° VR Video Streaming. (Penn State)
 uGEMM: Unary Computing Architecture for GEMM Applications. (Wisconsin)
 Gorgon: Accelerating Machine Learning from Relational Data. (Stanford)
 RecNMP: Accelerating Personalized Recommendation with NearMemory Processing. (Facebook)
 JPEGACT: Accelerating Deep Learning via TransformBased Lossy Compression. (UBC)
 Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads. (Sandia, Rochester)
 Echo: CompilerBased GPU Memory Footprint Reduction for LSTM RNN Training. (Toronto, Intel)
2020 HotChips

Google’s Training Chips Revealed: TPUv2 and TPUv3. （Google)

Software Codesign for the First WaferScale Processor (and Beyond). (Cerebras)

Manticore: A 4096core RISCV Chiplet Architecture for Ultraefficient Floatingpoint Computing. (ETHZ)

Baidu Kunlun – An AI Processor for Diversified Workloads. (Baidu)

Hanguang 800 NPU – The Ultimate AI Inference Solution for Data Centers. (Alibaba)

Silicon Photonics for Artificial Intelligence Acceleration. (Lightmatter)
 Xuantie910: Innovating Cloud and Edge Computing by RISCV. (Alibaba)
 A Technical Overview of the ARM CortexM55 and EthosU55: ARM’s Most Capable Processors for Endpoint AI. (ARM)
 PGMA: A Scalable Bayesian Inference Accelerator for Unsupervised Learning. (Harvard)
2020 VLSI

PNPU: A 146.52TOPS/W DeepNeuralNetwork Learning Processor with Stochastic CoarseFine Pruning and Adaptive Input/Output/Weight Skipping. (KAIST)

A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference. (IBM)

A 617 TOPS/W All Digital Binary Neural Network Accelerator in 10nm FinFET CMOS. (Intel)

An UltraLow Latency 7.813.6 pJ/b Reconfigurable Neural NetworkAssisted Polar Decoder with MultiCode Length Support. (NTU)

A 4.45ms LowLatency 3D PointCloudBased Neural Network Processor for Hand Pose Estimation in Immersive Wearable Devices. (KAIST)

A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception Using Parallel Gibbs Sampling in 16nm. (Harvard)
 1.03pW/b UltraLow Leakage VoltageStacked SRAM for Intelligent Edge Processors. (Umich)
 ZPIM: An EnergyEfficient SparsityAware ProcessingInMemory Architecture with FullyVariable Weight Precision. (KAIST)
2020 MICRO

SuperNPU: An Extremely Fast Neural Processing Unit Using Superconducting Logic Devices. (Kyushu University）

Printed Machine Learning Classifiers. (UIUC, KIT）

LookUp Table based Energy Efficient Processing in Cache Support for Neural Network Acceleration. (PSU, Intel)

FReaC Cache: FoldedLogic Reconfigurable Computing in the Last Level Cache. (UIUC, IBM)

Newton: A DRAMMaker's AcceleratorinMemory (AiM) Architecture for Machine Learning. (Purdue, SK Hynix)

VRDANN: RealTime Video Recognition via DecoderAssisted Neural Network Acceleration. (SJTU)

Procrustes: A Dataflow and Accelerator for Sparse Deep Neural Network Training. (University of British Columbia, Microsoft)

Duplo: Lifting Redundant Memory Accesses of Deep Neural Networks for GPU Tensor Cores. (Yonsei University, EcoCloud, EPFL)

DUET: Boosting Deep Neural Network Efficiency on DualModule Architecture. (UCSB, Alibaba)

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning. (GaTech)

Planaria: Dynamic Architecture Fission for Spatial MultiTenant Acceleration of Deep Neural Networks. (UCSD, Bigstream, Kansas, NVIDIA, Google)

TFE: EnergyEfficient Transferred FilterBased Engine to Compress and Accelerate Convolutional Neural Networks. (THU, Alibaba)

MatRaptor: A SparseSparse Matrix Multiplication Accelerator Based on RowWise Product. (Cornell)

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training. (Toronto)

SAVE: SparsityAware Vector Engine for Accelerating DNN Training and Inference on CPUs. (UIUC)

GOBO: Quantizing AttentionBased NLP Models for Low Latency and Energy Efficient Inference. (Toronto)

TrainBox: An ExtremeScale Neural Network Training Server Architecture by Systematically Balancing Operations. (SNU)

AWBGCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing. (Boston et al.)

Mesorasi: Architecture Support for Point Cloud Analytics via DelayedAggregation. (Rochestor, ARM)

NCPU: An Embedded Neural CPU Architecture on ResourceConstrained Low Power Devices for RealTime EndtoEnd Performance. (Northwestern University)
 FlexWatts: A Power and WorkloadAware Hybrid Power Delivery Network for EnergyEfficient Microprocessors. (ETHZ, Intel, Technion, NTU)
 AutoScale: Energy Efficiency Optimization for Stochastic Edge Inference Using Reinforcement Learning. (Facebook)
 CATCAM: Constanttime Alteration Ternary CAM with Scalable InMemory Architecture. (THU, Southeast University)
 DUAL: Acceleration of Clustering Algorithms using DigitalBased Processing InMemory. (UCSD)
 BitExact ECC Recovery (BEER): Determining DRAM OnDie ECC Functions by Exploiting DRAM Data Retention Characteristics. (ETHZ)
2020 ICCAD
 ReTransformer: ReRAMbased ProcessinginMemory Architecture for Transformer Acceleration. (Duke)
 Energyefficient XNORfree InMemory BNN Accelerator with Input Distribution Regularization. (POSTECH)
 HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems. (UCI, NGD)
 SynergicLearning: Neural NetworkBased Feature Extraction for HighlyAccurate Hyperdimensional Learning. (USC)
 Optimizing Stochastic Computing for Low Latency Inference of Convolutional Neural Networks. (Nanjing University)
 HAPI: HardwareAware Progressive Inference. (Samsung)
 MobiLattice: A Depthwise DCNN Accelerator with Hybrid Digital/Analog Nonvolatile ProcessingInMemory Block. (PKU, Duke)
 A ManyCore Accelerator Design for OnChip Deep Reinforcement Learning. (ICT)
 DRAMA: An Approximate DRAM Architecture for Highperformance and Energyefficient Deep Training System. (Kyung Hee Univ., NUS)
 FPGAbased LowBatch Training Accelerator for Modern CNNs Featuring High Bandwidth Memory. (ASU, Intel)
2021 ISSCC

The A100 Datacenter GPU and Ampere Architecture. (NVIDIA）

Kunlun: A 14nm HighPerformance AI Processor for Diversified Workloads. (Baidu）

A 12nm AutonomousDriving Processor with 60.4TOPS, 13.8TOPS/W CNN Executed by TaskSeparated ASIL D Control. (Renesas）

BioAIP: A Reconfigurable Biomedical AI Processor with Adaptive Learning for Versatile Intelligent Health Monitoring. (UESTC）

A 0.2to3.6TOPS/W Programmable Convolutional Imager SoC with InSensor CurrentDomain TernaryWeighted MAC Operations for Feature Extraction and RegionofInterest Detection. (Leuven）

A 7nm 4Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and WorkloadAware Throttling. (IBM）

A 28nm 12.1TOPS/W DualMode CNN Processor Using EffectiveWeightBased Convolution and ErrorCompensationBased Prediction. (THU）

A 40nm 4.81TFLOPS/W 8b FloatingPoint Training Processor for NonSparse Neural Networks Using Shared Exponent Bias and 24Way Fused MultiplyAdd Tree. (SNU）

PIU: A 248GOPS/W StreamBased Processor for Irregular Probabilistic Inference Networks Using PrecisionScalable Posit Arithmetic in 28nm. (Leuven）

A 6KMAC FeatureMapSparsityAware Neural Processing Unit in 5nm Flagship Mobile SoC. (Samsung）

A 1/2.3inch 12.3Mpixel with OnChip 4.97TOPS/W CNN Processor BackIlluminated Stacked CMOS Image Sensor. (Sony）

A 184μW RealTime HandGesture Recognition System with Hybrid Tiny Classifiers for Smart Wearable Devices. (Nanyang）

A 25mm2 SoC for IoT Devices with 18ms NoiseRobust SpeechtoText Latency via Bayesian Speech Denoising and AttentionBased SequencetoSequence DNN Speech Recognition in 16nm FinFET. (Harvard, Tufts, ARM, Cornell）

A BackgroundNoise and ProcessVariationTolerant 109nW Acoustic Feature Extractor Based on SpikeDomain DivisiveEnergy Normalization for an AlwaysOn Keyword Spotting Device. (Columnbia）
 A 148nW GeneralPurpose EventDriven Intelligent WakeUp Chip for AIoT Devices Using Asynchronous SpikeBased Feature Extractor and Convolutional Neural Network. (PKU）
 A Programmable NeuralNetwork Inference Accelerator Based on Scalable InMemory Computing. (Princeton）
 A 2.75to75.9TOPS/W ComputinginMemory NN Processor Supporting SetAssociate BlockWise Zero Skipping and PingPong CIM with Simultaneous Computation and Weight Updating. (THU）
 A 65nm 3T Dynamic Analog RAMBased ComputinginMemory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency. (Northwestern）
 A 5.99to691.1TOPS/W TensorTrain InMemoryComputing Processor Using BitLevelSparsityBased Optimization and VariablePrecision Quantization. (THU, UESTC, NTHU）
 A 22nm 4Mb 8bPrecision ReRAM ComputinginMemory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices. (NTHU, TSMC）
 eDRAMCIM: ComputeInMemory Design with Reconfigurable EmbeddedDynamicMemory Array Realizing Adaptive Data Converters and ChargeDomain Computing. (UT Austin, Intel）
 A 28nm 384kb 6TSRAM ComputationinMemory Macro with 8b of Precision for AI Edge Chips. (NTHU, Industrial Technology Research Institute, TSMC）
 An 89TOPS/W and 16.3TOPS/mm2 AllDigital SRAMBased FullPrecision ComputeIn Memory Macro in 22nm for MachineLearning Edge Applications. (TSMC）
 A 20nm 6GB FunctionInMemory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using BankLevel Parallelism, for Machine Learning Applications. (Samsung）
 A 21×21 DynamicPrecision BitSerial Computing Graph Accelerator for Solving Partial Differential Equations Using Finite Difference Method. (Nanyang）
2021 ASPLOS

Exploiting Gustavson's Algorithm to Accelerate Sparse Matrix Multiplication. (MIT, NVIDIA)

SIMDRAM: A Framework for BitSerial SIMD Processing using DRAM. (ETHZ, CMU)

RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference. (Harvard, Facebook, ASU)
 DiAG: A Dataflowinspired Architecture for Generalpurpose Processors. (UIUC)
 FieldConfigurable Multiresolution Inference: Rethinking Quantization. (Harvard, Franklin & Marshall College)
 Defensive Approximation: Securing CNNs using Approximate Computing. (University of Sfax et al.)
2021 HPCA

A Computational Stack for CrossDomain Acceleration. (UCSD et al.)

Heterogeneous Dataflow Accelerators for MultiDNN Workloads. (GaTech, Facebook, NVIDIA)

SPAGHETTI: Streaming Accelerators for Highly Sparse GEMM on FPGAs. (SFU et al.)

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. (MIT)

Mix and Match: A Novel FPGACentric Deep Neural Network Quantization Framework. (Northeastern et al.)

Tensor Casting: CoDesigning AlgorithmArchitecture for Personalized Recommendation Training. (KAIST)

GradPIM: A Practical ProcessinginDRAM Architecture for Gradient Descent. (SNU, Yonsei)

SpaceA: Sparse Matrix Vector Multiplication on ProcessinginMemory Accelerator. (UCSB, PKU)

Layerweaver: Maximizing Resource Utilization of Neural Processing Units via LayerWise Scheduling. (Sungkyunkwan, SNU)

Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. (UCM, Microsoft)

CSCNN: Algorithmhardware Codesign for CNN Accelerators using Centrosymmetric Filters. (GWU, Ohio)

AdaptNoC: A Flexible NetworkonChip Design for Heterogeneous Manycore Architectures. (GWU)

GCNAX: A Flexible and Energyefficient Accelerator for Graph Convolutional Neural Networks. (GWU, Ohio)

Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing. (Huawei)

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale. (Facebook)

Eudoxus: Characterizing and Accelerating Localization in Autonomous Machines. (Rochester et al.)

NeuroMeter: An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators. (UCSB. Google)

Chasing Carbon: The Elusive Environmental Footprint of Computing. (Harvard, Facebook)

FuseKNA: Fused Kernel Convolution based Accelerator for Deep Neural Networks. (THU)

FAFNIR: Accelerating Sparse Gathering by Using Efficient NearMemory Intelligent Reduction. (GaTech)

VIA: A Smart Scratchpad for Vector Units With Application to Sparse Matrix Computations. (Barcelona Supercomputing Center et al.)
 Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference. (NYU, SNU, Harvard, Facebook)
 CAPE: A ContentAddressable Processing Engine. (Cornell, PSU)
 Prodigy: Improving the Memory Latency of DataIndirect Irregular Workloads Using HardwareSoftware CoDesign. (Umich et al.)
 BRIM: Bistable ResistivelyCoupled Ising Machine. (Rochester)
 An Analog Preconditioner for Solving Linear Systems. (Sandia et al.)
2021 ISCA
 Ten Lessons From Three Generations Shaped Google's TPUv4i (Google)
 SparsityAware and ReConfigurable NPU Architecture for Samsung Flagship Mobile SoC (Samsung)
 Energy Efficiency Boost in the AIInfused POWER10 Processor (IBM)
 Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology (Samsung)
 Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families (AMD)
 RaPiD: AI Accelerator for UltraLow Precision Training and Inference (IBM)
 REDUCT: Keep It Close, Keep It Cool!  Scaling DNN Inference on MultiCore CPUs with NearCache Compute (Intel)
 Communication AlgorithmArchitecture CoDesign for Distributed Deep Learning (UCSB, TAMU)
 ABCDIMM: Alleviating the Bottleneck of Communication in DIMMBased NearMemory Processing with InterDIMM Broadcast (THU)
 Sieve: Scalable InSitu DRAMBased Accelerator Designs for Massively Parallel kmer Matching (Virginia)
 FORMS: FineGrained Polarized ReRAMBased InSitu Computation for MixedSignal DNN Accelerator (Northeastern et al)
 BOSS: BandwidthOptimized Search Accelerator for StorageClass Memory (SNU)
 Accelerated Seeding for Genome Sequence Alignment with Enumerated Radix Trees (Umich)
 Aurochs: An Architecture for Dataflow Threads (Stanford)
 PipeZK: Accelerating ZeroKnowledge Proof with a Pipelined Architecture (PKU et al)
 CODIC: A LowCost Substrate for Enabling Custom InDRAM Functionalities and Optimizations (ETHZ)
 Enabling ComputeCommunication Overlap in Distributed Deep Learning Training Platforms (GaTech)
 CoSA: Scheduling by Constrained Optimization for Spatial Accelerators (Berkeley)
 ηLSTM: CoDesigning HighlyEfficient Large LSTM Training via Exploiting MemorySaving and Architectural Design Opportunities (Washington et al)
 FlexMiner: A PatternAware Accelerator for Graph Pattern Mining (MIT)
 PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators (UCLA)
 LargeScale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses (EPFL)
 SPACE: LocalityAware Processing in Heterogeneous Memory for Personalized Recommendations (Yonsei)
 ELSA: HardwareSoftware CoDesign for Efficient, Lightweight SelfAttention Mechanism in Neural Networks (SNU)
 CambriconQ: A Hybrid Architecture for Efficient Training (CAS)
 TENET: A Framework for Modeling Tensor Dataflow Based on RelationCentric Notation (PKU et al)
 NASGuard: A Novel Accelerator Architecture for Robust Neural Architecture Search (NAS) Networks (CAS)
 NASA: Accelerating Neural Network Design with a NAS Processor (CAS)
 Albireo: EnergyEfficient Acceleration of Convolutional Neural Networks via Silicon Photonics (Ohio et al)
 QUACTRNG: HighThroughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips (ETHZ)
 NNBaton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators (THU)
 SNAFU: An UltraLowPower, EnergyMinimal CGRAGeneration Framework and Architecture (CMU)
 SARA: Scaling a Reconfigurable Dataflow Accelerator (Stanford)
 HASCO: Towards Agile HArdware and Software COdesign for Tensor Computation (PKU et al)
 SpZip: Architectural Support for Effective Data Compression In Irregular Applications (MIT)
 DualSide Sparse Tensor Core (Microsoft）
 RingCNN: Exploiting AlgebraicallySparse Ring Tensors for EnergyEfficient CNNBased Computational Imaging (NTHU)
 GoSPA: An EnergyEfficient HighPerformance Globally Optimized SParse Convolutional Neural Network Accelerator (Rutgers)
2021 VLSI
 MNCore  A Highly Efficient and Scalable Approach to Deep Learning (Preferred Networks)
 CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte OnChip Foundry Resistive RAM for Efficient Training and Inference (Standford, TSMC)
 OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with DualMode Weight Compression and OnChip Sparse Weight Transposer (KAIST)
 DepFiN: A 12nm, 3.8TOPs DepthFirst CNN Processor for High Res. Image Processing (Leuven)
 PNNPU: A 11.9 TOPS/W HighSpeed 3D Point CloudBased Neural Network Processor with BlockBased Point Processing for Regular DRAM Access (KAIST)
 A 28nm 276.55TFLOPS/W Sparse DeepNeuralNetwork Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation (THU)
 A 13.7 TFLOPSW Floatingpoint DNN Processor using Heterogeneous Computing Architecture with ExponentComputinginMemory (KAIST)
 PIMCA: A 3.4Mb Programmable InMemory Computing Accelerator in 28nm for OnChip DNN Inference (ASU)
 A 6.54to26.03 TOPS/W ComputingInMemory RNN Processor Using Input Similarity Optimization and AttentionBased ContextBreaking with Output Speculation (THU, NTHU)
 Fully Row/ColumnParallel InMemory Computing SRAM Macro Employing CapacitorBased MixedSignal Computation with 5b Inputs (Princeton)
 HERMES Core – A 14nm CMOS and PCMBased InMemory Compute Core Using an Array of 300ps/LSB Linearized CCOBased ADCs and Local Digital Processing (IBM)
 A 20x28 Spins Hybrid InMemory Annealing Computer Featuring VoltageMode Analog Spin Operator for Solving Combinatorial Optimization Problems (NTU, UCSB)
 Analog InMemory Computing in FeFETBased 1T1R Array for Edge AI Applications (Sony)
 EnergyEfficient Reliable HZO FeFET ComputationinMemory with Local Multiply & Global Accumulate Array for SourceFollower & ChargeSharing Voltage Sensing (Tokyo)
2021 ICCAD
 BitTransformer: Transforming Bitlevel Sparsity into Higher Preformance in ReRAMbased Accelerator (SJTU)
 Crossbar based Processing in Memory Accelerator Architecture for Graph Convolutional Networks (PSU, IBM)
 REREC: InReRAM Acceleration with AccessAware Mapping for Personalized Recommendation (Duke, THU)
 A Framework for Areaefficient Multitask BERT Execution on ReRAMbased Accelerators (KAIST)
 A Convergence Monitoring Method for DNN Training of OnDevice Task Adaptation (KAIST)
2021 HotChips
 Accelerating ML Recommendation with over a Thousand RISCV/Tensor Processors on Esperanto’s ETSoC1 Chip (Esperanto Technologies)
 AI Compute Chip from Enflame (Enflame Technology)
 Qualcomm Cloud AI 100: 12 TOPs/W Scalable, High Performance and Low Latency Deep Learning Inference Accelerator (Qualcomm)
 Graphcore Colossus Mk2 IPU (Graphcore)
 The MultiMillion Core, MultiWafer AI Cluster (Cerebras)
 SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow (SambaNova)
2021 MICRO
 RACER: BitPipelined Processing Using Resistive Memory (CMU, UIUC)
 AutoFL: Enabling HeterogeneityAware Energy Efficient Federated Learning (Soongsil, ASU)
 DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware (USC)
 2in1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency (Rice)
 F1: A Fast and Programmable Accelerator for Fully Homomorphic Encryption (MIT, Umich)
 Equinox: Training (for Free) on a Custom Inference Accelerator (EPFL)
 PointAcc: Efficient Point Cloud Accelerator (MIT)
 Noema: HardwareEfficient Template Matching for Neural Population Pattern Detection (Toronto, NeuroTek)
 SquiggleFilter: An Accelerator for Portable Virus Detection (Umich)
 EdgeBERT: SentenceLevel Energy Optimizations for LatencyAware MultiTask NLP Inference (Harvard et al.)
 HiMA: A Fast and Scalable HistoryBased Memory Access Engine for Differentiable Neural Computer (Umich)
 FPRaker: A Processing Element for Accelerating Neural Network Training (Toronto)
 RecPipe: CoDesigning Models and Hardware to Jointly Optimize Recommendation Quality and Performance (Harvard, Facebook)
 ShiftBNN: HighlyEfficient Probabilistic Bayesian Neural Network Training via MemoryFriendly Pattern Retrieving (Houston et al.)
 Distilling BitLevel Sparsity Parallelism for General Purpose Deep Learning Acceleration (ICT, UESTC)
 Sanger: A CoDesign Framework for Enabling Sparse Attention using Reconfigurable Architecture (PKU)
 ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition (Duke, USC)
 SparseAdapt: Runtime Control for Sparse Linear Algebra on a Reconfigurable Accelerator (Umich et al.)
 Capstan: A Vector RDA for Sparsity (Stanford, SambaNova)
 IGCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement Through Islandization (PNNL et al.)
2021 DAC
 MAT: Processing InMemory Acceleration for LongSequence Attention
 PIMQuantifier: A ProcessinginMemory Platform for mRNA Quantification
 NetworkonInterposer Design for Agile NeuralNetwork Processor Chip Customization
 GCiM: A NearData Processing Accelerator for Graph Construction
 An Intelligent Video Processing Architecture for Edgecloud Video Streaming
 Gemmini: Enabling Systematic DeepLearning Architecture Evaluation via FullStack Integration
 PixelSieve: Towards Efficient Activity Analysis From Compressed Video Streams
 TensorLib: A Spatial Accelerator Generation Framework for Tensor Algebra
 Scaling DeepLearning Inference with Chipletbased Architecture and Photonic Interconnects
 Dancing along Battery: Enabling Transformer with Runtime Reconfigurability on Mobile Devices
 Designing a 2048Chiplet, 14336Core Waferscale Processor
 Accelerating Fully Homomorphic Encryption with Processing in Memory
2022 ISSCC
 A 512Gb InMemoryComputing 3D NAND Flash Supporting Similar Vector Matching Operations on AI Edge Devices
 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6Based AcceleratorInMemory Supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep Learning Applications
 A 22nm 4Mb STTMRAM dataencrypted NearMemoryComputation Macro with 192GB/s ReadandDecryption Bandwidth and 25.155.1 TOPS/W at 8b MAC for AIoriented Operations
 A 40nm 2Mcell 8bPrecision Hybrid SLCMLC PCM ComputinginMemory Macro with 20.565.0 TOPS/W for Tiny AI Edge Devices
 An 8Mb DCCurrentFree Binaryto8b Precision ReRAM Nonvolatile ComputinginMemory Macro using TimeSpaceReadout with 1286.4 TOPS/W  21.6 TOPS/W for AI Edge Devices
 SingleMode 6T CMOS SRAM Macros with KeeperLoadingFree Peripherals and RowSeparate Dynamic Body Bias Achieving 2.53fW/bit Leakage for AIoT Sensing Platforms
 A 5 nm 254 TOPS/W and 221 TOPS/mm2 Fully Digital ComputinginMemory Supporting Wide Range DynamicVoltageFrequency Scaling and Simultaneous MAC and Write Operations
 A 1.041Mb/mm2 27.38TOPS/W SignedINT8 Dynamic Logic Based ADCLess SRAM ComputeInMemory Macro in 28nm with Reconfigurable Bitwise Operation for AI and Embedd Applications
 A 28nm 1Mb TimeDomain 6T SRAM ComputinginMemory Macro with 6.6ns Latency 1241 GOPS and 37.01 TOPS/W for 8bMAC Operations for AI Edge Devices
 A MultiMode 8KMAC HWUtilizationAware Neural Processing Unit with a Unified MultiPrecision Datapath in 4nm Flagship Mobile SoC
 A 65nm Systolic Neural CPU Processor for Combined Deep Learning and GeneralPurpose Computing with 95% PE Utilization, High Data Locality and Enhanced EndtoEnd Performance
 COMBMCM: ComputingonMemoryBoundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable MultiChipletModule Edge Machine Learning
 Hiddenite: 4KPE Hidden Network Inference 4DTensor Engine Exploiting OnChip Model Construction Achieving 34.8to16.0TOPS/W for CIFAR100 and ImageNet
 A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise InMemory Booth Multiplication for Cloud Deep Learning Acceleration
 DIANA: An EndtoEnd EnergyEfficient DIgital and ANAlog Hybrid Neural Network SoC
 ARCHON: A 332.7TOPS/W 5b VariationTolerant Analog CNN Processor Featuring Analog Neuronal Computation Unit and Analog Memory
 Analog Matrix Processor for Edge AI RealTime Video Analytics
 A 0.8V Intelligent Vision Sensor with Tiny Convolutional Neural Network and Programmable Weights Using MixedMode ProcessinginSensor Technique for Image Classification
 184QPS/W 64Mb/mm2 3D LogictoDRAM Hybrid Bonding with ProcessNearMemory Engine for Recommendation System
 A 28nm 27.5TOPS/W ApproximateComputingBased Transformer Processor with Asymptotic Sparsity Speculating and OutofOrder Computing
 A 28nm 15.59μJ/Token FullDigital BitlineTranspose CIMBased Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes
 ReckOn: A 28nm Submm2 TaskAgnostic Spiking Recurrent Neural Network Processor Enabling OnChip Learning over SecondLong Timescales
2022 HPCA
 LISA: Graph Neural Network based Portable Mapping on Spatial Accelerators
 Upward Packet Popup for Deadlock Freedom in Modular ChipletBased Systems
 FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding
 TransPIM: A Memorybased Acceleration via SoftwareHardware CoDesign for Transformer
 An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores
 ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing
 PIMCloud: QoSAware Resource Management of LatencyCritical Applications in Clouds with ProcessinginMemory
 ANNA: Specialized Architecture for Approximate Nearest Neighbor Search
 Enabling Efficient LargeScale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
 NeuroSync: A Scalable and Accurate Brain Simulation System using Safe and Efficient Speculation
 Enabling HighQuality Uncertainty Quantification in a PIM Designed for Bayesian Neural Network
 Griffin: Rethinking Sparse Optimization for Deep Learning Architectures
 CANDLES: ChannelAware Novel DataflowMicroarchitecture CoDesign for Low Energy Sparse Neural Network Acceleration
 SPACX: Silicon Photonicsbased Scalable Chiplet Accelerator for DNN Inference
 RMSSD: InStorage Computing for LargeScale Recommendation Inference
 CAMA: Energy and Memory Efficient Automata Processing in ContentAddressable Memories
 TNPU: Supporting Trusted Execution with Treeless Integrity Protection for Neural Processing Unit
 S2TA: Exploiting Structured Sparsity for EnergyEfficient Mobile CNN Acceleration
 Accelerating Graph Convolutional Networks Using Crossbarbased ProcessingInMemory Architectures
 Atomic Dataflow based GraphLevel Workload Orchestration for Scalable DNN Accelerators
 SecNDP: Secure NearData Processing with Untrusted Memory
 Direct Spatial Implementation of Sparse Matrix Multipliers for Reservoir Computing
 Hercules: Heterogeneityaware Inference Serving for Atscale Personalized Recommendation
 ReGNN: A RedundancyEliminated Graph Neural Networks Accelerator
 Parallel Time Batching: SystolicArray Acceleration of Sparse Spiking Neural Computation
 GCoD: Graph Convolutional Network Acceleration via Dedicated Algorithm and Accelerator CoDesign
 CoopMC: AlgorithmArchitecture CoOptimization for Markov Chain Monte Carlo Accelerators
 Application Defined Onchip Networks for Heterogeneous Chiplets: An Implementation Perspective
 The Specialized HighPerformance Network on Anton 3
 DarkGates: A Hybrid Powergating Architecture to Mitigate Dark Sides of DarkSilicon in High Performance Processors
2022 ASPLOS
 DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration
 A Fullstack Search Technique for Domain Optimized Deep Learning Accelerators
 FINGERS: Exploiting FineGrained Parallelism in Graph Mining Accelerators
 BiSone: A Lightweight and HighPerformance Accelerator for Narrow Integer Linear Algebra Computing on the Edge
 RecShard: Statistical FeatureBased Memory Optimization for IndustryScale Neural Recommendation
 AStitch: Enabling A New MultiDimensional Optimization Space for MemoryIntensive ML Training and Inference on Modern SIMT Architectures
 NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism
 VELTAIR: Towards HighPerformance MultiTenant Deep Learning Services via Adaptive Compilation and Scheduling
 Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
 GenStore: An Instorage Processing System for Genome Sequence Analysis
 ProSE: The Architecture and Design of a Protein Discovery Engine
 REVAMP: A Systematic Framework for Heterogeneous CGRA Realization
 Invisible Bits: Hiding Secret Messages in SRAM’s Analog Domain
2022 ISCA
 TDGraph: A TopologyDriven Accelerator for HighPerformance Streaming Graph Processing
 DIMMining: PruningEfficient and Parallel Graph Mining on DIMMbased NearMemoryComputing
 NDMiner: Accelerating Graph Pattern Mining Using Near Data Processing
 SmartSAGE: Training Largescale Graph Neural Networks using InStorage Processing Architectures
 Hyperscale FPGAAsAService Architecture for LargeScale Distributed Graph Neural Network
 Crescent: Taming Memory Irregularities for Accelerating Deep Point Cloud Analytics
 The Mozart Reuse Exposed Dataflow Processor for AI and Beyond
 SoftwareHardware Codesign for Fast and Scalable Training of Deep Learning Recommendation Models
 AI Accelerator on IBM Telum Processor
 Understanding Data Storage and Ingestion for LargeScale Deep Recommendation Model
 Cascading Structured Pruning: Enabling High Data Reuse for Sparse DNN Accelerators
 Anticipating and Eliminating Redundant Computations in Accelerated Sparse Training
 SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM
 A Softwaredefined Tensor Streaming Multiprocessor for LargeScale Machine Learning
 A Network BandwidthAware Collective Scheduling Policy for Distributed Training of DL Models
 Increasing Ising Machine Capacity with MultiChip Architectures
 Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards
 AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
 Mokey: Enabling Narrow FixedPoint Inference for OutoftheBox FloatingPoint Transformer Models
 Accelerating Attention through GradientBased Learned Runtime Pruning
2022 HotChips
 Groq SoftwareDefined Scaleout Tensor Streaming MultiProcessor
 Boqueria: A 2 PetaFLOPs, 30 TeraFLOPs/W AtMemory Inference Acceleration Device with 1,456 RISCV Cores
 DOJO: The Microarchitecture of Tesla's ExaScale Computer
 DOJO  SuperCompute System Scaling for ML Training
 Cerebras Architecture Deep Dive: First Look Inside the HW/SW CoDesign for Deep Learning
2022 MICRO
 CambriconP: A Bitflow Architecture for Arbitrary Precision Computing
 OverGen: Improving FPGA Usability Through Domainspecific Overlay Generation
 big.VLITTLE: OnDemand DataParallel Acceleration for Mobile Systems on Chip
 Pushing Point Cloud Compression to Edge
 ROG: A High Performance and Robust Distributed Training System for Robotic IoT
 Automatic DomainSpecific SoC Design for Autonomous Unmanned Aerial Vehicles
 GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs
 Skipper: Enabling Efficient SNN Training Through ActivationCheckpointing and TimeSkipping
 Going Further With Winograd Convolutions: TapWise Quantization for Efficient Inference on 4x4 Tiles
 HARMONY: HeterogeneityAware Hierarchical Management for Federated Learning System
 Adaptable Butterfly Accelerator for AttentionBased NNs via Hardware and Algorithm CoDesign
 DFX: A LowLatency MultiFPGA Appliance for Accelerating TransformerBased Text Generation
 GenPIP: InMemory Acceleration of Genome Analysis by Tight Integration of Basecalling and Read Mapping
 BEACON: Scalable NearDataProcessing Accelerators for Genome Analysis near Memory Pool with the CXL Support
 ICE: An Intelligent Cognition Engine with 3D NANDbased InMemory Computing for Vector Similarity Search Acceleration
 Sparse Attention Acceleration with Synergistic InMemory Pruning and OnChip Recomputation
 FracDRAM: Fractional Values in OfftheShelf DRAM
 pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables
 MultiLayer InMemory Processing
 FlashCosmos: InFlash Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory
 Scaling Superconducting Quantum Computers with Chiplet Architectures
 Towards Developing High Performance RISCV Processors Using Agile Methodology
 A DataCentric Accelerator for HighPerformance Hypergraph Processing
 DPUv2: EnergyEfficient Execution of Irregular Directed Acyclic Graphs
 3DFPIM: An Extreme EnergyEfficient DNN Acceleration System Using 3D NAND FlashBased InSitu PIM Unit
 DeepBurningSEG: Generating DNN Accelerators of SegmentGrained Pipeline Architecture
 ANT: Exploiting Adaptive Numerical Data Type for LowBit Deep Neural Network Quantization
 Sparseloop: An Analytical Approach to Sparse Tensor Accelerator Modeling
 Ristretto: An Atomized Processing Architecture for SparsityCondensed Stream Flow in CNN
2023 ISSCC
 MetaVRain: A 133mW RealTime HyperRealistic 3DNeRF Processor with 1D2D HybridNeural Engines for Metaverse on Mobile Devices
 A 22nm 832kb HybridDomain FloatingPoint SRAM InMemoryCompute Macro with 16.270.2TFLOPS/W for HighAccuracy AIEdge Devices
 A 28nm 64kb 31.6TFLOPS/W Digitaldomain FloatingPointComputingUnit and Doublebit 6TSRAM ComputinginMemory Macro for FloatingPoint CNNs
 A 28nm 38to102TOPS/W 8b MultiplyLess Approximate Digital SRAM ComputeInMemory Macro for NeuralNetwork Inference
 A 4nm 6163TOPS/W/b 4790TOPS/mm2/b SRAM based DigitalComputinginMemory Macro Supporting BitWidth Flexibility and Simultaneous MAC and Weight Update
 A 28nm Horizontalweightshift and Verticalfeatureshift based Separatewordline 6TSRAM ComputationinMemory UnitMacro for Edge Depthwise NeuralNetworks
 A 70.8586.27TOPS/W PVTInsensitive 8b WordWise ACIM with Post Processing Relaxation
 CVCIM: A 28nm XORderived Similarityaware ComputationInMemory For Cost Volume Construction
 A 22nm DeltaSigma ComputingInMemory (ΔΣCIM) SRAM Macro with NearZeroMean Outputs and LSBFirst ADCs Achieving 21.38TOPS/W for 8bMAC Edge AI Processing
 CTLEIsing: A 1440Spin ContinuousTime Latchbased Ising Machine with OneShot FullyParallel Spin Updates Featuring Equalization of Spin States
 A 7nm ML Training Processor with Wave Clock Distribution
 A 1mW Alwayson Computer Vision Deep Learning Neural Decision Processor
 MulTCIM: A 28nm 2.24μJ/Token AttentionTokenBit Hybrid Sparse Digital CIMBased Accelerator for Multimodal Transformers
 A 28nm 53.8TOPS/W 8b Sparse Transformer Accelerator with InMemory Butterfly Zero Skipper for UnstructuredPruned NN and CIMBased LocalAttentionReusable Engine
 A 28nm 16.9300TOPS/W ComputinginMemory Processor Supporting FloatingPoint NN Inference/Training with IntensiveCIM SparseDigital Architecture
 TensorCIM: A 28nm 3.7nJ/Gather and 8.3TFLOPS/W FP32 DigitalCIM Tensor Processor for MCMCIMBased BeyondNN Acceleration
 DynaPlasia: An eDRAM InMemoryComputingBased Reconfigurable Spatial Accelerator with TripleMode Cell for Dynamic Resource Switching
 A Nonvolatile AIEdge Processor with 4MB SLCMLC HybridMode ReRAM ComputeinMemory Macro and 51.4251TOPS/W
 A 40310TOPS/W SRAMBased AllDigital Up to 4b InMemory Computing MultiTiled NN Accelerator in FDSOI 18nm for DeepLearning Edge Applications
 A 12.4TOPS/W @ 136GOPS AIIoT SystemonChip with 16 RISCV, 2to8b PrecisionScalable DNN Acceleration and 30%Boost Adaptive Body Biasing
 A 28nm 2D/3D Unified Sparse Convolution Accelerator with BlockWise Neighbor Searcher for LargeScaled VoxelBased Point Cloud Network
 A 127.8TOPS/W Arbitrarily Quantized 1to8b ScalablePrecision Accelerator for GeneralPurpose Deep Learning with Reduction of Storage, Logic and Latency Waste
 A 28nm 11.2TOPS/W HardwareUtilizationAware NeuralNetwork Accelerator with Dynamic Dataflow
 CDNN: A 24.585.8TOPS/W ComplementaryDeepNeuralNetwork Processor with Heterogeneous CNN/SNN Core Architecture and ForwardGradientBased Sparsity Generation
 ANPI: A 28nm 1.5pJ/SOP Asynchronous Spiking Neural Network Processor Enabling Sub0.1μJ/Sample OnChip Learning for EdgeAI Applications
 DLVOPU: An EnergyEfficient DomainSpecific DeepLearningbased Visual Object Processing Unit Supporting MultiScale Semantic Feature Extraction for Mobile Object Detection/Tracking Applications
 A 0.81mm2 740μW RealTime Speech Enhancement Processor Using MultiplierLess PE Arrays for Hearing Aids in 28nm CMOS
 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with EntropyBased Early Exit, MixedPrecision Predication and FineGrained Power Management
2023 HPCA
 SGCN: Exploiting CompressedSparse Features in Deep Graph Convolutional Network Accelerators
 PhotoFourier: A Photonic Joint Transform CorrelatorBased Neural Network Accelerator
 INCA: Inputstationary Dataflow at Outsidethebox Thinking about Deep Learning Accelerators
 GROW: A RowStationary SparseDense GEMM Accelerator for MemoryEfficient Graph Convolutional Neural Networks
 Logical/Physical TopologyAware Collective Communication in Deep Learning Training
 Sibia: Signed Bitslice Architecture for Dense DNN Acceleration with Slicelevel Sparsity Exploitation
 Baryon: Efficient Hybrid Memory Management with Compression and SubBlocking
 iCACHE: An ImportanceSamplingInformed Cache for Accelerating I/OBound DNN Model Training
 HIRAC: A Hierarchical Accelerator with Sortingbased Packing for SpGEMMs in DNN Applications
 VEGETA: VerticallyIntegrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs
 ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator CoDesign
 Leveraging Domain Information for the Efficient Automated Design of Deep Learning Accelerators
 DIMMLink: Enabling Efficient InterDIMM Communication for NearMemory Processing
 Post0VR: Enabling Universal Realistic Rendering for Modern VR via Exploiting Architectural Similarity and Data Sharing
 ParallelNN: A Parallel Octreebased Nearest Neighbor Search Accelerator for 3D Point Clouds
 ViTALiTy: Unifying Lowrank and Sparse Approximation for Vision Transformer Acceleration with Linear Taylor Attention
 CTA: HardwareSoftware Codesign for Compressed Token Attention Mechanism
 HeatViT: HardwareEfficient Adaptive Token Pruning for Vision Transformers
 GraNDe: NearData Processing Architecture With Adaptive Matrix Mapping for Graph Convolutional Networks
 DeFiNES: Enabling Fast Exploration of the Depthfirst Scheduling Space for DNN Accelerators through Analytical Modeling
 CEGMA: Coordinated Elastic Graph Matching Acceleration for Graph Matching Networks
 ISOSceles: Accelerating Sparse CNNs through InterLayer Pipelining
 OptimStore: InStorage Optimization of Large Scale DNNs with OnDie Processing
 MERCURY: Accelerating DNN Training By Exploiting Input Similarity
 Dalorex: A DataLocal Program Execution and Architecture for Memorybound Applications
 eNODE: EnergyEfficient and LowLatency Edge Inference and Training of Neural ODEs
 MoCA: MemoryCentric, Adaptive Execution for MultiTenant Deep Neural Networks
 MixGEMM: An efficient HWSW Architecture for MixedPrecision Quantized Deep Neural Networks Inference on Edge Devices
 FlowGNN: A Dataflow Architecture for RealTime WorkloadAgnostic Graph Neural Network Inference
 Chimera: An Analytical Optimizing Framework for Effective Computeintensive Operators Fusion
 Securator: A Fast and Secure Neural Processing Unit