Today's electronics rely on a trio of electronic memory technologies—SRAM for speed, DRAM for density, and Flash for non-volatility. Unfortunately, all three memory technologies face unprecedented challenges in scaling below the 40-nm node. As a result, industry watchers expect memory manufacturers to hit a wall sometime in early 2017. So what comes next? Researchers in both academia and industry have been searching for a "universal memory." A memory technology that can not only can extend the scaling curve, but also incorporate all the benefits of SRAM, DRAM, and Flash: high speed, low power, and non-volatility with CMOS BEOL-compatible materials, low cost, and unlimited endurance. Such a device would shake the foundations of the integrated circuit industry and turn the current paradigm of computer architecture on its head.

Among all of the candidates, STT-RAM is the current front runner in the "universal memory" race. STT-RAM utilizes the physical property of tunnel magnetoresistance (TMR) of a magnetic tunnel junction (MTJ) device to store digital information. The MTJ cell has a pair of nanomagnets, one with fixed polarity and one that can be electronically switched. By switching the polarity of the free magnetic layer, "0" and "1" can be written and read out by sensing the resulting difference in resistance. STT-RAM has been demonstrated to have lower power than DRAM, smaller cell sizes than SRAM, the non-volatility of Flash, and virtually unlimited endurance. Therefore, STT-RAM is highly expected to replace SRAM-based caches in the next few years, revolutionizing the next-generation of microprocessor with non-volatility, zero stand-by power, and instant-on capabilities.

However, a couple of challenges still need be overcome before STT-RAM can transform the computing industry. One of the key challenge is its poor reading performance and the reliability of the MTJ bit-cell due to the large variation of the two resistance states. As such, our research efforts have focused on developing a comprehensive circuit-device co-design methodology. Extensive device modeling has allowed us to map the design space, explore the feasibility of deploying multiple MTJs per access transistor for improved memory density, and conduct several benchmarking studies to inform the future direction of device development.

We have also developed and patented a novel reading circuit that can effectively and reliably enable the ultra-high-speed reading of STT-RAM at GHz frequencies using a short pulse of sensing current. Coupled with a systematic calibration method for optimizing the reference voltage of local sensing circuits, in the presence of device variations, the Body-Voltage Sensing Circuit (BVSC) guarantees low power operation and minimal bit error rate to maximize yield.

**PI:** Prof. Dejan Marković, Prof. Chih-Kong Ken Yang, Prof. Kang L. Wang.

**Collaborators:** This project is collaborated with the Device Research Laboratory and the Western Institute of Nanoelectronics at UCLA, imec in Leuven, Belgium, and GlobalFoundries. The related works are supported by the Defense Advanced Research Projects Agency (DARPA) STT-RAM program (HR0011-09-C-0114).

The continued scaling CMOS for high-performance and low-power applications has become exponentially more difficult and expensive. Unsustainable power densities, poor device matching, lithography limitations, and production costs will almost guarantee that a 14-nm node will be the last generation CMOS. Future device technologies will need to address the problem of power with non-volatility and the fundamental issue of scaling by moving to three dimensions.

We have developed several candidate technologies, based on spintronic logic devices, to replace CMOS. Using magnetic tunnel junction (MTJ) devices with voltage-controlled magnetic anisotropy (VCMA), we have developed and patented several novel memory circuits, including a read-disturbance-free non-volatile content addressable memory (CAM) and a high-speed magneto-electric random access memory (MeRAM). As well as magnetic devices based on the giant Spin-Hall effect (SHE) can operate with power supplies as low as 100 mV. These devices are 3D stackable and show excellent promise towards development of a single electron transistor.

Spin-wave logic devices encode digital information into the phase of a resting spin-wave. The basic building blocks—voltage-to-spin wave and wave-to-voltage converters, spin waveguides, spin modulators, and the magnetoelectric cell—enable reconfigurable and 3D stackable digital logic with power supplies as low as 10 mV.

**PI:** Prof. Dejan Marković, Prof. Pedram Khalili Amiri, Prof. Kang L. Wang.

**Collaborators:** This project is collaborated with the Device Research Laboratory, the Western Institute of Nanoelectronics, and the NSF Nanosystems Engineering Research Center for Translational Applications of Nanoscale Multiferroic Systems (TANMS), at UCLA. The related works are supported by the Defense Advanced Research Projects Agency (DARPA) Non-Volatile Logic program (HR0011-10-C-0153).

Sparse linear algebra arises in a wide variety of computational disciplines, including medical imaging, 3D graphics, compressive sensing, neural networks, applied mathematics, and various optimization problems. Since the 1970's, the basic linear algebra subroutines (BLAS) have been the de facto programing standard for performing linear algebra in high performance computing (HPC) environments. However, the computational throughput of these sparse-BLAS libraries has always significantly lagged behind that of their dense counterparts due to a fundamental mismatch between the compression formats required to efficiently store sparse matrices and traditional Von Neumann computing architectures. And nowhere is this more apparent than the field of bioinformatics where an exponential grow in data has overwhelmed researchers. Given that petabytes of new biological and health data are being generated every single year, software running on multi-core microprocessors (CPUs) and graphics processing units (GPUs) simply cannot keep up with demand. If we are to close the gap between sparse and dense BLAS, and effectively leverage big data to address major social challenges, radical new computing architectures will be needed.

To that end, we have developed a parameterized and scalable VLSI architecture (soft IP core) that can improve the computational and energy efficiency of sparse linear algebra by several orders of magnitude. With the help of CAD tools, this soft IP core can be easily implemented in programmable logic devices such as field programmable gate arrays (FPGAs) or as part of a system-on-chip (SoC) to accelerate sparse algorithms.

We have prototyped the sparse-BLAS kernel on several FPGA systems (Kintex-7, ProcStar IV, and “ROACH” FPGA boards) and demonstrated big data graph clustering for bioinformatics, iterative 3D tomographic reconstructions, and compressive sensing reconstructions. A test-chip targeting for bioinformatics and big data applications will be taped-out at the end of 2014.

PI: Prof. Dejan Marković.

**Collaborators:** This project has been supported by generous donations from Xilinx, Inc., Altera, Corp., and GiDEL, Inc..