Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1764 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO.

6, JUNE 2019

Design of an Always-On Deep Neural


Network-Based 1-μW Voice Activity Detector
Aided With a Customized Software Model
for Analog Feature Extraction
Minhao Yang , Member, IEEE, Chung-Heng Yeh, Yiyin Zhou , Joao P. Cerqueira , Student Member, IEEE,
Aurel A. Lazar , Life Fellow, IEEE, and Mingoo Seok , Senior Member, IEEE

Abstract— This paper presents an ultra-low-power voice I. I NTRODUCTION


activity detector (VAD). It uses analog signal processing for
acoustic feature extraction (AFE) directly on the microphone A voice activity detector (VAD) distinguishes speech from
output, approximate event-driven analog-to-digital conversion the background noise. In the era of Internet of Things (IoT),
(ED-ADC), and digital deep neural network (DNN) for VADs have become increasingly important in audio systems:
speech/non-speech classification. New circuits, including the they are often used for power gating of functional blocks
low-noise amplifier, bandpass filter, and full-wave rectifier con-
tribute to the more than 9× normalized power/channel reduction with more advanced inference capabilities such as speaker
in the feature extraction front-end compared to the best prior and speech recognition for significant power saving [1], [2].
art. The digital DNN is a three-hidden-layer binarized multilayer Therefore, as an always-on sensing unit, a VAD should pro-
perceptron (MLP) with a 2-neuron output layer and a 48-neuron vide sufficient classification accuracy with minimum power
input layer that receives parallel event streams from the ED- consumption. A conventional digital-intensive VAD as shown
ADCs. To obtain the DNN weights via off-line training, a cus-
tomized front-end model written in python is constructed to in Fig. 1(a) is composed of a low-noise amplifier (LNA),
accelerate feature generation in software emulation, and the an analog-to-digital converter (ADC), a digital acoustic feature
model parameters are extracted from Spectre simulations. The extraction (AFE) block, and a digital classifier. The state-of-
chip, fabricated in 0.18-μm CMOS, has a core area of 1.66 × the-art VAD including the AFE and classifier consumes 22 μW
1.52 mm2 and consumes 1 μW. The classification measurements in 65-nm LP CMOS [2]. Further power reduction can resort to
using the 1-hour 10-dB signal-to-noise ratio audio with restaurant
background noise show a mean speech/non-speech hit rate of analog signal processing (ASP) leveraging its superior power
84.4%/85.4% with a 1.88%/4.65% 1-σ variation across ten dies efficiency under the condition of low-to-medium signal-to-
that are all loaded with the same weights. noise ratio (SNR) [3], which is, indeed, the case with inference
Index Terms— Analog signal processing, approximate sensing like a VAD.
quantization, bandpass filter (BPF), binarized neural network For AFE, the research in ASP has been ongoing for
(BNN), classification, computer-aided design, event driven, over three decades. Early works focused on using switched-
feature extraction, full-wave rectifier (FWR), hardware/software capacitor (SC) circuits to implement spectrum analysis in
co-design, integrate and fire (IAF), Internet of Things (IoT), parallel architectures [4]–[7]. The neuromorphic approach of
low-noise amplifier (LNA), machine learning, multilayer
perceptron (MLP), ultra-low power (ULP), voice activity silicon cochlea design started in Carver Mead’s lab in the
detection (VAD), wearable electronics. late 1980s serves the purpose of signal magnitude estimation
of each frequency channel in continuous time while taking
other bio-mimicking factors into account [8]–[18]. In par-
Manuscript received August 22, 2018; revised November 14, 2018; accepted ticular, some of the designs encode extracted features into
January 9, 2019. Date of publication April 18, 2019; date of current version
May 24, 2019. This paper was approved by Associate Editor Dennis Sylvester. multi-channel parallel asynchronous event streams [12], [16],
This work was supported in part by the Swiss National Science Founda- simulating the spike trains that are encoded in the cochlea
tion (SNSF) Early Postdoc Mobility Fellowship, in part by the Columbia and propagate on the auditory nerve fibers down to the prime
University Research Initiatives in Science and Engineering (RISE), and in part
by Air Force Office of Scientific Research (AFOSR) under Grant FA9550- auditory cortex. A possible analog-intensive VAD architec-
16-1-0410. (Corresponding author: Minhao Yang.) ture taking biomimetic inspiration is depicted in Fig. 1(b).
M. Yang is with the Institute of Microengineering, EPFL, 2000 Neuchatel, Here the features are extracted in the analog domain, and
Switzerland (e-mail: yangmh.ic@gmail.com).
C.-H. Yeh, Y. Zhou, J. P. Cerqueira, A. A. Lazar, and M. Seok are with then encoded by event-driven ADCs (ED-ADCs) into parallel
the Department of Electrical Engineering, Columbia University, New York, events. The necessity and advantages of using approximate
NY 10027 USA. ED-ADCs in contrast to conventional clocked ADCs with high
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. signal-to-noise-and-distortion ratio (SNDR), as will become
Digital Object Identifier 10.1109/JSSC.2019.2894360 clear later, are the combined functions of integration and
0018-9200 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1765

Fig. 1. (a) Conventional digital-intensive and (b) analog-intensive signal


processing chain in a VAD.

quantization, the flexibility and convenience of frame shifting,


and the high area and power efficiency. Recent exemplary
works emphasize the ultra-low-power (ULP) design for IoT,
e.g., the 128-channel AFE plus ED-ADCs in [19] consumes
only 55 μW.
The problem of any analog AFE design is the plague
of performance variation caused by device mismatch, which Fig. 2. VAD front-end design flow with a customized software model.
is aggravated in subthreshold region for ULP [20]. One
common coping method is the equalization technique widely
The normalized power per channel of the front-end in
adopted in the digitally assisted ADCs [21], [22]. The idea
this paper is more than 9× smaller than the latest silicon
of using polynomials to compensate the analog non-idealities
cochlea [19]. This power reduction is partly attributed to:
like mismatch and nonlinearity is feasible in ADC design
first, no opamp is used in the AFE and ED-ADC of the new
because the learning of the polynomial coefficients requires
design; second, the input-referred noise (IRN) requirement of
only a small pool of training samples and takes merely tens
the bandpass filter (BPF) in the AFE is relaxed thanks to the
of milliseconds [22].
LNA in front of the BPF bank. A multilayer perceptron (MLP),
In inference applications, a similar concept was conceived
a type of fully connected DNN, is employed as the classifier
by exploiting the powerful learning capacity of a classifier
for VAD. To significantly reduce the computational load
model, e.g., support vector machine (SVM) [23] or deci-
in conventional floating-point DNNs, the recently published
sion tree (DT) [24]. It was demonstrated that even with
binarized neural network (BNN) [26] with binary weights and
extreme non-idealities in feature extraction, after chip-specific
activations is employed for ULP. In several image datasets,
training, the classification accuracy can be restored to be
BNNs have shown less than 1% accuracy degradation [26],
close to software emulation. However, this chip-wise training
and our simulation shows similar results for VAD, while the
scheme is not feasible because the training samples are usu-
energy per MAC operation in DNN is inversely proportional to
ally of large amount for machine learning models, implying
the bit width [27]. Digital implementation was chosen for exact
time-consuming per-chip feature measurements and training
software-to-hardware mapping, allowing us to focus on the
procedures, and consequently results in high production cost,
challenge of the approximate computing nature of the front-
e.g., the computational VAD work in [25] shows that in the
end feature extraction.
noise-independent training, a 35-hour-long noisy speech is
used to train the boosted deep neural network (DNN) model.
II. S YSTEM D ESIGN AND M ODELING
To alleviate the costly post-fabrication effort, we believe it is
advisable to control the classification degradation at the design The system architecture of the VAD is shown in Fig. 3. Input
time. To this end, we build a python model for the front-end audio signals are amplified by LNA, and then analyzed by
including the LNA, the analog AFE, and the ED-ADC, with 16 parallel channels. Each channel contains a BPF, a full-wave
parameters extracted from post-layout Spectre simulations. rectifier (FWR), and an integrate-and-fire (IAF) encoder as the
Acoustic features can be extracted much faster with the model ED-ADC. The central frequencies of the BPFs across channels
running on GPU compared to transistor-level simulations. The are geometrically scaled from about 100 to 5 kHz. The features
model-generated features are then used as the input of a DNN generated after the FWR are encoded into asynchronous events
model for supervised training. If the classification accuracy by the IAF, and then sent to the digital BNN. The event
in model testing using features extracted from the model sequence ideally can be described by the following equation:
with added parameter variation has too much degradation,  t j +1
1
the variation is reduced under the accuracy becomes accept- | f v→i (v oBPF_k (t))|dt = Vrefdn (1)
Cint t j
able. The front-end circuits are then redesigned to conform
to the variation constraints in order to meet the accuracy where t j is the time stamp of the j th event, v oBPF_k is the BPF
requirement. The design flow described above is illustrated in output voltage in channel k, f v→i is the voltage-to-current
Fig. 2. This design flow can help avoid excessively reducing conversion function, and Cint and Vrefdn are the integration
the AFE variation, which is at the cost of silicon area [20]. capacitance and the threshold voltage of IAF, respectively.

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1766 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

Fig. 3. VAD system architecture using analog acoustic feature extraction and digital classification, with event-driven analog-to-digital conversion.

The time interval between two adjacent events is the function scenarios is mixed with the clean speech at different SNR
of the integrated v oBPF_k . levels using the noise mixing software program downloaded
In audio signal processing, a frame length of 25 ms is from aurora.hsnr.de.
typically used, with a 10-ms frame shift. Abiding by this The acoustic features for off-line training of the BNN are
convention, one input neuron in BNN has the value of the generated by the customized front-end python model. The
event number (EN) accumulated in a 25-ms window, which LNA is modeled by a high-pass and a low-pass transfer
can be seen as the quantized area under the waveform of function. An extra low-pass transfer function is added to model
the rectified v oBPF_k within 25 ms. Incorporating multiple the LNAs larger than 20 dB/decade roll-off. The BPFs are
contextual neighboring frames can improve the classification modeled by 2nd-order bandpass transfer functions with geo-
accuracy [25]. As a tradeoff between the accuracy and power, metrically scaled central frequencies. The FWRs are modeled
only two instead of more neighboring frames are used, i.e., to by the transfer functions of the output current in relation to
classify frame n, frame (n −3), n, and (n +3) are used together the input voltage. The IAF models the integration of input
as the input, and hence, the number of input neurons is 3× current on a capacitor, and whenever the capacitor potential
the front-end channel number. Because this scheme utilizes exceeds a predefined threshold, the capacitor voltage is reset
frames in the future, it results in latency, which is 30 ms in to 0 and an event is generated. We fit the models of all the
the case here. Latency is an important metric in applications building blocks with parameters from Spectre simulations with
like VAD-assisted hearing aids [28], which requires careful non-idealities considered, such as the frequency dependency
tradeoff with accuracy and power. of the FWR transfer functions, the finite bandwidth of the
Assuming a passive microphone that gives a 30-μVrms comparator in IAF, etc. To demonstrate the efficacy of our
output signal at a 65-dB sound pressure level (SPL) for a AFE model, we show in Fig. 4(a) the comparison of the
normal loudness conversation at 1-m distance [24], and a extracted features, i.e., EN along the frame sequence, using
minimum 10-dB SNR at the input of the VAD system, an IRN both the python model and the transient Spectre simulation
of 10 μVrms needs to be achieved. A maximum LNA gain with an utterance from Aurora4 shown in Fig. 4(c) as the input.
of 42 dB and a maximum BPF gain of 18 dB ensure a more The two sets of features almost overlap in all 16 channels.
than 50 dB dynamic range (DR) of the FWR output current Fig. 4(b) shows the small differences between the two sets of
with a 10-pA leakage current. The event number within a features.
25-ms frame is designed to be nominally less than 255 in The off-line training of the BNN classifier uses the features
response to speech, i.e., 8 bit, even though the theoretical max generated by the python model. The BNN has three hidden
is about 900 given the 5-nA max value of f v→i , 0.9-pF Cint , layers, and each has 60, 24, and 11 neurons, respectively. The
and 0.15-V Vrefdn in (1). number of hidden layers and neurons is heuristically selected
The audio dataset we use is built on the Aurora4 corpus [29]. for best possible hit-rate performance within the power budget.
Three hundred clean utterances with a 16-kHz sampling rate The activations of the two output neurons are compared with-
and a 16-bit resolution are randomly selected from Aurora4 as out binarization to classify a frame as either a voice or a noise
the basis for the training dataset, and another 30 and 300 clean frame. The optimization algorithm is the modified stochastic
utterances for the development and test dataset [25]. The gradient descent for BNN [26]. For training, the feature frames
utterances in each dataset are concatenated. The total length of are randomly shuffled, but the contextual window including
the training and development audio is about 37 and 3 minutes, frames n − 3, n, and n + 3 is maintained. The important
respectively. An additional non-speech section is added to training parameters are: batch size 200, learning rate 0.0003,
the test audio to balance the speech and non-speech periods, number of epochs 2000, dropout rate of hidden neurons 0.2.
so the test audio has a total length of about 1 hour. The The BNN models are separately trained for each noisy speech
noise corpus DEMAND contains 18 different noise scenarios with a specific noise type and SNR value. This is called noise-
including metro and restaurant [30]. Noise audio of different dependent training [25], in contrast to the noise-independent

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1767

Fig. 4. (a) Comparison of the 16-channel acoustic features in event number (EN) over a frame sequence generated by the customized python model (red
hollow circle) and by the post-layout transient Spectre simulation (black line). (b) EN difference (EN). (c) Audio waveform of the utterance that is selected
from the Aurora4 corpus, and used to generate the features (maximum amplitude normalized to 1).

Fig. 5. LNA block diagram. Fig. 6. Removal of one tail current of the inverter-based input stage.

training where the goal is that a VAD can be used in any


by the capacitance ratio Cin /Cfb . Cin is made programmable to
noise scenario and in a wide range of SNR conditions without
accommodate a large range of SPL, and accordingly the gain
changing the classifier parameters. Arguably the latter may
can be adjusted from 24 dB to 42 dB with a 6-dB step. For
require much larger network size and training datasets. a 42-dB maximum gain, a two-stage amplifier is needed in the
main path to have a sufficient dc open-loop gain, and its stabil-
III. C IRCUIT I MPLEMENTATION ity requires compensation. Recently, some ULP designs place
A. Biasing Circuits the dominant pole at the output of the 2nd stage [32], [33] to
A subthreshold proportional-to-absolute-temperature cir- avoid the power waste in pushing the non-dominant pole of
cuit [31] with a 33-M off-chip resistor generates the main the 2nd stage for enough phase margin (PM) in compensation
bias current of 1.62 nA at room temperature, and this current is schemes like Miller compensation, and analysis shows that
distributed to all the analog building blocks. The geometrically this method cannot achieve minimal power consumption in
scaled bias currents for the BPFs are generated by the pFET- our design with the specified gain-bandwidth product (GBW).
based current divider similar to the one used in [19]. As a result, the dominant pole is still placed at the output of
the 1st stage.
An inverter-based input stage has long been used to improve
B. Low-Noise Amplifier the noise efficiency factor (NEF) of amplifiers. Two tail current
The block diagram of the ac-coupled LNA is shown sources are normally required, one for the bias current, and
in Fig. 5. It consists of the main amplification path and the the other for the common-mode feedback (CMFB) [32]–[36],
dc-servo loop (DSL) path. The closed-loop gain is determined as shown on the left side of Fig. 6. It is desirable to remove

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1768 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

Fig. 7. Schematic of the main amplifier in the LNA.

one tail current for extra voltage headroom that is important for
robust operation over process, voltage and temperature (PVT)
variations under a low supply voltage. One possibility is to
remove Mtail , bias the gate of M p at a different dc from Mn ,
and ac couple the input signal like in [37]. However, in the
audio frequency range, this requires the use of pseudo-resistors
whose resistance is highly susceptible to PVT variations, and
consequently the associated high-pass corner frequency and
the recovery time of the amplifier from any common-mode
disturbance are poorly defined [33].
Another possibility is to remove Ibias , as shown on the right
side of Fig. 6. Due to the lack of current source for Mn ,
the input dc voltage, which is the output dc voltage of the
Fig. 8. Simulated frequency response of the main amplifier in the LNA
DSL amplifier, needs to be set to the trip point of the inverter. with (top) and without (bottom) the positive feedback networks.
This trip point can be provided by a replica inverter, as shown
in Fig. 7. The replica inverter is composed of M0 and M1 ,
which are compound transistors and topologically match with
the input transistors M2 –M5 but with scaled sizes to reduce the
bias current. The trip point voltage Vb is tied to the body of
M2 –M5 to avoid large width that can lower the non-dominant
pole frequency [38] and elevate the IRN of the LNA [39]
due to parasitic capacitance. The use of compound transistors
is for the pseudo-cascode compensation [40], [41] using Cc
and Rc , instead of the Miller compensation, for higher PM
and unity-gain frequency. To reduce the bias current of the 2nd
stage, we use the positive feedback via C f and R f adapted Fig. 9. Schematic of the DSL amplifier in the LNA.
from the bandwidth enhancement techniques [42], [43] to have
sufficient PM. The simulated frequency response of the main
The output dc voltage is set to Vb via its CMFB amplifier com-
amplifier and the LNA at a 42-dB gain in Fig. 8 shows the
posed of M4 –M9 . The amplifier is biased at the picoampere
efficacy of the positive feedback used in conjunction with the
level to give a well-defined high-pass corner frequency of the
pseudo-cascode compensation. The 3-dB bandwidth of the
LNA less than 100 Hz. The stability of the closed loop formed
main amplifier increases by 3.3× and the PM is improved
by the main and DSL amplifiers is guaranteed by the very low
from 20◦ to 56◦ . The bandwidth of the LNA is improved from
frequency of a dominant pole at the node v x .
2.5 kHz to 6.1 kHz with a steeper roll-off near the low-pass
corner frequency. The output dc is set to Vmid , i.e., VDD/2 via
the CMFB amplifier composed of M9 –M13 , and Rcm C. Bandpass Filter
and Ccm . The pseudo-resistor Rcm with symmetric resis- In off-line MLP training using the features generated by the
tance [19] is implemented with two pFETs connected as M14 front-end python model, we found that the classification results
and M15 . are satisfying with low quality factors and low-BPF orders,
The DSL amplifier is shown in Fig. 9. Its input and output and hence, we did not choose to use the source-follower-based
are the output and input of the main amplifier, respectively. 4th-order BPF in [19] which can synthesize high quality factor

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1769

Fig. 11. Simulated differential SSF-based BPF output noise spectral density
Fig. 10. Schematic of the SSF-based BPF with output buffer and input dc of channel 15 at Vout .
bias. The fabricated circuit is a differential version.

but requires voltage sum using a power-consuming opamp


to transform a low-pass transfer function to a bandpass one.
The new 2nd-order BPF circuit shown in Fig. 10 is based on
the super-source-follower (SSF) topology, which together with
its mutants, including the self-coupled source follower and the
flipped source follower, have been used to construct power-
efficient low-pass filters [44]–[46]. Gigahertz BPFs that are Fig. 12. (a) Single-ended 2-T current rectifier. (b) Direct cross-coupled
topologically similar to the BPF shown in Fig. 10 but utilize a current rectifier. (c) AC-coupled cross-coupled current rectifier.
transistor’s source as the input terminal can be found in [47]
and [48]. The transfer functions HBPF (s), central frequency f 0 ,
where i n M0 , i n M1 , i n0 , and i n1 are the noise current sources
quality factor Q, and peak gain A0 are derived as
of M0 , M1 , I0 , and I1 , respectively, and HLPF (s) in (4) is
C2 1
v out sg HLPF (s) = . (5)
HBPF (s) = = − C C m2 C (2) s 2 gC1 Cg 2 + s gC1 + 1
v in s 2 gm11 gm2
2
+ s gm21 + 1 m1 m2 m2
 
1 gm1 gm2 gm2 C2 C2 The simulated output noise spectral density of channel 15
f0 = , Q= , A0 = (3) with an f0 around 5 kHz shown in Fig. 11 verifies the
2π C1 C2 gm1 C1 C1
transfer function of each noise source. Overall, v n ,out has a
low-pass characteristic, but the most significant in-band noise
where gm1 and gm2 are the transconductance of M0 and M1 ,
is contributed by M0 and I0 . The output-referred noise (ORN)
respectively. In subthreshold, gm1 and gm2 are proportional to
at v obuf integrated from 10 Hz to 500 kHz is simulated to be
I0 and I1 –I0 , respectively. The Q and A0 are thus determined
about 111 μVrms . The periodic steady-state simulations show
by the current and capacitance ratio.
that the output signal swing is about 89 mVrms for a 5% total
The dc voltage of v out is susceptible to the Vth variation
harmonic distortion (THD), and DR is computed to be 58 dB.
of M1 and so is the dc voltage of v x to the Vth variation of
The DR of the BPFs across all channels is larger than 55 dB
M0 with a fixed input dc. This can cause serious headroom
in simulation.
problems for a large signal swing. A simple compensation
scheme uses the diode-connected M3 with its source connected
to a reference voltage Vrefup . The input dc of M0 is the gate D. Full-Wave Rectifier and Integrate-and-Fire Event Encoder
voltage of M3 , and the input to the BPF is ac-coupled. The In contrast to rectifiers widely used in energy harvesting,
pseudo-resistor Racin has the same implementation as the one precision rectifiers aim to precisely realize the mathematical
shown in Fig. 7. To drive the input capacitance of the FWR, function y = |x|(x, y R). For inference sensing applications,
a source-follower buffer is used. The input is ac-coupled, the exact mathematical rule in signal transformation is unim-
and the input dc is set to ground for sufficient output swing portant provided that the relevant information is not lost,
over PVT. The pseudo-resistor Racout is implemented with because a machine learning model like MLP can learn and
two pFETs configured as M4 and M5 . Because the signal adapt to any deterministic mapping. But in our model-based
transient at its terminal connected to Cacout can go below design, for the convenience of model construction and the AFE
ground, the topology of Racin is not used. variation control, we chose to use a precision rectifier.
The output noise v n,out of the SSF-based BPF core at the The MOSFET-only precision rectifiers can be categorized as
node v out can be derived as the voltage mode and the current mode depending on whether
 the signal being rectified is voltage or current. The voltage
1 g2  2  mode uses simple comparators and switches [49], [50], but is
v n,out = 2 i n2M0 · HBPF
2 2
(s) + m1 · i n M1 + i n1
2
· HLPF
2
(s)
gm1 2
gm2 not suitable here because the input signal can be much smaller
 2
than the comparator’s input offset, which can be as large as
gm1 tens of millivolts without offset calibration. Besides, voltage
+ i n0
2
· HLPF(s) + |HBPF (s)| (4)
gm2 has to be converted to current for easy implementation of

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1770 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

Fig. 13. Schematic of (a) FWR and IAF and (b) pseudo-resistor Rfb that is composed of eight diode-connected pFETs connected in series.

the integration in (1), motivating direct rectification in current


domain. One simple 2-T current-mode topology is shown in
Fig. 12(a) [51], [52]. It has several issues in robust low-voltage
design.
First, input current i in is normally realized by an operational
transconductance amplifier (OTA). For a 1000-fold change
of i in , the source voltages of Mn and Mp in subthreshold,
i.e., the OTA output, have a voltage swing of 178 mVpk at
room temperature, pushing the transistors in the OTA output Fig. 14. Simulated irec using the direct and ac-coupled cross-coupled rectifier
stage to the verge of the linear region with a 0.6-V supply. with a sinusoidal input at 100 Hz (black) and 5 kHz (red).
This problem can be circumvented by the cross-coupling
topology shown in Fig. 12(b). The swing is reduced to 99 mV
with a subthreshold slope factor κ = 0.8 [53]. Second, It is obvious that the dead-zone distortion is alleviated using
the parasitic capacitance at node v causes the delay and hence the ac-coupled topology compared to the direct cross-coupled
distortion in current transmission because it takes time to one, especially in the 5-kHz case. Small-signal ac simulation
charge v and turn on Mn /Mp [54]. This dead-zone problem shows that the transconductance of the OTA is about 103 nS,
can be mitigated by using a feedback amplifier to reduce the which is in agreement with the 2-nA peak current in the
swing of v [55]. However, the feedback requires an amplifier ac-coupled case. The cause of the much decreased peak current
with high GBW, leading to high power consumption. A low- in the direct cross-coupled case is the much smaller quiescent
power solution is to weakly bias Mn and Mp in dc. As shown leakage current. Because the ratio of the maximum to the
in Fig. 12(c), the gates of Mn and Mp are dc-biased and minimum i rec becomes larger, the output swing of the OTA
the differential transient signals are ac-coupled. Third, as a increases, and gets clipped under a 0.6-V supply.
voltage-controlled current source, the OTA is normally in The dead-zone problem is further mitigated by reducing
open loop, because any feedback can compromise the output the parasitic capacitance at v out , i.e., minimizing the size of
impedance, but the output dc offset current can significantly M0 –M3 in Fig. 13 and the width of the transistors at the
reduce the DR of the rectifier. Dedicated circuits for automatic output of the OTA [54]. However, small size is in conflict
offset current calibration are at the cost of extra power and with the requirement of small output conductance and small
area [54]. As shown in the complete schematic of the FWR and output voltage and current offset caused by mismatch. Cascode
IAF in Fig. 13(a), here we employ simple dc feedback through is used to solve this dilemma. In particular, the common-
the pseudo-resistor Rfb , which has much larger impedance source transistors M8 , M9 , M14 , and M15 in Fig. 15 have large
than the OTA output impedance. The dc offset current is thus size to guarantee small current mismatch, and the cascode
eliminated. transistors M10 –M13 have small size to minimize the parasitic
The input dc of the OTA is set to half Vmid to have sufficient capacitance. M14 and M15 are guaranteed to be at the verge
voltage headroom for its input stage. Rfb is composed of eight of saturation over PVT because of the biasing scheme of M12
diode-connected pFETs in series, as shown in Fig. 13(b). and M13 .
Racrec is the same as Rcm in Fig. 7. The dc bias voltages The IAF shown in Fig. 13 integrates the i rec on the capac-
Vbiasn and Vbiasp are generated by the scaled replica of the itor Cint . Whenever v int crosses the threshold Vrefdn , M6 is
cross-coupled rectifier using transistors M4 and M5 , whose turned on and v int is reset to ground. After the delay of
source voltages are set to Vmid , the same as the OTA output the continuous-time comparator, M6 is turned off, and the
common-mode voltage. The simulation results in Fig. 14 show integration starts again. This IAF functions as an area-efficient
the transient waveform of i rec with a 20-mVpk sinusoidal and power-efficient ED-ADC for approximate quantization.
input to the OTA at the frequencies of 100 Hz and 5 kHz. Non-idealities like the finite bandwidth of the comparator

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1771

Fig. 15. Schematic of OTA in FWR.

Fig. 16. Block diagram of the binarized MLP.

and the input-slope-dependent delay are not detrimental; they The 100 classification/s throughput requires that all opera-
are included in the front-end model, and the MLP can adapt tions of add and accumulation need to finish within 10 ms,
accordingly through training. and a clock of 500 kHz is thus sufficient for using one single
accumulator. Given the fact that only the 1st hidden layer
has 9-bit input, and other layers have 1-bit input, part of the
E. Binarized Multilayer Perceptron register is clock-gated for power saving. The full register and
a 15-bit adder are used for the 1st hidden layer computation
The block diagram of the digital binarized MLP is illus- indicated by l1 , and 7 bits of the full register and a 7-bit
trated in Fig. 16. Ripple counters are used as the interface adder are used for the subsequent layers. The input operand
between the asynchronous feature extraction front-end and the of the accumulator is selected via a mux according to the
synchronous inference computing. Nine-bit counters, one more weight value w. For the 1st hidden layer, the input feature d9
bit than the specified 8 bits, are used in case of overflow. Three is selected if w = 1, and otherwise its complement is selected.
counters are used for each channel. Each counter counts events For the rest of the hidden layers, 0, +1, or −1 is selected
in a 25-ms window, and three counters work in a cyclic way depending on both w and the activation of the previous layer.
with a 10-ms shift. The features of one frame are stored in a The activation function hard sigmoid HS(•) is simply the
16×9 block of the data memory DMEM. Because three frames inversion of the sign bit of the register after completing the
are needed for one classification, as discussed in Section II, accumulation of each neuron, and the 1-bit activation values
the features, e.g., in block 0, block 3, and block 6, are used are stored in an 84-bit register file. HS(•) is not applied to the
to classify the frame with its features stored in block 3. Old output layer. Instead, the difference of the accumulated values
features are overwritten by the new ones, e.g., the features of of the two output neurons is computed, controlled by lo2 , and
the 8th frame would be written into block 0, and so on. The the sign bit is the class, 1 for speech and 0 for non-speech.
1-b weights stored in the weight memory WMEM are obtained
from off-line training, and they are loaded through a scan chain
IV. E XPERIMENTAL R ESULTS
when the chip is powered up before normal operations. Both
DMEM and WMEM use latch-based design to work robustly The chip was fabricated in 1P6M 0.18-μm CMOS. The
under a low supply voltage. The addresses for the memories microphotograph of the die is shown in Fig. 17 with the
and the control signals for the counters and the computing building blocks labeled. The core area is 1.66×1.52 mm2 . The
engine are generated by the controller block. simulated power breakdown of the front-end at 0.6 V is shown

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1772 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

Fig. 17. Die microphotograph with building blocks and dimension labeled.

Fig. 19. Measured transfer functions (top) and input-referred noise spectral
density (bottom) of the LNA in four different gain settings.

Fig. 18. Simulated power breakdown of the feature extraction front-end.

in Fig. 18. A significant amount of current is consumed by the


OTA in the FWR for sufficient current DR, and alleviating the
dead-zone phenomenon. More power saving can be achieved
by scaling the OTA current in low-frequency channels like in Fig. 20. Harmonic distortion of the LNA at a gain 11 measured with a
the BPF bank. We did not choose to pursue this because of the 700-Hz fundamental tone.
extra complexity in sizing of the OTA transistors and scaling
the Cint in the IAF. Using Keithley 2401, the measured front-
end power is about 380 nW, close to the simulated 373 nW.
The measured MLP power at 0.55 V is about 630 nW. The
power consumption of the entire VAD system is 1.01 μW.

A. Feature Extraction Front-End


The measured power consumption of the LNA is about
87 nW under a 0.6-V supply. Using the dynamic signal
Fig. 21. Measured CMRR and PSRR at the highest gain of the LNA.
analyzer SR780, the measured transfer functions and IRN
spectral density of the LNA at four different gain settings
are shown in Fig. 19. The increased noise floor as the gain TABLE I
decreases can be attributed to the decrease of the input N OISE AND H ARMONIC D ISTORTION P ERFORMANCE OF LNA
capacitance Cin shown in Fig. 5 [39]. The harmonics measured
with a 700-Hz fundamental tone at the highest LNA gain
is shown in Fig. 20 for a 1% THD. The calculated IRN
within the frequency range of 50–12.8 kHz, the NEF [56],
the power efficiency factor (PEF = NEF2 × VDD), and the
DR at 1% THD are listed in Table I. The measured common-
mode rejection ratio (CMRR) and power supply rejection
ratio (PSRR) at the highest LNA gain are shown in Fig. 21. bandpass characteristic. The rate of leakage events caused
The frequency response of the whole feature extraction by the leakage dc component of i rec shown in Fig. 14(a)
front-end, from the input of the LNA to the output of the can be derived from the saturated EN per frame at high
IAF, is shown in Fig. 22. The differential input amplitude frequencies beyond 10 kHz. The highest leakage event rate
is 0.4 mVpk , and the gain code of the LNA is set to goes up to about 148 event/s, and the lowest leakage event rate
01. The feature is the EN per frame, and as mentioned in is about 70 event/s. Given a nominal 0.9-pF Cint in the IAF,
Section II, each frame has a duration of 25 ms. It is clear the corresponding leakage currents span from 9.5 to 20 pA,
that the features across the 16 channels exhibit the desired consistent with the designed values. The highest event rate can

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1773

Fig. 22. Measured frequency response of the feature extraction front-end.

Fig. 24. (a) Clean speech sample and label sequence. (b) Corresponding
metro 5-dB noisy speech sample and measured classification sequence.

The measured speech and non-speech hit rates of the ten


test chips and the simulated hit rate are shown in Fig. 23.
The method of calculating speech and non-speech hit rates is
referred to the definition of true positive and true negative
in the two-class classification [57]. The mean speech and
non-speech hit rates, μS and μNS , and the corresponding 1-σ
variations, σS and σNS , are given in the plot. It is interesting
to see the measurement points have somewhat similar tradeoff
between speech and non-speech hit rates as the points on the
receiver operating characteristic (ROC) curve [57]. Overall,
the software simulation result using the customized feature
extraction front-end software model is a good prediction of
the measured system classification performance.
To visualize the real-time classification process, frac-
Fig. 23. Measured speech/non-speech hit rate using the 10-dB noisy speech tions of the temporal classification sequence are shown
in a restaurant noise scenario, and 5-dB noisy speech in a metro noise scenario. in Fig. 24 together with the metro 5-dB noisy speech
The same weights trained offline are used for MLP in each case. waveform. The clean speech and label sequence are also
shown. Compared to the label, the measured classification
sequence exhibits sporadic misclassification. In retrospect,
go up above 10 kevent/s, giving a DR at the output of the IAF instead of directly comparing the activations of the output
around 40 dB. neurons, adding a posterior handling block with sliding win-
dow smoothing of the activations before decision threshold
B. Voice Activity Detection System may reduce this type of sporadic error and hence improve the
A 1-hour-long test audio, the same noisy speech used in the overall classification performance [58].
neural network inference in software simulation, is supplied to
the LNA input through PCIe-6323 generated using LabVIEW. V. C ONCLUSION AND D ISCUSSION
The audio rms amplitude is about 200 μV, and the LNA The design of a 1-μW voice activity detector using ana-
gain code is set to 10. Two cases are tested, the 10-dB log feature extraction and digital classification is elaborated
SNR noisy speech in a restaurant noise scenario, and the in this paper. The low power consumption is achieved by
5-dB SNR noisy speech in a metro noise scenario. In each leveraging the power/energy efficiency of ASP and BNN.
case, the same weights are used for the on-chip MLP for As shown in Table II, compared to the best prior work on
all the ten test chips, eliminating the costly chip-wise feature analog feature extraction front-end, this work is more than
measurement and classifier training in contrast to [23], [24]. 9× power efficient. Compared with the most power-efficient

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1774 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

TABLE II
C OMPARISON OF AFE

TABLE III
C OMPARISON OF AUDIO C LASSIFICATION S YSTEMS

digital implementation of BPF based on the charge-recovery rate is too low for VAD. The systems in [61] and [62] based on
design [59], this work including other building blocks besides spiking NNs are not very energy efficient. The reason is that
BPF shows 5× higher power efficiency when the frequency IBM’s TrueNorth is more of a general purpose platform that
range is equalized. can support various NN algorithms. The digital VAD in [2] is
Table III shows the comparison of this VAD with the other used to power-gate the automatic speech recognizer. It actu-
audio classification systems. While our design consumes the ally implemented three different algorithms: energy-based
lowest power among VAD, it is fair to also consider the (EB, 8.5 μW), harmonicity (HM, 24.4 μW), and modulated
classification rate. The 32-nm VAD [1] has slightly better frequency with NN (MF + NN, 22.3 μW). It is reported that
energy efficiency in terms of class/W/s. However, it may not MF + NN gives the best performance. With no fast Fourier
be the best choice for minimized power consumption in low- transform (FFT), the EB algorithm consumes 3× less power
throughput real-time intelligent sensing applications, because than the other two FFT-based ones, and HM consumes more
the energy efficiency decreases quickly with the power supply power than MF + NN because it takes two short-term FFT
voltage below 0.65 V due to leakage as indicated in the mea- per frame even though no NN computation is required. This
surement, even though the design already uses ULP transistors indicates the power-hungry nature of the conventional digital
with extremely low leakage and ultra-high Vth [1]. Techniques FFT approach for frequency analysis.
like transistor sizing and power gating can further suppress With the help of the customized feature extraction front-end
leakage current, but the energy efficiency can hardly get better model during the design time, we are able to use the same
than the optimized peak value at 0.65 V. The system in [60] classifier parameters for all test chips without costly chip-wise
for acoustic object detection consumes only 12 nW, but the feature measurement and classifier training, and relatively
system bandwidth is less than 500 Hz, and the classification small spread in speech and non-speech hit rates across dies

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1775

is achieved, despite the fact that for high power efficiency, [14] A. G. Katsiamis, E. M. Drakakis, and R. F. Lyon, “A biomimetic,
the front-end is designed in deep subthreshold analog, which 4.5 μW, 120+ dB, log-domain cochlea channel with AGC,” IEEE J.
Solid-State Circuits, vol. 44, no. 3, pp. 1006–1022, Mar. 2009.
is much more prone to PVT variations compared to the digital [15] B. Wen and K. Boahen, “A silicon cochlea with active coupling,” IEEE
counterpart. One open question is, is it possible to further Trans. Biomed. Circuits Syst., vol. 3, no. 6, pp. 444–455, Dec. 2009.
reduce the classification spread across dies by leveraging [16] S.-C. Liu, A. van Schaik, B. A. Minch, and T. Delbruck, “Asyn-
chronous binaural spatial audition sensor with 2×64×4 channel out-
the generalization capability of deep learning models without put,” IEEE Trans. Biomed. Circuits Syst., vol. 8, no. 4, pp. 453–464,
increasing the area of the analog circuits for feature extraction? Aug. 2014.
As an initial attempt, we performed software emulation of the [17] G. Yang, R. F. Lyon, and E. M. Drakakis, “A 6 μW per channel
VAD system with consideration of AFE parameter variation analog biomimetic cochlear implant processor filterbank architecture
with across channels AGC,” IEEE Trans. Biomed. Circuits Syst., vol. 9,
during BNN training. In other words, we slightly modified the no. 1, pp. 72–86, Feb. 2015.
flow shown in Fig. 2 so that the features for classifier training [18] S. Wang, T. J. Koickal, A. Hamilton, R. Cheung, and L. S. Smith, “A bio-
are from both “model with fit parameter” and “model with realistic analog CMOS cochlea filter with high tunability and ultra-steep
roll-off,” IEEE Trans. Biomed. Circuits Syst., vol. 9, no. 3, pp. 297–311,
parameter variation.” Features generated from different sets Jun. 2015.
of “model with parameter variation” are used for inference. [19] M. Yang, C. H. Chien, T. Delbruck, and S. C. Liu, “A 0.5 V 55 μW 64×2
Despite the 20× augmentation in training datasets, the relative channel binaural silicon cochlea for event-driven stereo-audio sensing,”
IEEE J. Solid-State Circuits, vol. 51, no. 11, pp. 2554–2569, Nov. 2016.
reduction of the hit rate variation is marginal: less than 10%
[20] P. R. Kinget, “Device mismatch and tradeoffs in the design of analog
across several runs. Further studies need to be done to explore circuits,” IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212–1224,
the limit of this scheme for variation reduction. Jun. 2005.
[21] B. Murmann, “Digitally assisted analog circuits,” IEEE Micro, vol. 26,
no. 2, pp. 38–47, Mar. 2006.
ACKNOWLEDGMENT [22] Y. Chiu, “Equalization techniques for nonlinear analog circuits,” IEEE
Commun. Mag., vol. 49, no. 4, pp. 132–139, Apr. 2011.
The authors would like to thank N. Mesgarani, Y. Tsividis, [23] J. Zhang, L. Huang, Z. Wang, and N. Verma, “A seizure-detection IC
M. Verhelst, and X.-L. Zhang for the valuable discussion and employing machine learning to overcome data-conversion and analog-
help. processing non-idealities,” in Proc. IEEE Custom Integr. Circuits Conf.
(CICC), Sep. 2015, pp. 1–4.
[24] K. M. H. Badami, S. Lauwereins, W. Meert, and M. Verhelst,
R EFERENCES “A 90 nm CMOS, 6 μW power-proportional acoustic sensing frontend
for voice activity detection,” IEEE J. Solid-State Circuits, vol. 51, no. 1,
[1] A. Raychowdhury, C. Tokunaga, W. Beltman, M. Deisher, J. W. Tschanz, pp. 291–302, Jan. 2016.
and V. De, “A 2.3 nJ/frame voice activity detector-based audio front-end [25] X. L. Zhang and D. Wang, “Boosting contextual information for
for context-aware system-on-chip applications in 32-nm CMOS,” IEEE deep neural network based voice activity detection,” IEEE/ACM Trans.
J. Solid-State Circuits, vol. 48, no. 8, pp. 1963–1969, Aug. 2013. Audio, Speech, Language Process., vol. 24, no. 2, pp. 252–264,
[2] M. Price, J. Glass, and A. P. Chandrakasan, “A scalable speech recog- Feb. 2016.
nizer with deep-neural-network acoustic models and voice-activated [26] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
power gating,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. “Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst.,
Tech. Papers, Feb. 2017, pp. 244–245. vol. 29, 2016, pp. 4107–4115.
[3] R. Sarpeshkar, “Analog versus digital: Extrapolating from electronics to [27] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo,
neurobiology,” Neural Comput., vol. 10, no. 7, pp. 1601–1638, 1998. “UNPU: A 50.6 TOPS/W unified deep neural network accelerator with
[4] L. T. Lin, H.-F. Tseng, D. B. Cox, S. S. Viglione, D. P. Conrad, and 1 b-to-16 b fully-variable weight bit-precision,” in IEEE Int. Solid-State
R. G. Runge, “A monolithic audio spectrum analyzer,” IEEE J. Solid- Circuits Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA,
State Circuits, vol. JSSC-18, no. 1, pp. 40–45, Feb. 1983. Feb. 2018, pp. 218–220.
[5] N. C. Bui, J. J. Monbaron, and J. G. Michel, “An integrated voice [28] Y.-J. Chen, C.-W. Wei, Y. FanChiang, Y.-L. Meng, Y.-C. Huang, and
recognition system,” IEEE J. Solid-State Circuits, vol. JSSC-18, no. 1, S.-J. Jou, “Neuromorphic pitch based noise reduction for monosyllable
pp. 75–81, Feb. 1983. hearing aid system application,” IEEE Trans. Circuits Syst. I, Reg.
[6] Y. Kuraishi, K. Nakayama, K. Miyadera, and T. Okamura, “A single- Papers, vol. 61, no. 2, pp. 463–475, Feb. 2014.
chip 20-channel speech spectrum analyzer using a multiplexed switched-
[29] N. Parihar, J. Picone, D. Pearce, and H. G. Hirsch, “Performance analysis
capacitor filter bank,” IEEE J. Solid-State Circuits, vol. JSSC-19, no. 6,
of the Aurora large vocabulary baseline system,” in Proc. Eur. Signal
pp. 964–970, Dec. 1984.
Process. Conf., 2004, pp. 553–556.
[7] J. S. Chang and Y. C. Tong, “A micropower-compatible time-multiplexed
SC speech spectrum analyzer design,” IEEE J. Solid-State Circuits, [30] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-
vol. 28, no. 1, pp. 40–48, Jan. 1993. channel acoustic noise database: A database of multichannel environ-
[8] R. F. Lyon and C. Mead, “An analog electronic cochlea,” IEEE Trans. mental noise recordings,” J. Acoust. Soc. Amer., vol. 133, no. 5, p. 3591,
Acoust., Speech Signal Process., vol. ASSP-36, no. 7, pp. 1119–1134, May 2013.
Jul. 1988. [31] E. Vittoz and J. Fellrath, “CMOS analog integrated circuits based on
[9] L. Watts, D. A. Kerns, R. F. Lyon, and C. A. Mead, “Improved weak inversion operations,” IEEE J. Solid-State Circuits, vol. 12, no. 3,
implementation of the silicon cochlea,” IEEE J. Solid-State Circuits, pp. 224–231, Jun. 1977.
vol. 27, no. 5, pp. 692–700, May 1992. [32] Y.-P. Chen et al., “An injectable 64 nW ECG mixed-signal SoC in 65 nm
[10] R. Sarpeshkar, M. W. Baker, C. D. Salthouse, J. J. Sit, L. Turicchia, for arrhythmia monitoring,” IEEE J. Solid-State Circuits, vol. 50, no. 1,
and S. M. Zhak, “An analog bionic ear processor with zero-crossing pp. 375–390, Jan. 2015.
detection,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. [33] P. Harpe, H. Gao, R. V. Dommele, E. Cantatore, and
Papers, Feb. 2005, pp. 78–79. A. H. M. van Roermund, “A 0.20 mm2 3 nW signal acquisition
[11] E. Fragniere, “A 100-channel analog CMOS auditory filter bank for IC for miniature sensor nodes in 65 nm CMOS,” IEEE J. Solid-State
speech recognition,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Circuits, vol. 51, no. 1, pp. 240–248, Jan. 2016.
Dig. Tech. Papers, Feb. 2005, pp. 140–141. [34] S. Rai, J. Holleman, J. N. Pandey, F. Zhang, and B. Otis, “A 500 μW
[12] V. Chan, S.-C. Liu, and A. van Schaik, “AER EAR: A matched silicon neural tag with 2 μVrms AFE and frequency-multiplying MICS/ISM
cochlea pair with address event representation interface,” IEEE Trans. FSK transmitter,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Circuits Syst. I, Reg. Papers, vol. 54, no. 1, pp. 48–59, Jan. 2007. Tech. Papers, Feb. 2009, pp. 212–213.
[13] T. J. Hamilton, C. Jin, A. V. Schaik, and J. Tapson, “An active 2-D silicon [35] X. Zou, W. S. Liew, L. Yao, and Y. Lian, “A 1 V 22 μW 32-channel
cochlea,” IEEE Trans. Biomed. Circuits Syst., vol. 2, no. 1, pp. 30–43, implantable EEG recording IC,” in IEEE Int. Solid-State Circuits Conf.
Mar. 2008. (ISSCC) Dig. Tech. Papers, Feb. 2010, pp. 126–127.

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
1776 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 54, NO. 6, JUNE 2019

[36] D. Han, Y. Zheng, R. Rajkumar, G. Dawe, and M. Je, “A 0.45 V 100- [60] S. Jeong et al., “A 12 nW always-on acoustic sensing and object
channel neural-recording IC with sub-μW/channel consumption in recognition microsystem using frequency-domain feature extraction and
0.18 μm CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. SVM classification,” in IEEE Int. Solid-State Circuits Conf. (ISSCC)
Tech. Papers, Feb. 2013, pp. 290–291. Dig. Tech. Papers, Feb. 2017, pp. 362–363.
[37] B. Sporrer et al., “A fully integrated dual-channel on-coil CMOS receiver [61] S. K. Esser et al., “Convolutional networks for fast, energy-efficient
for array coils in 1.5–10.5 T MRI,” IEEE Trans. Biomed. Circuits Syst., neuromorphic computing,” Proc. Nat. Acad. Sci. USA, vol. 113,
vol. 11, no. 6, pp. 1245–1255, Dec. 2017. pp. 11441–11446, Aug. 2016.
[38] B. Razavi, Design of Analog CMOS Integrated Circuits, 1st ed. Boston, [62] W.-Y. Tsai et al., “Always-on speech recognition using TrueNorth, a
MA, USA: McGraw-Hill, 2000. reconfigurable, neurosynaptic processor,” IEEE Trans. Comput., vol. 66,
[39] R. R. Harrison, “The design of integrated circuits to observe brain no. 6, pp. 996–1007, Jun. 2017.
activity,” Proc. IEEE, vol. 96, no. 7, pp. 1203–1216, Jul. 2008.
[40] V. Saxena and R. J. Baker, “Compensation of CMOS op-amps using
split-length transistors,” in Proc. IEEE Int. Midwest Symp. Circuits Syst.
(MWSCAS), 2008, pp. 109–112.
[41] M. Taherzadeh-Sani and A. A. Hamoui, “A 1-V process-insensitive
current-scalable two-stage opamp with enhanced DC gain and settling
behavior in 65-nm digital CMOS,” IEEE J. Solid-State Circuits, vol. 46,
no. 3, pp. 660–668, Mar. 2011.
[42] A. Vasilopoulos, G. Vitzilaios, G. Theodoratos, and Y. Papananos, Minhao Yang (S’11–M’16) received the Ph.D.
“A low-power wideband reconfigurable integrated active-RC filter degree in physics from ETH Zurich, Zürich,
with 73 dB SFDR,” IEEE J. Solid-State Circuits, vol. 41, no. 9, Switzerland, in 2015.
pp. 1997–2008, Sep. 2006. He was a Post-Doctoral Researcher with Columbia
[43] M. Abdulaziz, M. Törmänen, and H. Sjöland, “A compensation tech- University, New York, NY, USA. He is currently
nique for two-stage differential OTAs,” IEEE Trans. Circuits Syst. II, a Collaborateur Scientifique with EPFL, Lausanne,
Exp. Briefs, vol. 61, no. 8, pp. 594–598, Aug. 2014. Switzerland. His post-doctoral research was partly
[44] M. D. Matteis, A. Pezzotta, S. D’Amico, and A. Baschirotto, “A 33 MHz supported by the Early Postdoc Mobility Fellow-
70 dB-SNR super-source-follower-based low-pass analog filter,” IEEE J. ship from the Swiss National Science Foundation.
Solid-State Circuits, vol. 50, no. 7, pp. 1516–1524, Jul. 2015. His research interests include ultra-low-power (ULP)
[45] Y. Xu, S. Leuenberger, P. K. Venkatachala, and U.-K. Moon, inference sensing systems, event-driven sensors like
“A 0.6 mW 31 MHz 4th -order low-pass filter with +29 dBm IIP3 using spiking silicon retina and cochlea, and spike coding and processing.
self-coupled source follower based biquads in 0.18 μm CMOS,” in Proc.
IEEE Symp. VLSI Circuits, Jun. 2016, pp. 132–133.
[46] M. De Matteis and A. Baschirotto, “A biquadratic cell based on the
flipped-source-follower circuit,” IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 64, no. 8, pp. 867–871, Aug. 2017.
[47] A. Thanachayanont, “Low-voltage low-power high-Q CMOS RF band-
pass filter,” Electron. Lett., vol. 38, no. 13, pp. 615–616, Jun. 2002.
[48] Z. Gao, J. Ma, M. Yu, and Y. Ye, “A fully integrated CMOS active Chung-Heng Yeh received the B.S. degree in elec-
bandpass filter for multiband RF front-ends,” IEEE Trans. Circuits Syst. trical engineering from National Taiwan University,
II, Exp. Briefs, vol. 55, no. 8, pp. 718–722, Aug. 2008. Taipei, Taiwan, in 2010, and the M.S. degree in
[49] M. S. Hosny and J. Hanson, “A wide-band, high-precision CMOS electrical engineering from Columbia University,
rectifier,” Analog Integr. Circuits Signal Process., vol. 5, no. 2, New York, NY, USA, in 2013, where he is currently
pp. 183–190, Mar. 1994. pursuing the Ph.D. degree in electrical engineering.
[50] E. Rodriguez-Villegas, P. Corbishley, C. Lujan-Martinez, and His current research interests include information
T. Sanchez-Rodriguez, “An ultra-low-power precision rectifier for processing in spike domain, large-scale neural sys-
biomedical sensors interfacing,” Sens. Actuators A, Phys., vol. 153, tem emulation, and neural inspired algorithm.
no. 2, pp. 222–229, Aug. 2009.
[51] Z. Wang, “Novel pseudo RMS current converter for sinusoidal signals
using a CMOS precision current rectifier,” IEEE Trans. Instrum. Meas.,
vol. 39, no. 4, pp. 670–671, Aug. 1990.
[52] Z. Wang, “Full-wave precision rectification that is performed in current
domain and very suitable for CMOS implementation,” IEEE Trans.
Circuits Syst. I, Fundam. Theory Appl., vol. 39, no. 6, pp. 456–462,
Jun. 1992.
[53] S.-C. Liu, J. Kramer, G. Indiveri, T. Delbruck, and R. Douglas, Analog
VLSI: Circuits and Principles, 1st ed. Cambridge, MA, USA: A Bradford Yiyin Zhou received the B.E. degree in electrical
Book, 2002. engineering from Shanghai Jiao Tong University,
[54] M. S. J. Steyaert, W. Dehaene, J. Craninckx, M. Walsh, and P. Real, Shanghai, China, in 2007, and the M.S. and Ph.D.
“A CMOS rectifier-integrator for amplitude detection in hard disk servo degrees in electrical engineering from Columbia
loops,” IEEE J. Solid-State Circuits, vol. 30, no. 7, pp. 743–751, University, New York, NY, USA, in 2009 and 2015,
Jul. 1995. respectively.
[55] S. M. Zhak, M. W. Baker, and R. Sarpeshkar, “A low-power wide He is currently a Post-Doctoral Research
dynamic range envelope detector,” IEEE J. Solid-State Circuits, vol. 38, Scientist with the Department of Electrical
no. 10, pp. 1750–1753, Oct. 2003. Engineering, Columbia University. He is an
[56] M. S. J. Steyaert and W. M. C. Sansen, “A micropower low-noise active team member of the Fruit Fly Brain
monolithic instrumentation amplifier for medical purposes,” IEEE Observatory Team, Columbia University,
J. Solid-State Circuits, vol. JSSC-22, no. 6, pp. 1163–1168, Dec. 1987. and has co-organized three Fruit Fly Brain Hackathons. He is also
[57] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., interested in massively parallel neural computation on high-performance
vol. 27, no. 8, pp. 861–874, Jun. 2006. computing devices. His research interests include formal methods for
[58] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting spike-time-based representation of sensory information and system
using deep neural networks,” in Proc. IEEE Int. Conf. Acoust. Speech identification, and the logic of information processing of the fruit fly brain
Signal Process., May 2014, pp. 4087–4091. on both neuroinformation processing and neural circuit levels.
[59] H. S. Wu, Z. Zhang, and M. C. Papaefthymiou, “A 13.8 μW binaural Dr. Zhou received the Jury Award from the Department of Electrical
dual-microphone digital ANSI S1.11 filter bank for hearing aids with Engineering, Columbia University, in 2016, for outstanding achievement
zero-short-circuit-current logic in 65 nm CMOS,” in IEEE Int. Solid- by a graduate student in the area of systems, communications, and signal
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2017, pp. 348–349. processing.

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DESIGN OF AN ALWAYS-ON DEEP NEURAL NETWORK-BASED 1-μW VAD 1777

Joao P. Cerqueira (S’17) received the B.S. degree Mingoo Seok (S’05–M’11–SM’18) received the
(Hons.) in electrical engineering from the University B.S. degree (summa cum laude) in electrical engi-
of Brasília, Brasília, Brazil, in 2014, and the M.S. neering from Seoul National University, Seoul,
degree in electrical engineering from Columbia South Korea, in 2005, and the M.S. and Ph.D.
University, New York, NY, USA, in 2016, where he degrees in electrical engineering from the University
is currently pursuing the Ph.D. degree in electrical of Michigan, Ann Arbor, MI, USA, in 2007 and
engineering. 2011, respectively.
His current research interests include In 2011, he joined Texas Instruments Inc., Dallas,
energy-efficient integrated circuits and computer TX, USA, as a Technical Staff Member. Since
architecture. 2012, he has been with the Department of Electrical
Mr. Cerqueira received the Science Without Engineering, Columbia University, New York, NY,
Borders Fellowship from CAPES, the Lemann Foundation Fellowship, and USA, where he is currently an Associate Professor. His current research
the Qualcomm Innovation Fellowship. interests include variation, voltage, aging, thermal-adaptive circuits and archi-
tecture, ultra-low-power (ULP) SoC design for emerging embedded sys-
Aurel A. Lazar (S’77–M’80–SM’90–F’93–LF’16) tems, machine-learning VLSI architecture and circuits, and nonconventional
was a Principal Investigator leading a number of hardware design.
computer networking research groups in the Depart- Dr. Seok received the 1999 Distinguished Undergraduate Scholarship
ment of Electrical Engineering, Columbia Univer- from the Korea Foundation for Advanced Studies, the 2005 Doctoral
sity, New York, NY, USA, for 20 years. He covered a Fellowship from the Korea Foundation for Advanced Studies, and the
broad set of research topics/fields, including building 2008 Rackham Pre-Doctoral Fellowship from the University of Michigan,
major switching hardware, architecting broadband the 2009 AMD/CICC Scholarship Award for picowatt voltage reference
kernels for programmable networks, and creating work, the 2009 DAC/ISSCC Design Contest for the 35-pW sensor platform
game theory models for resource allocation. He also design, and the 2015 NSF CAREER Award. He has served as an Associate
ran a networking start-up as a CEO. Some 15 years Editor for the IEEE T RANSACTIONS ON C IRCUITS A ND S YSTEMS I
ago, predicting that Moore’s law will soon reach from 2013 to 2015 and the IEEE T RANSACTIONS ON V ERY L ARGE S CALE
its limits, he switched his field of research to computational neuroscience I NTEGRATION S YSTEMS since 2015 and IEEE S OLID -S TATE C IRCUITS
in search of principles of building cognitive machines, with capabilities L ETTERS since 2017.
well beyond those powered by von Neumann architectures. He currently
leads research projects in computing with fruit fly brain circuits, in building
interactive computing tools for the Fruit Fly Brain Observatory and on creating
neuroinformation processing machines. His research has drawn support from
a number of funding agencies including the AFSOR, NIH, and NSF.
Dr. Lazar received the 2003 IFIP/IEEE Dan Stokesberry Memorial Award
in recognition of “the most distinguished technical contributions to the growth
and understanding of the field of network management.”

Authorized licensed use limited to: Johns Hopkins University. Downloaded on June 19,2024 at 02:08:56 UTC from IEEE Xplore. Restrictions apply.

You might also like