# A 48.6-to-105.2 µW Machine Learning Assisted Cardiac Sensor SoC for Mobile Healthcare Applications

Shu-Yu Hsu, Student Member, IEEE, Yingchieh Ho, Member, IEEE, Po-Yao Chang, Chauchin Su, Member, IEEE, and Chen-Yi Lee, Member, IEEE

Abstract—A machine-learning (ML) assisted cardiac sensor SoC (CS-SoC) is designed for mobile healthcare applications. The heterogeneous architecture realizes the cardiac signal acquisition, filtering with versatile feature extractions and classifications, and enables the higher order analysis over traditional DSPs. Besides, the asynchronous architecture with dynamic standby controller further suppresses the system active duty and the leakage power dissipation. The proposed chip is fabricated in a 90-nm standard CMOS technology and operates at 0.5 V–1.0 V (0.7 V–1.0 V for SRAM and I/O interface). Examined with healthcare monitoring applications, the CS-SoC dissipates 48.6/105.2 μW for real-time syndrome detections of ECG-based arrhythmia/VCG-based myocardial infarction with 95.8/99% detection accuracy, respectively.

Index Terms—Arrhythmia, biomedical signal processor, classification, ECG, feature extraction, machine learning, myocardial infarction, VCG.

#### I. INTRODUCTION

OBILE devices integrated with miniaturized sensors enable the opportunities for versatile healthcare applications. For instance, continuous cardiac signal monitoring of the electrocardiogram (ECG), vectorcardiogram (VCG) and phonocardiogram (PCG) supports both the early detection of chronic and emergent heart events [1]–[3]. Therefore, early treatments can be applied to the users. Long-term monitoring is desired for such applications to trace the abnormal events. However, the wireless transmission dominates the power dissipation of mobile devices and limits the monitoring duration [4]. Especially, the transmission energy grows when the resolution and monitored channel number increase. Alternatively, the on-sensor analysis that extracts the key information of physiological signal not only reduces the transmission data for

Manuscript received August 19, 2013; revised November 03, 2013 and December 12, 2013; accepted December 13, 2013. Date of publication January 14, 2014; date of current version March 24, 2014. This paper was approved by Guest Editor Hideyuki Kabuo. This work was supported in part by the NSC of Taiwan, R.O.C., under Grant 100-2220-E-009-068, and by a MediaTek Fellowship.

- S. Y. Hsu, P. Y. Chang, and C. Y. Lee are with the Department of Electronics Engineering and Institute of Electronics, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: fishya@si2lab.org; cylee@si2lab.org).
- Y. Ho is with the Department of Electrical Engineering, Dong Hwa University, Hualien 97401, Taiwan (e-mail: ycho@mail.ndhu.edu.tw).
- C. Su is with the Department of Electrical Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2013.2297406

extended monitoring time, but also provides local indications for reduced communication latency.

There have been several biomedical signal processors designed for on-sensor analysis. The general purpose processor (GPP) approach provides high flexibility, but requires more computation cycles [5], [6]. Besides, only basic features are extracted. A dedicated processor enables atrial fibrillation detection with low power consumption; however, only single syndrome is detected [7]. A RISC based classifier enhances the flexibility for multiple cardiac syndrome detections, whereas only arrhythmias are detected [8]. These aforementioned solutions based on conventional DSP algorithms provide limited feature extractions or detectable syndromes, which confine the use for in-depth monitoring. The machine learning (ML) techniques provide advanced data analysis ability [3]. However, the computation intensive algorithms are hard to perform on the traditional DSP processors [5]-[8]. Though an SoC integrates a dedicated ML classifier, the feature types are limited [9]. Moreover, pre-processing is not included to enhance the analysis accuracy.

On the other hand, the power budget of the mobile devices is limited. The complicated analysis algorithms and large data storage lead to increased system power. Although lowering supply voltage significantly reduces the power dissipation, the leakage power during computation and sleep modes becomes dominated. Generally, power gating technique is used to save the leakage power in sleep mode. Nevertheless, the naive approach requires large power switch to prevent voltage drop, resulting in large parasitic capacitance and long wakeup time [24].

Accordingly, this work proposes a cardiac sensor SoC (CSSoC) with data quality enhancements and accelerated ML functions. Moreover, low power dissipation is achieved to fulfill the requirements for mobile healthcare applications. The design features include a data management processor (DMP) that improves the signal quality while compressing the required data storage. A machine learning processor (MLP) supports the versatile feature extractions and classifications. Accompanied with the low-power techniques of voltage scaling, duty-cycling, and memory compression, the system power is reduced. Besides, the proposed dynamic standby controller (DSC) further diminishes the leakage current in sleep mode, enabling minimized average power dissipation and fast wakeup time. Considering signal acquisition and intelligent analysis based on the proposed features, the CS-SoC enables >95% accurate arrhythmia and myocardial



Fig. 1. System overview of the proposed CS-SoC with the ML-assisted framework.

infarction (MI) syndrome detection with  $\mu$ W-level power dissipation for mobile healthcare applications.

This paper is organized as follows. Section II gives the description of the overall algorithm and architecture of the proposed CS-SoC. Section III describes the data management processor for noise cancellation and data compression. The ML processor for feature extraction and syndrome classification is shown in Section IV. Section V discusses the low-power chip implementations. Section VI provides the experimental results and comparisons, followed by the conclusion in Section VII.

#### II. OVERALL SYSTEM DESCRIPTION AND ARCHITECTURE

#### A. System Overview and Algorithm Description

As a demonstration case, Fig. 1 describes the ML assisted framework with the processing flow and corresponding hardware for 3-channel VCG-based MI detection. The processing flow mainly comprises the operations of signal acquisition, noise cancellation, feature extraction and classification. The cardiac signal is first amplified, filtered and digitized by the sensor interfaces, including the analog front-end (AFE) and the ADC, modified from [8], [10]. In order to enhance the signal quality for better detection accuracy, the residual noise is removed by the data management processor. Meanwhile, the data is compressed for reduced data storage. Furthermore, the CS-SoC includes an ML processor and general purpose processor to perform the on-sensor analysis. Besides compression, the on-sensor data analysis includes feature extraction and classification to reduce the transmission data for system energy saving. Though data compression is intuitive, the reduction is limited to  $2 \times$  to  $10 \times [8]$ . The extracted features, such as heart rate, further reduce the transmission data to 1%. However, the original waveform cannot be reconstructed from the features. Hence, to meet the requirements from medical doctors, the CS-SoC performs classification to select the abnormal waveforms for transmission. Real-time detection is achieved with the classification model, which is offline learned to save the computation power. Once the abnormal signal is classified, the



Fig. 2. The ECG lead II signal and VCG  $\rm V_{\rm Z}$  signal of two MI patients: (a) patient A; (b) patient B.

crypto processor can be activated to encrypt the data before sending to the wireless module.

The syndrome detection suffers from the noise induced in mobile environments and the signal variability to each individual. Although the noise can be filtered out, the variability results in poor detection performance with the conventional DSP algorithms. For instance, Fig. 2 shows the ECG lead II and VCG  $V_Z$  waveforms from two different MI patients. Although the syndrome of both patients is the same, the waveforms vary with individuals. Accordingly, the ML algorithms [3] are involved to enhance the analysis accuracy. Considering in-time event detection with sufficient computing capability, hardware-friendly noise reduction and ML techniques are applied.

1) Noise Cancellation: Generally, the cardiac signal in mobile condition is distorted by the high-frequency noise and baseline drift. The high-frequency noise is directly filtered with FIR filter. However, the bandwidth of low-frequency baseline drift is overlapped with ECG signal, resulting in difficulty in baseline drift cancellation. Direct high-pass filtering (HPF) [11] distorts the original signal. The dyadic wavelet transform (DYWT) [12] separates the baseline drift and signal, but the higher scale



Fig. 3. Performance comparison of the baseline estimation methods.

wavelet decomposition and combination lead to raised hard-ware complexity. Contrarily, the median filter and morphological filter (MF) estimate the signal baseline [12], [13] with lower complexity. Fig. 3 compares the simulation results of these filters, where the mean baseline deviation (MBD) counts for the baseline estimation performance and is expressed as

$$MBD = \frac{1}{T} \sum_{n} |b_e(n) - b_o(n)| \tag{1}$$

where T,  $b_e(n)$  and  $b_o(n)$  are the time interval, estimated and original baseline, respectively. Considering the least MBD with moderate hardware requirement, the MF is adopted for baseline drift cancellation.

2) Feature Extraction: The ML assisted MI detection involves two-stage feature extractions and a classification stage. Fig. 4 illustrates the feature extraction flow. The first feature extraction stage segments the cardiac cycle with P, Q, R, S, and T boundaries by the wavelet-based delineator proposed in [8]. The performance is further fine-tuned, where the detection sensitivity and specificity of QRS complex is 99.97% and 100.0%, respectively.

Based on the segmented cardiac cycles, the second feature extraction stage performs higher order time-series and shape analysis. A common time-series feature extraction is by autoregressive (AR) modeling [14], translating the time series signal into several AR model coefficients. However, the inter-channel correlation is not utilized. Hence a multivariate AR (MAR) estimator [15], considering multi-channel signal, is applied in this work. The equation is shown below:

$$X(n) = \sum_{i=1}^{p} A(i) \cdot X(n-i) + e(n)$$
 (2)

where  $X(\cdot)$  is the multi-channel signal, p is the model order, and  $e(\cdot)$  is the prediction error. For MI detection with 3-channel VCG and 4th order MAR, 36 MAR coefficients  $(A(\cdot))$  are estimated.

In addition to model the cardiac cycles, the shape analysis (SA) is performed on the segmented P, Q, R, S, and T waves. Since the cardiac signal morphology often changes according to heart status, the 3rd and 4th order principal moments (skewness and kurtosis) are estimated to indicate the skew direction and the sharpness, respectively. Besides modeling and the shapes, additional features including the vector angles, magnitudes, ratios and other time domain statistics are extracted [3].

*3) Classification:* To evaluate the MI detection performance, the maximum likelihood classification (MLC), support vector machine (SVM), and *k*th nearest neighbor (k-NN) classifiers are

applied. The feature selection is done offline in order to reduce the required computation and storage.

Based on the maximum *a posteriori* (MAP) [3], the MLC decision is made by finding the class i such that the following criteria is minimized:

MAP = 
$$\arg \min_{i=\{1,2,...M\}} \left\{ -2 \ln P(i) + \frac{1}{2} \ln \left| \sum_{i} \right| + \frac{1}{2} (\text{FV} - \mu_i)^T \sum_{i}^{-1} (\text{FV} - \mu_i) \right\}$$
 (3)

where FV represents the extracted feature vector, M is the class number, and P(i),  $\Sigma_i$ , and  $\mu_i$  are the offline learned parameters. The SVM performs binary classification by finding the optimal decision boundary that separates the two classes in  $N_{\rm FV}$ -dimensional feature space, where  $N_{\rm FV}$  is the feature number. The class decision is expressed as

Class(FV) = sign 
$$\left(\sum_{i=1}^{N_{SV}} \alpha_i \cdot K(FV, SV_i) \cdot y_i - b\right)$$
 (4)

where  $N_{\rm SV}$  is the number of support vectors (SV),  $\alpha_i, y_i, b$  are the trained parameters, and  $K(\cdot)$  is the kernel function to find the decision boundary, either linear or nonlinear. If the Class(FV) is greater than zero, the FV belongs to MI class. Otherwise, the FV is in the normal class. For k-NN classification, the classifier finds the k nearest training vectors according to the test FV and performs majority vote for class decision.

#### B. CS-SoC Architecture

Fig. 5 shows the CS-SoC architecture. The 0.5 V sensor interfaces include the cardiac acquisition circuits adapted from [8], [10] and the multi-phase ring oscillator based ADCs and TDCs. The AFE gain range is configurable with 26–46 dB gain and 0.5–150 Hz bandwidth for cardiac signal acquisition. The target sampling rate of ADC and TDC are 250–10 kS/s with 8/12-bit resolution for various applications. The required sampling clock is divided by the 8–32 kHz system frequency, which is on-chip generated with a crystal-less clock generator (CLCG).

The ML algorithms are computation intensive and hard to perform by a general-purpose processor with low power consumption. Therefore, the configurable and heterogeneous architecture comprises the data management processor, ML processor and crypto processor to accelerate the required processing time while reducing the power dissipation. For enhanced SoC flexibility in mobile conditions, a 32-bit RISC general-purpose processor is integrated. As illustrated in Fig. 5, the CS-SoC operates with the data collection mode and the burst computation mode. Due to the slow operation scenario for biomedical system, the leakage usually dominates the system power dissipation [8]. Therefore, during the data collection period, the processors are disabled and only the required sensor interfaces are turned on. To further minimize the system active duty, the critical-path replica oscillators (CPROs) generate the MHz-scale operation frequencies to enable burst computation. This duty-cycled approach minimizes the system active time and prevents the large leakage power dissipation. Moreover, the low-power techniques including voltage scaling and the dynamic standby controller are applied to minimize both the active and sleep power.



Fig. 4. The cardiac feature extraction with cycle segmentations, MAR estimation, shape analysis, time-domain statistics and other vector analysis.



Fig. 5. The CS-SoC architecture with the power domain partitions and behavior timeline.



Fig. 6. The data management processor architecture with clock-less pre-processing.

#### III. DATA MANAGEMENT PROCESSOR

Fig. 6 shows the data management processor architecture, including the clock-less pre-processing, FIFO storage and the multi-rate MF. This not only enhances the signal quality for better ML analysis accuracy, but also compresses the required data storage. Additionally, the data management processor be-

haves as the interface to connect the sampled data with burst computation.

## A. Clock-Less Filtering and Compression

During the data collection period, the 32-tap FIR filter and the adaptive compressor are applied to remove the high-frequency



Fig. 7. The clock-less pipeline principle, performance evaluation, and the Monte Carlo simulation. The Monte Carlo simulation is performed under design corners of  $(0.45 \text{ V}, -40^{\circ}\text{C}), (0.5 \text{ V}, 25^{\circ}\text{C}), \text{ and } (0.55 \text{ V}, 120^{\circ}\text{C}).$ 

noise and to reduce the data storage, respectively. However, the pre-processing operation with sampling frequency (sub-kHz) or system frequency (sub-100 kHz) suffers from large leakage current and decreases the energy efficiency. Although a higher operation frequency can be supported, the settling overhead with frequent computation leads to large power dissipation due to the small computation block size of the compressor and filter. Therefore, the flip-flops of the filter and compressor are replaced by the clock-less latches and handshake (HS) circuits [16]. As shown in Fig. 7, the internal power switches controlled by 4-phase HS protocols enable run-time power gating with at least 82.6% energy reduction. To guarantee the functional correctness, the delay time within HS protocol  $(T_{Delay})$  should be larger than the logic computation time  $(T_{Logic})$ , implying the corresponding delay difference  $(T_{Delay} - T_{Logic})$  should be greater than zero. Accordingly, the Monte Carlo simulation is performed under three design corners for evaluation, where the process variation is 3 standard deviations.

Since multi-channel signal occupies large storage size and power dissipation, the adaptive sampling based compressor proposed in [8] is applied for data reduction. Estimating the min-max difference as information, the sensed data are stored with four different sampling rates. Hence, the signal with higher information would be expressed with higher sampling rate and vice versa. With further lossless encoding, the adaptive compressor storage size is reduced by  $10\times$ . Additionally, 40% always-on storage power is saved with the register/memory-hybrid FIFO.

#### B. Multi-Rate Morphological Filter

Since the baseline drift is a certain low-frequency noise, the filtering takes a large window size, up to 2–3 seconds. This results in large storage requirements and power dissipation. Accordingly, a multi-rate MF is proposed for storage and computation efficient implementation. Fig. 8 shows the proposed data management flow and the results for both compression and noise reduction. Based on the adaptive compression, the cardiac signal is expressed as the multi-rate signal with different sampling rates. Filtered from the compressed data with opening and closing morphological operators, the positive and negative peaks of the signal are removed and the baseline is estimated.



Fig. 8. (a) The processing flow for data compression and baseline noise cancellation with (b) the corresponding waveforms.

The noise removal is then done by subtracting the estimated baseline. The implementation of MF comprises a sequence of registers and comparators to extract the minimum and maximum values within a computation window [25]. Through the multi-rate approach, the window size of MF is reduced. Hence the required registers and computation time are decreased by 64%, resulting in 42% power reduction.

### IV. MACHINE LEARNING PROCESSOR

The ML processor comprises the two-stage feature extraction engine (FEE) and classification engine (CE). Accompanied with the general purpose processor, versatile feature extractions and classifications are performed.

## A. Two-Stage Feature Extraction Engine

Fig. 9 shows the 2-stage FEE architecture to extract the critical signal characteristics. In the first stage, the hardware efficient cardiac signal delineator performs wavelet decomposition with the updatable search rules. Hence, the signal boundaries of the P, Q, R, S, and T waves are identified for cardiac cycle segmentation.

After segmentation, the characteristics of each cardiac cycle are analyzed with MAR estimation, shape and other



Fig. 9. The architecture of two-stage feature extraction engine.

vector analysis. Although the MAR estimator utilizes the inter-channel correlation and removes the signal redundancy, the computation complexity is relatively high. Conventionally, the one-shot computation requires higher dimensional matrix inversion (channel number  $\times p$ ). Besides, the  $p^2$  times larger covariance computation window is required. This implies longer computation time and power dissipation. Accordingly, the proposed MAR estimator is designed iteratively using Burg-type algorithm [17]. To minimize the prediction error, the MAR estimator updates the coefficients and the intermediate values during iterations. This iterative approach not only provides flexibility for 1st-4th order MAR coefficient generations, but also saves 43.8% average power from the one-shot approach [17]. Besides, the processing time is reduced to 96 k cycles, while more than 21 M cycles are required to perform on general purpose processor. As the cardiac cycle changes dynamically, the variable window size and compressed data further reduces the processing time.

To distinguish the normal and abnormal wave morphologies, the configurable SA computes the 3rd and 4th order principal moments. With different principal moment order (M), the SA datapath is shared as the corresponding equation:

$$SA = \frac{1}{L} \sum_{i=1}^{L} (x(i) - \bar{x})^{M} / \left(\frac{1}{L} \sum_{i=1}^{L} (x(i) - \bar{x})^{2}\right)^{M/2}$$
(5)

where L is the window length. Considering different wave segments, the SA is operated using the time-multiplexing approach with a variable window length, which leads to different computation cycles. For a 128-sample window size, the computation takes at most 300 cycles. Besides, the multiplier-less divider and square root functions further saves 87.4% active power with latency <30 cycles. Other features, such as the VCG vector angles, magnitudes, and beat intervals are computed by the general purpose processor with CORDIC accelerator.

The extracted features of one cardiac cycle are aggregated as an FV for classification. To enhance the flexibility, the FV length can be varied and stored in an FV buffer. Since the features are extracted at uneven time, only one register or one bank of the feature storage is clocked when new features are extracted.



Fig. 10. The switchable CE example for (a) MLC, and (b) linear SVM computation.

Therefore, the dynamic power is saved if the new feature values are not computed. Furthermore, the high- $V_t$  latches are applied for static power reduction.

## B. Classification Engine

The CE is able to perform classification after the feature extraction of each cardiac cycle. Fig. 10 shows the CE architecture that performs classifications based on the feature vectors and the offline learned model, which are stored in the FV buffer and the partitioned SRAM banks, respectively. Normally, different classifiers are chosen for different applications. Hence the CE datapath is designed to be switchable for different classifiers, such as MLC and SVM. For instance, Fig. 10(a) shows the datapath for MLC classifier, which makes classification decision by finding MAP. The probabilities onto the normal and MI classes are computed, where the higher posterior probability implies higher chance that the test FV belongs to the corresponding class. Furthermore, the MLC is computed in log-domain for reduced complexity. The required feature number is offline decided and performs the trade-off between the processing time and accuracy (e.g., MLC processing time is proportional



Fig. 11. The DSC schematics to control the power switch.

to  $N_{\rm FV}^2$ ). To deal with the varied  $N_{\rm FV}$ , the CE accumulates the computation results generated from each feature.

The linear SVM classifier is performed using the switchable datapath shared with MLC, as shown in Fig. 10(b). The multiplication-accumulation (MAC) operators parallelize the SVM computation with 50% reduced processing time. By finding the decision boundaries with the aid of support vectors, the binary classification is performed. The polynomial SVM can be performed with the same datapath, but requires more multiplications than the linear SVM and leads to more computation time. If other nonlinear kernels or the k-NN classifier are required, the computation should be assisted with general-purpose processor and CORDIC.

## V. Low-Power Designs

#### A. Low-Power Digital Implementation

The proposed algorithms and architectures reduce the hardware complexity while maintaining the flexibility. But the system power should be further minimized by reducing both the computation and sleep power. As the architecture shown in Fig. 5, the power domain and clock domain are partitioned for voltage scaling, clock gating and power gating to achieve extremely low power consumption. To minimize both the static and dynamic power, the SoC supply voltage is scaled from 1.0 V to 0.5 V, except the SRAM and interface to the I/O pads. Besides, level shifters are inserted to all the paths that cross different voltage domains. A 0.5 V standard cell library, based on regular V<sub>t</sub> devices, is characterized for digital processor implementations, including the data management processor, ML processor, general purpose processor, and other control circuits. By removing the cells with functional failures at 0.5 V, the standard-cell-based design flow is applied to ensure the circuit is reliable. To evaluate the power saving performance, the synthesis results show that the active power of the ML processor is reduced to 24% compared to the design operated at 1.0 V.

#### B. Crystal-less Clocking Circuits

The CS-SoC integrates two major clock sources, including the kHz-scale crystal-less clock generator (CLCG) and the MHz-scale critical-path replica oscillators (CPROs). Typically, a kHz-scale quartz crystal oscillator is applied for system control, but the off-chip device occupies large area. For sensor miniaturization, the quartz crystal is eliminated by the crystal-less clock generator, which generates an 8–32 kHz system frequency within  $\pm 0.15\%$  stability (i.e., 0–45°C) and is further enhanced to  $\pm 30$  ppm with wireless [18]. The sampling clock for sensor interfaces is then divided by this system frequency.

Considering burst computations of the processors, raising operation frequency from kHz-scale to MHz-scale further reduces the active computation duty and maximizes the system sleep time. The critical-path replica oscillators are consisted of the programmable delay lines with delay larger than critical path delay of the processors. Hence, the critical-path replica oscillators are turned on during computation and provide the 25/40 MHz clocks without using a frequency reference.

# C. Dynamic Standby Controller (DSC)

In order to reduce the leakage current in sleep mode, several power gating techniques have been reported. In naive approach, the power switches serve as header or footer to reduce the standby leakage current [24]. The dual  $V_t$  approach applies high- $V_t$  power switch to further reduce the leakage current [19]. A charge pump circuit generates a boosting signal to reduce the current on power switch [20]. A power switch comprises two serial PMOS transistors [21], where the serial topology and body effect further suppresses the leakage current. However, the effective turned-on capacitance of the power switch limits the wake-up time in active mode. In fact, trade-off occurs between leakage suppression and size of the power switch.

Fig. 11 shows the proposed DSC schematics, developed for leakage power reduction with faster wakeup time. A PMOS based power switch is controlled by the DSC. The proposed DSC comprises a duty cycle adjuster (DCA), a negative overdrive generator (NOG), and a wake-up enhancement (WE) circuit. In order to eliminate the leakage current aggressively, the power switch is turned off by NOG, which provides an average negative overdrive signal with a boosting voltage. As shown in Fig. 11,  $C_{\rm BP}$  is the bootstrap capacitor and stores a voltage potential of  $V_{\rm DD}$  in active mode. When the system is switched to sleep mode, the  $V_{\rm DSC}$  is boosted above  $V_{\rm DD}$ . Although the



Fig. 12. The simulation result of the standby current and the turned-on time with the proposed DSC and the conventional power gating technique.

sub-threshold leakage degrades the negative overdrive signal at  $V_{\rm DSC}$ , the average  $V_{\rm DSC}$  can be expressed as

$$V_{\rm DSC,avg} \approx \frac{1}{C_{\rm BP} + C_{\rm DSC}} \left( 2C_{\rm BP} \cdot V_{\rm DD} - \frac{1}{2} (1 - \eta) I_{\rm Leak} \cdot T \right)$$
(6)

where T is the system clock period and  $\eta$  is the duty of recharge towards period.  $C_{\rm DSC}$  is the parasitic capacitance at  $V_{\rm DSC}$  node.  $I_{\rm Leak}$  is the cumulated leakage current at  $V_{\rm DSC}$  node and is assumed as a constant. Since the  $C_{\rm BP}$  charges are leaked due to  $I_{\rm Leak}$ , the potential of  $V_{\rm DSC}$  and the power gating performance are lowered. To guarantee the  $V_{\rm DSC}$  keeps its potential above  $V_{\rm DD}$  under a 10 kHz operation,  $C_{\rm BP}$  is designed with a 1 pF MIM capacitor according to the worst case condition. Besides, DCA generates a 0.5  $\mu$ s-1.0  $\mu$ s periodic pulse for recharging the  $C_{\rm BP}$ . Because  $I_{\rm Leak}$  varies in different corners, the system clock is tunable to find the appropriate recharge rate.

The WE accelerates the turned-on time with a boosted positive overdrive voltage when the system is switched for computation. The  $C_{\rm BN}$  is also designed with a 1 pF MIM capacitor such that  $V_{\rm DSC}$  has a minimum value of -0.41~V at typical design corner. Since the WE generates a one-shot-boosting  $V_{\rm DSC}$ , the  $V_{\rm DSC}$  gradually raises due to the charge leakage until  $V_{\rm DSC}$  is equal to 0 V. Similarly to (6), the minimum  $V_{\rm DSC}$  in active mode can be expressed as

$$V_{\rm DSC,min} \approx \frac{-C_{\rm BN} \cdot V_{\rm DD}}{C_{\rm BN} + C_{\rm DSC}}.$$
 (7)

Fig. 12 shows the simulation result of the proposed DSC and the conventional power gating approach [24]. For a fair comparison, both approaches are designed with identical sizes of power switch, where all the devices are regular  $V_{\rm t}$ . In the conventional approach, a tapered buffer is applied to enhance the switch driving capability. The aspect ratio of the tapered buffer is designed according to the gradual fanout ratio of 1:4.  $V_{\rm OUT}$  is one of the processor output to observe the wakeup enhancement. Table I lists the DSC performance summary. Compared to the naive power gating technique [24], the DSC further re-

TABLE I
COMPARISONS OF CONVENTIONAL POWER GATING TECHNIQUE
WITH THE PROPOSED DSC

| Leakage suppress |      | P <sub>STB</sub><br>(nW) | Turn-on time(ns) |
|------------------|------|--------------------------|------------------|
| conv.            | 26.7 | 87.4                     | 89.7             |
| DSC              | 19.5 | 8.9                      | 26.1             |
| improve          | 27%  | 89.8%                    | 71%              |

P<sub>drive</sub>: power of tapered buffer or DSC P<sub>STB</sub>: standby power of power switch

duces 89.8% standby current with  $3.45 \times$  faster wakeup time of the power switch.

# VI. EXPERIMENTAL RESULTS

The proposed CS-SoC is fabricated in a 90 nm standard CMOS technology [22]. The measurement instruments include a LeCroy 4000A oscilloscope and Agilent 16902A logic analyzer. The chip current is measured by Keithley 2401 source meter.

# A. Chip Measurement

Fig. 13 shows the chip photo including the power and clock domain partitions. In order to evaluate the applied techniques for both active and average power reduction, Fig. 14 shows the measured power of general purpose processor with different supply voltages and operation frequencies. Scaling the supply voltage from 1.0 V to 0.5 V, the active computation power is reduced by more than 79%. Furthermore, the computation with lower operation frequency is observed with worse energy efficiency due to the large static current. By raising the operation frequency with critical-path replica oscillators, the energy efficiency is enhanced to sub-5 pJ/cycle. Moreover, the raised operation frequency also results in the lower system active duty.

Since the system active duty is lowered, the leakage current during data collection period is further suppressed with DSC for system power saving. Fig. 15 shows the measured DSC power reduction ratio compared to the conventional power gating approach. The measured power includes the power dissipation of predriver and the standby power of power switch. When



Fig. 13. Chip microphotograph of CS-SoC.



Fig. 14. The measured voltage scaling result and energy efficiency with different operation frequencies.



Fig. 15. The measured DSC power reduction ratio compared to the conventional power gating approach.

the system frequency from crystal-less clock generator is approaching to zero, the power saving is limited due to insufficient charge of DSC. As the system frequency arises, the DSC provides higher average overdrive signal to the power switch and achieves 69.8% power reduction ratio at 16 kHz. Although the power reduction ratio increases with system frequency, the DSC power also increases. The optimal frequency designed for overall power saving ranges between 8–32 kHz.

TABLE II
THE CHIP DESIGN SUMMARY OF THE PROPOSED CS-SOC

| Technology                             | UMC 90nm Standard CMOS                                                                          |  |
|----------------------------------------|-------------------------------------------------------------------------------------------------|--|
| Chip Area                              | 2565µm×1945µm                                                                                   |  |
| Supply Voltage                         | 0.5-1.0V (0.7-1.0V for SRAM)                                                                    |  |
| CLCG/CPRO<br>Frequency                 | 8-32kHz (System)<br>25/40MHz (Signal Processing)                                                |  |
| Sensor<br>Interface                    | 8/12-bit, 250-10kS/s<br>BW: 0.5-160Hz (tunable)<br>AFE gain: 26-46dB (tunable)                  |  |
| Input Signal                           | 3-ch ECG/VCG, 1-ch PCG,<br>Other support signal                                                 |  |
| SoC Power                              | 19.4μW (crystal-less CLK gen.)<br>7-32.8μW (digital processors)<br>9.1-53μW (sensor interfaces) |  |
| ML-Assisted<br>Statistical<br>Analysis | MAR coefficient estimator<br>SA (skewness/kurtosis)<br>MLC/SVM classifications                  |  |
| Detection Rate                         | >95.8% (Arrhythmia*)<br>>99% (MI <sup>†</sup> )                                                 |  |

\*Verified using in-house recorded data (707 ECG records)

†Verified using MIT-PTB database (448 VCG records)

To compare with the state-of-the-art on-sensor processors [5]–[8], Fig. 16 summarizes the measured processor power, including the memories, logics, and the critical path replica oscillators. The digital processors perform versatile feature extractions and syndrome classifications where the system active duty is minimized to 0.01%–0.11% with the clock-less preprocessing and 25/40 MHz critical-path replica oscillators. Applied with the 0.5 V voltage scaling, memory compression and DSC, the processor shows 11% to 46% lower power dissipation compared to the state-of-the-art [5]–[8]. Moreover, the CS-SoC enables the ML analysis for the most detectable syndromes and can be configured for general use.

# B. Applications to Cardiac Syndrome Detections

The CS-SoC is verified with the pre-recorded in-house patient database and the MIT-PTB database (PTBDB) [23]. The in-house patient database is constructed with single channel ECG using 250 S/s sampling rate and 12-bit ADC resolution. Both static and mobile recording conditions are included. Each record is annotated with at least two medical doctors to identify the normal status and abnormal arrhythmia syndromes. In addition, The PTBDB includes multi-channel ECG and VCG signals with noises and syndromes such as MI. In this work, the CS-SoC is evaluated with 3-channel VCG-based MI detection.

The atrial and ventricular arrhythmia detections are evaluated with the single channel ECG based on the P, Q, R, S, T fiducial points segmented in the first feature extraction stage. Applied with the medical advice, the rule-based classification is applied for power-saving purpose. Among 707 records from the in-house database, at least 95.8% detection rate is achieved.

As the MI syndrome is hard to detect by simple rules, the ML assisted algorithm is applied. Examined with 1–75 features and 3 classifiers under PTBDB [23], the MI detection rate is shown



Fig. 16. (a) The measured processor power dissipation versus different cardiac signal analysis applications. (b) The detectable syndrome types.



Fig. 17. The MI detection performance with different features and classifiers.



Fig. 18. The detection results with (a) the normal and (b) MI waveforms.

in Fig. 17. The MLC with more than 44 features achieves over 99% on-chip MI detection accuracy. Fig. 18 shows the example waveforms of the normal and MI detection results, where each heartbeat is classified. If the abnormal syndrome is detected, an alarm signal would be indicated for successive processing.

Table II shows the chip design summary. Considering the device size, the quartz crystal is eliminated by the crystal-less clock generator while the sensor interfaces are integrated. Including the power dissipation of the crystal-less clock generator and the sensor interfaces, the CS-SoC achieves over 95.8/99% detection accuracy while consuming 48.6  $\mu$ W to 105.2  $\mu$ W for arrhythmia and MI detections. To the best of the authors' knowledge, this is the 1st chip that enables the detection of the MI occurrence accompanied with the versatile cardiac syndromes,

allowing both long-term recording and early warning for mobile healthcare applications.

#### VII. CONCLUSION

Low-power sensor nodes with extended monitoring duration are desired for mobile healthcare applications. A key for such applications is the precise on-sensor analysis for the information extraction, enabling the transmission data reduction and immediate alarm indication. In this work, a ML assisted CS-SoC is proposed. The CS-SoC not only enhances the signal quality, but also enables versatile feature extractions and classifications. Applied with the low-power techniques including voltage scaling, duty-cycling and the dynamic standby controller, the SoC consumes  $\mu$ W-scale power dissipation for long-term mobile healthcare applications.

#### ACKNOWLEDGMENT

The authors appreciate AndesCore license from Andes Technology. P. Y. Hsu, C. Y. Yu, Y. Tseng, T. Y. Lin, T. Z. Yang, S. Indevuyst, C. S. Huang, and S. W. Lu are acknowledged for their technical support. M. D. T. F. Yang and M. D. R. J. Chen are also acknowledged for the medical diagnosis.

# REFERENCES

- [1] Y. H. Lin, I. C. Jan, P. C. I. Ko, Y. Y. Chen, J. M. Wong, and G. J. Jan, "A wireless PDA-based physiological monitoring system for patient transport," *IEEE Trans. Inf. Technol. Biomed.*, vol. 8, no. 4, pp. 439–447, Dec. 2004.
- [2] S. Arnon, D. Bhastekar, D. Kedar, and A. Tauber, "A comparative study of wireless communication network configurations for medical applications," *IEEE Wireless Commun.*, vol. 10, no. 1, pp. 56–61, Feb. 2003.
- [3] C. S. Huang, L. W. Ko, S. W. Lu, S. A. Chen, and C. T. Lin, "A vectorcardiogram-based classification system for the detection of myocardial infarction," in *IEEE Conf. Eng. Med. Biol. Soc. (EMBC)*, 2011, pp. 973–976.
- [4] N. Verma, A. Shoeb, J. Bohorquez, J. Dawson, J. Guttag, and A. Chandrakasan, "A micro-power EEG acquisition SoC with integrated feature extraction processor for a chronic seizure detection system," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 804–816, Apr. 2010.
- [5] M. Ashouei, J. Hulzink, M. Konijnenburg, J. Zhou, F. Duarte, A. Breeschoten, J. Huisken, J. Stuyt, H. de Groot, F. Barat, J. David, and J. Van Ginderdeuren, "A voltage-scalable biomedical signal processor running ECG using 13 pJ/cycle at 1 MHz and 0.4 V," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2011, pp. 332–333.
- [6] J. Kwong and A. P. Chandrakasan, "An energy-efficient biomedical signal processing platform," *IEEE J. Solid-State Circuits*, vol. 46, no. 7, pp. 1742–1753, Jul. 2011.

- [7] F. Zhang, Y. Zhang, J. Silver, Y. Shakhsheer, M. Nagaraju, A. Klinefelter, J. Pandey, J. Boley, E. Carlson, A. Shrivastava, B. Otis, and B. Calhoun, "A batteryless 19 μW MICS/ISM-band energy harvesting body area sensor node SoC," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2012, pp. 298–299.
- [8] S. Y. Hsu, Y. Ho, Y. Tseng, T. Y. Lin, P. Y. Chang, J. W. Lee, J. H. Hsiao, S. M. Chuang, T. Z. Yang, P. C. Liu, T. F. Yang, R. J. Chen, C. Su, and C. Y. Lee, "A sub-100 μW multi-functional cardiac signal processor for mobile healthcare applications," in *Symp. VLSI Circuits Dig.*, 2012, pp. 156–157.
- [9] J. Yoo, L. Yan, D. El-Damak, M. Bin Altaf, A. H. Shoeb, H. J. Yoo, and A. Chandrakasan, "An 8-channel scalable EEG acquisition SoC with fully integrated patient-specific seizure classification and recording processor," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, 2012, pp. 292–294.
- [10] Y. Tseng, Y. Ho, S. Kao, and C. Su, "A 0.09  $\mu$ W low-power front-end biopotential amplifier for biosignal recording," *IEEE Trans. Biomed. Circuits Syst.*, vol. 6, no. 5, pp. 508–516, Oct. 2012.
- [11] M. L. Ahlstrom and W. J. Tompkins, "Digital filters for real-time ECG signal processing using microprocessors," *IEEE Trans. Biomed. Eng.*, vol. 32, no. 9, pp. 708–713, Sep. 1985.
- [12] W. H. Lin, M. Wong, L. N. Pu, and Y. T. Zhang, "Comparison of median filter and discrete dyadic wavelet transform for noise cancellation in electrocardiogram," in *Proc. IEEE Conf. Eng. Med. Biol. Soc.* (EMBC), 2010, pp. 2395–2398.
- [13] Y. Sun, K. L. Chan, and S. M. Krishnan, "ECG signal conditioning by morphological filtering," *Comput. Biol. Med.*, vol. 32, pp. 465–479, 2002.
- [14] L. Chisci, A. Mavino, G. Perferi, M. Sciandrone, C. Anile, G. Colicchio, and F. Fuggetta, "Real-time epileptic seizure prediction using AR models and support vector machines," *IEEE Trans. Biomed. Eng.*, vol. 57, no. 5, May 2010.
- [15] C. W. Anderson, E. A. Stolz, and S. Shamsunder, "Multivariate autoregressive models for classification of spontaneous electroencephalographic signals during mental tasks," *IEEE Trans. Biomed. Eng.*, vol. 45, no. 3, Mar. 1998.
- [16] I. J. Chang, S. P. Park, and K. Roy, "Exploring asynchronous design techniques for process-tolerant and energy-efficient subthreshold operation," *IEEE J. Solid-State Circuits*, vol. 45, no. 2, pp. 401–410, Feb. 2010.
- [17] W. A. Woodward, H. L. Gray, and A. C. Elliot, Applied Time Series Analysis. London, U.K.: Chapman and Hall/CRC, 2011.
- [18] W. H. Sung, S. Y. Hsu, J. Y. Yu, C. Y. Yu, and C. Y. Lee, "A frequency accuracy enhanced sub-10 μW on-chip clock generator for energy efficient crystal-less wireless biotelemetry applications," in *Symp. VLSI Circuits Dig.*, 2010, pp. 115–116.
- [19] J. T. Kao and A. P. Chandrakasan, "Dual-threshold voltage techniques for low-power digital circuits," *IEEE J. Solid-State Circuits*, vol. 35, no. 7, pp. 1009–1018, Jul. 2000.
- [20] A. Valentian and E. Beigné, "Automatic gate biasing of an SCCMOS power switch achieving maximum leakage reduction and lowering leakage current variability," *IEEE J. Solid-State Circuits*, vol. 43, no. 7, pp. 1688–1698, Jul. 2008.
- [21] J. S. Chen, C. W. Yeh, and J. S. Wang, "Self-super-cutoff power gating with state retention on a 0.3 V 0.29 fJ/cycle/gate 32b RISC core in 0.13 μm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.* Papers, 2013, pp. 426–427.
- [22] S. Y. Hsu, Y. Ho, P. Y. Chang, P. Y. Hsu, C. Y. Yu, Y. Tseng, T. Z. Yang, T. F. Yang, R. J. Chen, C. Su, and C. Y. Lee, "A 48.6-to-105.2 μW machine-learning assisted cardiac sensor SoC for mobile health-care monitoring," in Symp. VLSI Circuits Dig., 2013, pp. 252–253.
- [23] The PTB Diagnostic ECG Database, 1995 [Online]. Available: http://www.physionet.org/physiobank/database/ptbdb/
- [24] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, "1-V power supply high-speed digital circuit technology with multithreshold-voltage CMOS," *IEEE J. Solid-State Circuits*, vol. 30, no. 8, pp. 847–854, Aug. 1995.
- [25] J. Bartovský, M. Holík, V. Kraus, A. Krutina, R. Šalom, and V. Georgiev, "Overview of recent advances in hardware implementation of mathematical morphology," *Telecommunications Forum*, Nov. 2012



**Shu-Yu Hsu** (S'11) received the B.S. and Ph.D. degrees in electronics engineering from National Chiao Tung University (NCTU), Hsinchu, Taiwan, in 2007 and 2013, respectively.

He is currently working for MediaTek Inc., Hsinchu, Taiwan. In 2013, he was a postdoctoral researcher with the Department of Electronics Engineering, NCTU, as well as a visiting scholar with Wong Lab, Stanford University, Stanford, CA, USA. His research interests include algorithms, architectures, and low-power SoC designs for wireless,

biomedical, and big data applications

Dr. Hsu was a recipient of the MediaTek Fellowship from 2012 to 2013.



**Yingchieh Ho** (S'09–M'13) received the B.S. and M.S. degrees in electrical engineering from National Central University, Chungli, Taiwan, in 1999 and 2001, respectively. He received the Ph.D. degree in the Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan, in 2012.

Since February 2013, he has been on the faculty at the Department of Electrical Engineering, National Dong Hwa University, Taiwan, where he is an Assistant Professor. His research interests are circuit designs for low-voltage and biomedical systems.



**Po-Yao** Chang received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Hsinchu, Taiwan, in 2009 and 2012, respectively.

He is currently working for MediaTek Inc., Hsinchu, Taiwan. His research interests include biomedical signal processing and low-power SoC designs.



Chauchin Su (M'90) received the B.S. and M.S. degrees in electrical engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1979 and 1981, respectively. He received the Ph.D. degree in electrical and computer engineering from University of Wisconsin, Madison, WI, USA, in 1990.

He is now a Professor in the Department of Electrical Engineering, National Chiao Tung University, Taiwan. His research interests are in the area of mixed analog and digital circuit design and testing, especially in low-power biomedical circuits and systems.



Chen-Yi Lee (M'01) received the B.S. degree from National Chiao Tung University (NCTU), Hsinchu, Taiwan, in 1982, and the M.S. and Ph.D. degrees from Katholieke Universiteit Leuven, Leuven, Belgium, in 1986 and 1990, respectively, all in electrical engineering.

From 1986 to 1990, he was with IMEC/VSDM, working in the area of architecture synthesis for DSP. In February 1991, he joined the faculty of the Department of Electronics Engineering, NCTU, Hsinchu, Taiwan, where he is currently a Professor. His re-

search interests mainly include VLSI algorithms and architectures for high-throughput DSP applications. He is also active in various aspects of high-speed networking, SoC design technology, very low-power designs, and multimedia signal processing. In these areas, he has published more than 200 papers and holds decades of patents.

Dr. Lee served as the Director of Chip Implementation Center (CIC), an organization for IC design promotion in Taiwan (2000/8–2003/12), and the microelectronics program coordinator of Engineering Division under National Science Council of Taiwan (2003/1–2005/12). He was the former IEEE CAS Taipei Chapter Chair.