Learning Shapelets for Improving Single-Molecule Nanopore Sensing

May 14, 2019 - A primitive machine learning concept, called shapelets, has been .... to each category, thus defining the categorical variable Y ∈ {1...
0 downloads 0 Views 2MB Size
Article Cite This: Anal. Chem. XXXX, XXX, XXX−XXX

pubs.acs.org/ac

Learning Shapelets for Improving Single-Molecule Nanopore Sensing Zi-Xuan Wei,† Yi-Lun Ying,*,‡,§ Meng-Yin Li,‡ Jie Yang,‡ Jia-Le Zhou,† Hui-Feng Wang,*,† Bing-Yong Yan,† and Yi-Tao Long‡,§

Downloaded via KEAN UNIV on July 19, 2019 at 01:34:52 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



School of Information and Engineering, East China University of Science and Technology, Shanghai 200237, People’s Republic of China ‡ School of Chemistry and Molecule Engineering, East China University of Science and Technology, Shanghai 200237, People’s Republic of China § State Key Laboratory of Analytical Chemistry for Life Science, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing 210023, People’s Republic of China S Supporting Information *

ABSTRACT: The nanopore technique employs a nanoscale cavity to electrochemically confine individual molecules, achieving ultrasensitive single-molecule analysis based on evaluating the amplitude and duration of the ionic current. However, each nanopore sensing interface has its own intrinsic sensing ability, which does not always efficiently generate distinctive blockade currents for multiple analytes. Therefore, analytes that differ at only a single site often exhibit similar blockade currents or durations in nanopore experiments, which often produces serious overlap in the resulting statistical graphs. To improve the sensing ability of nanopores, herein we propose a novel shapelet-based machine learning approach to discriminate mixed analytes that exhibit nearly identical blockade current amplitudes and durations. DNA oligomers with a single-nucleotide difference, 5′-AAAA-3′ and 5′GAAA-3′, are employed as model analytes that are difficult to identify in aerolysin nanopores at 100 mV. First, a set of the most informative and discriminative segments are learned from the time-series data set of blockade current signals using the learning time-series shapelets (LTS) algorithm. Then, the shapelet-transformed representation of the signals is obtained by calculating the minimum distance between the shapelets and the original signals. A simple logistic classifier is used to identify the two types of DNA oligomers in accordance with the corresponding shapelet-transformed representation. Finally, an evaluation is performed on the validation data set to show that our approach can achieve a high F1 score of 0.933. In comparison with the conventional statistical methods for the analysis of duration and residual current, the shapelet-transformed representation provides clearly discriminated distributions for multiple analytes. Taking advantage of the robust LTS algorithm, one could anticipate the real-time analysis of nanopore events for the direct identification and quantification of multiple biomolecules in a complex real sample (e.g., serum) without labels and time-consuming mutagenesis.

N

sensing ability of nanopore techniques usually depends on the measurable differences between the duration and residual current of each molecule. Therefore, advanced algorithms have been developed to precisely evaluate the durations and residual current of each blockade from the noisy current trace, including CUSUM and adaptive time-series analysis (ADEPT) in MOSAIC software,12,13 the second-order differential-based calibration method,14 the modified hidden Markov model,8 etc. By incorporation of these high-performance algorithms, various types of biological nanopores (e.g., αhemolysin,15 MspA,16 the phi29 DNA-packaging nanomotor,17

anopore technology has been developed as a powerful platform for single-molecule analysis1−11 and has been widely applied in single-molecule analysis,9 DNA sequencing,5,7 clinical diagnosis,2,4 environmental monitoring,3 etc. A typical biological nanopore device consists of a nanoscaleconfined pore embedded in a membrane and two electrolyte solutions on either side of the pore. Driven by a specific amplitude voltage, ions in the solution enter and traverse through the nanopore, which produces a weak open current at the picoampere level.6,8 When a single molecule, such as a protein or oligonucleotide, enters the nanopore, the molecule regulates the ionic flow through the confined space, which in turn causes a weak blockade current.8,11 The duration and residual current of each blockade can then be further analyzed to extract the characteristics of molecules, including size, conformation, structure, and intermolecular interactions. The © XXXX American Chemical Society

Received: April 18, 2019 Accepted: May 14, 2019 Published: May 14, 2019 A

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

Figure 1. Shapelet learning on nanopore data. The blockade current signals are extracted from a limited range of data point numbers, and the most similar segments extracted from the open current are subtracted from them. Then, the LTS algorithm is used to mine shapelets that transform the raw blockade current signal into a new representation of distance or similarity. (a) Illustration of an aerolysin nanopore in analyzing the mixture of AA3 and GA3. (b) The raw current traces for AA3 (left) and GA3 (right). The current trace in each signal is numbered as Tm. The representative shapelets S1 and S2 overlap with the closest match among the subsegments in each blockade. (c) Two typical patterns of shapelets S1 and S2 learned from the blockade current signals. The LTS algorithm can determine a certain number of shapelets. For a clear illustration, this figure shows only two shapelets, S1 and S2, from the blockade current signals. (d) Illustration of the closest matches of two shapelets, S1 and S2, to T2 of AA3 (blue) and T4 of GA3 (green), respectively. (e) Scatter plots of duration versus residual current of AA3 and GA3. Insert: shapelet-transformed representation in two-dimensional space. The traditional scatter plots of duration versus residual current exhibit a serious overlap. Given shapelets S1 and S2, two analytes show a clear distribution in the scatter plots of minimum Euclidean distances (||T* − Sn||2) between the optimal subsegment in each blockade and S1 and S2. All data were acquired at 22 ± 2 °C with an applied voltage of +100 mV in 1 M KCl, 10 mM Tris, and 1 mM EDTA buffer at pH 8.0 in the presence of 5 μM oligonucleotide. Current traces were acquired at a sampling frequency of 250 kHz and filtered by a 5 kHz low-pass filter.

ClyA,18 FhuA,19 lysenin,20 and CsgG21,22) have been used to study the charge and length of peptides,23,24 the entry and transport of unfolded proteins,25 enzymatic reactions,26,27 and the size of PEGs.28 Moreover, the deep learning method has been employed to translate measurable time-series current variations into DNA sequences.29 As previously noted, the sensing ability of nanopore techniques usually depends on the measurable differences in the whole current trace for each analyte. Therefore, challenges still remain in the nanopore analysis of multiple analytes in a mixed sample: (1) requiring of high current resolution and temporal resolution of a nanopore to produce a distinguishable blockade current that could be clearly assigned to each analyte and (2) demanding of high resistance to the interferent signals in the mixture. Especially, it is challenging to discriminate the mixed species in heterosequences, since the multiple analytes show a clear overlap in the statistical scatter plots. For example, 5′-AAAA-3′ (AA3) and 5′-GAAA-3′ (GA3) produce identical residual currents and durations with an aerolysin nanopore in 1 M KCl. This drawback could greatly decrease the sensitivity of nanopores, limiting the application of nanopores in real sample analysis. Possible approaches to overcome these challenges include site-directed mutagenesis, modulation of the experimental conditions, and the incorporation of labels, chemical modifications, and probe molecules. However, these strategies increase the cost and complexity of experiments. Moreover, it is difficult to completely eliminate overlap in the event distribution of various analytes. For example, the K238G mutant aerolysin nanopore, which has the strongest sensing

ability for aerolysin-based single-nucleotide discrimination,30 still shows clear overlap in the scatter plots of cytosine DNA and methylcytosine DNA. To accurately assign each blockade event to the species in the mixture, it is important to improve the selectivity and sensitivity for direct reading of the timeseries current trace. A primitive machine learning concept, called shapelets, has been proposed to explore the maximally discriminative features that represent a short local segment of time-series data.31−33 It utilizes short time-series segments to predict the target categories of global series. In this paper, we propose a novel shapelet-based machine learning approach to the accurate identification of two species in a mixture that are difficult to discriminate with traditional data processing methods. Here, two heterosequences of oligonucleotides (AA3 and GA3) with a single-nucleobase difference were used as a model system. Notably, we utilize the learning time-series shapelets (LTS) method to mine shapelets from blockade current signals. The shapelet-transformed representation of the signals is further used to identify the corresponding analytes. An evaluation has been performed to show that our approach can achieve efficient discrimination of a mixture of oligonucleotides, which demonstrates its practicability.



METHODS Complete experimental methods are provided in the Supporting Information. Data Analysis Procedure. As illustrated in Figure 1, the processing procedure consists of three main steps: (i) data preprocessing, which extracts blockade current signals from a B

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry limited number range of data points using a certain preset threshold and removes the trend by subtracting the baseline component, (ii) use of the LTS algorithm to mine a set of shapelets from the preprocessed signals, and (iii) use of the shapelets produced by the LTS algorithm to transform the signals and input of the shapelet-transformed representation into a classifier to obtain the target category. In the LTS algorithm, a stack structure is constructed by using the shapelet-transforming layer and the classifier with the most basic logistic function. The algorithm consists of three procedures: learning, transformation, and prediction. In nanopore data analysis, the learning process is to extract common segments (Sn) as shapelets that provide discriminative information to identify the analyte from the current traces of each blockade event (Tm) (Figure 1b,c). Since the nanopore signals contain considerable noise, the learned shapelets may not be identical with the original series.34 However, they can be regarded as approximations to the latent variables that are usually hidden in the noise of the nanopore. Then, each shapelet is compared to the subsegments from Tm with the same length (Figure 1d). To represent the similarity of shapelets to the original series, we calculated the Euclidean distances (||Tm − Sn||2) between the shapelets and the compared subsegments. Screening all the subsegments in one T m provided the minimum distances (||T* − S n || 2 ) corresponding to the closest match between the shapelets Sn and the optimal subsegments T*. Note that the Euclidean distances are robust to noise, which ensures that our methods are suitable for application to complex mixtures with low signal-to-noise ratios. Subsequently, ||T* − Sn||2 is treated as a feature to transfer the time-series blockade events into ndimensional space. For example, we assume that two shapelets have been learned from the original series (Figure 1c). The minimum distances of the segments to the shapelets can optimally transform the raw signals into a two-dimensional space. As shown in Figure 1e, the nanopore events in the shapelet-transformed scatter plots clearly reveal two populations without any overlap, while the classic scatter plots of the ionic current and durations from each event show serious overlap. To achieve the precise identification of each event, the classification algorithms are further used to classify the event in n-dimensional space.35 Here, we employed the logistic function, which is a typical linear classifier. Our results demonstrate that the transformed values of ||T* − Sn||2 are linearly distinguishable in n-dimensional space. Data Preprocessing. Before the algorithm is applied, data preprocessing is required to make a data set for the rest of the steps. In the proposed LTS algorithm, the similarity of shapelets to raw signals is calculated point to point. Therefore, using time units to describe the signals as in conventional statistical analysis is not suitable for our method. In the following analysis, we use the number of data points. A threshold is employed to identify the blockade current signals. To eliminate the falling and rising edges, we exclude 10 points from both ends of the signals. The histogram of the number of points obtained by using this preprocessing procedure shows a distribution consistent with the durations extracted by the MOSAIC software (Figure 2a,b). For the discrimination of AA3 and GA3, we choose blockade current signals with a number of points ranging from 1100 to 1300. All of the signals are padded at the end with a blockade current of zeros to reach the maximum length of 1300 points.

Figure 2. Statistical analysis of AA3 and GA3 traversing the aerolysin pores. (a) Duration histograms of AA3 and GA3. The duration of each raw signal is acquired by MOSAIC software. (b) Histogram of the number of points for AA3 and GA3. The Gaussian distribution was used to fit the histogram. The data was obtained at a sampling frequency of 250 kHz.

Time-Series Data Set. In this paper, a time-series data set is a set of blockade current signals extracted by the data preprocessing step. It contains N signal samples, each of which has a length of Q. Therefore, the data set is denoted as TN×Q. The number of molecular species corresponding to the data set T is C. We can set a natural number from 1 to C corresponding to each category, thus defining the categorical variable Y ∈ {1, ..., C}. Shapelets. As a segment of a time series, the shapelet itself is a time series. Each shapelet (S) contains the L length sequential data. Note that L is lower than Q. A time-series data set contains many segments from which a certain number (K) of shapelets can be selected or learned. Therefore, the K most discriminative shapelets are represented by S ∈ K×L, where  represents the real number domain. Sliding Window Segment. The shapelet is considered as a segment of the time series. To compare the shapelets with the original sequence, we need to define a sliding window with the same length as the shapelet with a sliding step set to 1. Therefore, a time-series data set contains J segments, where J = Q − L + 1. In the ith sample of the data set, the segment whose start time is j can be expressed as (Ti,j, ..., Ti,j+L−1). To capture information from segments of different lengths, shapelets with variable lengths can be adopted. We can start from a minimum length Lmin and gradually increase to R·Lmin: i.e., {Lmin,2Lmin, ..., r·Lmin}, r = 1, 2, ..., R. Therefore, in the case of a fixed r value, a time series of blockade events containing J = Q − r·Lmin + 1 segments can be expressed as (Ti,j, ..., Ti,j+r·Lmin − 1). Then, the selected K shapelets with fixed length r·Lmin lead to a total number of shapelets of r·K. Therefore, the minimum length Lmin, the number of the most discriminative shapelets K, and the scale R are the crucial hyperparameters that need to be manually set before applying the LTS algorithm to the nanopore data set. To simplify the setting of hyperparameters, Lmin is assigned a percentage value to represent its ratio to the length Q. Then, K can be derived using eq 1.32 K = log10(J ) × (C − 1)

(1)

Shapelet-Transformed Representation. Since the minimum Euclidean distance between a shapelet and the optimal subsegments represents the similarity, the distance can be used as a representative feature in the shapelet feature space. By comparison of the sliding window segments of the ith sample in the data set with a shapelet, J distance values can be obtained, and the minimum value is selected as the representation shown in eq 2. C

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry r ·Lmin 2

Di , r , k = || T * − Sr , k || = minj = 1,..., J



Table 1. Confusion Matrix

(Ti , j + l − 1 − Sr , k , l)2

predicted class

l=1

(2) N×Q

By shapelet transformation, the data set T is converted to DN×R·K. LTS Algorithm. The LTS model is constructed using a stack structure with a logistic function as the top classifier layer. For the classification problem, cross entropy is generally chosen as the loss function, as shown in eq 3 3(Y , Y ̂ ) = −Y ln σ(Y ̂ ) − (1 − Y ) ln(1 − σ(Y ̂ ))

actual class

positive

negative

positive negative

TP (true positive) FP (false positive)

FN (false negative) TN (true negative)

Here, the F1 score combines precision and recall into a single metric. The precision refers to the percentage of the total number of positive cases predicted by a classifier that are correctly identified, while recall refers to the number of positive cases as a percentage of the actual class. These measurements are defined as

(3)

where Y is the real target category, while Ŷ is the prediction calculated by a linear function with weights W ∈ R·K+1. The prediction Ŷ i for the ith sample is derived from the shapelettransformed representation, as shown in eq 4.

precision =

R ·K

Yî = W0 +

∑ Di ,kWk , ∀ i ∈ {1, ..., N} k=1

recall = (4)

TP TP + FN

(6) (7)

According to eqs 6 and 7, the precision and recall differ from each other. Usually, a high precision score accompanies a low recall score. To combine the effects of precision and recall, we adopt the F1 score as follows:

A logistic sigmoid function σ(·) is defined as σ(Y) = (1 + e−Y)−1. The function maps the above prediction to the range (0, 1), representing the probability that Di,* belongs to a certain class. Therefore, the parameters that need to be optimized in the LTS model are the shapelet S and the linear weights W. In the field of machine learning, an effective classifier is usually obtained by optimizing a loss function in the learning process. In this algorithm, the shapelets affect the performance of the transformation over the data set, which in turn influences that of the classifier. To reduce overfitting risk and improve the robustness of the algorithm, we add regularization terms for the weights to the cross entropy function, which derives the ultimate loss function denoted by - in eq 5

F1 =

2 × precision × recall precision + recall

(8)

Validation. Generally, the metrics should be implemented on a data set called a validation set, to further optimize the performance of a classifier. Since almost every machine learning algorithm suffers from overfitting in the training process, suspect results are usually obtained in a real application because some training set signals are included in a validation set. Therefore, we should cautiously separate the data set into a training set and a validation set without any duplicate data. Here, we use 5-fold cross validation (k-fold cross validation, where k = 5 is one of the most commonly used cases) to process the data set (Figure 3). First, the data set is equally split into five subsets. Every subset is kept as consistent as possible in data distribution; that is, the signals are from stratified sampling in the data set. Then, we train the classifier on the union set of four subsets, while the remaining set is used for validating the classifier. Clearly, there are five pairs of

N

argminS , W -(S , W ) = argminS , W

TP TP + FP

∑ 3(Yi , Yî ) + λW || W ||2 i=1

(5)

where λW is the coefficient of the regularization term utilized for reducing overfitting. It should be set no higher than 1 because a large value may make it difficult to update the linear weights. The LTS algorithm jointly updates the classifier parameters and shapelets until realizing an optimized classification loss function 5. Therefore, optimizing eq 5 requires deriving the gradients of the loss function - with respect to S and W. Another hyperparameter called the learning rate, denoted by η, controls the optimization strength for adjusting the weights of the algorithms with respect to the gradient. η is fixed at a small value of η = 0.01.32 The optimization details and some tricks for calculating gradients are introduced in ref 32. Metrics. To validate the performance of the proposed method, we adopt the F1 score as a score evaluation system, as discussed below. Here, the conf usion matrix summarizes the results for further validation of our algorithm. As given in Table 1, there are two types of actual classes, positive and negative. The predicted class is the predicted results obtained by the LTS algorithm. The symbols FP and FN are used for the cases when the classifier fails, while TP and TN represent the correct outputs. Ideally, the nondiagonal entries of the confusion matrix for a validated classifier should be zero or close to zero.

Figure 3. Illustration of 5-fold cross validation. The term Model denotes the algorithm of a classifier, while the terms train and valid denote the training and validation sets, respectively. The data set is first split into five equal subsets. Then, five models of the algorithm are trained on the five groups, resulting in a set of metrics of the same number. The average value of the five metrics is used to measure the performance of the algorithm. D

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

Figure 4. Comparison of efficiency and performance under different hyperparameter settings. (a) Time cost in the training process versus the minimum shapelet length Lmin with respect to minimum length and scale factor. (b) Performance measurement evaluated by the F1 score. (c)−(e) F1 score calculated under different hyperparameter settings with five levels of noise from 0.0 to 0.5 pA.

training and validation sets, which generate five scores. Further comparison and validation are based on the average of these results.

each optimization epoch. A comparison of different R values reveals that the training time positively depends on the R value (Figure 4a). Moreover, the training time exhibits a logarithmic increase with the minimum shapelet length Lmin. The number of segments of a signal for calculating the distance to a shapelet decreases with increasing shapelet length when the length of the blockade signals is fixed. Thus, less time is required to determine the minimum distance, which reduces the total time cost in the case of a larger minimum length. The process of optimizing the shapelet parameters dominates the overall training process. To evaluate the performance of the algorithm, we also calculate the F1 score using cross validation. Figure 4b shows the F1 scores for our hyperparameter settings, which illustrates that the performance is optimized as the minimum shapelet length Lmin and scale factor R increase. With a fixed value of Lmin, a larger value of R represents more combinations of individual scaled-length shapelets, providing more characteristics for a better classification with a high F1 score. For example, when Lmin = 0.05, the F1 score shows an increase from 0.765 to 0.897 as the value of R increases. However, when Lmin is further increased above 0.15, it hardly affects the performance of the LTS algorithm, which shows a comparable F1 score. For instance, setting R = 4 and Lmin = 0.2 generates segments with a length 80% of that of the signals. These segments show a performance similar to that for R = 4, Lmin = 0.1. These results further demonstrate that the shapelet from a local segment of the raw current traces preserves the discriminative information for the classification of nanopore data. Therefore, the LTS algorithm achieves the best performance in the case of Lmin = 0.1 and R = 4 with an F1 score of 0.933. Moreover, the LTS algorithm efficiently processes the raw current traces directly without further filtering. Robustness to Noise. Usually, the raw current traces contain a considerable density of noise. For example, the RMS value is 0.61 in the aerolysin experiments on AA3 and GA3. Therefore, shapelets should be insensitive to noise to yield a minimum distance at the right position. To prove the



RESULTS AND DISCUSSION Application to Labeled Nanopore Data. We ran the evaluation on the computing environment of Intel Xeon CPU E5-2630 v4 @ 2.20 GHz, 64 GB RAM DDR4 2133 Hz, 64-bit Win10 Enterprise. In addition, the algorithm is implemented in tslearn,36 which is a Python package. The performance of the LTS algorithm relies on its hyperparameter setting. To validate the algorithm, it is first applied to the labeled nanopore data from AA3 and GA3 raw current traces. To prevent the algorithm from learning to identify the current amplitude, the amplitude of each labeled trace in the data set is decentralized by using eq 9 Ti , j = Ti̅ , j −

1 Q

Q

∑ Ti̅ ,m m=1

(9)

where T̅ represents the raw current traces. First, the shapelet length is related to the number of segments in the data set. A smaller Lmin provides more segments, which leads to more candidates for the algorithm to learn. This setting could slow the convergence of the algorithm. To accelerate the whole process, fewer parameters could be optimized to complete an update epoch. To confirm the efficiency of the LTS algorithm with respect to the shapelet length, sets with minimum lengths of Lmin ∈ {0.01, 0.02, 0.05, 0.10, 0.20, 0.30} were used to implement the 5-fold cross validation. Then, the training epoch was set to a fixed value of 500, while the coefficient of the regularization term was set to λW = 0.01. With increasing shapelet length, the time cost of the algorithm learning from the training set gradually rises from 80 to 3000 s (Figure 4a). In addition, we also evaluate the algorithm with the scale factor R selected from R ∈ {1, 2, 3, 4}. This scale factor R controls the length range of the shapelets. A larger value of R suggests that the algorithm has larger numbers of parameters to be optimized, which requires more time for E

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

AA3 in the 1:1 mixture was 0.593 (Figure 5c), which is consistent with the results showing that AA3 exhibits nearly 1.59-fold higher capture by aerolysin nanopores than GA3 (Figure S1). To further confirm the accuracy of the prediction of the LTS algorithm, the events in a mixture of AA3 and GA3 in a 4:3 ratio were analyzed. Interestingly, the proportion of AA3 predicted with the same hyperparameter setting increased to 0.695, which indeed reflects the increase in the AA3 concentration ratio (Figure 5b). Then, the prediction of the LTS algorithm was also verified by a 3:4 mixture. As expected, the LTS results show that the proportion of AA3 further decreased as the concentration of AA3 in the mixture decreased (Figure S2). All of these results demonstrate that the LTS algorithm can discriminate analytes that are severely overlapped in conventional statistics.

robustness of the algorithm to noise, we compared the performance of the LTS algorithm by searching hyperparameter values from Lmin ∈ {0.05, 0.1, 0.2} and R ∈ {2, 3, 4}. We trained the algorithm on the data set mixed with zeromean Gaussian white noise with standard deviations (sd) of 0.0, 0.1, 0.2, 0.3, 0.4, and 0.5 pA. Increasing the noise level makes it more difficult to calculate the distances while ignoring the noise data, and the minimum distance may be disturbed by some instantaneous large noises (Figure 4c−e). Furthermore, we also find that the F1 scores are greater than 0.85 with Lmin ∈ {0.1, 0.2}, which is better than those obtained with shorter shapelets. The curves become flatter and the range of the F1 score decreases from 0.135 to 0.050 with increasing Lmin, which indicates that the algorithm performance is more stable and robust with longer shapelets (Figure 4c−e). From a comparison of the F1 score on the noisy data, we may conclude that our method is very tolerant to longer shapelets, i.e., Lmin ∈ {0.1, 0.2}, and is thus suitable for processing unfiltered data with a lower signal-to-noise ratio. Considering the efficiency of the algorithm, shapelets with lower lengths are more efficient but result in a serious loss of accuracy. Comparing the tendencies of the time cost and F1 score shows that the progress of the F1 score becomes small as Lmin ∈ {0.1, 0.2}. Therefore, Lmin = 0.1 and R = 4 are used to set the hyperparameters, since they achieve the best F1 score and effective performance. Real Sample Application to Nanopore Data from AA3 and GA3 Mixture. In this section, we applied the LTS algorithm to an aerolysin nanopore experiment to identify AA3 and GA3 in the mixture. As illustrated in Figure 5, the components of AA3 and GA3 overlap completely regardless of the sample mixture. As described in Data Preprocessing, we selected signals with numbers of points ranging from 500 to 700. We statistically analyzed the proportion of AA3 and GA3 predicted by the LTS algorithm (Figure 5). The proportion of



CONCLUSION In this paper, we developed an LTS algorithm to process seriously overlapped blockade current signals in order to improve the sensing ability of nanopore techniques. The LTS algorithm successfully identifies the blockade current of the single target molecule. In contrast to the conventional statistical methods relying on duration and residual current level, the shapelet-transformed representation provides a clearly discriminated distribution for multiple analytes. Further employing a simple linear classifier, the presented methods distinguish the AA3 and GA3 in the mixture with high precision. Our results also confirm that the shapelets are suitable for processing noisy nanopore data. Taking advantage of the robust LTS algorithm, one could anticipate the real-time analysis of nanopore events for the direct identification and quantification of multiple biomolecules in a complex real sample (e.g., serum). Since the shapelets contain typical features of time-series current traces, this algorithm would help to deepen the comprehensive understanding of the dynamic interactions between nanopores and analytes. A previous study showed that the wavelet algorithm increases the signal-to-noise ratio of nanopore data.37 When the wavelets are employed as a preprocessing algorithm, the presented LTS algorithm may exhibit a stronger tolerance to noise, leading to more efficient signal discrimination. The implementation of the algorithm in the Python package of tslearn would benefit from a well-known open-source machine learning framework, such as TensorFlow38 and keras.39 Thus, we can also effectively train the algorithm on a GPU. Recently, researchers have proposed more effective algorithms for time-series subsequence discovery, such as Matrix Profile. Therefore, the incorporation of these algorithms could be further used to promote the classification effectiveness of shapelets.40,41 Although the risk of overfitting is a universal concern in machine learning algorithms, the performance of the presented algorithm could be further improved by completely resolving this issue. Note that the shapelet-based method focuses on time-series similarity instead of the sequentially changing pattern in the current trace of nanopore sequencing. Therefore, the advantage of the LTS algorithm is the ability to assign each whole current trace to a corresponding target with similar features. Therefore, we expect that more applications of the LTS algorithm could be realized in microRNA detection, posttranscriptional nucleotide identification, DNA damage sensing, peptide discrimination, posttranslational modification analysis, small-molecule identification, etc.

Figure 5. Scatter plots of duration versus residual current of a 1:1 mixture (a) and a 4:3 mixture (b) of AA3 and GA3. The LTS identified populations of AA3 (blue) and GA3 (green) in the 1:1 mixture (c) and 4:3 mixture (d). After the LTS algorithm was implemented, each event could be assigned to either the AA3 population or the GA3 population. Inserts in (c) and (d): the prediction ratios of AA3 and GA3 in the mixture. A potential of 100 mV was applied to perform the experiments in 1 M KCl, 10 mM Tris, and 1 mM EDTA buffer at pH 8.0. F

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry



(16) Butler, T. Z.; Pavlenok, M.; Derrington, I. M.; Niederweis, M.; Gundlach, J. H. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 20647− 20652. (17) Wendell, D.; Jing, P.; Geng, J.; Subramaniam, V.; Lee, T. J.; Montemagno, C.; Guo, P. Nat. Nanotechnol. 2009, 4, 765−772. (18) Soskine, M.; Biesemans, A.; De Maeyer, M.; Maglia, G. J. Am. Chem. Soc. 2013, 135, 13456−13463. (19) Mohammad, M. M.; Iyer, R.; Howard, K. R.; McPike, M. P.; Borer, P. N.; Movileanu, L. J. Am. Chem. Soc. 2012, 134, 9521−9531. (20) Aoki, T.; Hirano, M.; Takeuchi, Y.; Kobayashi, T.; Yanagida, T.; Ide, T. Proc. Jpn. Acad., Ser. B 2010, 86, 920−925. (21) Goyal, P.; Krasteva, P. V.; Van Gerven, N.; Gubellini, F.; Van Den Broeck, I.; Troupiotis-Tsaïlaki, A.; Jonckheere, W.; PéhauArnaudet, G.; Pinkner, J. S.; Chapman, M. R.; et al. Nature 2014, 516, 250−253. (22) Brown, C. G.; Clarke, J. Nat. Biotechnol. 2016, 34, 810−811. (23) Stefureac, R.; Long, Y. T.; Kraatz, H. B.; Howard, P.; Lee, J. S. Biochemistry 2006, 45, 9172−9179. (24) Li, S.; Cao, C.; Yang, J.; Long, Y.-T. ChemElectroChem 2019, 6, 126−129. (25) Pastoriza-Gallego, M.; Rabah, L.; Gibrat, G.; Thiebot, B.; Van Der Goot, F. G.; Auvray, L.; Betton, J. M.; Pelta, J. J. Am. Chem. Soc. 2011, 133, 2923−2931. (26) Wang, Y.; Montana, V.; Grubišić, V.; Stout, R. F.; Parpura, V.; Gu, L. Q. ACS Appl. Mater. Interfaces 2015, 7, 184−192. (27) Fennouri, A.; Daniel, R.; Pastoriza-Gallego, M.; Auvray, L.; Pelta, J.; Bacri, L. Anal. Chem. 2013, 85, 8488−8492. (28) Baaken, G.; Halimeh, I.; Bacri, L.; Pelta, J.; Oukhaled, A.; Behrends, J. C. ACS Nano 2015, 9, 6443−6449. (29) Teng, H.; Cao, M. D.; Hall, M. B.; Duarte, T.; Wang, S.; Coin, L. J. M. Gigascience 2018, 7, 1−9. (30) Wang, Y. Q.; Cao, C.; Ying, Y. L.; Li, S.; Wang, M. B.; Huang, J.; Long, Y. T. ACS Sensors 2018, 3, 779−783. (31) Ye, L.; Keogh, E. Time Series Shapelets: A New Primative for Data Mining. In KDD ‘09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining; 2009; pp 947−956. (32) Grabocka, J.; Schilling, N.; Wistuba, M.; Schmidt-Thieme, L. Learning Time-Series Shapelets. In KDD ‘14 Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining; 2014; pp 392−401. (33) Ji, C.; Liu, S.; Yang, C.; Pan, L.; Wu, L.; Meng, X. Procedia Comput. Sci. 2018, 129, 461−467. (34) Zhang, Q.; Wu, J.; Yang, H.; Tian, Y.; Zhang, C. Unsupervised Feature Learning from Time Series. In Proceedings of the 25th International Joint Conference on Artificial Intelligence; 2016; pp 2322− 2328. (35) Lines, J.; Davis, L. M.; Hills, J.; Bagnall, A. A Shapelet Transform for Time Series Classification. In KDD ‘12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; 2012; pp 289−297. (36) Tavenard, R. Tslearn: A Machine Learning Toolkit Dedicated to Time-Series Data; 2017. (37) Drndic, M.; Shepard, K. L.; Chien, C.-C.; Marks, A.; Ong, P.; Shekar, S.; Clarke, O. B.; Hartel, A. Nano Lett. 2019, 19, 1090−1097. (38) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M., et al. Tensor Flow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16); 2016; pp 265−283. (39) Chollet, F., Keras: Deep learning library for theano and tensorflow.; 2015, https://keras.io. (40) Yeh, C. C. M.; Zhu, Y.; Ulanova, L.; Begum, N. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets. In 2016 IEEE 16th International Conference on Data Mining; IEEE: 2016; pp 1317−1322. (41) Yeh, C. C. M.; Zhu, Y.; Ulanova, L.; Begum, N.; Ding, Y.; Dau, H. A.; Zimmerman, Z.; Silva, D. F.; Mueen, A.; Keogh, E. Data Min. Knowl. Discovery 2018, 32, 83−123.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.9b01896.



Detailed materials and methods, the capture rate of AA3 and GA3 by the aerolysin nanopores, and the scatter plots of duration versus residual current of a 3:4 mixture of AA3 and GA3 (PDF)

AUTHOR INFORMATION

Corresponding Authors

*E-mail for Y.-L.Y.: [email protected]. *E-mail for H.-F.W.: [email protected]. ORCID

Yi-Lun Ying: 0000-0001-6217-256X Yi-Tao Long: 0000-0003-2571-7457 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This research was supported by the National Natural Science Foundation of China (61871183 and 21834001) and Innovation Program of the Shanghai Municipal Education Commission (2017-01-07-00-02-E00023). Y.-L.Y. is sponsored by the National Ten Thousand Talent Program for young topnotch talent and the Shanghai Rising-Star Program (19QA1402300).



REFERENCES

(1) Kasianowicz, J. J.; Brandin, E.; Branton, D.; Deamer, D. W. Proc. Natl. Acad. Sci. U. S. A. 1996, 93, 13770−13773. (2) Meller, A.; Nivon, L.; Brandin, E.; Golovchenko, J.; Branton, D. Proc. Natl. Acad. Sci. U. S. A. 2000, 97, 1079−1084. (3) Thompson, J. F.; Oliver, J. S. Electrophoresis 2012, 33, 3429− 3436. (4) Tian, K.; Gu, L. Q. Nanopore Single-Molecule Dielectrophoretic Detection of Cancer-Derived MicroRNA Biomarkers. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society; EMBS: 2013; pp 6821−6824. (5) Bohmann, K.; Evans, A.; Gilbert, M. T. P.; Carvalho, G. R.; Creer, S.; Knapp, M.; Yu, D. W.; de Bruyn, M. Trends Ecol. Evol. 2014, 29, 358−367. (6) Ying, Y. L.; Cao, C.; Long, Y. T. Analyst 2014, 139, 3826−3835. (7) Castro-Wallace, S. L.; Chiu, C. Y.; John, K. K.; Stahl, S. E.; Rubins, K. H.; McIntyre, A. B. R.; Dworkin, J. P.; Lupisella, M. L.; Smith, D. J.; Botkin, D. J.; et al. Sci. Rep. 2017, 7, 18022. (8) Zhang, J.; Liu, X.; Ying, Y. L.; Gu, Z.; Meng, F. N.; Long, Y. T. Nanoscale 2017, 9, 3458−3465. (9) Ying, Y. L.; Long, Y. T. Sci. China: Chem. 2017, 60, 1187−1190. (10) Cao, C.; Long, Y. T. Acc. Chem. Res. 2018, 51, 331−341. (11) Liu, S. C.; Li, M. X.; Li, M. Y.; Wang, Y. Q.; Ying, Y. L.; Wan, Y. J.; Long, Y. T. Faraday Discuss. 2018, 210, 87−99. (12) Forstater, J. H.; Briggs, K.; Robertson, J. W.; Ettedgui, J.; MarieRose, O.; Vaz, C.; Kasianowicz, J. J.; Tabard-Cossa, V.; Balijepalli, A. Anal. Chem. 2016, 88, 11900−11907. (13) Balijepalli, A.; Ettedgui, J.; Cornio, A. T.; Robertson, J. W. F.; Cheung, K. P.; Kasianowicz, J. J.; Vaz, C. ACS Nano 2014, 8, 1547. (14) Gu, Z.; Ying, Y. L.; Cao, C.; He, P.; Long, Y. T. Anal. Chem. 2015, 87, 907−913. (15) Kawano, R.; Schibel, A. E. P.; Cauley, C.; White, H. S. Langmuir 2009, 25, 1233−1237. G

DOI: 10.1021/acs.analchem.9b01896 Anal. Chem. XXXX, XXX, XXX−XXX