Learning shapelets for improving the single-molecule nanopore sensing

6 days ago - Copyright © 2019 American Chemical Society ... the most informative and discriminative segments are learned from the time-series dataset...
0 downloads 0 Views 606KB Size
Subscriber access provided by UNIV AUTONOMA DE COAHUILA UADEC

Article

Learning shapelets for improving the single-molecule nanopore sensing Zi-Xuan Wei, Yi-Lun Ying, Meng-Yin Li, Jie Yang, JiaLe Zhou, Hui-Feng Wang, bingyong yan, and Yi-Tao Long Anal. Chem., Just Accepted Manuscript • Publication Date (Web): 14 May 2019 Downloaded from http://pubs.acs.org on May 14, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Learning shapelets for improving the single-molecule nanopore sensing Zi-Xuan Wei,1 Yi-Lun Ying,*2 Meng-Yin Li,2 Jie Yang,2 Jia-Le Zhou,1 Hui-Feng Wang,*1 Bing-Yong Yan1 and Yi-Tao Long2,3 1School

of Information and Engineering, East China University of Science and Technology, Shanghai 200237, P.R. China. 2School of Chemistry and Molecule Engineering, East China University of Science and Technology, Shanghai 200237, P.R. China. 3School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, 210023, P. R. China. KEYWORDS: nanopore technology, single-molecule analysis, aerolysin nanopore, shapelets, machine learning Abstract: The nanopore technique employs a nanoscale cavity to electrochemically confine individual molecules, achieving ultrasensitive single-molecule analysis based on evaluating the amplitude and duration of the ionic current. However, each nanopore sensing interface has its own intrinsic sensing ability, which does not always efficiently generate distinctive blockade currents for multiple analytes. Therefore, analytes that differ at only a single site often exhibit similar blockade currents or durations in nanopore experiments, which often produces serious overlap in the resulting statistical graphs. To improve the sensing ability of nanopores, herein, we propose a novel shapelet-based machine learning approach to discriminate mixed analytes that exhibit nearly identical blockade current amplitudes and durations. DNA oligomers with a single-nucleotide difference, 5’-AAAA-3’ and 5’-GAAA-3’, are employed as model analytes that are difficult to identify in aerolysin nanopores at 100 mV. First, a set of the most informative and discriminative segments are learned from the timeseries dataset of blockade current signals using the learning time-series shapelets (LTS) algorithm. Then, the shapelettransformed representation of the signals is obtained by calculating the minimum distance between the shapelets and the original signals. A simple logistic classifier is used to identify the two types of DNA oligomers in accordance with the corresponding shapelet-transformed representation. Finally, an evaluation is performed on the validation dataset to show that our approach can achieve a high F1 score of 0.933. Compared with the conventional statistical methods for the analysis of duration and residual current, the shapelet-transformed representation provides clearly discriminated distributions for multiple analytes. Taking advantage of the robust LTS algorithm, one could anticipate the real-time analysis of nanopore events for the direct identification and quantification of multiple biomolecules in a complex real sample (e.g., serum) without labels and time-consuming mutagenesis.

Nanopore technology has been developed as a powerful platform for single-molecule analysis1–11 and has been widely applied in single-molecule analysis,9 DNA sequencing,5,7 clinical diagnosis,2,4 environmental monitoring,3 etc. A typical biological nanopore device consists of a nanoscale-confined pore embedded in a membrane and two electrolyte solutions on either side of the pore. Driven by a specific amplitude voltage, ions in the solution enter and traverse through the nanopore, which produces a weak open current at the picoampere level.6,8 When a single molecule, such as a protein or oligonucleotide, enters the nanopore, the molecule regulates the ionic flow through the confined space, which in turn causes a weak blockade current.8,11 The duration and residual current of each blockade can then be further analyzed to extract the characteristics of molecules, including size, conformations, structure, and intermolecular interactions. The sensing ability of nanopore techniques usually depends on the measurable differences between the duration and residual current of each molecule. Therefore, advanced algorithms have been developed to precisely evaluate the durations

and residual current of each blockade from the noisy current trace, including CUSUM and Adaptive Time-Series Analysis (ADEPT) in MOSAIC software,12,13 the secondorder differential-based calibration method,14 the modified hidden Markov model,8 etc. By incorporating these highperformance algorithms, various types of biological nanopore (e.g., α-hemolysin,15 MspA,16 the phi29 DNApackaging nanomotor,17 ClyA,18 FhuA,19 lysenin,20 and CsgG21,22) have been used to study the charge and length of peptides,23,24 the entry and transport of unfolded proteins, 25 enzymatic reactions,26,27 and the size of PEGs.28 Moreover, the deep learning method has been employed to translate measurable time-series current variations into DNA sequences.29 As previously noted, the sensing ability of nanopore techniques usually depends on the measurable differences in the whole current trace for each analyte. Therefore, challenges still remain in the nanopore analysis of multiple analytes in a mixed sample: 1) Improving the current resolution and temporal resolution of a nanopore to produce a distinguishable blockade current that could be clearly assigned to each analyte and 2) resistance to the

ACS Paragon Plus Environment

1

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

interferent signals in the mixture. For example, it is challenging to discriminate the mixed species in heterosequences since the multiple analytes show a clear overlap in the statistical scatter plots. For example, 5’AAAA-3’ (AA3) and 5’-GAAA-3’ (GA3) produce identical

Page 2 of 9

residual currents and durations with an aerolysin nanopore in 1 M KCl. This drawback could greatly decrease the sensitivity of nanopores, limiting the application of nanopores in real sample analysis.

Figure 1. Shapelet learning on nanopore data. The blockade current signals are extracted from a limited range of data point numbers, and the most similar segments extracted from the open current are subtracted from them. Then, the LTS algorithm is used to mine shapelets that transform the raw blockade current signal into a new representation of distance or similarity. (a) Illustration of aerolysin nanopore in analyzing the mixture of AA3 and GA3. (b) The raw current traces for AA3 (left) and GA3 (right). The current trace in each signal is numbered as Tm. The representative shapelets S1 and S2 overlap with the closest match among the subsegments in each blockade. (c) Two typical patterns of shapelets S1 and S2 learned from the blockade current signals. The LTS algorithm can determine a certain number of shapelets. For a clear illustration, this figure shows only two shapelets, S1 and S2, from the blockade current signals. (d) Illustration of the closest matches of two shapelets, S1 and S2, to T2 of AA3 (blue) and T4 of GA3 (green), respectively. (e) The scatter plots of duration versus residual current of AA3 and GA3. Insert: shapelet-transformed representation in 2-dimensional space. The traditional scatter plots of duration versus residual current exhibit a serious overlap. Given shapelets S1 and S2, two analytes show a clear distribution in the scatter plots of minimum Euclidean distances (||T* - Sn||2) between the optimal subsegment in each blockade and S1 and S2. All data were acquired at 22 ± 2 ℃ with an applied voltage of +100 mV in 1 M KCl, 10 mM Tris and 1 mM EDTA buffer at pH 8.0 in the presence of 5 μM oligonucleotide. Current traces were acquired at a sampling frequency of 250 kHz and filtered by a 5 kHz low-pass filter.

Possible approaches to overcome these challenges, include site-directed mutagenesis, modulation of the experimental conditions and the incorporation of labels, chemical modifications, and probe molecules. However, these strategies increase the cost and complexity of experiments. Moreover, it is difficult to completely eliminate overlap in the event distribution of various analytes. For example, the K238G mutant aerolysin nanopore, which has the strongest sensing ability for aerolysin-based single-nucleotide discrimination,30 still shows clear overlap in the scatter plots of cytosine DNA and methylcytosine DNA. To accurately assign each blockade event to the species in the mixture, it is important to improve the selectivity and sensitivity for direct reading of the time-series current trace. A primitive machine learning concept, called shapelets, has been proposed to explore the maximally discriminative features that represent a short local segment of time-series data.31–33 It utilizes short time-series segments to predict the target categories of global series. In this paper, we propose a novel shapelet-based machine learning approach to the accurate identification of two species in a mixture that are difficult to discriminate with traditional data processing methods. Here, two heterosequences of

oligonucleotides (AA3 and GA3) with a single-nucleobase difference were used as a model system. Notably, we utilize the learning time-series shapelets (LTS) method to mine shapelets from blockade current signals. The shapelettransformed representation of the signals is further used to identify the corresponding analytes. An evaluation has been performed to show that our approach can achieve efficient discrimination of a mixture of oligonucleotides, which demonstrates its practicability.

Methods Data analysis procedure. As illustrated in Figure 1, the processing procedure consists of three main steps: (i) data preprocessing, which extracts blockade current signals from a limited number range of data points using a certain preset threshold and removes the trend by subtracting the baseline component; (ii) use of the LTS algorithm to mine a set of shapelets from the preprocessed signals; and (iii) use of the shapelets produced by the LTS algorithm to transform the signals and input of the shapelet-transformed representation into a classifier to obtain the target category. In the LTS algorithm, a stack structure is constructed by using the shapelet-transforming layer and the classifier with the most basic logistic function. The algorithm consists

ACS Paragon Plus Environment

2

Page 3 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry of three procedures: learning, transformation, and prediction. In nanopore data analysis, the learning process is to extract common segments (Sn) as shapelets that provide discriminative information to identify the analyte from the current traces of each blockade event (Tm) (Figure 1b-c). Since the nanopore signals contain considerable noise, the learned shapelets may not be identical to the original series.34 However, they can be regarded as approximations to the latent variables that are usually hidden in the noise of the nanopore. Then, each shapelet is compared to the subsegments from Tm with the same length (Figure 1d). To represent the similarity of shapelets to the original series, we calculated the Euclidean distances (||Tm - Sn||2) between the shapelets and the compared subsegments. Screening all the subsegments in one Tm provided the minimum distances (||T* - Sn||2) corresponding to the closest match between the shapelets Sn and the optimal subsegments T*. Note that the Euclidean distances are robust to noise, which ensures that our methods are suitable for application to complex mixtures with low signal-to-noise ratios. Subsequently, the ||T* - Sn||2 is treated as a feature to transfer the time-series blockade events into n-dimensional space. For example, we assume that two shapelets have been learned from the original series (Figure 1c). The minimum distances of the segments to the shapelets can optimally transform the raw signals into a 2-dimensional space. As shown in Figure 1e, the nanopore events in the shapelet-transformed scatter plots clearly reveal two populations without any overlap, while the classic scatter plots of the ionic current and durations from each event show serious overlap. To achieve the precise identification of each event, the classification algorithms are further used to classify the event in ndimensional space.35 Here, we employed the logistic function, which is a typical linear classifier. Our results demonstrate that the transformed values of ||T* - Sn||2 are linearly distinguishable in n-dimensional space. Data preprocessing. Before applying the algorithm, data preprocessing is required to make a dataset for the rest of the steps. In the proposed LTS algorithm, the similarity of shapelets to raw signals is calculated point-to-point. Therefore, using time units to describe the signals as in conventional statistical analysis is not suitable for our method. In the following analysis, we use the number of data points. A threshold is employed to identify the blockade current signals. To eliminate the falling and rising edges, we exclude 10 points from both ends of the signals. The histogram of the number of points obtained by using this preprocessing procedure shows a distribution consistent with the durations extracted by the MOSAIC software (Figure 2a-b).

of each raw signal is fitted by MOSAIC software. (c) The histogram of the number of points for AA3 and GA3. A Gaussian distribution was used to fit the histograms. The data were obtained at a sampling frequency of 250 kHz.

For the discrimination of AA3 and GA3, we choose blockade current signals with a number of points ranging from 1100 to 1300. All the signals are padded at the end with a blockade current of zeros to reach the maximum length of 1300 points. Time-series dataset. In this paper, a time-series dataset is a set of blockade current signals extracted by the data preprocessing step. It contains N signal samples, each of which has a length of Q. Therefore, the dataset is denoted as 𝑇𝑁 × 𝑄. The number of molecule species corresponding to dataset T is C. We can set a natural number from 1 to C corresponding to each category, thus defining the categorical variable 𝑌 ∈ {1,⋯,𝐶}. Shapelets. As a segment of a time series, the shapelet itself is a time series. Each shapelet (S) contains the L-length sequential data. Note that L is lower than Q. A time-series dataset contains many segments from which a certain number (K) of shapelets can be selected or learned. Therefore, the K most discriminative shapelets are represented by 𝑆 ∈ ℝ𝐾 × 𝐿, where ℝ represents the real number domain. Sliding window segment. The shapelet is considered as a segment of the time series. To compare the shapelets with the original sequence, we need to define a sliding window with the same length as the shapelet with a sliding step set to 1. Therefore, a time-series dataset contains 𝐽 segments, where 𝐽 = 𝑄 ― 𝐿 + 1. In the i-th sample of the dataset, the segment whose start time is j can be expressed as (𝑇𝑖,𝑗,⋯,𝑇𝑖,𝑗 + 𝐿 ― 1). To capture information from segments of different lengths, shapelets with variable lengths can be adopted. We can start from a minimum length 𝐿min and gradually increase to 𝑅 ∙ 𝐿min, i.e., {𝐿min,2𝐿min,⋯,𝑟 ∙ 𝐿min} , 𝑟 = 1,2,⋯,𝑅. Therefore, in the case of a fixed r value, a time series of blockade events containing 𝐽 = 𝑄 ― 𝑟 ∙ 𝐿min +1 segments can be expressed as (𝑇𝑖,𝑗,⋯,𝑇𝑖,𝑗 + 𝑟 ∙ 𝐿min ― 1). Then, the selected K shapelets with fixed length 𝑟 ∙ 𝐿min lead to a total number of shapelets of r ∙ 𝐾. Therefore, the minimum length Lmin, the number of the most discriminative shapelets K, and the scale R are the crucial hyperparameters that need to be manually set before applying the LTS algorithm to the nanopore dataset. To simplify the setting of hyperparameters, Lmin is assigned a percentage value to represent its ratio to the length Q. Then, K can be derived using Equation 132. 𝐾 = ⌊log10(𝐽) ∗ (𝐶 ― 1)⌋#(1)

Figure 2. Statistical analysis of AA3 and GA3 traversing the aerolysin pores. (a) The residual current histograms of AA3 and GA3. (b) The duration histograms of AA3 and GA3. The duration

Shapelet-transformed representation. Since the minimum Euclidean distance between a shapelet and the optimal subsegments represents the similarity, the distance can be used as a representative feature in the shapelet feature space. By comparing the sliding window segments of the i-th sample in the dataset with a shapelet, J distance values can be obtained, and the minimum value is selected as the representation shown in Equation 2.

ACS Paragon Plus Environment

3

Analytical Chemistry 𝑟 ∙ 𝐿𝑚𝑖𝑛

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

𝐷𝑖,𝑟,𝑘 = ‖𝑇 * - 𝑆𝑟,𝑘‖2 = min𝑗 = 1,⋯,𝐽



Positive

TP (True positive)

FN (False negative)

Negative

FP (False positive)

TN (True negative)

(𝑇𝑖,𝑗 + 𝑙 ― 1 ― 𝑆𝑟,𝑘,𝑙)2#(2)

𝑙=1

𝑁×𝑄

By shapelet transformation, the dataset 𝑇 is converted to 𝐷𝑁 × 𝑅 ∙ 𝐾. LTS algorithm. The LTS model is constructed using a stack structure with a logistic function as the top classifier layer. For the classification problem, cross entropy is generally chosen as the loss function, as shown in Equation 3: ℒ(𝑌,𝑌) = ―𝑌ln𝜎(𝑌) ― (1 ― 𝑌)ln(1 ― 𝜎(𝑌))#(3) where Y is the real target category, while 𝑌 is the prediction calculated by a linear function with weights 𝑊 ∈ ℝ𝑅 ∙ 𝐾 + 1. The prediction 𝑌𝑖 for the i-th sample is derived from the shapelet-transformed representation, as shown in Equation 4, 𝑅∙𝐾

𝑌𝑖 = 𝑊0 +

∑𝐷

∀𝑖 ∈ {1,⋯,𝑁}#(4)

𝑖,𝑘𝑊𝑘,

𝑘=1

A logistic sigmoid function 𝜎( ∙ ) is defined as 𝜎(𝑌) = (1 + 𝑒 ―𝑌) ―1. The function maps the above prediction to the range [0, 1], representing the probability that 𝐷𝑖, ∗ belongs to a certain class. Therefore, the parameters that need to be optimized in the LTS model are the shapelet S and the linear weights W. In the field of machine learning, an effective classifier is usually obtained by optimizing a loss function in the learning process. In this algorithm, the shapelets affect the performance of the transformation over the dataset, which in turn influences that of the classifier. To reduce overfitting risk and improve the robustness of the algorithm, we add regularization terms for the weights to the cross entropy function, which derives the ultimate loss function denoted by ℱ in Equation 5, 𝑁

argmin𝑆,𝑊 ℱ(𝑆,𝑊) = argmin𝑆,𝑊

∑ℒ(𝑌 ,𝑌 ) + 𝜆 𝑖

2 𝑊‖𝑊‖ #(5)

𝑖

𝑖=1

where 𝜆𝑊 is the coefficient of the regularization term utilized for reducing overfitting. It should be set no higher than 1 because a large value may make it difficult to update the linear weights. The LTS algorithm jointly updates the classifier parameters and shapelets until realizing an optimized classification loss function 5. Therefore, optimizing Equation 5 requires deriving the gradients of the loss function ℱ with respect to S and W. Another hyperparameter called the learning rate, denoted by 𝜂, controls the optimization strength for adjusting the weights of the algorithms with respect to the gradient. 𝜂 is fixed at a small value of 𝜂 = 0.01.32 The optimization details and some tricks for calculating gradients are introduced in the paper (32). Metrics. To validate the performance of the proposed method, we adopt the F1 score as a score evaluation system, as discussed below. Table 1. Confusion matrix Actual class

Predicted class Positive

Page 4 of 9

Negative

Here, the confusion matrix summarizes the results for further validation of our algorithm. As listed in Table 1, there are two types of actual classes, positive and negative. The predicted class is the predicted results obtained by the LTS algorithm. The symbols FP and FN are used for the cases when the classifier fails, while TP and TN represent the correct outputs. Ideally, the nondiagonal entries of the confusion matrix for a validated classifier should be zero or close to zero. Here, the F1 score combines precision and recall into a single metric. The precision refers to the percentage of the total number of positive cases predicted by a classifier that are correctly identified, while recall refers to the number of positive cases as a percentage of the actual class. These measurements are defined as 𝑇𝑃 precision = #(6) 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 recall = #(7) 𝑇𝑃 + 𝐹𝑁 According to Equations 6-7, the precision and recall differ from each other. Usually, a high precision score accompanies a low recall score. To combine the effects of precision and recall, we adopt the F1 score as follows: 2 × precision × recall 𝐹1 = #(8) precision + recall Validation. Generally, the metrics should be implemented on a dataset called a validation set, to further optimize the performance of a classifier. Since almost every machine learning algorithm suffers from overfitting in the training process, suspect results are usually obtained in a real application because some training set signals are included in a validation set. Therefore, we should cautiously separate the dataset into a training set and a validation set without any duplicate data.

Figure 3. Illustration of 5-fold cross validation. The term Model denotes the algorithm of a classifier, while the terms train and valid denote the training and validation sets, respectively. The dataset is first split into five equal subsets. Then, five models of the algorithm are trained on the five groups, resulting in a set of metrics of the same number. The average value of the five metrics is used to measure the performance of the algorithm.

Here, we use 5-fold cross validation (k-fold cross validation, where k = 5 is one of the most commonly used

ACS Paragon Plus Environment

4

Page 5 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry cases) to process the dataset (Figure 3). First, the dataset is equally split into 5 subsets. Every subset is kept as consistent as possible in data distribution; that is, the signals are from stratified sampling in the dataset. Then, we train the classifier on the union set of four subsets, while the remaining set is used for validating the classifier. Clearly, there are five pairs of training and validation sets, which generate 5 scores. Further comparison and validation are based on the average of these results.

Results and Discussion Application to labeled nanopore data. We ran the evaluation on the computing environment of Intel® Xeon® CPU E5-2630 v4 @ 2.20 GHz, 64 GB RAM DDR4 2133 Hz, 64bit Win10 Enterprise. In addition, the algorithm is implemented in tslearn,36 which is a Python package. The performance of the LTS algorithm relies on its hyperparameter setting. To validate the algorithm, it is first applied to the labeled nanopore data from AA3 and GA3 raw current traces. To prevent the algorithm from learning to identify the current amplitude, the amplitude of each labeled trace in the dataset is decentralized by using Equation 9. 𝑇𝑖,𝑗 = 𝑇𝑖,𝑗 ―

1 𝑄

𝑄

∑𝑇

𝑖,𝑚#(9)

𝑚=1

where 𝑇 represents the raw current traces. First, the shapelet length is related to the number of segments in the dataset. A smaller Lmin provides more segments, which leads to more candidates for the algorithm to learn. This setting could slow the convergence of the algorithm. To accelerate the whole process, fewer parameters could be optimized to complete an update epoch. To confirm the efficiency of the LTS algorithm with respect to the shapelet length, sets with minimum lengths

of 𝐿min ∈ {0.01, 0.02, 0.05, 0.10, 0.20, 0.30} are used to implement the 5-fold cross validation. Then, the training epoch was set to a fixed value of 500, while the coefficient of the regularization term is set to 𝜆𝑊 = 0.01. With increasing shapelet length, the time cost of the algorithm learning from the training set gradually rises from 80 to 3000 seconds (Figure 4a). In addition, we also evaluate the algorithm with the scale factor R selected from 𝑅 ∈ {1,2,3,4}. This scale factor R controls the length range of the shapelets. A larger value of R suggests that the algorithm has larger numbers of parameters to be optimized, which requires more time for each optimization epoch. A comparison of different R values reveals that the training time positively depends on the R value (Figure 4a). Moreover, the training time exhibits a logarithmic increase with the minimum shapelet length Lmin. The number of segments of a signal for calculating the distance to a shapelet decreases with increasing shapelet length when the length of the blockade signals is fixed. Thus, less time is required to determine the minimum distance, which reduces the total time cost in the case of a larger minimum length. The process of optimizing the shapelet parameters dominates the overall training process. To evaluate the performance of the algorithm, we also calculate the F1 score using cross validation. Figure 4b shows the F1 scores for our hyperparameter settings, which illustrates that the performance is optimized as the minimum shapelet length Lmin and scale factor R increase. With a fixed value of Lmin, a larger value of R represents more combinations of individual scaled-length shapelets, providing more characteristics for a better classification with a high F1 score. For example, when Lmin = 0.05, the F1 score shows an increase from 0.765 to 0.897 as the value of R increases. However, when Lmin is further increased above 0.15, it hardly affects the performance of the LTS algorithm, which shows a comparable F1 score.

Figure 4. Comparison of efficiency and performance under different hyperparameter settings. (a) The time cost in the training process versus the minimum shapelet length Lmin with respect to minimum length and scale factor. (b) Performance measurement evaluated by the F1 score. (c)-(e) F1 score calculated under different hyperparameter settings with five levels of noise from 0.0 to 0.5 pA.

For instance, setting R = 4 and Lmin = 0.2 generates segments with a length 80% of that of the signals. These

segments show a similar performance to R = 4, Lmin = 0.1. These results further demonstrate that the shapelet from a

ACS Paragon Plus Environment

5

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

local segment of the raw current traces preserves the discriminative information for the classification of nanopore data. Therefore, the LTS algorithm achieves the best performance in the case of Lmin = 0.1, R = 4 with an F1 score of 0.933. Moreover, the LTS algorithm efficiently processes the raw current traces directly without further filtering. Robustness to noise. Usually, the raw current traces contain a considerable density of noise. For example, the RMS value is 0.61 in the aerolysin experiments on AA3 and GA3. Therefore, shapelets should be insensitive to noise to yield a minimum distance at the right position. To prove the robustness of the algorithm to noise, we compared the performance of the LTS algorithm by searching hyperparameter values from 𝐿min ∈ {0.05, 0.1, 0.2} and 𝑅 ∈ {2,3,4}. We trained the algorithm on the dataset mixed with zero-mean Gaussian white noise with standard deviations (s.d.) of 0.0, 0.1, 0.2, 0.3, 0.4 and 0.5 pA. Increasing the noise level makes it more difficult to calculate the distances while ignoring the noise data, and the minimum distance may be disturbed by some instantaneous large noises (Figure 4ce). Furthermore, we also find that the F1 scores are greater than 0.85 with 𝐿min ∈ {0.1, 0.2}, which is better than is obtained with shorter shapelets. The curves become flatter, and the range of the F1 score decreases from 0.135 to 0.050 with increasing Lmin, which indicates that the algorithm performance is more stable and robust with longer shapelets (Figure 4c-e). From the comparison of the F1 score on the noisy data, we may conclude that our method is very tolerant with longer shapelets, i.e., 𝐿min ∈ {0.1, 0.2}, and is thus suitable for processing unfiltered data with a lower signal-to-noise ratio. Considering the efficiency of the algorithm, shapelets with lower lengths are more efficient but result in a serious loss of accuracy. Comparing the tendencies of the time cost and F1 score shows that the progress of the F1 score becomes small as 𝐿min ∈ {0.1, 0.2}. Therefore, 𝐿min = 0.1, 𝑅 = 4 is used to set the hyperparameters since it achieves the best F1 score and effective performance. Real sample application to nanopore data from AA3 and GA3 mixture. In this section, we applied the LTS algorithm to an aerolysin nanopore experiment to identify AA3 and GA3 in the mixture. As illustrated in Figure 5, the components of AA3 and GA3 overlap completely regardless of the sample mixture. As described in the data preprocessing section, we selected signals with numbers of points ranging from 500 to 700.

Page 6 of 9

Figure 5. The scatter plots of duration versus residual current of a 1:1 mixture (a) and a 4:3 mixture (b) of AA3 and GA3. The LTS identified populations of AA3 (blue) and GA3 (green) in the 1:1 mixture (c) and 4:3 mixture (d). After implementing the LTS algorithm, each event could be assigned to either the AA3 population or the GA3 population. Insert: the prediction ratios of AA3 and GA3 in the mixture. A potential of 100 mV was applied to perform the experiments in 1 M KCl, 10 mM Tris and 1 mM EDTA buffer at pH 8.0.

We statistically analyzed the proportion of AA3 and GA3 predicted by the LTS algorithm (Figure 5). The proportion of AA3 in the 1:1 mixture was 0.593 (Figure 5c), which is consistent with the results showing that AA3 exhibits nearly 1.59-fold higher capture by aerolysin nanopores than GA3 (Figure S1). To further confirm the accuracy of the prediction of the LTS algorithm, the events in a mixture of AA3 and GA3 in a 4:3 ratio were analyzed. Interestingly, the proportion of AA3 predicted with the same hyperparameter setting increased to 0.695, which indeed reflects the increase in the AA3 concentration ratio (Figure 5b). Then, the prediction of the LTS algorithm was also verified by a 3:4 mixture. As expected, the LTS results show that the proportion of AA3 further decreased as the concentration of AA3 in the mixture decreased (Figure S2). All these results demonstrate that the LTS algorithm can discriminate analytes that are severely overlapped in conventional statistics.

Conclusion In this paper, we develop the LTS algorithm to process seriously overlapped blockade current signals in order to improve the sensing ability of nanopore techniques. The LTS algorithm successfully identifies the blockade current of the single target molecule. In contrast to the conventional statistical methods relying on duration and residual current level, the shapelet-transformed representation provides a clearly discriminated distribution for multiple analytes. Further employing a simple linear classifier, the presented methods distinguish the AA3 and GA3 in the mixture with high precision. Our results also confirm that the shapelets are suitable for processing noisy nanopore data. Taking advantage of the robust LTS algorithm, one could anticipate the real-time analysis of nanopore events for the direct identification and quantification of multiple biomolecules in

ACS Paragon Plus Environment

6

Page 7 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry a complex real sample (e.g., serum). Since the shapelets contain typical features of time-series current traces, this algorithm would help to deepen the comprehensive understanding of the dynamic interactions between nanopores and analytes. A previous study showed that the wavelet algorithm increases the signal-to-noise ratio of nanopore data.37 By employing the wavelets as a preprocessing algorithm, the presented LTS algorithm may exhibit a stronger tolerance to noise, leading to more efficient signal discrimination. The implementation of the algorithm in the Python package of tslearn would benefit from the well-known open-source machine learning framework, such as TensorFlow38 and keras.39 Thus, we can also effectively train the algorithm on a GPU. Recently, researchers have proposed more effective algorithms for time-series subsequence discovery, such as Matrix Profile. Therefore, the incorporation of these algorithms could be further used to promote the classification effectiveness of shapelets.40,41 Although the risk of overfitting is a universal concern in machine learning algorithms, the performance of the presented algorithm could be further improved by completely resolving this issue. Note that the shapeletbased method focuses on time-series similarity instead of the sequentially changing pattern in the current trace of nanopore sequencing. Therefore, the advantage of the LTS algorithm is the ability to assign each whole current trace to a corresponding target with similar features. Therefore, we expect that more applications of the LTS algorithm could be realized in microRNA detection, posttranscriptional nucleotide identification, DNA damage sensing, peptide discrimination, posttranslational modification analysis, small molecule identification, etc.

EXPERIMENTAL METHODS Complete experimental methods are provided in the Supporting Information.

ASSOCIATED CONTENT Supporting Information. Detailed materials and method, the capture rate of AA3 and GA3 by the aerolysin nanopore, and the scatter plots of duration versus residual current of 3:4 mixture of AA3 and GA3 are provided in the Supporting Information.

This material is available free of charge via the Internet at http://pubs.acs.org. And the source code is publicly available at https://github.com/zixuanweeei/LSNano.

AUTHOR INFORMATION Corresponding Author * [email protected]; [email protected]

Present Addresses ORCID Yi-Lun Ying: 0000-0001-6217-256X Yi-Tao Long: 0000-0003-2571-7457

ACKNOWLEDGMENTS This research was supported by the National Natural Science Foundation of China (61871183 and 21834001) and Innovation Program of the Shanghai Municipal Education Commission (2017-01-07-00-02-E00023). Yi-Lun Ying is

sponsored by the National Ten Thousand Talent Program for young top-notch talent and the Shanghai Rising-Star Program (19QA1402300).

References (1) Kasianowicz, J. J.; Brandin, E.; Branton, D.; Deamer, D. W. Characterization of Individual Polynucleotide Molecules Using a Membrane Channel. Proc. Natl. Acad. Sci. 1996, 93, 13770–13773. (2) Meller, A.; Nivon, L.; Brandin, E.; Golovchenko, J.; Branton, D. Rapid Nanopore Discrimination between Single Polynucleotide Molecules. Proc. Natl. Acad. Sci. U. S. A. 2000, 97, 1079–1084. (3) Thompson, J. F.; Oliver, J. S. Mapping and Sequencing DNA Using Nanopores and Nanodetectors. Electrophoresis 2012, 33, 3429–3436. (4) Tian, K.; Gu, L. Q. Nanopore Single-Molecule Dielectrophoretic Detection of Cancer-Derived MicroRNA Biomarkers. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS; 2013; pp 6821–6824. (5) Bohmann, K.; Evans, A.; Gilbert, M. T. P.; Carvalho, G. R.; Creer, S.; Knapp, M.; Yu, D. W.; de Bruyn, M. Environmental DNA for Wildlife Biology and Biodiversity Monitoring. Trends Ecol. Evol. 2014, 29, 358–367. (6) Ying, Y. L.; Cao, C.; Long, Y. T. Single Molecule Analysis by Biological Nanopore Sensors. R. Soc. Chem. 2014, 139, 3826–3835. (7) Castro-Wallace, S. L.; Chiu, C. Y.; John, K. K.; Stahl, S. E.; Rubins, K. H.; McIntyre, A. B. R.; Dworkin, J. P.; Lupisella, M. L.; Smith, D. J.; Botkin, D. J.; et al. Nanopore DNA Sequencing and Genome Assembly on the International Space Station. Sci. Rep. 2017, 7, 18022. (8) Zhang, J.; Liu, X.; Ying, Y. L.; Gu, Z.; Meng, F. N.; Long, Y. T. High-Bandwidth Nanopore Data Analysis by Using a Modified Hidden Markov Model. Nanoscale 2017, 9, 3458–3465. (9) Ying, Y. L.; Long, Y. T. Single-Molecule Analysis in an Electrochemical Confined Space. Sci. China Chem. 2017, 60, 1187– 1190. (10) Cao, C.; Long, Y. T. Biological Nanopores: Confined Spaces for Electrochemical Single-Molecule Analysis. Acc. Chem. Res. 2018, 51, 331–341. (11) Liu, S. C.; Li, M. X.; Li, M. Y.; Wang, Y. Q.; Ying, Y. L.; Wan, Y. J.; Long, Y. T. Measuring a Frequency Spectrum for the SingleMolecule Interactions with A Confined Nanopore. Faraday Discuss. 2018, 210, 87–99. (12) Forstater, J. H.; Briggs, K.; Robertson, J. W.; Ettedgui, J.; Marie-Rose, O.; Vaz, C.; Kasianowicz, J. J.; Tabard-Cossa, V.; Balijepalli, A. MOSAIC: A Modular Single-Molecule Analysis Interface for Decoding Multistate Nanopore Data. Anal. Chem. 2016, 88, 11900–11907. (13) Balijepalli, A.; Ettedgui, J.; Cornio, A. T.; Robertson, J. W. F.; Cheung, K. P.; Kasianowicz, J. J.; Vaz, C. Quantifying Short-Lived Events in Multistate Ionic Current Measurements. ACS Nano 2014, 8, 1547. (14) Gu, Z.; Ying, Y. L.; Cao, C.; He, P.; Long, Y. T. Accurate Data Process for Nanopore Analysis. Anal Chem 2015, 87, 907–913. (15) Kawano, R.; Schibel, A. E. P.; Cauley, C.; White, H. S. Controlling the Translocation of Single-Stranded Dna through αHemolysin Ion Channels Using Viscosity. Langmuir 2009, 25, 1233–1237. (16) Butler, T. Z.; Pavlenok, M.; Derrington, I. M.; Niederweis, M.; Gundlach, J. H. Single-Molecule DNA Detection with an Engineered MspA Protein Nanopore. Proc. Natl. Acad. Sci. 2008, 105, 20647–20652. (17) Wendell, D.; Jing, P.; Geng, J.; Subramaniam, V.; Lee, T. J.; Montemagno, C.; Guo, P. Translocation of Double-Stranded DNA through Membrane-Adapted Phi29 Motor Protein Nanopores. Nat. Nanotechnol. 2009, 4, 765–772. (18) Soskine, M.; Biesemans, A.; De Maeyer, M.; Maglia, G. Tuning the Size and Properties of ClyA Nanopores Assisted by Directed Evolution. J. Am. Chem. Soc. 2013, 135, 13456–13463.

ACS Paragon Plus Environment

7

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19) Mohammad, M. M.; Iyer, R.; Howard, K. R.; McPike, M. P.; Borer, P. N.; Movileanu, L. Engineering a Rigid Protein Tunnel for Biomolecular Detection. J. Am. Chem. Soc. 2012, 134, 9521–9531. (20) Aoki, T.; Hirano, M.; Takeuchi, Y.; Kobayashi, T.; Yanagida, T.; Ide, T. Single Channel Properties of Lysenin Measured in Artificial Lipid Bilayers and Their Applications to Biomolecule Detection. Proc. Jpn. Acad. Ser. B. Phys. Biol. Sci. 2010, 86, 920–925. (21) Goyal, P.; Krasteva, P. V.; Van Gerven, N.; Gubellini, F.; Van Den Broeck, I.; Troupiotis-Tsaïlaki, A.; Jonckheere, W.; PéhauArnaudet, G.; Pinkner, J. S.; Chapman, M. R.; et al. Structural and Mechanistic Insights into the Bacterial Amyloid Secretion Channel CsgG. Nature 2014, 516, 250–253. (22) Brown, C. G.; Clarke, J. Nanopore Development at Oxford Nanopore. Nat. Biotechnol. 2016, 34, 810–811. (23) Stefureac, R.; Long, Y. T.; Kraatz, H. B.; Howard, P.; Lee, J. S. Transport of α-Helical Peptides through α-Hemolysin and Aerolysin Pores. Biochemistry 2006, 45, 9172–9179. (24) Li, S.; Cao, C.; Yang, J.; Long, Y.-T. Detection of Peptides with Different Charges and Lengths by Using the Aerolysin Nanopore. ChemElectroChem 2018, 6, 126–129. (25) Pastoriza-Gallego, M.; Rabah, L.; Gibrat, G.; Thiebot, B.; Van Der Goot, F. G.; Auvray, L.; Betton, J. M.; Pelta, J. Dynamics of Unfolded Protein Transport through an Aerolysin Pore. J. Am. Chem. Soc. 2011, 133, 2923–2931. (26) Wang, Y.; Montana, V.; Grubišić, V.; Stout, R. F.; Parpura, V.; Gu, L. Q. Nanopore Sensing of Botulinum Toxin Type B by Discriminating an Enzymatically Cleaved Peptide from a Synaptic Protein Synaptobrevin 2 Derivative. ACS Appl. Mater. Interfaces 2015, 7, 184–192. (27) Fennouri, A.; Daniel, R.; Pastoriza-Gallego, M.; Auvray, L.; Pelta, J.; Bacri, L. Kinetics of Enzymatic Degradation of High Molecular Weight Polysaccharides through a Nanopore: Experiments and Data-Modeling. Anal. Chem. 2013, 85, 8488– 8492. (28) Baaken, G.; Halimeh, I.; Bacri, L.; Pelta, J.; Oukhaled, A.; Behrends, J. C. High-Resolution Size-Discrimination of Single Nonionic Synthetic Polymers with a Highly Charged Biological Nanopore. ACS Nano 2015, 9, 6443–6449. (29) Coin, L. J. M.; Teng, H.; Cao, M. D.; Duarte, T.; Hall, M. B.; Wang, S. Chiron: Translating Nanopore Raw Signal Directly into Nucleotide Sequence Using Deep Learning. Gigascience 2018, 7, 1– 9. (30) Wang, Y. Q.; Cao, C.; Ying, Y. L.; Li, S.; Wang, M. B.; Huang, J.; Long, Y. T. Rationally Designed Sensing Selectivity and Sensitivity of an Aerolysin Nanopore via Site-Directed Mutagenesis. ACS Sensors 2018, 3, 779–783. (31) Ye, L.; Keogh, E. Time Series Shapelets: A New Primative for Data Mining. In KDD ’09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining; 2009; pp 947–956. (32) Grabocka, J.; Schilling, N.; Wistuba, M.; Schmidt-Thieme, L. Learning Time-Series Shapelets. In KDD ’14 Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining; 2014; pp 392–401. (33) Ji, C.; Liu, S.; Yang, C.; Pan, L.; Wu, L.; Meng, X. A Shapelet Selection Algorithm for Time Series Classification: New Directions. Procedia Comput. Sci. 2018, 129, 461–467. (34) Zhang, Q.; Wu, J.; Yang, H.; Tian, Y.; Zhang, C. Unsupervised Feature Learning from Time Series. In Proceedings of the 25th International Joint Conference on Artificial Intelligence; 2016; pp 2322–2328. (35) Lines, J.; Davis, L. M.; Hills, J.; Bagnall, A. A Shapelet Transform for Time Series Classification. In KDD ’12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining; 2012; pp 289–297. (36) Tavenard, R. Tslearn: A Machine Learning Toolkit Dedicated to Time-Series Data. 2017. (37) Drndic, M.; Shepard, K. L.; Chien, C.-C.; Marks, A.; Ong, P.; Shekar, S.; Clarke, O. B.; Hartel, A. Wavelet Denoising of HighBandwidth Nanopore and Ion-Channel Signals. Nano Lett. 2019, 19, 1090–1097.

Page 8 of 9

(38) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16); 2016; pp 265–283. (39) Chollet, F.; others. Keras. 2015. (40) Yeh, C. C. M.; Zhu, Y.; Ulanova, L.; Begum, N. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets. In 2016 IEEE 16th International Conference on Data Mining; IEEE, 2016; pp 1317– 1322. (41) Yeh, C. C. M.; Zhu, Y.; Ulanova, L.; Begum, N.; Ding, Y.; Dau, H. A.; Zimmerman, Z.; Silva, D. F.; Mueen, A.; Keogh, E. Time Series Joins, Motifs, Discords and Shapelets: A Unifying View That Exploits the Matrix Profile. Data Min. Knowl. Discov. 2018, 32, 83– 123.

ACS Paragon Plus Environment

8

Page 9 of 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry TOC

ACS Paragon Plus Environment

9