Parallelization of Change Point Detection - The Journal of Physical

Jun 15, 2017 - Nancy Song and Haw Yang. Department of Chemistry, Princeton University,, Princeton, New Jersey 08544, United States. J. Phys. Chem...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/JPCA

Parallelization of Change Point Detection Nancy Song and Haw Yang* Department of Chemistry, Princeton University,, Princeton, New Jersey 08544, United States S Supporting Information *

ABSTRACT: The change point detection method (Watkins, L. P.; Yang, H. J. Phys. Chem. B 2005, 109, 617) allows the objective identification and isolation of abrupt changes along a data series. Because this method is grounded in statistical tests, it is particularly powerful for probing complex and noisy signals without artificially imposing a kinetics model. The original algorithm, however, has a time complexity of 6(N 2), where N is the size of the data and is, therefore, limited in its scalability. This paper puts forth a parallelization of change point detection to address these time and memory constraints. This parallelization method was evaluated by applying it to changes in the mean of Gaussian-distributed data and found that time decreases superlinearly with respect to the number of processes (i.e., parallelization with two processes takes less than half of the time of one process). Moreover, there was minimal reduction in detection power. These results suggest that our parallelization algorithm is a viable scheme that can be implemented for other change point detection methods.



algorithm has a runtime complexity of 6(N 2) over data size N (see Figure 7) such that it may become extremely time and memory consuming, and in some cases impractical, when applied to large data sets. For instance, in fluorescence spectroscopy, if one assumes that the average detected photon flux is 100 kHz and the average length of a trajectory is 20−30 min long, this leads to about 180 million data points. For recursive change point detection,5 the program would need about 2 GB of memory. Considering that the typical RAM on a computer is 2−8 GB, it is unfeasible for a regular desktop computer to use such a program without running into memory constraints. This paper presents a new parallelization algorithm for change point detection that implements unique segmentation of the data to different processors. A previously proposed parallelization algorithm has achieved linear scaling in runtime reduction with respect to the number of processes, runtime ∝ (number of processes)−0.98; however, it requires two passes of the data.18 The approach detailed here requires only one pass, which leads to better scaling with increasingly large data sets. As an example, change point detection using one process takes about 281 s, while using 10 parallel processes takes under 7 s, representing an over 40-fold reduction runtime. Although this parallelization algorithm is applied to the specific example of Gaussian mean change point detection, which itself has not been explicitly characterized in the literature, it is applicable to

INTRODUCTION Time series data are sets of observations taken at specific times of a studied system. Though time series are usually associated with time-resolved or time-dependent techniques, they can be generated simply from sampling measurements at different times and, therefore, arise naturally in methods used to investigate chemical reaction kinetics and dynamics. When analyzing a time series, detecting change pointsdiscrete jumps in measurement without the imposition of underlying modelsis a first step in data analysis because they are key to unraveling the underlying dynamics of a system such as a single molecule or nanoparticle1,2 and should be conducted with scrutiny. Many statistically robust methods have been proposed for single change point detection, including Bayesian statistics3 and maximum likelihood estimates,4 to name a few. However, it is unlikely that time series will only contain one change point; algorithms for detecting multiple change points have been developed including recursive binary segmentation5,6 and linear runtime algorithms.7 In spectroscopic or imaging applications, for example, the binary segmentation change point detection method has been used to evaluate the number of states of single quantum dots,8 blinking statistics for quantum dots9,10 and organic dyes,11 and electron transfer kinetics.12 It has also been used to investigate biological phenomenon, including single-enzyme turnover,13 the number of subunits in a protein,14,15 the number of proteins in bacteria,16 and how a virus-like particle approaches a live cell.17 As the application scope of change point detection expands, however, current algorithms may be limiting in their scalability in terms of runtime and computational memory. For example, the aforementioned recursive binary segmentation © XXXX American Chemical Society

Received: May 8, 2017 Revised: June 14, 2017 Published: June 15, 2017 A

DOI: 10.1021/acs.jpca.7b04378 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

izations relevant to chemical applications; these characterizations will also be considered in evaluation of the proposed parallelization scheme. Gaussian Mean Change Point Detection. For a time series of N observations, all observations xi, i = 1...N, are assumed to be Gaussian-distributed with equal variance σ2. Two hypotheses, HA and H0, are considered; HA states that there exists a change point at k such that

other change point detection methods and perhaps other analyses that prove parallelizable in a similar manner to change point detection as well. The remainder of this report is organized as follows: the detection of a single change point is treated as a hypothesis test problem with a generalized likelihood ratio test and is characterized by considering time series with a single change point. This method is then generalized to time series containing multiple change points through a novel segmentation scheme amenable to parallelization, wherein recursive binary segmentation is used to detect and isolate change points within each segment. The performance of the parallelization scheme is evaluated in terms of the number of processes and the size of data sets.



xi ∼ f (μ1 , σ )

i = 1 ... k

xi ∼ f (μ2 , σ )

i = k + 1 ... N

where “ ∼ ” denotes a drawing action from a probability density

THEORETICAL BACKGROUND To provide a concrete illustration for these ideas, this work considers detecting changes in the mean of a Gaussiandistributed time series using a generalized likelihood ratio test.4 The generalized likelihood ratio test was chosen because it provides an objective and statistically rigorous way of finding change points in noisy time series data that can be implemented independent of any assumptions of the underlying kinetic model of the system being studied. Figure 1 shows

2 2 1 e−(x − μ) /2σ 2π σ

function where f (μ , σ ) ≡

is a Gaussian

density. The quantities of μ1 and μ2 are estimated using k

1 k

μ1̂ =

∑ xi

μ2̂ =

i=1

1 N−k

N



xi

i=k+1

The likelihood LA of hypothesis HA is thus k

LA =

n

∏ f (i ; μ1̂ , σ ) i=1

=



f (i ; μ2̂ , σ )

i=k+1

k N 2 2 2 2 1 e∑i=1[−(xi − μ1̂ ) /2σ ] +∑i=k +1[−(xi − μ2̂ ) /2σ ] N (σ 2π )

The alternative hypothesis, HA, is compared to the null hypothesis, H0, which states that there is no change point xi ∝ f (μ0 , σ )

i = 1 ... N

where μ0 is the average observed measurement of the entire trajectory and can be estimated using μ̂0 =

1 N

N

∑i = 1 xi . The

likelihood L0 of H0 is n

Figure 1. Change point detection using a log-likelihood ratio test on the sample time series. Time series (blue) were generated from Gaussian-distributed random points of length 500. Black lines are the “true” states of the trajectory, and red dashed lines are states determined from log-likelihood change point detection. (a) The average intensity values were μ1 = 36.5 and μ2 = 16.5. Both intensity states had variance σ1 = σ2 = 5.1. (b) The average intensity values were μ1 = 30.5 and μ2 = 40.5. Both intensity states had variance σ1 = σ2 = 8.5.

L0 =

∏ f (i ; μ0̂ , σ ) i=1

=

N 2 2 1 e∑i=1[−(xi − μ0̂ ) /2σ ] N (σ 2π )

To compare the likelihood of HA to H0, one considers the loglikelihood ratio of these two hypotheses k

two examples of applying the Gaussian mean change point analysis to simulated trajectories. The statistical test performs well even when the trajectories contain multiple change points in quick succession (Figure 1a) or with a low signal-to-noise ratio (Figure 1b). Time series data that exhibit similar characteristics include changes in the recorded emission intensities of single molecules (or molecular clusters) and particles by camera-based detectors whose noise characteristics can be approximated by Gaussian distributions, changes in the velocity of molecular motors or cargo trafficked in a cell, and kymographs in chemical analyses. First, the theory for this detection method is outlined, followed by detailed character-

n

∏i = 1 f (i ; μ1̂ , σ ) ∏i = k + 1 f (i ; μ2̂ , σ )

3k = ln

n

∏i = 1 f (i ; μ0̂ , σ ) k

=−

1 [∑ (xi − μ1̂ )2 + 2σ 2 i = 1 N

N



(xi − μ2̂ )2

i=k+1

−∑ (xi − μ0̂ ) ] 2

i=1

This can be further simplified given the values of μ̂1, μ̂ 2, and μ̂0 = B

1 N

N

∑i = 1 xi . Defining the following DOI: 10.1021/acs.jpca.7b04378 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A k

Sk =

1 (∑ xi)2 k i=1 N

SN − k =

1 ( ∑ xi)2 N − k i=k+1 N

S0 =

1 (∑ xi)2 N i=1

the log-likelihood ratio can be written as 3k =

1 [Sk + SN − k − S0] 2σ 2

(1)

The ratio above describes the likelihood that a change point occurs at point k; therefore, the most likely location of a change point for N observations is the location at which the loglikelihood ratio is at a maximum AN = max {3k} 1≤k 1, then the scaling is superlinear. In Figure 9, the same analysis in Figure 7 is considered, but now the x-axis is the number of processes. Figure 9a shows the results of the actual runtime, while Figure 9b shows the scaled runtime, where the actual runtime is multiplied by the number of processes. Different time series lengths are indicated by the different colors. In Table 2, b2 increases significantly as the series length increases, suggesting that the runtime decreases superlinearly

Table 3. Error Rates in the Overlap Regions in Parallelized Gaussian Mean Change Point Detection series length 5 5 5 5 1 1 1 1

Table 2. Fitted Parameters for Equation 6 in Time Complexity Analysis with Respect to the Number of Processes series length

b1

5 × 105

2.81 (2.79, 2.83) 6.15 (6.11, 6.19) 15.1 (15.0, 15.2) 45.3 (44.5, 46.2) 59.5 (58.9, 60.1) 80.6 (79.7, 81.6) 127 (125, 130)

1 × 106 2 × 106 4 × 106 5 × 106 6 × 106 8 × 106 10 × 10

6

222 (219, 224)

b2 1.11 (1.10, 1.13) 1.14 (1.12, 1.16) 1.26 (1.23, 1.28) 1.46 (1.39, 1.53) 1.45 (1.41, 1.49) 1.49 (1.44, 1.54) 1.49 (1.41, 1.56) 1.70 (1.64, 1.75)

b3

R

1.0000

0.87 (0.30, 1.44)

0.9995

0.92 (0.51, 1.33)

0.9999

1.17 (0.52, 1.81)

0.9999

1.02 (−0.59, 2.63) 2.63 (1.09, 4.16)

0.9997

106 106 106 106 107 107 107 107

1 2 5 10 1 2 5 10

overall error rate 5.217 5.219 5.222 5.230 5.218 5.219 5.220 5.222

× × × × × × × ×

10−4 10−4 10−4 10−4 10−4 10−4 10−4 10−4

error rate in overlap region 5.217 5.465 5.510 5.426 5.218 5.485 5.466 5.451

× × × × × × × ×

10−4 10−3 10−3 10−3 10−4 10−3 10−3 10−3

error rate in other regions 5.217 5.217 5.214 5.210 5.218 5.218 5.216 5.213

× × × × × × × ×

10−4 10−4 10−4 10−4 10−4 10−4 10−4 10−4

specific to the aforementioned simulations where 10 000 traces of different data sizes were analyzed using the parallelized Gaussian mean change point detection method. The preset type-I error rate was 0.05. The magnitude of the difference between the error rates in the overlap regions and the other regions is similar to the ratio observed between data points near the ends of time series in Figure 2, around 5−10 times higher at the ends of segments. To address the increased error rate in overlap regions, one can adjust the original change point detection method to have a more uniform error rate across the entire segment, as proposed by Henderson.20 Another consideration of the overlap region is its length. For analysis of the simulated data conducted here, the overlap region has been preset to be 200. However, depending on the nature of the data, the overlap length should vary: If the data is noisier, then the overlap region should be longer; likewise, if there are fewer change points, the overlap region should be longer to address the possibility that the time series is segmented at a change point. As such, one may imagine that the program can be modified to first do a preliminary pass of the data to determine the overlap region. The specific implementation of determining this overlap region is not carried out here; rather, this is left to future users of this parallel scheme as they would be more familiar with the particular data set that they are analyzing.

2

0.051 (0.035, 0.068) 0.081 (0.044, 0.12) 0.25 (0.17, 0.34)

× × × × × × × ×

processes

1.0000 0.9999

0.9999

with the number of processes. Indeed, in Figure 9b, if the runtime decreases linearly, then one would expect a horizontal straight line as the number of processes increased, but that is not observed. This superlinear increase in time can be attributed to the time complexity of the multiple change point detection method employed here, which itself is not linear. Overlap Region. Because the parallelization scheme has overlapping segments, it is expected that there is a higher type-I error rate in the overlap region because it is near the ends of the processed segments and the error rates are higher for the loglikelihood ratio tests (cf. Figure 2). Indeed, when comparing the specific error rates from the previous simulations, the overlap region has a type-I error rate 10 times higher than that of the nonoverlapping regions (Table 3). These error rates are



CONCLUDING REMARKS Change point detection is a crucial step in understanding the characteristics of a system described by a time series. However, the change point algorithm originally put forth for analysis of time series data more than a decade ago may become resourcelimited in today’s rapidly increasing amounts of data. In this work, we present a one-pass parallelization scheme that H

DOI: 10.1021/acs.jpca.7b04378 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

(2) Yang, H. Change-Point Localization and Wavelet Spectral Analysis of Single-Molecule Time Series. Adv. Chem. Phys. 2011, 146, 217−243. (3) Barry, D.; Hartigan, J. A. A Bayesian Analysis for Change Point Problems. J. Am. Stat. Assoc. 1993, 88, 309−319. (4) Worsley, K. J. The Power of Likelihood Ratio and Cumulative Sum Tests for a Change in a Binomial Probability. Biometrika 1983, 70, 455−64. (5) Watkins, L. P.; Yang, H. Detection of Intensity Change Points in Time-Resolved Single-Molecule Measurements. J. Phys. Chem. B 2005, 109, 617−28. (6) Montiel, D.; Cang, H.; Yang, H. Quantitative Characterization of Changes in Dynamical Behavior for Single-Particle Tracking Studies. J. Phys. Chem. B 2006, 110, 19763−70. (7) Killick, R.; Fearnhead, P.; Eckley, I. A. Optimal Detection of Changepoints With a Linear Computational Cost. J. Am. Stat. Assoc. 2012, 107, 1590−8. (8) Zhang, K.; Chang, H.; Fu, A.; Alivisatos, A. P.; Yang, H. Continuous Distribution of Emission States from Single CdSe/ZnS Quantum Dots. Nano Lett. 2006, 6, 843−847. (9) Cordones, A. A.; Bixby, T. J.; Leone, S. R. Evidence for Multiple Trapping Mechanisms in Single CdSe/ZnS Quantum Dots from Fluorescence Intermittency Measurements over a Wide Range of Excitation Intensities. J. Phys. Chem. C 2011, 115, 6341−6349. (10) Cordones, A. A.; Bixby, T. J.; Leone, S. R. Direct Measurement of Off-State Trapping Rate Fluctuations in Single Quantum Dot Fluorescence. Nano Lett. 2011, 11, 3366−3369. (11) Wustholz, K. L.; Bott, E. D.; Kahr, B.; Reid, P. J. Memory and Spectral Diffusion in Single-Molecule Emission. J. Phys. Chem. C 2008, 112, 7877−7885. (12) Wong, N. Z.; Ogata, A. F.; Wustholz, K. L. Dispersive ElectronTransfer Kinetics from Single Molecules on TiO2 Nanoparticle Films. J. Phys. Chem. C 2013, 117, 21075−21085. (13) Terentyeva, T. G.; Engelkamp, H.; Rowan, A. E.; Komatsuzaki, T.; Hofkens, J.; Li, C. B.; Blank, K. Dynamic Disorder in SingleEnzyme Experiments: Facts and Artifacts. ACS Nano 2012, 6, 346−54. (14) Goldsmith, R. H.; Moerner, W. E. Watching Conformationaland Photodynamics of Single Fluorescent Proteins in Solution. Nat. Chem. 2010, 2, 179−186. (15) Wang, Q.; Moerner, W. E. Lifetime and Spectrally Resolved Characterization of the Photodynamics of Single Fluorophores in Solution Using the Anti-Brownian Electrokinetic Trap. J. Phys. Chem. B 2013, 117, 4641−4648. (16) Tuson, H. H.; Biteen, J. S. Unveiling the Inner Workings of Live Bacteria Using Super-Resolution Microscopy. Anal. Chem. 2015, 87, 42−63. (17) Welsher, K.; Yang, H. Multi-Resolution 3D visualization of the Early Stages of Cellular Uptake of Peptide-Coated Nanoparticles. Nat. Nanotechnol. 2014, 9, 198−203. (18) Nikol’skii, I. M.; Furmanov, K. K. Parallel Algorithm to Detect Structural Changes in Time Series. Comput. Math. Model. 2016, 27, 247. (19) Noé, M. The Calculation of Distributions of Two-Sided Kolmogorov-Smirnov Type Statistics. Ann. Math. Stat. 1972, 43, 58− 64. (20) Henderson, R. A Problem with the Likelihood Ratio Test for a Change-Point Hazard Rate Model. Biometrika 1990, 77, 835−43. (21) Worsley, K. J. Confidence Regions and Tests for a Change-Point in a Sequence of Exponential Family Random Variables. Biometrika 1986, 73, 91−104. (22) Gabriel, E.; Fagg, G. E.; Bosilca, G.; Angskun, T.; Dongarra, J. J.; Squyres, J. M.; Sahay, V.; Kambadur, P.; Barrett, B.; Lumsdaine, A.; et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. Proceedings of the 11th European PVM/MPI Users’ Group Meeting, Budapset, Hungary; 2004. (23) Knuth, D. E. The Art of Computer Programming, 2nd ed.; Addison-Wesley: New York, 1998; Vol. 3.

addresses this limitation. For 10 million data points, using Gaussian mean change point detection as an example, a 40-fold decrease in runtime has been achieved. As such, this presents a viable method for addressing the computation-resource bottleneck. Because each process only deals with a small amount of data compared to the entire time series, a typical amount of computer physical memory is sufficient to carry out all of the necessary computation without having to invoke virtual memorya process that can become even more resource taxing. That is, one expects to see improved performance even for a standalone workstation with multicore CPU and moderate memory. It is emphasized that the parallelization algorithm presented here is generalizable: the parallelization scheme proposed in this paper provides a framework for parallel change point detection, regardless of the specific detection method. Though Gaussian mean change point detection is specifically parallelized, the parallelization scheme can easily be implemented on other change point detection methods, such as Gaussian variance change point detection and multivariate change point detection. Moreover, this scheme is not just restricted to change point detection; it could also be applied to other analyses that are amenable to parallelization in a similar manner to change point detection, such as wavelet analysis of dynamical heterogeneity.27 It is hoped that with this new method, the bottleneck in the statistical analysis of data can be reduced, allowing for the study of more phenomena.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jpca.7b04378. Justification of using weights in the overlap region by way of characterizing the distribution of confidence intervals, time complexity of change point detection, detection accuracy in parallelization, and system specifications (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Nancy Song: 0000-0002-9605-4382 Notes

The authors declare no competing financial interest. Source codes associated with the publication are made available through the Yang lab or through the Princeton github repository at https://github.com/PrincetonUniversity/ ParallelChangePoint. The source codes are under the Creative Commons license CC BY-NC-SA (Attribution-NonCommercial-ShareAlike).



ACKNOWLEDGMENTS N.S. acknowledges summer support from the Fred Fox Fund for undergraduate research, Princeton Environmental Institute, and from the Peter N. Curtin Memorial Fund.



REFERENCES

(1) Yang, H. Model-Free Statistical Reduction of Single-Molecule Time Series. In Theory and Evaluation of Single-Molecule Signals; Barkai, E., Brown, F. L. H., Orrit, M., Yang, H., Eds.; World Scientific: Singapore, 2008; pp 1−29. I

DOI: 10.1021/acs.jpca.7b04378 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A (24) Puaut, I. Real-Time Performance of Dynamic Memory Allocation Algorithms. Proceedings of the 14th Euromicro Conference on Real-Time Systems, Washington, DC; 2002; p 41. (25) Masmano, M.; Ripoll, I.; Crespo, A.; Real, J. TLSF: a New Dynamic Memory Allocator for Real-Time Systems. Proceedings of the 16th Euromicro Conference on Real-Time Systems, Washington, DC; 2004; pp 79−86. (26) Wilson, P. R.; Johnstone, M. S.; Neely, M.; Boles, D. Dynamic Storage Allocation: A Survey and Critical Review. In Memory Management; Baler, H., Ed.; Springer-Verlag: Berlin, Heidelberg, Germany, 1995; pp 1−116. (27) Yang, H. Detection and Characterization of Dynamical Heterogeneity in a Event Series Using Wavelet Correlation. J. Chem. Phys. 2008, 129, 074701.

J

DOI: 10.1021/acs.jpca.7b04378 J. Phys. Chem. A XXXX, XXX, XXX−XXX