2017 10th International Symposium on Computational Intelligence and Design
Ensemble Tracking based on CNN Xiancai Zhang1, Zhuang Miao1, Yang Li1, Jiabao Wang1, Bo Zhou1, and Zhijie Zhao2 1. College of Command Information Systems, PLA Army Engineering University, Nanjing, China 2. Unit No.31700 of PLA, Liaoning, China Email:
[email protected],
[email protected] convolutional features from only one CNN layer, and the experimental result shown the convolutional features provided the outstanding performance for visual tracking. Ma et al. [11] employed the convolutional features from three CNN layers to obtain three correlation filters which produced multi-level response maps to collaboratively predict the target position for tracking. Li et al. [12] inferred the target location by using five correlation filters with hierarchical convolutional features. Qi et al. [13] proposed a method which used the features from six CNN layers to build six weak trackers in the correlation filter framework, and employed an adaptive hedge method to combine the six weak trackers into one stronger tracker. Although the method achieved outstanding results, this tracker used the template with the same size that was not able to deal with the scale variations of the targets. In this paper, we propose an ensemble tracker based on CNN which decomposes the tracking task into the ensemble position estimation and the scale estimation. Firstly, the proposed approach extracts the convolutional features from six layers of a CNN, trains six trackers in correlation filter framework, and obtains six positions of the target in the current frame. Specifically, we treat the ensemble position estimation as an adaptive decision task. The final position of the target is determined by the weighted decisions of all positions. Unlike [13] gives the initial weights of all positions and employs the adaptive hedge algorithm to update the weights. We compute the similarity between the template in the previous frame and each tracking result in the current frame, and the weights of all positions are obtained by the similarity. Then we estimate the scale of the target by using the scale correlation filter. We carry out experiments on the benchmark dataset [2] with 51 challenging sequences. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art trackers.
Abstract—With the rapid development of deep learning, many tracking methods based on the convolutional neural network (CNN) have been proposed. They utilize the hierarchical convolutional features extracted from a CNN to show the stateof-the-art results. However, most of these trackers are trained using the convolutional features from only one layer. To address this issue, we propose a novel ensemble tracker based on CNN. Firstly, we construct six trackers, and we combine these trackers to a stronger tracker for predicting the position of the target. For improving the robustness, we extract multiple scales and employ the scale correlation filter to estimate the suitable scale of the target. Experimental results over 51 benchmark sequences demonstrate the proposed approach performs favorably against the state-of-the-art tracking algorithms. Keywords-convolutional features; correlation filter; scale estimation; object tracking
I.
INTRODUCTION
Visual object tracking is one of the topics of increasing interest in the computer vision with an expansive range of applications, such as human-computer interaction, robotics, video processing and intelligent video surveillance. However, the tracking task is still a challenging problem due to complicated interfering factors like fast motion, illumination change, background clutter, shape deformation, large scale variations, and so on [1]-[3]. In recent years, correlation filter-based trackers have shown to provide excellent tracking performance and high computational efficiency. Initially, Henriques et al. [4] proposed the circulant structure of tracking-by-detection with kernels (CSK) method based on gray feature to solve a ridgeregression problem for visual tracking. The feature plays a vital role in a tracker, so different features are employed in correlation filter framework. Based on [4], a kernelized correlation filter (KCF) [5] method adopted the histogram of oriented gradient (HOG) [6] feature instead of gray feature to improve both the robustness and accuracy of the CSK tracker. In [7], Danelljan et al. used the color attributes [8] to describe the target, which provided excellent performance. Li et al. [9] fused the HOG and the color attributes features to further enhance the performance. Recently, the convolutional features extracted from a CNN had demonstrated state-of-the-art results on the visual tracking tasks in the correlation filter framework. Danelljan et al. [10] proposed a tracker which was trained with 2473-3547/17 $31.00 © 2017 IEEE DOI 10.1109/ISCID.2017.98
II.
THE KCF TRACKER
The KCF [5] is a typical tracker which achieves impressive tracking performance on [2]. The KCF models the appearance of an interested target using a filter w which trained on an image patch x of the size M × N . x i , j ( (i, j ) ∈ {0,1, 2" M − 1} × {0,1, 2" N − 1} ) is treated as a training example which is a cyclic shift of the base sample x . Each example xi , j is labeled with a Gaussian function yi , j .
130 131
The objective function can be written by minimizing the sum of squared error over all samples xi , j and its label yi , j
appearance variations and has precise localization ability by using the features from multiple layers.
w = arg min ¦ | ¢ϕ (xi , j )ˈw² − yi, j |2 + λ || w ||2 ,
B. The Ensemble Position Estimation We decompose the tracking task into the ensemble position estimation and the scale estimation. For the ensemble position estimation, we employ the VGG-Net to extract the feature of the patch. For example, given an image patch of the size, we first rescale the image patch to the size 224 × 224 which is the requirement of the VGG-Net. By a lot of experiments, the outputs of the pool3, relu4_3, pool4, relu5_2, relu5_3, and pool5 are treated as the features. Assume the current frame is the t -th frame. For the features map x k which extracted from the k -th convolutional layer. We can train a correlation filter which gets the response by (3), and we obtain the target position (atk , btk ) using (4). We treat the tracking process as an adaptive decision task. The final position of the target in the t -th frame is estimated by the ensemble of all weighted positions. (at , bt ) = ¦ wtk (atk , btk ), (5)
w
(1)
i, j
where λ is the regularization parameter( λ ≥ 0 ) to prevent overfitting. ϕ represents the mapping to the Hilbert space induced by the kernel κ . And the symbol ¢ , ² represents matrix multiplication. In the high-dimensional feature space, the objective w in (1) can be expressed as w = ¦ Į i , jϕ (xi , j ) , and the i, j
parameter Į is defined by
Į = F −1 (
F (y ) ), F (¢ϕ (x), ϕ (x)² ) + λ
(2)
where F and F −1 denote the Fourier transform and its inverse, y = { yi , j |(i, j ) ∈ {0,1, 2 " M − 1} × {0,1, 2" N − 1}} . In the new frame, a patch z with the same size and position of x is cropped out. The response map yˆ can be obtained by yˆ = F −1 ( F (Į) : F (¢ϕ (z ), ϕ (x )² ) ) , (3)
where wtk is the weight of the k -th tracker and
k t
=1.
k t
Then our goal is to compute the weight w . Unlike [14] gives the initial weights of all positions and employs the adaptive hedge algorithm to update the weights. We know that the targets in two adjacent frames have little difference. Let d donates the similarity of the target in the t − 1-th frame and the predicted target in the t -th frame. We compute the similarity of the k -th tracker by
where the symbol : denotes the element-wise product. Į is the updated parameter of the previous frame. x is the updated target appearance of the previous frame. The KCF tracker obtains the target position (a, b) where the element of response matrix is largest (a, b)= arg max yˆ (a ′, b′). (4) a ′ , b′
where yˆ (a ′, b′) denotes the element at position (a ′, b′) of matrix yˆ . III.
¦w
M
N
d k = ¦¦ (| xijt − xijt −1 | M × N ),
(6)
i =1 j =1
where M × N is the size of the appearance, xijt −1 is the pixel
THE PROPOSED ALGORITHM
of the appearance of the t − 1-th frame, xijt is the pixel of
A. The Convolutional Features The VGG-Net [14] is one of the popular CNN models, the VGG-Net which is trained with 1.3 million images of the ImageNet dataset achieves the state-of-the-art results on classification task. The proposed method uses the pre-trained VGG-Net to extract the hierarchical convolutional features. The hierarchical convolutional features describe the target from multiple aspects. On the one hand, the features extracted from the earlier layers obtain higher spatial resolution for precise target localization but they are not robust to the target appearance changes. On the other hand, the outputs of the latter layers of the CNN model provide the semantic information of the target and they are robust to significant appearance variations, but they can’t precisely localize the target. The visual tracking task is interested in estimating the precise position of the target. It requires powerful features. Due to the features from the latter layers contain more semantic information and the features from the earlier layers retain higher spatial resolution, the features from different layer are complementary. The tracker is robust to significant
the image patch centered around the position (atk , btk ) of the
t -th frame( (i, j ) ∈ {1, 2" M } × {1, 2" N } ). The smaller similarity d k reflects the k -th tracker has the more precise position of the target, and we make the bigger weight wtk .we define that
1, d k = 0 dk = ® , ¯d k , d k > 0 We compute the weight wtk of the position by wtk =
1 dk . ¦ (1 dk )
(7)
(8)
C. Adaptive Scale Estimation In the process of the scale estimation, we use the scale correlation filter [15] to estimate the scale of the target in each frame.
132 131
Let wt −1 × ht −1 denotes the target size in the t − 1-th frame.
KCF[5], SAMF[9], HCFT[11], HDT[13], DSST[15], RPT[16], Struck[17], SRDCF[18], TLD[19], and MEEM[20]. Among these methods, Struck and TLD are classical methods, KCF, SAMF, DSST, RPT, and SRDCF are correlation filter-based trackers, HCFT and HDT are CNN-based trackers. We present quantitative comparisons of average CLE, average DP at 20 pixels, and average OP at 0.5 in Table I. We highlight the first and second best results by bold and underline. Table I shows that our approach performs favorably against state-of-the-art trackers. Among the state-of-the-art trackers, the HCFT method attains the best results with an average CLE of 15.7 pixels and an average DP of 89.1%, our approach improves the tracking performance with an average CLE of 14.4 pixels and an average DP of 89.8%. Meanwhile, the SRDCT tracker achieves the highest average OP of 78.4%, and our approach provides higher OP of 85.3%. As shown in Figure 1. The precision plot which illustrates the average distance precision over all the 51 sequences and the success plot which illustrates the overlap precision over all the 51 sequences. In both plots, our method significantly reaches the best results and outperforms the other methods. In summary, the precision plot demonstrates that our approach has very precise localization ability. Similarly, the success plot shows that our method can effectively handle the problem of the target scale variations in the sequences. Figure 2. shows a visualization of the tracking results of our approach and the state-of-the-art trackers: SRDCF, HCFT, RPT, MEEM and HDT on 8 challenging sequences which include Car4, CarScale, Coke, David, Freeman4, MotorRolling, Shaking, Walking2. These sequences exist many challenging problems such as scale variations (Fig.2 (a), (b), (d), (f) and (h)), occlusion (Fig.2 (b), (c), (e), (g) and (h)), illumination variations (Fig.2 (a), (d), (f) and (g)), rotation (Fig.2 (d), (e), (f)and (g)), and so on. The results show that our approach not only estimates the location and the scale of the target precisely, but also has a certain robustness to handle the other challenging problems.
After estimating the position (at , bt ) of the t -th frame in section B, we extract S number of patches using variable scales centered around this position of the target using (9). S − 1½ S −1 (9) q s wt −1 × q s ht −1 , s ∈ ®− ," , ¾, 2 ¿ ¯ 2 where q is the scale coefficient. We resize all S patches to
the same size, and using the scale correlation filter which is learned by the HOG features of the image patch to get a response of each patch, the max response s′ is the scale of the target in the t -th frame, the target scale be calculated as wt × ht = q s′ wt −1 × q s′ ht −1 , (10) After estimating the target scale in the t -th frame, we get the patch z t centered around the estimated position (at , bt ) using the estimated size and extract feature map z tk from the
k -th convolutional layer to retrain the correlation filter. The appearance x tk and classifier coefficient Į tk are updated using k k k °xt = (1 − η ) x t −1 + η z t (11) ® k k k °¯Į t = (1 − η )Į t −1 + η Į t . where η is the learning rate. IV.
EXPERIMENTS
A. Parameters Setup The proposed approach in this paper is implemented in Matlab 2015a. We perform experiments on an Intel i5-4210 2.6 GHz CPU with 8 GB RAM. The regularization parameter in (1) is set to λ = 10 −4 ; the scale coefficient in (9) is set to q = 1.03 ; the number of scales in (9) is set to S = 29 ; and the learning rate in (11) is set to η = 0.01 . We present results using center location errors (CLE), distance precision (DP) and overlap precision (OP), they are defined as follows. CLE is defined as the Euclidean distance between the target positions obtained by the tracker and the manually labeled target positions in each frame. OP is a percentage of frames where CLE are less than a predefined threshold. DP is a percentage of frames where the overlap rates (OR) are greater than a predefined threshold. OR in each frame is area( Bt Gt ) , where Bt is the tracked computed as OR = area( Bt * Gt ) bounding box, Gt is the ground truth bounding box, and * represent the intersection and union of two boxes, respectively, and area (⋅) represents the area of the region.
B. Comparison with the State-of-the-art Trackers To evaluate the effectiveness of the proposed approach, we compare our method with 10 state-of-the-art trackers:
Figure 1. Precision and success plots. The legend of the precision plot represents the average DP at 20 pixels for each method. The legend of the success plot contains the area-under-the-curve (AUC) score for each tracker.
133 132
TABLE I.
COMPARISONS OF OUR APPROACH WITH STATE-OF-THE-ART TRACKERS
Evaluation
Ours
HCFT
Average CLE (pixels)
14.4
Average DP (%)
89.8
Average OP (%)
85.3
HDT
SRDCF
MEEM
SAMF
15.7
16
89.1
88.8
74
73.6
35.1
19
28.4
35.4
54.3
51.5
36.5
40.9
83.8
84.8
79
74.3
64.1
56.3
81.4
74.3
78.4
72.2
73.3
62.4
54.3
49.5
71.2
67.4
[5]
[6]
[7]
[8]
[9]
[10] Figure 2. A visualization of the tracking results of our approach and the state-of-the-art trackers: SRDCF, HCFT, RPT, MEEM and HDT on 8 sequences.
V.
[11]
CONCLUSION
[12]
In this paper, we propose a novel ensemble tracking method based on CNN, which decomposes the tracking task into the target position and scale estimation. The target position is estimated by an ensemble tracking model of six trackers where each one is trained with features from one layer of CNN in the kernelized correlation filter framework. Then we estimate the target scale by using the discriminative scale correlation filter. Experimental results demonstrate that our approach outperforms state-of-the-art trackers on the 51 challenging sequences and precisely estimates both the positions and scales of the target.
[13] [14] [15]
[16]
REFERENCES [1] [2]
[3]
[4]
A. Yilmaz, O. Javed, and M.Shah, “Object tracking: A survey,” ACM Computing Surveys, vol. 38, Dec. 2006, pp. 81–93, doi: 10.1145/1177352.1177355. Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” Proc. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Jun. 2013, pp. 2411–2418, doi:10.1109/CVPR.2013.312. A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: an experimental survey,” Transactions on Pattern Analysis & Machine Intelligence, vol. 36, Jul. 2014, pp. 1442–1468, doi:10.1109/TPAMI.2013.230. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” Proc. European Conference on Computer Vision (ECCV), Springer Press, Oct. 2012, pp. 702–715, doi: 10.1007/978-3-642-33765-9_50.
[17]
[18]
[19] [20]
134 133
KCF
Struck
TLD
RPT
DSST
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” Transactions on Pattern Analysis & Machine Intelligence, vol. 37, Mar. 2015, pp. 583–596, doi:10.1109/TPAMI.2014.2345390. B. Triggs and N. Dalal, “Histograms of oriented gradients for human detection,” Proc. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Jun. 2005, pp. 886–893, doi:10.1109/CVPR.2005.177. M. Danelljan, F. S. Khan, M. Felsberg, and J. V. D. Weijer, “Adaptive color attributes for real-time visual tracking,” Proc. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Jun. 2014, pp. 1090–1097, doi:10.1109/CVPR.2014.143. J. V. D. Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning color names for realworld applications,” Transactions on Image Processing, vol. 18, Jul. 2009, pp. 1512–1523, doi: 10.1109/ TIP.2009.2019809. Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” Proc. Computer Vision - ECCV 2014 Workshops, Springer Press, Sep. 2014, pp. 254–265, doi: 10.1007/978-3-319-16181-5_18. M. Danelljan, G. Hager, F. S Khan., and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” Proc. International Conference on Computer Vision Workshop(ICCVW), IEEE Press, Dec. 2015, pp. 621—629, doi:10.1109/ICCVW.2015.84. C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” Proc. International Conference on Computer Vision (ICCV), IEEE Press, Dec. 2015, pp. 3074–3082,doi: 10.1109/ICCV.2015.352. Y. Li, Y. Zhang, Y. Xu, J. Wang, and Z. Miao, “Robust scale adaptive kernel correlation filter tracker with hierarchical convolutional features,” IEEE Signal Processing Letters, vol. 23, Aug. 2016, pp. 1136̾1140, doi: 10.1109/LSP.2016.2582783. Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang, “Hedged deep tracking,” Proc. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Jun. 2016, pp. 4303–4311 K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”. Proc. International Conference on Learning Representations, IEEE Press, pp. 1–14. M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” Proc. British Machine Vision Conference (BMVC), BMVA Press, Sep. 2014, pp. 1–11, doi: 10.5244/C.28.65. Y. Li, J. Zhu, and S. C. H. Hoi, “Reliable patch trackers: Robust visual tracking by exploiting reliable patches,” Proc. Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Press, Jun. 2015, pp. 353–361, doi:10.1109/CVPR.2015. 7298632. S. Hare, A. Saffari, and P. H.S. Torr, “Struck: Structured output tracking with kernels,” Proc. International Conference on Computer Vision(ICCV), IEEE Press, Nov. 2011, pp. 263–270, doi: 10.1109/ICCV.2011.6126251. M. Danelljan, H.Gustav, F. Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” Proc. International Conference on Computer Vision(ICCV), IEEE Press, Dec. 2015, pp. 4310–4318, doi:10.1109/ICCV.2015.490. Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learningdetection,” Transactions on Pattern Analysis & Machine Intelligence, vol. 34, July 2012, pp. 1409–1422, doi:10.1109/ TPAMI.2011.239. J. Zhang, S. Ma, and S. Sclarof, “MEEM: Robust tracking via multiple experts using entropy minimization,” Proc. European Conference on Computer Vision (ECCV), Springer Press, Sep. 2014, pp. 188–203, doi: 10.1007/978-3-319-10599-4_13.