TiO2(Anatase-B

Department of Applied Physics, Chongqing Key Laboratory of Soft Condensed Matter Physics. 6 and Smart Materials, Chongqing University, Chongqing 40004...
0 downloads 0 Views 10MB Size
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 1

Efficient Segmentation-Based PatchMatch for Large Displacement Optical Flow Estimation Jun Chen, Zemin Cai, Jianhuang Lai, Senior Member, IEEE, and Xiaohua Xie, Member, IEEE

Abstract—Efficient optical flow estimation with high accuracy is a challenging problem in computer vision. In this paper, we present a simple but efficient segmentation-based PatchMatch framework to address this issue. Specifically, it firstly generates sparse seeds without losing important motion information by oversegmentation, and then yields sparse matches by adopting a coarse-to-fine PatchMatch with sparse seeds. Such a scheme enhances the robustness of global regularization and yields better matching results compared with the existing NNF techniques while leading to a significant speed-up due to the sparsity of these seeds. Simultaneously, we introduce an extended nonlocal propagation and adaptive random search to address the basic limitation of traditional coarse-to-fine framework in handing motion details that often vanish at coarser levels. Finally, we obtain dense matches at the finest level through an efficient sparse-todense matching according to the cues of oversegmentation. While performing an efficient approximation for oversegmentation, the proposed algorithm runs significantly fast and are robust to large displacements while preserving important motion details. It also achieves good performance on the challenging MPI-Sintel and Kitti flow 2015 datasets. Index Terms—Optical flow, oversegmentation, PatchMatch, extended nonlocal propagation, adaptive random search.

I. I NTRODUCTION PTICAL flow estimation plays an important role in computer vision. In recent years, although remarkable progresses [28, 32, 34–43, 46–48] have been achieved to handle challenging problems in this area, designing an efficient algorithm for large displacement optical flow estimation with high accuracy is still an unsolved issue. In this paper, our method mainly focuses on handling naturalistic video sequences that include large displacements, significant occlusions, and non-rigid motions, which exclude fluid-like images or dynamic texture videos [52], [56], [57], [58] that contain rivers or ocean waves [53], sea ice [54], atmospheric motions [55], groups of pedestrians, and traffic flows over time. A major drawback of traditional optical flow methods [23], [2], [3], [17], [18], [30] is that they find the matching correspondences for every pixel related to a coarse-to-fine

O

J. Chen is with Guangdong Academy of Research on VR Industry, Foshan University, Foshan, Guangdong, 528000, China, and also with Guangdong Key Laboratory of Information Security Technology, Guangzhou, Guangdong 510006, China. E-mail: [email protected]. Z. Cai is with the Department of Electronic Engineering, Shantou University, Shantou, Guangdong 515063, China, and also with Guangdong Provincial Key Laboratory of Digital Signal and Image Processing Techniques, Shantou, Guangdong 515063, China. E-mail: [email protected]. J. Lai and X. Xie are with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510006, China, and also with Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education, China. E-mail: [email protected], [email protected].

Fig. 1. Oversegmentation with the SLIC algorithm [5]. From top left to bottom right: Image sequence, Ground-truth flow, Image after oversegmentation, and Ground truth with oversegmentation curves.

framework with full resolution images or dense seeds directly, which limits their application because of highly computational demand. In addition, there is a fundamental limitation for traditional coarse-to-fine framework in handling tiny structures with large displacements that often vanish at coarser levels. Fig. I depicts an over-segmented image and the corresponding color-coded ground-truth flow with oversegmentation curves, which shows that a strong similarity of the motion in a large neighborhood is indeed very common for optical flow field. Our key motivation draws from the observation that the pixels in each superpixel generally tend to have the similar motion, and the motions on the whole image plane are also sparse. Depend on this, if we want to have an initial prediction of the flow for every pixel, we only need find the matching of one pixel known as a seed in each superpixel, which avoids finding the correspondence for every pixel in the image. In this paper, by making use of this motion similarity and sparsity, we develop a simple but efficient segmetationbased PatchMatch framework to address these challenging problems for optical flow estimation. The main contributions are summarized as follows: (1) Different from coarse-to-fine PatchMatch schemes in prior work [14], [17], [18] that use full resolution images or dense seeds directly, we adopt a coarse-to-fine PatchMatch with sparse seeds guided by oversegmentation. By representing each superpixel with a seed as the basic unit, it reduces the number of processed pixels and effectively extends the propagation range, which enhances the robustness of global regularization and avoids getting trapped in the local minima. Furthermore, by adjusting the superpixel size, these sparse

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 2

seeds obtained from oversegmentation have an approximately complete representation of different motions on the whole image plane, therefore it avoids losing important motion features while reducing information redundancy, which can lead to a remarkable acceleration without a significant loss in accuracy. (2) It also addresses the fundamental limitation of traditional coarse-to-fine scheme in handling important motion details that often vanish at coarser levels: At the propagation step, we introduce an extended nonlocal propagation that adds the nearest m seeds with edge-aware (geodesic) distances as neighbors, which can availably recover the motions of small-scale structures that often vanish at coarser levels by propagating good matches from nonlocal similar superpixels along image edges or geodesic lines to the current seed. At the search step, we use an adaptive random search that handles this issue by detecting the seeds that vanish between adjacent levels and recovering their motions by expanding the corresponding search radius to avoid getting trapped in the local minima. (3) We can obtain more dense matches at the finest level of image pyramids by an efficient sparse-to-dense matching according to the cues from oversegmentation, and further improve the efficiency of our method by performing an efficient approximation for oversegmentation, which assigns each pixel to the nearest seed with geodesic distance. Benefited from advantages of each step in the proposed framework, our method achieves significant speed-up with high accuracy while preserving important motion details. It also achieves a high ranking on the challenging MPI-Sintel and Kitti flow 2015 datasets. II. R ELATED W ORK In this section, we briefly review the related work that focuses on the challenging problems for optical flow estimation, such as high efficiency, large displacement, and motion details preservation. In recent years, classic variational framework [2, 8–11] shows outstanding performance for accurate optical flow estimation. They perform very well in restoring small displacements and local motion details. However, these variational optical flow methods often integrate into the traditional coarseto-fine scheme to deal with large displacements and compute the flow for every pixel at each level directly. Therefore, there is a fundamental limitation for them in handling smallscale structures with relatively large displacements that always vanish at coarser levels while suffering from the computational complexity. To handle this issue, the descriptor matching is introduced into the original variational framework. Brox et al. [2] introduced a descriptor matching term that penalizes the difference between flow and HOG matches. Furthermore, Xu et al. [3] proposed an extended coarse-to-fine framework that performs expensive fusion of the sparse matches and estimated flow at each level. However, these methods are still computationally complex and fail to handle tiny structures with large displacements larger than their own size for the realworld video due to the sparsity of matching. To handle the sparsity of descriptor matching, Revaud et al. [21] proposed a

descriptor matching algorithm named DeepMatching to obtain relatively dense matches, and performed an edge-preserving interpolation [22] with these dense matches to obtain accurate flow field. However, its efficiency is limited by the DeepMatching [21]. Chen et al. [11] later used an extended coarse-to-fine framework with nonlocal split-bregman regularization, whose efficiency is still limited by the Deepmatching initialization. To achieve real-time performance, some hardware implementations for optical flow estimation are recently proposed to improve the efficiency. Diaz et al. [59] proposed a real-time optical flow system with a field-programmable gate array (FPGA) device. Pauwels et al. [60] used a massive degree of parallelism to achieve real-time performance and made a comparison of FPGA and GPU for real-time phasebased optical flow, stereo, and local Image features. Botella et al. [61] proposed robust bio-inspired architecture for optical flow computation, and later introduced quantization analysis and enhancement of a VLSI (very large scale integration) gradient-based motion estimation architecture [62]. Although these methods achieve significant improvements in efficiency for optical flow estimation, they heavily rely on powerful hardware such as FPGA and GPU. Our method follows PatchMatch [14] related to the nearest neighbor field (NNF). Although the efficient of computing NNF has a significant improvement in recent work [14], [19], it usually contains many outliers due to lack of global regularization. Chen et al. [15] used the computed NNF to recover the dominant motion patterns. Bao et al. [16] used an edge-preserving PatchMatch for optical flow estimation. However, they are still time-consuming, and failed to handle very large displacements and significant occlusions. Recently, PatchMatch Filter [24] was proposed to address discrete multi-labeling problem, which uses superpixels-based search strategy to improve original PatchMatch method. Li et al. [32] later improved it with two-layer graph structures. However, their efficiency is greatly limitted by the cost aggregation for the pixels in each superpixel. To obtain dense matches, Bailer et al. [17] proposed an efficient hierarchical search strategy to repel outliers while recovering motions for tiny objects and other details, but its speed is limitted by using a relatively small grid interval. Hu et al. [18] later proposed an efficient method by using CPM instead of DeepMatching, which achieves high efficiency but still constrained by coarseto-fine PatchMatch with dense seeds directly. However, these methods still easily lose some fine-scale motion structures that often vanish at coarser levels due to considering only spatially adjacent seeds as neighbors. In contrast, we introduce coarse-to-fine PatchMatch with sparse seeds guided by oversegmentation. By taking superpixels as the basic units and considering nonlocal similar superpixels as neighbors, it achieves significant acceleration while preserving important motion details. III. S EGMENTATION BASED M ATCHING F RAMEWORK In this section, we introduce segmentation-based PatchMatch framework and discuss its main features. Our matching framework mainly consists of five steps: (1) Superpixel representation. (2) Coarse-to-fine PatchMatch with sparse seeds.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 3

(3) Sparse-to-dense matching. (4) Occlusions and outliers handling. (5) Interpolation and refinement. A. Superpixel Representation Superpixel decomposition of a given image is a key part for our method. In this paper, we obtain superpixels by performing oversegmentation at the finest level of image pyramids with the SLIC algorithm [5], which segments an input color image I into K non-overlapping superpixels, i.e.,  R = Rk | ∪ K k=1 Rk = I and ∀k 6= m, Rk ∩ Rm = ∅}. The SLIC algorithm is locally aware of image edges and has a fast runtime linear in the number of pixels. We represent these superpixels with a set of seeds S0 = {s1 , s2 , · · · , sK }, and use the cluster centers of these superpixels as the seeds. There is only one seed in every non-overlapped superpixel, therefore it significantly reduces the number of processed pixels, which can result in a significant acceleration. In each superpixel, we assign each pixel located at p = (xp , yp ) a label with the corresponding seed label L, L ∈ {1, 2, · · · , K}, and the pixels from the same superpixels share the same labels. To avoid losing important motion information, we make these superpixels contain nearly all the different motions on the whole image plane by adjusting the superpixel size σ, and the pixels in each subperpixel generally tend to have the similar motion. Therefore, although these seeds are sparse, they can have an approximately complete representation for different motions in these superpixels on the whole image plane and avoid losing important motion information. Considering motion similarity for the pixels in each superpixel, we only compute the matching of seeds in these superpixels rather than every pixel in the image for high efficiency. Given two consecutive images I1 , I2 ⊂ R2 and a collection of seeds S0 = {s1 , s2 , · · · , sK } obtained from oversegmentation, our goal is to find the displacement of each seed F (sa ) = M (p(sa )) − p(sa ) ⊂ R2 , where sa is a seed at position p(sa ) in the set S0 , and M (p(sa )) is the corresponding matching position in I2 for a seed sa in I1 . We firstly construct an undirected graph G = (V, E) for the oversegmented image, where V is the vertices that represent a set of sparse seeds, and E is a set of edges that connect the spatially adjacent seeds. The length of edges denotes the cost between them. As in EpicFlow [22], the result of “structured edge detector” (SED) is used as the cost map. We define approximate geodesic distance as the shortest distance between two positions with respect to the cost map, and compute approximate geodesic distance between any two seeds using Dijkstra’s algorithm on graph G. As is shown in Figure I, the motions on the whole image plane are sparse, which results in most spatially adjacent superpixels and some nonlocal similar superpixels that have similar image appearances also have the similar motion. Therefore, our basic matching can follow PatchMatch [14] and make use of the motion similarity of these superpixels. However, different from PatchMatch [14], in addition to use spatially adjacent seeds as neighbors, the proposed segmentationbased PatchMatch adds the nearest m seeds with edge-aware (geodesic) distances as neighbors for the current seed, which considers motion similarity of nonlocal similar superpixels.

Fig. 2. An example for extend nonlocal propagation to handle fine scale structures with large displacements that often vanish at coarse levels. From top left to bottom right: Image sequence with the marked region A, groundtruth flow with the marked region B, marked region A with oversegmentation curves, and marked region B with oversegmentation curves.

Similar to PatchMatch [14], we compute the matching by performing propagation followed by random search iteratively after the flow initialization for each seed. At the propagation step, we find good correspondence for the current seed sa by propagating good matching from the neighboring seeds to it. These neighboring seeds have already been examined in the current iteration. The propagation step is described as F (sa ) = arg min D (p(sa ), p(sa ) + F (si )), F (si )

(1)

si ∈ {sa } ∪ Na , Na = Ns ∪ Ng , where D (p(sa ), p(sb )) denotes the match cost between a source matching patch centered at p(sa ) in I1 and a target matching patch centered at p(sb ) in I2 . The match cost D (p(sa ), p(sb )) is computed as a sum of the absolute difference over the entire 128 dimensions of the SIFT feature [4] at the matching position. As in [18], the SIFT feature is extracted for every pixel on each level of image pyramids with the same patch size 8 × 8. Na denotes a set of the neighboring seeds for a current seed sa , which includes the set of the spatially adjacent seeds Ns and the set of the nearest m(m = 8) seeds with geodesic distances Ng . We call it “extended nonlocal propagation”, which considers non-local motion information and can propagate good matches from nonlocal similar superpixels that have similar image appearances. Furthermore, it can availably recover the motion of small-scale structures that vanish at coarser levels by propagating good matches from nonlocal similar superpixels along the geodesic line (or image edges) for the current seed, which can result in an correct flow initialization for these seeds when they appear again at lower levels. Fig. 2 shows an example for the proposed extend nonlocal propagation to handle small-scale objects with large displacements. The marked region A in Fig. 2 contains long and fine scale object with relatively large displacement, which often vanishes at coarser levels of image pyramid, and the corresponding flow field is visualized in the marked region B in Fig. 2. The current seed sa is on this fine scale object, and

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 4

the set {s1 , s2 , · · · , s8 } in Fig. 2 is the nearest m (m = 8) seeds of it with edge-aware distances. At the propagation step, Fig. 2 shows that when this fine scale object appears again at lower levels, we can give a better initialization for the seed sa by propagating good matches from these nonlocal similar superpixels along the geodesic line (or image edges), which reveals the reason why our method that adds the nearest m(m = 8) seeds with geodesic distances as neighbors is able to recover important motion details that often vanish at coarser levels. After the propagation step, to avoid getting trapped in the local minima, a random search is performed to find better correspondence for the current seed sa by trying several random guesses near the current matching, that is F (sa ) = arg min D (p(sa ), p(sa ) + d), d  d ∈ {F (sa )} ∪ Fi |Fi = F (sa ) + wαi Ri

(2)

where w is a maximum search radius, Ri is a uniform random in [−1, 1] × [−1, 1], and α = 0.5 is a fixed ratio between two consecutive search scopes. We examine displacements for i = 0, 1, 2, · · · until the current search radius wαi is below 1 pixel. We perform propagation and random search iteratively in an interleaved manner. Moreover, displacements of seeds are examined in scan order on even iterations and in reverse scan order on odd iterations. In practice, it works well only requiring a fixed number of iterations n. Based on above analysis, now we discuss advantages of the proposed segmentation-based PatchMatch: (1) Compared with previous pixel-based PatchMatch methods [14, 17, 18], our method takes superpixels as the based units, and it can effectively extend the propagation range and avoid getting trapped in the local minima, which can enhance robustness of the global regularization. (2) We represent each superpixel with a seed as a basic unit, which avoid unnecessary motion information propagation between the pixels in each superpixel and reduce computational redundancy while resulting in a significant acceleration due to the sparsity of the seeds. Although PatchMatch Filter [24] and SPM-BP [32] also used superpixel-based PatchMatch, they often depend on expensive cost aggregation, which is still time-consuming. (3) In addition to use the spatially adjacent seeds as neighbors, our method adds the nearest m seeds with edge-aware (geodesic) distances as neighbors, which can propagate better matches from nonlocal similar superpixels and give it a better flow initialization for the current seed. Compared with prior superpixel-based PatchMatch [24, 32], such a preprocessing scheme with low complexity can make our method converge faster and handle small objects with large displacements better. B. Coarse-to-fine PatchMatch with Sparse Seeds After oversegmentation at the finest level of image pyramids, we obtain a set of sparse seeds for these non-overlapped superpixels. Our goal is to find the best correspondences for these sparse seeds. However, like PatchMatch [14], our basic matching can yield many outliers caused by the matching ambiguity of small patches. In this paper, we proposed a

novel coarse-to-fine PatchMatch with sparse seeds guided by oversegmentation, which incorporates an efficient hierarchical structure with the extended nonlocal propagation from top to bottom to handle this issue. As is proposed in [18], [21], the matching responses on the target image are often more discriminative with relatively large referenced image patches. As in [18], we compute the SIFT features with the same patch size 8 × 8 on all the levels of image pyramids for our method, therefore the patch size is relatively large at coarser levels, which guarantees robust matching for high levels. In this section, we use a downsampling factor of η = 0.5 to construct image pyramids with k levels for both images I1 and I2 . Ii l denotes the lth level of image pyramids of Ii , i ∈ {1, 2}, l ∈ {0, 1, · · · , k − 1}). The raw images I1 and I2 are the bottom level of the pyramids I1 0 and I2 0 . After constructing image pyramids, we use the robust matching results obtained from high levels to guide the matching process on lower levels to handel the matching errors caused by the local matching ambiguity of small patches. Sparse seeds are constructed on each level of image pyramids. To avoid losing important motion information, the seeds keep the same number on each level while preserving the same neighbor relationship as the finest level. {sl } is defined as the seeds at the position {p(sl )} on the lth level, and the position {p(sl )} is obtained by taking the nearest integers after the downsampling of the raw seeds in I1 0 , that is  l  p(s ) = bη l · p(s0 ) + 0.5c, l ≥ 1 (3) where bZc denotes the largest integer that does not exceed Z. However, the traditional coarse-to-fine framework has a fundamental limitation in handling motion details, especially in the case that the moving objects have small-scale structures with relatively large displacements. The main reason is that the fine-scale structures do not always exist at coarser levels and only the motion of large background is estimated, which results in error flow initialization for these tiny structures when they appear again at finer levels. Therefore, how to recover these small-scale motion features is a challenging issue for the conventional coarse-to-fine framework. Although we can handle this issue by the proposed extend nonlocal propagation, which can give a good initialization for the current seed by propagating good motions from nonlocal similar superpixels, it is still easy to get trapped in the local minima. In this paper, we obtain non-overlapped superpixels by oversegmentation at the finest level, some small-scale superpixels on the finest level may be vanished on coarser levels, which always contain important small-scale motion features. In our method, we use a seed as a representation for each superpixel. However, the position of the seeds is truncated to the nearest integers for coarser levels after the downsampling. Therefore, some seeds on high levels may have the same position. Although these duplicated seeds make the propagation followed by random search performed more extensively at coarser levels and guarantee robust matching results on high levels, it loses fine-scale motion features due to the disappearance of small-scale superpixels at coarser levels. When the position of these seeds on coarser level is mapped back to the originally finest level, the mapping position is

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 5

not always the same as the location of the original seeds. Furthermore, if the new mapping position and the original position of the seed on the finest level have different labels, it indicates that they belong to different superpixels. In this case, it provides a signal that some superpixels on the finest level is vanished on coarser level. To handle this issue, we consider the seeds on two consecutive levels. sa l and sa l−1 respectively denotes seeds on the lth level and l − 1th level obtained from the downsampling of the same seed sa on the finest level, and the positions of sa l and sa l−1 are mapped back to the originally finest level, that is  l 1 0 l · p(sla ), l ≥ 1 (4) p (sa ) = η  l−1 1 0 l−1 p (sa ) = · p(sl−1 (5) a ), l ≥ 2 η where p0 (sla ) and p0 (sl−1 a ) respectively denote the mapping positions on the finest level for the seeds sla and sal−1 . If the labels of the the mapping positions p0 (sla ) and p0 (sl−1 a ) are not the same, that is L(p0 (sla )) 6= L(p0 (sl−1 a )) , it indicates that the mapping positions belong to different superpixels and the superpixel represented by the seed sa l−1 on the l − 1th level is vanished on the lth level. In this case, we avoid losing important motion features by increasing the corresponding search radius for the seed sl−1 on the l − 1th level in the a subsequent search step, which is called “adaptive random search”. Because this situation often arises at higher levels due to the sparse attribute of the seeds, we only detect these sparse seeds at coarser levels to achieve high efficiency. After constructing the pyramid and generating the seeds on each level, we perform the flow initialization with a random correspondence field on the coarsest level, and then improve it by performing an extend nonlocal propagation followed by an random search with the maximum image dimension iteratively. Because the size of the matching patches are relatively large on the coarsest level, it guarantees more robust matching on this level. Simultaneously we use an extend nonlocal propagation, which expands the scope of propagation and can obtain more robust matching results by propagating better matches from nonlocal similar superpixels. Furthermore, the search radius with the maximum image dimension ensures that we can search for the best correspondence for very matching position on the whole image plane, which can help us to find global optimal solutions and avoids getting trapped in the local minima. Therefore, although we use a random flow initialization at the coarsest level, it can still guarantee robust matching results. After obtaining robust matches from the high levels, we use the robust matching results obtained from high levels to guide the matching process on lower levels. The obtained matches are used as an initialization of the seeds on the next lower levels, that is  1  F (sl−1 ) = · F (sl ) , 1 ≤ l < k. (6) η After the flow initialization, the propagation followed by random search is performed iteratively to improve the matches

of the seeds on each level. The search radius with the maximum image dimension on the coarsest level can help us to find global optimal solutions. However, on lower levels, a small search radius around the propagated matching is selected to avoid finding a poor local optimum far from it. As in [6], we use adaptive search radius for the seeds on the low levels, which is set as the radius of the minimum circle containing all the initial matches of spatially adjacent seeds. When the seeds is beside the motion discontinuities, the neighboring seeds have different matches, and they generate a large search radius, which results in more reasonable matches can be found. If the neighboring seeds have consistent matches, they yield a relatively small search radius, which can effectively reduce computational cost. Compared with previous pixel-based PatchMatch methods [14, 18], we use coarse-to-fine PatchMatch with sparse seeds guided by oversegmentation instead of dense seeds. Because these sparse seeds have an approximately complete representation for different motions on the whole image plane, unlike dense seeds in previous methods, it reduces unnecessary motion information propagation while avoids losing important motion information, which can lead to a remarkable acceleration without a significant loss in accuracy. Although PatchMatch Filter [24] addresses the fundamental limitation of coarse-to-fine framework by adopting adaptive cross-scale consistency constraint to estimate a high-quality correspondence field at the full image scale, it compute flow field by handling discrete multi-labeling problem and its efficiency is greatly limited by the cost aggregation. Different from PatchMatch Filter [24], we introduce extend nonlocal propagation and adaptive random search to address this issue, and the subsequent experiments also show that our method significantly outperforms PatchMatch Filter [24] on both efficiency and accuracy.

C. Sparse-to-dense Matching After performing coarse-to-fine PatchMatch with sparse seeds, we obtain a set of sparse matches. However, a direct performance of an edge-preserving interpolation [22] with sparse seeds will result in losing important motion details, especially for small-scale structures with relatively large displacements due to lack of sufficient matches in each superpixel. In this paper, we handle this issue by introducing new seeds. Specifically, we generate dense seeds at the cross point of the regular image grid with a relatively dense interval of d1 pixels in both horizontal and vertical directions on the finest level, and adapt spatially adjacent seeds as neighbors. Based on the observation that the pixels in the same superpixels generally tend to have the similar motions, we perform flow initialization for a new seed with the match of an old seed that belongs to the same superpixel. After the flow initialization, we improve these matches by performing the propagation followed by random search iteratively. Because the initial flow always has an approximate for the true matches, therefore it requires only a few iterations to achieve convergence.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 6

Algorithm 1: Segmentation-Based PatchMatch Input: A pair of images I1 and I2 , a set of sparse seeds S0 with a large grid spacing d0 , and a set of dense seeds S1 with a small grid spacing d1 . Output: Dense matching correspondences M , and optical flow field F between I1 and I2 . • Construct graph G for sparse seeds S0 , and assign each pixel to the nearest seed using geodesic distance. • Construct image pyramids for both I1 and I2 , and generate sparse seeds sl for each level. • Detect and mark sparse seeds that vanish between the adjacent levels at coarser levels of image pyramids. for seeds sl from sk−1 to s0 do if l = k − 1 then • Random flow initialization; • Extended propagation followed by random search with the maximum image dimension. else • Flow initialization according to Equation (6); • Extended propagation and adaptive random search with adaptive radius based on sec. III-B. end end • Obtain robust matches M0 for sparse seeds S0 by occlusions and outliers handling, and recover invalid matches with the matching of the nearest seeds (using geodesic distance) that have valid matches. • Obtain dense matches by sparse-to-dense matching with sparse matches M0 according to sec. III-C; • Obtain robust matches M1 of S1 by outlier handing for these dense matches; • Obtain flow F by an edge-preserving interpolation of matches M = M0 ∪ M1 and variational refinement.

D. Occlusions and Outliers Handling In this paper, we use the fixed positions on the image plane for both sparse seeds and dense seeds. However, some seeds may be in the occlusion regions, where we can’t find the corresponding matches. Furthermore, some seeds may be greatly influenced by the noise in the real-world scenes, which can result in error matching results. To handle these problems, we adopt a forward-backward consistency checking [17, 18, 24, 28] to detect the matching errors caused by occlusions and outliers. Specifically, we compute the forward matches Ff (·) from I1 to I2 and the backward matches Fb (·) from I2 to I1 . For each seed sa in I1 and its forward match Ff (sa ), we can obtain the corresponding matching position p(sa ) + Ff (sa ) in I2 . Based on the matching position, we can obtain the corresponding seed label l = L(p(sa ) + F (sa )) in I2 , and the corresponding seed is represented as sl in I2 , whose backward match is Fb (sl ). For each Ff (sa ), we define consistency matches as 2

kFf (sa ) + Fb (sl )k < τ

(7)

where τ is a fixed threshold. After checking, we discard inconsistency matches.

After obtaining sparse matches from coarse-to-fine PatchMatch, we perform a forward-backward consistency checking to detect invalid matches (caused by occlusions and outliers) for these sparse seeds, which prevents error propagation at the stage of sparse-to-dense matching. For these seeds that have invalid matches, we recover these invalid matches with the correspondences of the nearest seeds with edge-aware geodesic distances, which have valid matches. It is based on the observation that nonlocal similar superpixels tend to have the similar motion and guarantees the robustness of the matching for these sparse seeds. After obtaining dense matches, a forward-backward consistency checking is performed to detect occlusions and remove outliers for these dense seeds. E. Interpolation and refinement After occlusions and outliers handling for these dense seeds, some invalid matches are removed. To fill the gaps created by outlier handling and obtain subpixel-resolution flow field, an edge-preserving interpolation and variational refinement proposed in [22] are employed to obtain the final flow. IV. E FFICIENT I MPLEMENTATION BY A PPROXIMATION To have an efficient implementation of the proposed algorithm, for the stage of coarse-to-fine PatchMatch, we generate sparse seeds at the cross point of the regular image grid with a relatively large interval of d0 pixels. We oversegment image with the SLIC algorithm that initializes the corresponding cluster centers with the position of these sparse seeds, which can guarantee at least one seed in each superpixel after oversegmentation. The density of the seeds can be controlled by adjusting d0 and the parameters of the SLIC algorithm. Because we generate seeds at the cross point of the regular image grid with a fixed interval instead of the cluster centers of these superpixels. Therefore, there are generally more than one seeds in these superpixels after oversegmentation, and this moderate redundancy can make the matching results more robust while having little effect on the speed. We can make use of this sparse and redundancy by adjusting the parameter d0 , which can result in a significant speed-up while yielding more robust matching results. However, the oversegmentation with the SLIC algorithm is still time-consuming and only locally aware of image edges. In our method, we replace oversegmentation by assigning each pixel to the nearest seed according to the shortest geodesic distance, which is able to capture regions at the image scale while having high efficiency, and it also assigns each pixel to the nearest seed with the corresponding label. Similar to labels obtained from oversegmentation, they are used as the cues to conduct coarse-to-fine PatchMatch with sparse seeds and sparse-to-dense matching. The main step of segmentation-based PatchMatch framework is summarized in Algorithm 1. V. E XPERIMENTS In this section, we evaluate our method on three public datesets: MPI-Sintel [12], Kitti [13], and Middlebury [1]. Each of them respectively contains a training dataset with public

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 7

(b) d1 = 3

3

2 3.6

3.4

1 0

5

10

15

Grid spacing d0

Fig. 3.

20

0

5

10

15

20

Grid spacing d0

2.5

4.2 d0 = 11

d0 = 11

4 2

Time (s)

AEE (Average endpoint error)

d1 = 3

4

3.8

(d)

(c)

4

Time (s)

AEE (Average endpoint error)

(a) 4.2

3.8

1.5 3.6

3.4

1 0

2

4

6

8

10

12

0

2

Grid spacing d1

4

6

8

10

12

Grid spacing d1

Impact of different parameters d0 and d1 on our method with some image sequences from the MPI-Sintel training dataset.

(a) Image sequence

(b) Ground truth

(c) SegFlow (d0 = 3)

(d) SegFlow (d0 = 5)

(e) SegFlow (d0 = 7)

(f) SegFlow (d0 = 9)

(g) SegFlow (d0 = 11)

(h) SegFlow (d0 = 13)

Fig. 4. Visual comparison of optical flow fields for the proposed method (SegFlow) using different values for the parameter d0 on the image sequence from the MPI-Sintel training dataset [12].

ground truth flow and a test dataset with hidden ground truth flow for evaluation purposes. • MPI-Sintel [12]: MPI-Sintel dataset is obtained from an animated movie and contains very large displacements. It includes a clean version and a final version. The clean version contains significant occlusions and shading effects. The final version adds rendering effects, such as specular reflections, atmospheric effects, motion blur, and defocus blur. • Kitti [13]: Kitti flow 2015 dataset is created from a platform on a driving car, and some dynamic scenes are from the Kitti raw dataset [43], which contains real-world image sequences with large displacements, different lighting conditions, a variety of materials, and non-lambertian surfaces. • Middlebury [1]: Middlebury dataset is a classic benchmark for optical flow estimation, which is used for accurate optical flow estimation with relatively small displacements. We use constant parameter set {d1 , k, n} = {3, 5, 6} to evaluate our method on all the datasets. For different image sequences, it is expediential to choose appropriate d0 ∈ {3, 5, 7, 9, 11, 13, 15} for our method to make a trade-off between time and accuracy. After obtaining robust matches, an edge-preserving interpolation and variational refinement proposed in [22] are employed to obtain the final flow. We perform our method with an Intel Core [email protected] CPU on these datasets.

A. Parameters Sensitivity Analysis In this section, we investigate the impact of different parameters, different clustering methods, different seed neighbors, and schemes with and without seed detection on the proposed algorithm. Impact of different parameters. To investigate the influence of the parameters d0 and d1 on the proposed method, we perform our method on some image sequences from the MPI-Sintel training dataset. Fig. 3 shows that the parameters d0 and d1 have great effect on the efficiency and accuracy for the proposed algorithm. The edge computation time and interpolation time are not included in the reported time. For the grid spacing d0 , a relatively small d0 yields dense seeds. Although dense seeds can guarantee more robust matching results, they can cause information redundancy, which can’t improve the accuracy obviously while significantly increasing computational complexity. A relatively large d0 produces too sparse seeds. Although too sparse seeds can improve the efficiency, they often lose important motion information and reduce matching precision. Here we can make use of this sparsity and redundancy by adjusting the parameter d0 . For these image sequences from the MPI-Sintel training dataset, Fig. 3(a) and Fig. 3(b) show that choosing an appropriate d0 (d0 = 11) can result in a significant speedup while yielding relatively robust matching results. For the grid spacing d1 , Fig. 3(c) and Fig. 3(d) show that a small grid spacing results in highly computational complexity, especially when there is no spacing d1 = 1. We find that choosing d1 = 3

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 8

1

1

Sparse seeds: Level = 4

0.2

0.6

0.4

0.2

0 3

4

5

6

0.4

2

3

4

5

6

0.8

0.6

0.4

0.2

0 1

Iterations

Fig. 5.

0.6

0.2

0 2

Sparse seeds: Level = 0

0.8

Update Ratio

0.4

Sparse seeds: Level = 1

0.8

Update Ratio

0.6

1

Sparse seeds: Level = 2

0.8

Update Ratio

Update Ratio

0.8

1

1

Sparse seeds: Level = 3

Update Ratio

1

2

3

4

Iterations

5

6

0.4

0.2

0 1

Iterations

0.6

0 1

2

3

4

5

6

1

Iterations

2

3

4

5

6

Iterations

An example of convergent curves for each level of corase-to-fine PatchMatch with sparse seeds on the temple image sequence (d0 = 11).

Update Ratio

1 Dense seeds: Level = 0

0.8 0.6 0.4 0.2 0 1

2

3

4

5

6

Iterations

Fig. 6.

An example of convergent curves and visual convergence maps for sparse-to-dense matching at the finest level on the temple image sequence.

can lead to a significant speed-up while hardly losing accuracy. Because the selection of the parameter d0 plays an important role on our method for the accuracy and efficiency, we also provides visual comparison of optical flow fields for the proposed method (SegFlow) using different values for the parameter d0 in Fig. 4, which shows that when d0 increases within certain range, that is d0 ∈ {3, 5, 7, 9}, there is almost no loss in accuracy. A relatively large d0 (d0 = 11) may result in the loss of some important motion details. However, a relatively small d0 is not always performs better than a relatively larger d0 , such as for this image sequence our method with d0 = 7 and d0 = 9 performs better than that with d0 = 3 and d0 = 5 in the marked region E, and our method with d0 = 13 also performs better than that with d0 = 11. The main reason may be that relatively large d0 for our method can enhance the robustness of global regularization. Therefore, for different image sequences, it is expediential to choose appropriate d0 for our method to make a trade-off between time and accuracy. To evaluate convergence speed of our method (SegFlow, d0 = 11, d1 = 3), we investigate the relationship between update ratio and the number of iterations for each level including the stages of corase-to-fine PatchMatch with sparse seeds and sparse-to-dense matching. Update ratio is defined as the percentage of those seeds whose values are changed in each iteration. When update ratio tends to a very small constant, it indicates convergence. We give an example for the convergence speed of our method with two consecutive images from the MPI-Sintel dataset. Fig. 5 shows that our method achieves convergence requiring only a few (about 4) iterations

TABLE I C OMPARISON OF DIFFERENT CLUSTERING

Datasets MPI-Sintel [12] Kitti(2015) [13] Middlebury [1] Time (MPI-Sintel)

METHODS .

Pixel Clustering SLIC Geodesic 3.543 3.514 7.835 7.782 0.368 0.367 ∼5.6s ∼4.1s

for each level of coarse-to-fine PatchMatch with sparse seeds. Fig. 6 shows that it requires only a few (about 3 to 4) iterations to achieve convergence for sparse-to-dense matching at the finest level. Therefore, for convergence criterium, we set a maximum number of iterations n = 6 with update ratio smaller than 0.05 when achieving convergence. Impact of different clustering methods. In this paper, we further improve the efficiency and accuracy of our method by an efficient approximation for oversegmentation, which adopts a pixel clustering method by assigning each pixel to the closest seed with geodesic distance while using the corresponding label instead of the labels obtained from oversegmentation. For the pixel clustering, we assign each pixel to a seed with the corresponding label. These labels can be used as the cues to conduct coarse-to-fine PatchMatch with sparse seeds and sparse-to-dense matching. We investigate three schemes for the pixel clustering related to the proposed method: (1) SLIC method. It generates superpixels by oversegmentation with the SLIC algorithm [5]. (2) Geodesic method. It assigns each pixel to a seed that has the shortest geodesic

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 9

Fig. 7.

(a) Image sequence

(b) Ground truth

(c) Classic + NL [23]

(d) MDP-Flow2 [3]

(e) PMF[24]

(f) DeepFlow [21]

(g) EpicFlow [22]

(h) SegFlow

Visual comparison of optical flow fields for different methods on the image sequence from the MPI-Sintel training dataset [12] (d0 = 11). TABLE II I MPACT OF DIFFERENT SEED NEIGHBORS

Datasets MPI-Sintel [12] Kitti(2015) [13] Middlebury [1]

Seed Neighbors S S+G 3.587 3.514 7.893 7.782 0.369 0.367

TABLE III I MPACT OF S EED D ETECTION

Datasets MPI-Sintel [12] Kitti(2015) [13] Middlebury [1]

Seed Detection Without With 3.532 3.514 7.813 7.782 0.368 0.367

distance from it. We evaluation the impact of them on our method with some image sequences from three public optical flow datasets, respctively. The evaluation results are shown in Table I, which shows that pixel clustering using geodesic distances performs slightly better than oversegmentation with the SLIC algorithm [5] while achieving higher efficiency. The reason is that oversegmentation with the SLIC algorithm is only locally aware of image edges, but geodesic distances are able to capture regions at the image scale. Impact of seed neighbors and seed detection. In this paper, we introduce an extended nonlocal propagation and adaptive random search to address the basic limitation of traditional coarse-to-fine framework in handing motion details that often vanish at coarser levels. We investigate the impact of different seed neighbors and seed detection on the proposed method. S denotes propagation using only spatially adjacent seeds. S + G denotes propagation using both the spatially adjacent seeds and the geodesic distance seeds. For the seed neighbors, the extended nonlocal propagation that adds the closest m(m = 8) seeds with geodesic distances as neighbors considers nonlocal motion information, which can availably recover the motion of small-scale structures that

vanish at coarser levels by propagating good matches from nonlocal nonlocal similar superpixels along image edges for the current seed. Table II shows that the extended propagation that considers both spatially adjacent seeds and the closest m (geodesic distance) seeds as seed neighbors outperforms that using only spatially adjacent seeds alone. For the seed detection, we introduce adaptive random search, which avoids losing important motion features by detecting seeds that vanish between adjacent levels and increasing the corresponding search radius in the search step. Table III shows our method with seed detection performs slightly better than it without seed detection, which demonstrates the validity of the proposed scheme. In our method we use adaptive search radius for the seeds on each level (except for the top level), therefore we can ignore seed detection during the subsequent experiments to further improve the efficiency of the proposed method. Visual comparison in Fig. 7 shows that our method with an extended nonlocal propagation and adaptive random search achieves better performance in preserving important motion details in regions F that have fine-scale structures with relatively large displacements than some state-of-the-art methods, such as Classic+NL [23], MDP-Flow2 [3], PMF [24], DeepFlow [21], and EpicFlow [22]. B. Evaluation Results on Datasets We firstly evaluate our method (SegFlow) on the MPI-Sintel test datasets and submit our evaluation results to the website of this dataset: http://sintel.is.tue.mpg.de/results. The evaluation results in Table IV show that SegFlow (d0 = 3) achieves better performance on the clean pass than some state-of-the-art methods, such as MDP-Flow2 [3], EpicFlow [22], CPM-Flow [18], and FlowFields [17]. Although SegFlow (d0 = 11) has a drop in accuracy, it still achieves significant improvements on both efficiency and accuracy over pixel based PatchMatch methods, such as EPPM [16] and NNFLocal [15], and suberpixel based PatchMatch methods, such as PMF [24] and SPM-BP [32]. It runs about 4.2s for an MPISintel image pair with the resolution 1024 × 436 pixels on an Intel Core [email protected] CPU, which significantly faster than

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 10

A

A

B (a) Image sequence

B (b) Ground truth

A

(e) PMF [24]

A

B (c) MDP-Flow2 [3]

A

B Fig. 8.

A

(d) EPPM [16]

A

B (f) EpicFlow [22]

B

A

B (g) CPM-Flow [18]

B (h) SegFlow

Visual comparison of optical flow fields for different methods on the image sequence from the MPI-Sintel test dataset [12] (d0 = 11). TABLE IV E VALUATION R ESULTS OF D IFFERENT M ETHODS ON THE MPI-S INTEL T EST DATASET

Methods SegFlow (d0 = 3) DCFlow [26] RicFlow [27] CPM-Flow [18] DiscreteFlow [20] FullFlow [28] FlowFields [17] FlowFieldsCNN [33] SegFlow (d0 = 11) FlowNet2 [31] EpicFlow [22] SegFlow (d0 = 15) SegFlow (d0 = 19) SPM-BP [32] DeepFlow [21] PMF [24] NNF-Local [15] MDP-Flow2 [3] EPPM [16] LDOF [2] Classic+NL [23] EPE all EPE noc EPE occ

EPE all 3.356 3.537 3.550 3.557 3.567 3.601 3.748 3.778 3.862 3.959 4.115 4.151 4.893 5.202 5.377 5.378 5.386 5.837 6.494 7.563 7.961 Endpoint Endpoint Endpoint

Clean Pass Final Pass Time EPE noc EPE occ EPE all EPE noc EPE occ 1.033 22.322 6.726 3.200 35.458 ∼6.2s 1.103 23.394 5.119 2.283 28.228 — 1.264 22.220 5.620 2.765 28.907 5s 1.189 22.889 5.960 2.990 30.117 4.3s 1.108 23.626 6.007 2.937 31.685 — 1.296 22.424 5.895 2.838 30.793 — 1.056 25.700 5.810 2.621 31.799 18s 0.996 26.469 5.363 2.303 30.313 — 1.132 26.141 6.423 2.995 34.361 ∼4.1s 1.468 24.294 6.016 2.977 30.807 0.12s (GPU) 1.360 26.595 6.285 3.060 32.564 16.4s 1.246 27.855 6.191 2.940 32.682 ∼4s 1.570 31.973 7.071 3.380 37.136 ∼3.95s 1.815 32.839 7.325 3.493 38.561 — 1.771 34.751 7.212 3.336 38.781 19s 1.858 34.102 7.630 3.607 40.435 39s 1.397 37.896 7.249 2.973 42.008 — 1.869 38.158 8.445 4.150 43.430 754s 2.675 37.632 8.377 4.286 41.695 0.25s (GPU) 3.432 41.170 9.116 5.037 42.344 30s 3.770 42.079 9.153 4.814 44.509 — error over the complete frames. error over regions that remain visible in adjacent frames. error over regions that are visible only in one of two adjacent frames.

most of the state-of-the-art methods. Although FlowNet2 [31] runs about 0.12s, it depends on powerful hardware GPU. Visual comparison of different methods are shown in Fig. 8, which shows the detail improvements of our method (SegFlow, d0 = 11) over some state-of-the-art methods, such as MDPFlow2 [3] that uses an extended coarse-to-fine framework with sparse matches, EPPM [16] that uses an edge-preserving PatchMatch for optical flow estimation, PMF [24] that uses superpixel-based PatchMatch, EpicFlow [22] that uses relatively dense matches obtained from DeepMatching [21], and

CPM [18] that uses coarse-to-fine PatchMatch with dense seeds directly. Benefited from advantages of each step in the proposed segmentation-based PatchMatch framework, our method performs better in preserving sharp flow edges and important motion details, especially in the case of small-scale structures with relatively large displacements. The evaluation results in Table IV also show that our method (SegFlow) does not perform very well on the final pass, which indicates that our method is not suitable to deal with the challenging image sequences that contain rendering effects,

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 11

TABLE V R ESULTS ON THE K ITTI (2015)

TEST DATASET

Methods Fl-all Fl-bg Fl-fg DCFlow [26] 14.86 13.10 23.70 FlowFieldsCNN [33] 18.68 18.33 20.42 19.80 19.51 21.26 FlowFields [17] DiscreteFlow [20] 21.57 21.53 21.76 SegFlow (d0 = 3) 22.46 22.21 23.72 FullFlow [28] 23.37 23.09 24.79 SPM-BP [32] 24.21 24.06 24.97 EpicFlow [22] 26.29 25.81 28.69 SegFlow (d0 = 11) 27.91 28.97 22.64 DeepFlow [21] 28.48 27.96 31.06 SGM+C+NL [23] 35.61 34.24 42.46 39.33 40.81 31.92 SGM+LDOF [2] HS [7] 41.81 39.90 51.39 DB-TV-L1 [25] 47.64 47.52 58.27 Fl-all : Percentage of outliers over all the pixels. Fl-bg : Percentage of outliers over the background. Fl-fg : Percentage of outliers over the foreground.

Time 8.6s 23s 28s 3min ∼ 6.4s 4min 10s 15s ∼4.2s 17s 4.5 min 86s 2.6min 16s

such as specular reflections, atmospheric effects, motion blur, and defocus blur. However, our method with a relatively large d0 (d0 = 11) performs better than that with a relatively small d0 (d0 = 3) on the final pass, which indicates that the accuracy of our method can be improved for these challenging image sequences by choosing a relatively large d0 that can enhance the robustness of global regularization. Even when we take a large d0 = 15 for our method, it still performs better than that using a relatively small d0 (d0 = 3 or d0 = 11) on the final pass. Of course, our method taking a very large d0 (d0 = 19) will lead to a significant drop in accuracy due to the loss of important motion information for too sparse seeds. For different image sequences, it is expediential to choose appropriate d0 for our method to make a trade-off between time and accuracy. We also evaluate our method on the Kitti flow 2015 test datasets. Table V shows comparison of different methods published on the website of this dataset (http://www. cvlibs.net/datasets/kitti/eval_flow.php). As we can see, our method (SegFlow, d0 = 3) achieves better performance than some state-of-the-art methods, such as SGM+LDOF [2], SGM+C+NL [23], DeepFlow [21], EpicFlow [22], SPM-BP [32], and FullFlow [28]. Although the accuracy of SegFlow (d0 = 11) is somewhat reduced, it runs significantly faster than most of optical flow methods published on this website. Although the proposed method can deal with large displacements for optical flow estimation very well, it often fails to handle motions of some non-rigid moving objects, such as fluid-like images. Here we also evaluate our method on the Crowd Segmentation dataset [49], which contains videos of crowds and other high density moving objects. Fig. 9 shows that the proposed method using only interpolation without variational refinement does not perform very well on these challenging image sequences. The reason is that our method used SIFT-based feature descriptor to compute the matching cost, which is usually not suitable for handling complex videos with fluid phenomena like river, ocean, smoke, dense crowd, heavy traffic jam, flock of birds, school of fishes, etc. Fig. 9 shows that our method using variational refinement after interpolation shows good performance. Therefore, variational

refinement must be required as postprocessing for our method to handle this issue. VI. C ONCLUSION In this paper, we present a simple but efficient segmentationbased PatchMatch framework for large displacement optical flow estimation. It yields sparse seeds guided by oversegmentation, and uses the cues of oversegmentation to conduct coarse-to-fine PatchMatch and sparse-to-dense matching. While performing an efficient approximation for oversegmentation, the proposed algorithm runs significantly faster than most of the state-of-the-art methods while preserving important motion details. However, our method performs worse than the current state-of-the-art methods that use Convolutional Neural Network (CNN)-based feature descriptors [44], [45], [33], which show better performance than most traditional matching descriptors that rely on hand-crafted features, such as SIFT flow [4] used in our method. Therefore, in the follow-up work, we will improve our method with CNN-based descriptor and obtain good train model by selecting appropriate network structure and training samples. ACKNOWLEDGMENTS This project was supported by the National Natural Science Foundation of China (U1611461, 61573387, 61203249, 61672544), Guangdong Natural Science Foundation (2015A030313439, 2015A030311047), Shenzhen Innovation Program (No. JCYJ 20150401145529008), and 2014 Microsoft Research Asia Collaborative Research Program (FY15RES-THEME-037). R EFERENCES [1] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” Int. J. Comput. Vis., vol. 92, no. 1, pp. 1-31, 2011. [2] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 500-513, Mar. 2011. [3] L. Xu, J. Jia, and Y. Matsushita, “Motion detail preserving optical flow estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1744-1757, 2012. [4] C. Liu, J. Yuen, and A. Torralba, “SIFT Flow: Dense correspondence across scene and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978-994, 2010. [5] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixels compared to start-of-the-art superpixel method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274-2282, 2012. [6] R. Song, Y. Hu, and Y. Li, “Coarse-to-fine PatchMatch for dense correspondence,” IEEE Trans. Circuits Syst. Video Technol., doi: 10.1109/TCSVT.2017.2720175. [7] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell., vol. 17, no. 1-3, pp. 185-203, Aug. 1981. [8] A. Bruhn, J. Weickert, and C. Schnorr, “Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow Methods,” Int. J. Comput. Vis., vol. 61, no. 3, pp. 211-231, 2005. [9] A. Bruhn and J. Weickert, “A multigrid platform for real-time motion computation with discontinuity-preserving variational methods,” Int. J. Comput. Vis., vol. 70, no. 3, pp. 257-277, 2006. [10] H. Zimmer, A. Bruhn, and J. Weickert, “Optic flow in harmony,” Int. J. Comput. Vis., vol. 93, no. 3, pp. 368-388, 2011.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 12

-50

-50

-100

-100 -150

-150

-200

-200

-250

-250

-300

-300 -350

-350 -400

-400 0

100

200

300

400

500

600

0

100

200

300

400

500

600

700

0

100

200

300

400

500

600

700

-50

-50

-100

-100 -150

-150

-200

-200

-250

-250

-300

-300

-350

-350

-400

-400 0

100

200

300

400

500

600

Fig. 9. Optical flow fields for our method on the challenging image sequences [49]. From top to bottom: Image sequences, SegPM + Interpolation flow, and SegPM + Interpolation + Refinement flow. The evaluation results are visualized with both vectors and color-coded flow fields.

[11] J. Chen, Z. Cai, and J. Lai, and X. Xie, “Fast optical flow estimation based on the split bregman method,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 3, pp. 664-678, Mar. 2018. [12] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 611-625. [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231-1237, 2013. [14] C. Barnes, E. Shechtman, A. Finkeistein, and D. B. Goldman, “PatchMatch: A randomized correspondence algorithm for structural image editing,” in Proc. ACM SIGGRAGH, 2009. [15] Z. chen, H. Jin, Z. Lin, S. Cohen, and Y. Wu, “Large displacement optical flow from nearest neighbor fields,” in Proc. ACM SIGGRAGH, 2009. [16] L. Bao, Q. Yang, and H. Jin, “Fast edge-preserving patchmatch for large displacement optical flow,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp. 4996-5006, 2014. [17] C. Bailer, B. Taetz and D. Stricker, “Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 4015-4023. [18] Y. Hu, R. Song, and Y. Li, “Efficient coarse-to-fine PatchMatch for large displacement optical flow,” in Proc. IEEE conf. Comput. Vis. Pattern Recognit., 2016. [19] K. He and J. Sun, “Computing nearest-neighbor fields via propagation assisted kd-tree,” in Proc. IEEE conf. Comput. Vis. Pattern Recognit., 2012, vol. 157, no. 10, pp. 111-118. [20] M. Menze, C. Heipke, and A. Geiger, “Discrete optimization for optical flow,” in Proc. German Conf. Pattern Recognit., 2015, pp. 16-28. [21] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1385-1392. [22] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “EpicFlow: Edge-preserving interpolation of correspondences for optical flow,” in Proc. IEEE conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1164-1172. [23] D. Sun, S. Roth, and M.J. Black, “Secrets of optical flow estimation and their Principles,” in Proc. IEEE conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2432-2439.

[24] J. Lu, Y. Li, H. Yang, D. Min, W. Eng, M. N. Do, “PatchMatch Filter: Edge-Aware Filtering Meets Randomized Search for Visual Correspondence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1866-1879, 2017. [25] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime TV-L1 optical flow,” in Proc. 29th DAGM Symp. Pattern Recognition, 2007, pp. 214-223. [26] Jia Xu, Rene Ranftl, Vladlen Koltun, “Accurate optical flow via direct cost volume processing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5807-5815. [27] Yinlin Hu, Yunsong Li, and Rui Song, “Robust interpolation of correspondences for large displacement optical flow,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. [28] Q. Chen and V. Koltun, “Full flow: Optical flow estimation by global optimization over regular grids,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4706-4714. [29] A. S. Wannenwetsch, and M. Keuper, and S. Roth, “ProbFlow: Joint optical flow and uncertainty estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2016, pp. 1182-1191. [30] R. Ranftl, K. Bredies, and T. Pock, “Non-local total generalized variation for optical flow estimation,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 439-454. [31] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of optical flow estimation with deep networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, 1647-1655. [32] Y. Li, D. Min, M. S. Brown, M. N. Do, and J. Lu, “SPM-BP: Sped-up patchmatch belief propagation for continuous MRFs,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4006-4014. [33] C. Bailer, K. Varanasi and D. Stricker, “CNN-based patch matching for optical flow with thresholded hinge embedding loss,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 2710-2719. [34] J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical flow in mostly rigid scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6911-6920. [35] Y. Yang and S. Soatto, “S2F: Slow-to-fast interpolator flow,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3767-3776. [36] J. Hur and S. Roth, “MirrorFlow: Exploiting symmetries in joint optical flow and occlusion estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 312-321.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 13

[37] S. N. Ravi, Y. Xiong, L. Mukherjee, and V .Singh, “Filter flow made practical: Massively parallel and lock-free,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5009-5018 [38] T. Schuster and L. Wolf and D. Gadot, “Optical flow requires multiple strategies (but only one network),” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6921-6930. [39] S. Wang, S. R. Fanello, C. Rhemann, S. Izadi, and P. Kohli, “The global patch collider,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 127-135. [40] D. Gadot and L. Wolf, “PatchBatch: A batch augmented loss for optical flow, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016. [41] J. Yang and H Li, “Dense, accurate optical flow estimation with piecewise parametric model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1019-1027. [42] J. Wulff and M. J. Black, “Efficient sparse-to-dense optical flow estimation using a learned basis and layers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 120-130. [43] J. M. Perez-Rua, T. Crivelli, P. Bouthemy, and P. Perez, “Determining occlusions from space and time image reconstructions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1382-1391. [44] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer, “Discriminative learning of deep convolutional feature point descriptors,” in Proc. IEEE Int. Conf. Comput. Vis., 2016, pp. 118-126. [45] S. Zagoruyko and N. Komodakis, “Learning to compare image patches via convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4353-4361. [46] D. Sun, E. B. Sudderth, and M. J. Black, “Layered image motion with explicit occlusions, temporal consistency, and depth ordering,” in Proc. Advances in Neural Information Processing Systems, 2010, pp. 2226-2234. [47] D. Sun, C. Liu, and H. Pfister, “Local layering for joint motion estimation and occlusion detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1098-1105. [48] D. Fortun, P. Bouthemy, and C. Kervrann, “Aggregation of local parametric candidates with exemplar-based occlusion handling for optical flow,” Computer Vision and Image Understanding, vol. 145, pp. 81-94, April 2016. [49] S. Ali and M. Shah, “A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis,” IEEE International Conference on Computer Vision and Pattern Recognition, 2007. [50] L. Bao, Q. Yang, and H. Jin, “Fast edge-preserving patchmatch for large displacement optical flow,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp. 4996-5006, 2014. [51] Jiangbo Lu, Yu Li, Hongsheng Yang, Dongbo Min, WeiYong Eng, Minh N. Do, “PatchMatch Filter: Edge-Aware Filtering Meets Randomized Search for Visual Correspondence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1866-1879, 2017. [52] H. Sakaino, “Motion estimation for dynamic texture videos based on locally and globally varying models,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3609-3623, Nov. 2015. [53] H. Sakaino, “Fluid motion estimation method based on physical properties of waves,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1-8. [54] M. Thomas, C. Kambhamettu, and C. A. Geiger, “Motion tracking of discontinuous sea ice,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 12, pp. 5064-5079, Dec. 2011. [55] C. Cassisa, S. Simoens, V. Prinet, and L. Shao, “Sub-grid physical optical flow for remote sensing of sandstorm,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2010, pp. 2230-2233. [56] T. Corpetti, É. Mémin, and P. Pérez, “Dense estimation of fluid flows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 365-380, Mar. 2002. [57] A. Doshi and A. G. Bors, “Robust processing of optical flow

of fluids,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 23322344, Sep. 2010. [58] P. Héas, C. Herzet, É. Mémin, D. Heitz, and P. D. Mininni, “Bayesian estimation of turbulent motion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1343-1356, Jun. 2013. [59] J. Diaz, E. Ros, F. Pelayo, E. M. Ortigosa, and S. Mota, “FPGA-based real-time optical-flow system,” IEEE transactions on circuits and systems for video technology, vol. 16, no. 2, pp. 274-279, 2006. [60] K. Pauwels, M. Tomasi, J. D. Alonso, E. Ros, and M. M. V. Hulle, “A comparison of FPGA and GPU for real-time phasebased optical flow, stereo, and local image features,” IEEE Transactions on Computers, vol. 61, no. 7, pp. 999-1012, 2012. [61] G. Botella, A. Garcia, M. Rodriguez-Alvarez, E. Ros, and U. Meyer-Baese, “Robust bioinspired architecture for optical-flow computation,” IEEE Transactions on Very Large Scale Integration Systems, vol. 18, no. 4, pp. 616-629, 2010. [62] G. Botella, U. Meyer-Baese, A. García, and M. Rodríguez, “Quantization analysis and enhancement of a VLSI gradientbased motion estimation architecture,” Digital Signal Processing, vol. 22, no. 6, pp. 1174-1187, 2012.

Jun Chen received the M.S. degree in communication and information system from Shantou University in 2014, and the Ph.D. degree in information and communication engineering from Sun Yat-sen University in 2018, Guangdong, China. He is currently a distinguished research fellow with Guangdong Academy of Research on VR Industry, Foshan University. His research interests are in computer vision, image processing, pattern recognition, multiple target tracking, and optical flow.

Zemin Cai received the M.S. degree in information and computing science in 2006, and the Ph.D. degree in applied mathematics from Sun Yat-sen University, China, in 2009. He is currently an associate professor with the School of Engineering, Shantou University. His current research interests are in computer vision, digital image processing, and image-based measurement techniques.

Jianhuang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, multiple target tracking, and wavelet and its applications. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, ECCV, and ICDM. He serves as a standing member of the Image and Graphics Association of China and also serves as a standing director in the Image and Graphics Association of Guangdong. He is a senior member of the IEEE.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2885246, IEEE Transactions on Circuits and Systems for Video Technology 14

Xiaohua Xie received a B.S. degree in mathematics and applied mathematics (2005) from Shantou University, a M.S. degree in information and computing science (2007), and a Ph.D. degree in applied mathematics (2010) from Sun Yat-sen University in China (jointly supervised by Concordia University in Canada). He is currently a Research Professor at Sun Yat-Sen University (SYSU). Prior to joining SYSU, Xiaohua was an Associate Professor at Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences. His current research fields cover image processing, computer vision, pattern recognition, and computer graphics. He has published more than a dozen papers in the prestigious international journals and conferences. He is recognized as Overseas HighCaliber Personnel (Level B) in Shenzhen, China.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.