Machine Learning for Predicting Electron Transfer Coupling

predicting energies in the literature,31,33,53 and our results for electronic coupling indicate that generality would be a challenge for predicting el...
0 downloads 0 Views 384KB Size
*&&&*OUFSOBUJPOBM$POGFSFODFPO.VMUJNFEJBBOE&YQP *$.&

UNSUPERVISED LEARNING OF DEPTH AND EGO-MOTION WITH SPATIAL-TEMPORAL GEOMETRIC CONSTRAINTS Anjie Wang1#, Yongbin Gao1#, Zhijun Fang1, Xiaoyan Jiang1, Shanshe Wang2, Siwei Ma2, Jenq-Neng Hwang3 1 College of Electronic and Electrical Engineering, Shanghai University of Engineering Science 2 School of Electrical Engineering and Computer Science, Peking University 3 Department of Electrical Engineering, University of Washington

ABSTRACT

W

In this paper, we propose an unsupervised joint deep learning pipeline for depth and ego-motion estimation that explicitly incorporated with traditional spatial-temporal geometric constraints. The stereo reconstruction error provides the spatial geometric constraint to estimate the absolute scale depth. Meanwhile, the depth map with absolute scale and a pre-trained pose network serve as a good starting point for direct visual odometry (DVO), resulting in a fine-grained ego-motion estimation with the additional back-propagation signals provided to the depth estimation network. The proposed joint training pipeline enables an iterative coupling optimization process for accurate depth and precise ego-motion estimation. The experimental results show the state-of-the-art performance for monocular depth and ego-motion estimation on the KITTI dataset and a great generalization ability of the proposed approach. Index Terms— Depth, constraint, visual odometry

ego-motion,

W

W

KƚĆ

dd1

W

W KƚͲϭ Kƚ

d2

Fig. 1. Scale ambiguity in monocular vision, different † values result in distinct depth estimations of the voxel . ambiguous as shown in Fig. 1, The lack of the absolute scale of the scene results in different estimated depth of ܲ with varying displacement ݀. Thus, the projected pixels in the image plane might have equal coordinates when using different pose scales based on the structure from motion methods for monocular cameras. Some work [6][7][8] has been proposed to improve the results, while absolute scale is still not available. Stereo images are typically used to calculate the absolute scale. Garg et al. [9] proposed an unsupervised deep learning method by using stereo image pairs that captured by two cameras whose baseline is given. [10][11][12] also adopt stereo images pairs to recover the absolute scale and solve the scale ambiguity issue. It's worth noting that deep learning-based estimation of depth and ego-motion is developed for dense reconstruction due to its semantic learning ability. However, training with massive data intrinsically learns the latent transformation between frames instead of direct geometry. In past decades, model-based and geometry-based VO approaches have been extensively studied. Steinbruecker et al. [13] propose a DVO method to model

geometric

1. INTRODUCTION In recent years, supervised deep learning methods have shown a great power in dense depth estimation [1][2][3] [4]. However, supervised deep learning methods need a vast amount of training images with manually annotated ground truth depth data, these data are expensive and impracticable to acquire. To alleviate this problem, many unsupervised learning frameworks are proposed for depth estimation. Zhou et al. [5] trained a depth network and a pose network from monocular sequences, which obtained promising performance in depth and ego-motion estimation by using the photometric error in videos as the supervisory signal in training, while the scale of the output depth is 1 ͓ Š‡•‡ authors contributed equally to this work. Corresponding authors: Zhijun Fang ([email protected]) and Xiaoyan Jiang ([email protected]). This work was supported by National Natural Science Foundation of China (61831018, 61802253, 61772328).

¥*&&& %0**$.&

KƚͲϭĆ



spattial spatial

IǦ–ʹ

IǦ–ʹ

TLR Depth net

|ILǦ–ʹ-IǦ–ʹ|

warping

Ǧ–ʹ

IL-t2

ILǦt1

sK Pose net

IL-t1

Ps + PP P s+1

Tt2t1

warping

Tt2t1

|ILǦ–ʹ-ILǦt1|

Inverse Compositional

temporal

Fig. 2. Proposed joint deep learning training pipeline with spatial and temporal geometric constraints.

the relationship between input dense depth maps and output pose predictions. The target is to minimize the pixel-level photometric warp error for sequential RGB-D images. Compared with VO approaches using sparse features, DVO shows better accuracy when the motion between frames is small, since they use all the dense information. Inspired by the traditional geometry and deep learning methods, we proposed a joint unsupervised learning based pipeline for depth and ego-motion estimation by integrating deep learning with spatialtemporal geometric constraints. Fig 2 shows the framework of the proposed method. Depth-net uses the spatial geometric constraint between stereo pairs to restore the absolute scale of the depth map, while the two consecutive monocular images use temporal geometric constraints to estimate the camera pose, the DVO estimates the fine-grained ego-motion even for pose change by using the inverse compositional algorithm and provide additional back-propagated error signals to optimize the depth map. Finally, spatial-temporal geometry constraints are used to reconstruct the temporal image-pairs to achieve joint optimization of the depth network and the pose network.

݀ ൌ ܾ݂ ൈ ‫ܦ‬௜௡௩ ǡ

whereܾ is the baseline of the stereo camera; ݂ is the focal length; and ‫ܦ‬௜௡௩ is the inverse depth value of the corresponding pixel. Given a stereo image-pair ‫ܫ‬௟ି௧ଶ  and ‫ܫ‬௥ି௧ଶ , where ‫ܫ‬௟ି௧ଶ is the reference view and ‫ܫ‬௥ି௧ଶ is the source view, we ᇱ from the can synthesize the reference view ‫ܫ‬௟ି௧ଶ  as ‫ܫ‬௥ି௧ଶ source view ‫ܫ‬௥ି௧ଶ by using the horizontal distance ݀ . Similarly, we can reconstruct the right image from the given left image. A hybrid loss function ࣦ௦ି௟ଵ which combines the SSIM [14] and L1 norm, is used as the left and right photometric consistency loss of the synthesized view and the reference view:

Geometric

Constraint

for

ᇱ ሺ‫ݔ‬௜ ሻሿሽǡ ࣦ௟ ൌ σ௫೔ ሼࣦ௦ି௟ଵ ሾ‫ܫ‬௟ି௧ଶ ሺ‫ݔ‬௜ ሻǡ ‫ܫ‬௥ି௧ଶ

(2)

ᇱ σ௫೔ ሼࣦ௦ି௟ଵ ሾ‫ܫ‬௥ି௧ଶ ሺ‫ݔ‬௜ ሻǡ ‫ܫ‬௟ି௧ଶ ሺ‫ݔ‬௜ ሻሿሽǡ

(3)

ࣦ௥ ൌ

ࣦ௦ ൌ ࣦ௟ ൅ ࣦ௥ Ǥ

(4)

The summation of the left and right photometric consistency loss is the spatial geometry constraint, there is no pose estimation involved. Thus, the absolute scale of the monocular video sequence can be calculated, resulting in an accurate depth map.

2. THE PROPOSED APPROACH 2.1. Spatial Estimation

(1)

2.2. Temporal Geometric Constraint for Ego-Motion Estimation

Depth

In addition, the temporal geometric constraint is beneficial for ego-motion estimation. Given a temporal image-pair ‫ܫ‬௟ି୲ଵ and‫ܫ‬௟ି୲ଶ , during the training of pose-net, the adjacent frames are reconstructed by the image warping process. For each pixel point ‫݌‬௟ି௧ଶ in the reference view‫ܫ‬௟ି୲ଶ , we first project it to the adjacent source view ‫ܫ‬௟ି୲ଵ according to the predicted depth and the pose of the camera. Then, the value of the pixel of the reference view ‫ܫ‬௟ି୲ଶ is reconstructed at position ‫ݔ‬௜ using

To calculate the absolute scale of the depth map, stereo images are used to train a depth-network. Given the baseline of a stereo system, which is regarded as spatial geometry constraint, the depth estimation network converts the depth map into disparity data. Specifically, for those overlapping regions of a stereo image pair, the corresponding pixels of a left-right image-pair can be matched by their horizontal distance݀, which can be calculated as:



bilinear interpolation. The projected coordinates are obtained by: ᇱ ‫ܫ‬௟ି௧ଵ ൌ ݂൫‫ܭ‬ǡ ܶ௧మ՜௧భ ǡ ‫ܦ‬൯ ή ‫ܫ‬௟ି௧ଵ ǡ

Since the Jacobian matrix and the Hessian matrix in the inverse compositional algorithm do not need to be recalculated in each iteration. Therefore, it is faster to converge than the gradient descent method. It is also worthy of noting that DVO is converged only when the initial error ‫ ݎ‬is close to zero, a good initial pose is a prerequisite to receiving an optimal solution. Therefore, we use a pre-trained Pose-Net to provide pose initialization for DVO, and the DVO provides temporal geometry constraint to refine the Pose-Net. For the initial error caused by excessive motion, we use the image pyramid algorithm to down-sample the input image and depth map to ensure the convergence of the algorithm.

(5)

ᇱ is the reconstructed reference view of ‫ܫ‬௟ି୲ଶ , ‫ܭ‬ where ‫ܫ‬௟ି௧ଵ is the camera's intrinsic matrix; ‫ ܦ‬is the depth value of the pixel in the reference view ‫ܫ‬௟ି୲ଶ , and ܶ௧ଶ՜௧ଵ is the camera coordinate transformation matrix from the reference view to the source view. We can synthesize the reference view ‫ܫ‬௟ି୲ଶ from the source view ‫ܫ‬௟ି୲ଵ using the estimated pose and spatial transform [15]. Therefore, the photometric consistency loss between the monocular image sequences is: ᇱ ࣦ௧ ൌ σ௫೔ሼࣦ௦ି௟ଵ ሾ‫ܫ‬௟ି௧ଶ ሺ‫ݔ‬௜ ሻǡ ‫ܫ‬௟ି௧ଵ ሺ‫ݔ‬௜ ሻሿሽǤ

(6)

2.3. Joint Training and Global Optimization

Eq. (6) illustrates that the reconstruction accuracy is highly relied on the depth map and the pose between two views, the depth map obtained is accurate with absolute scale in this paper. However, the obtained pose may not be accurate since the unsupervised learning process learns the approximation of transformation with no geometry involved. Inspired by the recent advance in DVO, we use DVO as an explicit temporal geometric constraint for egomotion estimation. DVO takes the reference view‫ܫ‬௟ି୲ଶ , the corresponding depth map ‫ ܦ‬and the adjacent source view ‫ܫ‬௟ି୲ଵ as inputs, and aims to find an optimal camera pose P to minimize the photometric error between the warped source image and the reference image in an iterative manner, the objective of DVO is: ଶ

ࣦ௧ି௩௢ ൌ ݉݅݊௣ σ௫೔ ൣ൫ܹሺ‫ݔ‬௜ ǡ ‫݌‬ǡ ‫ܦ‬ሻ൯ െ ‫ܫ‬௟ି௧ଶ ሺ‫ݔ‬௜ ሻ൧ ǡ

This paper proposes a joint training pipeline, where the depth and ego-motion estimations are optimized with spatial and temporal geometrical constraints. Meanwhile, the depth and pose networks provide an excellent initial depth map and ego-motion, respectively. The depth map and ego-motion estimation is optimized in an iterative manner, which means the depth map with absolute scale improves the ego-motion estimation based on DVO, while the DVO errors are back propagated to the depth network to improve the estimation accuracy, the iterative optimization converges when both depth map and egomotion are optimal. For the photometric consistency loss of the monocular image sequence, our reconstruction error is only related to the depth D and the pose ܶ, thus the endto-end training objective can be described as:

(7)

ࣦ௧ି௩௢ ൌ ܽ‫݊݅݉݃ݎ‬ఛ ݉݅݊௣ ࣦሺ‫ܦ‬ሻࣦሺܶሻ,

where W is the warping transformation, p is the parameter of pose T. This nonlinear least square problem can be well solved by the Gauss-Newton method [16]. To improve computational efficiency, we instead use the inverse compositional algorithm (see algorithm 1).

(8)

where ɒ is the parameter of depth-netǤGiven a depth D, our solution to the pose predictor ்݂ can be expressed as the best pose by minimizing the photometric consistency loss:

The algorithm 1 is briefly described as follows:

்݂ ൫‫ܦ‬ǡ ‫ܫ‬௟ǡ௧భ ǡ ‫ܫ‬௟ǡ௧మ ൯ ‫݊݅݉݃ݎܽ ؜‬௣ ࣦሺܶሻǤ

Pre-process漣 1. Calculate image gradient‫ ׏‬ሺš௜ ሻ; பனሺ௫೔ Ǣ௣ǡௗ೔ ሻ 2. Calculate Jacobian matrix ൌ ‫ ׏‬ሺš௜ ሻ

(9)

Substituting Eq. (9) into Eq. (8) we can get: ࣦ௧ି௩௢ ൌ ݉݅݊ఛ ࣦሼ݂஽ ሺ߬ሻǡ ்݂ ሾ݂஽ ሺ߬ሻሿሽ .

ப୮

atሺš௜ Ǣ Ͳǡ †௜ ); 3. Calculate Hessian matrix˖ ൌ ் ; Iteration optimization: 1. Warp I with ߱ሺ‫ݔ‬௜ Ǣ ‫݌‬ǡ ݀௜ ሻ to computer ‫ ܫ‬ᇱ ൫߱ሺ‫ݔ‬௜ Ǣ ‫݌‬ǡ ݀௜ ሻ൯ 2. Compute the error imageσ௫೔ ‫;ݎ‬ 3. Compute‫ ’׏‬ൌ ିଵ ் σ௫೔ ‫ ݎ‬, where ିଵ ் is differentiating matrix pseudo-inverse. 4. Update ’௦ାଵ ՚ ‫݌‬௦ ൅ ‫’׏‬ Repeat till converge

(10)

Therefore, in the DVO process, the effect of depth on photometric consistency loss mainly comes from two aspects: the partial derivative of loss over depth and pose. ௗࣦ ೟షೡ೚ ௗ௙ವ



డࣦ ೟షೡ೚ డ௙ವ



డࣦ ೟షೡ೚ డ௙೅ . డ௙೅ డ௙ವ

(11)

Eq. (11) shows that our depth-net can obtain additional backpropagation signals from the pose prediction. In view of this, the temporal geometry constrain establishes a direct relationship between the



Table 1. Quantitative results of our method reported in the literature on the test set of the KITTI Raw dataset used by Eigenspilt, with different caps on ground-truth. Method

Eigen et al. [1] (Fine) Liu et al. [2] Zhou et al. [5] Garg et al. [9] Godard et al. [10] Zhan et al. [11] (Full) Ours

Dataset

Supervision

K K K K K K K

Depth Depth Mono Stereo Stereo Stereo Stereo

Error Metric (lower,better) Abs Rel SqRel RMSE RMSE log Depth: cap 80m 0.203 1.548 6.307 0.282 0.201 1.584 6.471 0.273 0.208 1.768 6.856 0.283 0.152 1.226 5.849 0.246 0.148 1.344 5.927 0.247 0.135 1.132 5.585 0.229 0.128 1.079 5.453 0.220

(12)

Our total loss is: ࣦ௔௟௟ ൌ ߙࣦ௟̴௥ ൅ ߚ൫ࣦ௧ೡ೚ ൅ ࣦ௧ ൯ ൅ ɀࣦ௦௠ Ǥ

0.702 0.680 0.678 0.784 0.803 0.820 0.838

0.890 0.898 0.885 0.921 0.922 0.933 0.943

0.958 0.967 0.957 0.967 0.964 0.971 0.972

reasonable prediction, the rectified linear unit (ReLU) is used as the activation function after the last prediction layer in the depth-net. The result of our depth-net is the inverse of depth‫ܦ‬௜௡௩ instead of the depth. Since ReLU might produce a zero prediction for an infinite depth in the network, we use the formula  ൌ ͳ Τ ሺ‫ܦ‬௜௡௩ ൅ ͳͲିସ ሻ to convert the inverse of depth to depth. Our Pose-net has the similar network structure as the one described in [12]. The input of the pose-net is two consecutive monocular frames and the six degrees of freedom (DOF) pose matrix is the regression output, which is conversed to a 4x4 transformation matrix. The pose-net contains seven convolutional layers and three fully connection layers.

depth and pose in the DVO process, and provides additional error signals to the depth-net to optimize the network, resulting in a more accurate depth map and finegrained ego-motion. Moreover, in order to obtain smooth depth predictions, our approach encourages depth local smoothing by introducing edge-aware smoothing terms in the joint optimization process. Edge perceived smoothness loss is expressed as: ࣦ௦௠ ൌ σ௫೔ห߲௫ ‫ܦ‬௫೔ ห݁ ିቚడೣூೣ೔ ቚ ൅ ห߲௬ ‫ܦ‬௫೔ ห݁ ିቚడ೤ ூೣ೔ ቚ .

Accuracy Metrics (higher,better) ߜ ൏ ͳǤʹͷ ߜ ൏ ͳǤʹͷଶ ߜ ൏ ͳǤʹͷଷ

(13)

Since the loss is established on top of spatial geometric constraints and temporal geometric constraints rather than labelled data, our proposed network is an unsupervised learning network. Although our system is trained with binocular video sequences, a monocular camera is used in deployment, thus, our proposed network is a monocular system for depth and ego-motion estimation for monocular video sequences.

3.2. Training hyper-parameters We train our CNNs in the framework of TensorFlow [18]. Adam optimizer [19] is used for parameter optimization. We setሾȾଵ ǡ Ⱦଶ ǡ Ԗሿ ൌ ሾͲǤͻǡͲǤͻͻͻǡͳͲି଼ ሿ. The initial learning rate of all training networks is set to be 0.001. The weights in the loss function are set toሾȽǡ Ⱦǡ ɀሿ ൌ ሾͳǡͳǡͲǤͳሿ, which shows a high stability in our training process. To consider the DVO module and the computational efficiency, the batch size is set to be 1. It should be noted that in the mixed loss function ࣦ௦ି௟ଵ , the ratio of SSIM to ͳis 0.85:0.15.

3. EXPERIMENTS In this section, we show the qualitative and quantitative evaluation results on the benchmark KITTI dataset [17] and compare with state-of-the-art depth estimators and VO methods. In addition, we also test the generalization ability of the proposed method on the Make3D dataset.

3.3. Depth Estimation Results on KITTI The KITTI dataset contains 61 video sequences. Eigenspilt selects 697 images from the 28 sequences as the test datasets for monocular depth estimation. The rest of the 33 scenes contain 23,488 stereo pairs used for training. The original image resolution is down sampled from 1242x375 size to 512x256 size for computational efficiency.

3.1. Network Architecture Our depth-net is composed of an encoder and a decoder, which is similar to the network structure in [10]. Skipconnections are designed to fuse the features from different lower layers of the encoder. To obtain a



Input image

Ground truth

Zhou et al.[5]

Ours

Godard et al.[10]

Fig. 3. Compared with the state-of-the-art methods [5][10], Here, ground truth depth maps are interpolated for visualization purpose. As shown in Table 1. To obtain a fair comparison, the 80 meters are used as the maximum depth threshold values for metric evaluation. Table 1 and Figure 3 present the error measurements of different methods and the visualization results of estimated depth maps, respectively. Our approach performs much better than state-of-the-art depth estimators [1][2][5][9][10] with lower estimation errors and higher accuracies. As can be seen in Figure 3, we obtain better finegrained depth estimation. Compared to [5][10], our depth map prediction preserves more details and gives a more precise reconstruction of objects, such as the van, person, guidepost and tree.

Table 2. Results on the Make3D dataset. Method

Supervision

Abs Rel

SqRel

RMSE

RMSE(log)

Train set mean

depth

0.876

13.98

12.27

0.377

Liu et al. [2]

depth

0.475

6.562

10.05

0.165

Laina et al. [3] Godard et al. [10] Zhou et al. [5] Ours

depth pose none pose

0.204 0.544 0.383 0.312

1.840 10.94 5.321 4.320

5.683 11.76 10.47 7.921

0.084 0.193 0.478 0.124

Input image

Ground truth

Ours

3.4. Depth Estimation Results on Make3D To access the generalization ability of our monocular depth estimation model, we test the model trained by KITTI on the test dataset Make3D [20] directly. Quantitative evaluation results are presented in Table 2, from which we can state that our model obtain comparable and even better performance than other stateof-the-art methods without explicitly training the model on Make3D. From the experiments, we can say that the global scene structures are well captured by our model. As shown in Figure 4, object details in the image can be estimated precisely.

Fig. 4. Qualitative results on the Make3D dataset. under comparison. Since SfM-Learner and ORB-SLAM [21] have the scale ambiguity problem, we conduct postprocessing to align their results with ground truth. The pose estimation between frames in [5] only works in short videos that last for five frames. By tuning the scale factor, each of the short videos is aligned to the ground truth independently. Since our approach jointly optimizes DVO and the absolute scale from stereo images, we do not need any post-processing for scale alignment in evaluation.

3.5. Pose Estimation Results The KITTI odometry-split only provides ground truth camera poses of 11 videos indexed from 00 to 10. We used videos indexed from 00 to 08 for training and the rest for testing. We use monocular images to test all the methods



Table 3. Visual Odometry results Seq. 09 ‫ݎ‬௘௥௥ (e/100m )

Method

‫ݐ‬௘௥௥ (%)

Mur-Artal et al. [21] (LC) Mur-Artal et al. [21] Zhou et al. [5] Zhan et al. [11] (Temporal) Zhan et al. [11] (Full) Ours

16.23 15.30 17.84 11.93 11.92 9.12

1.36 0.26 6.78 3.91 3.60 2.16

[6] Seq. 10

‫ݐ‬௘௥௥ (%)

/ 3.68 37.91 12.45 12.62 8.08

‫ݎ‬௘௥௥ (e/100m )

[7]

/ 0.48 17.78 3.46 3.43 1.95

[8]

Depending on the evaluation standard of the KITTI VO dataset, we use possible sub-sequences with the length (100, 200, ..., 800). Table 3 shows the average translation error and rotation error for testing the sequences 09 and 10. LC denotes loop closure. ‫ݐ‬௘௥௥  is average translational RMSE drift (%). ”௘௥௥ is average rotational RMSE drift (°/100m). Our stereo vision-based VO learning approach obtains superior results than the monocular learning method in [5]. Compared with the other stereo vision-based methods [11], our performance is the best.

[9]

[10]

[11]

[12]

4. CONCLUSION We integrate the absolute scale recovery module and DVO to realize an end-to-end unsupervised depth and ego-motion estimation deep learning framework, which is able to avoid the problems of scale ambiguity and produce more accurate ego-motion estimation, In addition, stereo spatial reconstruction constraint and temporal constraint from videos are jointly integrated into the deep learning framework to result more reliable depth and egomotion estimation. 5. REFERENCES [1]

[2]

[3]

[4]

[5]

[13]

[14]

[15]

[16] [17]

D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014 F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. TPAMI, 38(10):2024 2039, 2016. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016. P. Wang, X. Shen, B. Russell, S. Cohen, B. L. Price, and A. L. Yuille. SURGE: surface regularized geometry estimation from a single image. In NIPS, 2016. T. Zhou, M. Brown, N. Snavely, and D.G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017

[18]

[19] [20]

[21]



R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In CVPR, 2018. A. Wang, Z. Fang, Y. Gao, X. Jiang, and S. Ma. Depth Estimation of Video Sequences With Perceptual Losses. IEEE Access 6: 30536-30546, 2018. C. Wang, J. Buenaposada, R. Zhu, and S. Lucey. Learning Depth from Monocular Videos using Direct Methods. In CVPR, 2018. R. Garg, V. K. B. G, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In ECCV, 2016. C. Godard, O. M. Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, 2017. H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction. In CVPR, 2018. R. Li, S. Wang, Z. Long, and D. Gu. UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning. In ICRA, 2018. F. Steinbruecker, J. Sturm, and D. Cremers. Realtime visual odometry from dense rgb-d images. In ICCV, 2011. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004. M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, Spatial transformer networks. In NIPS, 2015. S. Baker, and I. Matthews. Lucas-kanade 20 years on: A unifying framework. IJCV, 56(3):221255, 2004. A.Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. D. Kingma, and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015 A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. TPAMI 31(5):824840, 2009. R. Mur-Artal, J. Montiel, and J. D. Tardos. ORBSLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147 1163, 2015.