Enhanced Sampling of Intrinsic Structural Heterogeneity of the BH3

Learning Probabilistic Respresentation of Shape Recognition from Volumetric Grid Kun Liu

Weiwei Shang

Department of Automation, University of Science and Technology of China, Hefei, 230027, PR China Email: [email protected]

Department of Automation, University of Science and Technology of China, Hefei, 230027, PR China Email: [email protected]

Abstract—Since a camera only gets an observation of partial surface of an object, the recognition of the whole object from this single-view observation is imcomplete. To address this problem, we establish probability representation of surface shape recognition based on 3D data. For a number of objects, multiple Convolutional Neural Network (CNN) stuctures are used to extract local shape features and to fit probility representation. In experiments, computation effect of different network structures are compared and the feasibility of fitting probability representation using convolution neural network is verified. In most cases, recognition probability is very accurate.

I. I NTRODUCTION 3D geometry shape is always a crucial cue of object recognition. Especially for two types of objects, people usually do not rely on the surface color to distinguish them, but rely on surface shape. For the first type, object lacks obvious color characteristics and texture, such as bricks, mechanical parts. For the second type, although object surface has obvious color characteristics, but at some point we only concerne about shape. In daily life, many objects have same shape and different color and their shapes are the key to provide function. For example, the same or similiar shape of tables, chairs, cups, bowls and so on. For these two types, it is difficult or useless to distinguish objects from color images, and surface shape is the key to identify them. Therefore, it is considered that target object should be recognized through surface shape in 3D data. 3D data types include depth image, point cloud, 3D CAD model, etc. Here, volumetric grid is used as it is easy to handle. In general, a camera can only capture partial surface of an object from any viewpoint and obtain an observation of such surface part. Because at most only the surface part can be recognized in the observation, the recognition result of the object is not entirely certain. Therefore, the result should reflect such uncertainty. We can use recognition probability to describe the degree of certainty. Here, the probability of shape recognition indicates a practical significance, and it is also desirable that robots is capable of knowing the reliability of a recognition result. In this paper, we study shape recognition in the sense of probability. Although there are already some ways to solve this problem, most of them output the probability of a known view rather than the current view. We use a data-driven way

978-1-5386-3742-5/17/$31.00 © 2017 IEEE

to train convolution neural networks (CNNs) to learn shape distributions. Then shape recognition tasks are performed. In order to train these CNNs, a proper dataset is built based on ModelNet10 model dataset [1]. We show the effect of this method on the dataset, and our model can effectively output recognition probabilities. II. R ELATED W ORK There have been a lot of shape recognition studies using 3D CAD models. In traditional methods, many kinds of global features have been designed. In [2], 3DNet, a large-scale CAD model dataset, was constructed for object recognition based on CAD models and test several types of point cloud features. 3D views were generated from object models and rates of captured surface were also stored. So, these rates belong to known views and do not generalize to unknown views. Many new types of global feature have been proposed for instance recognition and pose estimation [3–6]. But these features are designed by hand and not used to predict recognition probability. Further, [7] used CNN to map training views to global feature descriptors for instance recognition and pose estimation. However, recognition probability is not considered. In addition, many types of local feature have also been used for shape recognition and 3D registration [8, 9]. Although these methods based on local features can recognize occluded objects, it is more difficult for them to obtain probability representation. In recent years, deep learning has been applied to 3D shape recognition rapidly. One idea is to classify objects based on volumetric grid of complete object surface. [1] proposed a probability distribution model based on binary voxels, 3D ShapeNets, using a Convolutional Deep Belief Network (CDBN) and applied it to RGB-D object recognition. Then [10] presented a new CNN archetecture, VoxNet, which can perform real-time shape classification. In [11], orientationboosted voxel nets for 3D shape recognition was proposed, in which the network is forced to predict the coarse pose of the object in addition to the class label. PointNet was constructed in [12] as an improvement over the existing methods by using density occupancy grids representations for the input data. In contrast to previous models, [13] proposed a data representation which enables 3D convolutional networks which are

both deep and high resolution. The other idea is object shape classification based on view dataset of CAD models. [14] considered about learning to recognize 3D shapes from a collection of rendered views on 2D images and presented a novel CNN architecture that combines information from multiple views into a single and compact shape descriptor. This method offers better recognition performance than previous volumetric CNNs. [15] combined both representations (volumetric representation and pixel image representation) and exploited them to learn new features, which yield a significantly better classifier than using either of the representations in isolation. In [16], it is aimed to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches which are able to outperform current state-of-theart methods. All of these methods have made great progress in shape classification, but not individual recognition, without containing the actual meaning of recognition probability.

are divided into two classes: each red voxel has state 1 and indicates the voxel belongs to surface; and each transparent voxel has state 0 and indicates surface of object does not pass through this voxel. In the visualization results, object surfaces become rough, but object shapes are still distinguishable. Each volumetric grid has a specific spatial resolution. If you need to improve spatial resolution, you can increase the size of a volumetric grid in a fixed space. Then data size is alse increased. Therefore, we use 24 as the length of volumetric grid.

III. P ROBLEM M ODEL

Fig. 1. Volumetric grids of some objects. From left to right: table, chair, toilet.

Surface shape is a very important object property. So we hope that robots be able to distinguish and identify objects from this point of view. Specifically, when a robot obtains a 3D data sample, it should obtain the probability that the sample represents each known object. • If the complete surface of the object is recognized, then the recognition probability = 1. • If partial surface of the object is recognized, then the recognition probability < 1. • If surface which does not belong to the object is recognized, then the recognition probability = 0. A. Data Representation Although there are many types of 3D data, but we need to choose a data type according to the actual situation. If depth image is used directly, geometric information in the data type is not very direct. Although point cloud can be generated directly from depth image, and a point cloud contains original, rich surface geometry, it is difficult to efficiently handle a large number of point clouds. A 3D CAD model is a digital model used in the manufacturing process. It is a type of analytical 3D data, and most of artificial objects have corresponding 3D CAD models. 3D CAD model files are easy to get from the network, but there are big differences between them and actual 3D data obtained by cameras. Volumetric grid is a 3D grid of voxels, which represents a range of 3D space. Each voxel occupies a certain volume and indicates the state of the volume. So, volumetric grid discretizes 3D space and contains direct 3D geometry. To sum up, volumetric grid is an appropriate data type, since it can effectively reduce the amount of data while maintaining certain surface details and spatial resolution. Not only can actual data (depth image, point cloud) collected by cameras be converted to body pixels, but analytic data (3D CAD model) also can be converted. In this paper, volumetric grids are obtained by using a large number of CAD models and some volumetric grids are shown in Fig. 1. In this case, the voxels

B. Mathmatical Model Given a volumetric grid of an object model, G0 , the voxels are divided into two sets. The voxels with a value of 1 are called surface voxels, and all the surface voxels constitute the surface voxel set So ; the voxels with a value of 0 are called blank voxels, and all the blank voxels constitute the blank voxel set Su . N is the total number of voxels in So . For a new input volumetric grid G1 , i is the index of one voxel and v(i) is its value. That the object is recognized in G1 is considered a random event. Denote the probability of the event as p1 . So it can be expressed as: ∑ v(i) p1 = i∈So (1) N If G1 contains any voxel which does not belong to the object suface, p1 is expected to be 0. In this case, an example of the distributions of two voxel sets in G1 is shown in Fig. 2. Each such voxel is punished as -1 by the formula as: ∑ ∑ v(i) p1 = max(0, i∈So − v(i)) (2) N i∈Su

Surface voxels

Blank voxels

Fig. 2. Two voxel sets: surface voxel set and blank voxel set.

IV. M ODEL F ITTING For an object, the two simple formulas (1) and (2) can be directly calculated to get probability. For the volumetric grid of an object, 243 = 13824 parameters are required because it is necessary to record which one of the two voxel sets the voxels belong to. If there are M objects, then the number of parameters increases linearly as 13824M . At this point, we want to get recognition probabilities of a large number of objects from a input volumetric grid, then the number of parameters becomes a problem during calculation. So local feature extraction is applied to get useful shape features in original voxel data while reducing the number of required parameters. For example, if only 16 local features of 6 × 6 × 6 are used for all objects, only 3456 local feature parameters are required. In feature extraction based on machine learning, CNN is very effective at automatically learning local features from data. A CNN can contain multiple convolutional layers. A convolution layer applies multiple convolutional kernels. Each kernel represents a type of local feature and produces an output, that is a feature response, in each convolutional space of input data. This feature response can be understood as the strength of this type of feature. All feature responses extracted by the same convolution kernel belong to the same type of response. Moreover, the parameters of the convolution kernel are obtained in the process of network training. After a feature extraction process, it is necessary to calculate the probability representation for a shape recognition from the feature responses. Since the relationship between probability representation and input volumetric grid is piecewise linear, the mutual coupling between the feature responses need to be canceled. Therefore, after convolution layers, some fully connected layers are used to fit the calculation. V. E XPERIMENT The experimental process includes: constructing a dataset and designing a network structure. Firstly, the appropriate dataset is constructed, and then concrete CNN network structures are used to fit the mathematical model of object recognition. Finally, experimental results are analyzed. A. Dataset Construction ModelNet dataset is introduced in [1] to evaluate the effect of 3D shape classifiers. And ModelNet10 is a subset that contains 10 categories, 4899 models in total. Since probability representation of shape recognition rather than shape classification is concerned, it is not necessary to use category information. So we take 10 object models from each category in ModelNet10 and get 100 object models to build our dataset. Each of the 100 models is discretized into a 24 × 24 × 24 volumetric grid and 3 voxels are added in each direction to reduce boundary effect during convolution. In order to generate different probabilities, from 0.9 ∼ 1, a rate is taken every 0.01. For a grid of an object model and each rate, the rate of voxels are randomly extracted from the surface voxels of the grid as the surface voxels in a new grid; The

recognition probability is calculated as Eq. (2); the repetition of this random extraction runs 20 times, and 22,000 samples form a total sample set. The total sample set is divided into a training set, a verification set and a test set by 60%, 20% and 20%. B. Network Structure CNN is considered to be composited by the following types of layers. For each network layer, use an abbreviated symbol to represent it. The input layer accepts fixed-size volumetric grid whose size is 32 × 32 × 32. A convolutional layer is represented as C(f, d, s, a). It accepts 4-dimesional input and the first three are space dimensions and the last is a feature response dimesion. It convolves d × d × d × f ′ volumes in input to get f feature responses, in which d is the space dimesion and f ′ is the feature response dimension in input. Convolution can be applied with a stride s. An nonlinear activation function a maps convolutional results to outputs, here a is sigmoid function s or rectified linear unit (ReLU) r. A fully connected layer is represented by F C(n, a). It has n output units. A linear combination of input is computed and goes through an activation function a, here a is sigmoid function s or rectified linear unit (ReLU) r or linear function l. If we use the original formula, 1382400 parameters are required. Taking into account the complexity of the problem, three convolution neural network structures are designed to solve the problem. They are different in detail, but contain similiar architectures as shown in Fig. 3. The three specific CNN network structures and the corresponding ratio of parameters are shown in Table I. The implementation of these structures is based on modification and refinement of open source code of [1].

Fully connected layers

Convolutional layers

Input layer

Fig. 3. Rough network layers.

C. Training Process These networks are trained by Stochastic Gradient Descent (SGD) with momentum. The objective is one half of Mean Square Error (MSE) plus a regularization term. The regularization term is a regularization parameter times L2 weight norm. Network 1, 2 have four regularization parameters [5e-5

TABLE I S PECIFIED NETWORK STRUCTURES . No. of parameters 203572 203572 647172

5e-4 5e-3 5e-2] and that of network 3 is 0. Batch size is 32. SGD is initialized with a learning rate of 0.01 for network 1, 2 and 0.6 for network 3. The learning rate was decreased by a factor of 10 each 5000 batches. Since this is a regression problem, the indicator used to evaluate the training results is root of MSE. The error statistics of training results are shown in Table II. The fitting results of the network 1 under the four regularization parameters are shown in Fig. 4. The horizontal axis is the value of the regularization parameter, and the vertical axis are evaluatetion errors on the training set, and the validation set. When the regularization parameter is 5e-4, the network 1 obtains the best fitting ability, and the corresponding training process is shown in Fig. 5. The fitting results of the network 2 under the four regularization parameters are shown in Fig. 6. When the regularization parameter is 5e-5, it obtains the best fitting ability, and the corresponding training process is shown in Fig. 7. In the training process of network 3, there is no regularization, and the resulting network may overfits the training sample set. However, the fitting error of the network 3 on the verification set is not much different from that on the training set in Table II, so there is no obvious overfitting.

Ratio of parameters to the original number 0.147 0.147 0.468

Overfit Diagnosing

0.08

Root of Mean Square Error

Sturcture represention C(16, 6, 6, s)–F C(100, l) C(16, 6, 6, s)–F C(100, s) C(32, 6, 6, r)–F C(128, r)–F C(100, r)

Training Cross Validation

0.06

0.04

0.02

0 10 -5

10 -4

10 -3

10 -2

10 -1

Weight Decay Param

Fig. 6. The fitting results of network 2. Training Process Root of Mean Square Error

No. 1 2 3

0.08


0.06

0.04

0.02

20

40

60

80

100

Training Number

Fig. 7. The best training process of network 2. Overfit Diagnosing


0.1


0.08

D. Result of Probability Representation

0.06 0.04 0.02 0 10 -5

10 -4

10 -3

10 -2

10 -1

Weight Decay Param

Fig. 4. The fitting results of network 1.


Training Process Training Cross Validation

0.1 0.08 0.06 0.04 0.02 20

40

60

80

100

Training Number

Fig. 5. The best training process of network 1.

In order to further evaluate the generalization ability of the networks, errors on the test set are compared. The error between the output of the network and the true value is expected to < 5%. According to Table II, on the test set, the errors of network 1, 2 and 3 are 0.0131, 0.0031 and 0.0021 respectively. The effect of network 2, 3 have little difference. The average error of the output of network 3 is about 0.2%, and it works best. The fitting result of network 1 is the worst, and its fitting ability may be not enough for this problem. In order to observe which samples cause big errors, the absolute error of each sample is counted in sample analysis. Sample analysis of network 1 on the test set is shown in Fig. 8a. In the figure, the error of the network is concentrated within 6%, but the error of a few samples is too large. The average absolute error is 0.0088, and the maximum absolute error is 0.2354. Although the average absolute error is very small and satisfies the expectation, the maximum absolute error limits the application of network 1. Similarly, the sample analysis on the test set is shown in Fig. 8b. In the training set, the average absolute error is 0.0088 and the maximum absolute error is 0.1790. The situation of network 2 is similar to that of network

TABLE II T HE TRAINING RESULTS OF THE NETWORKS . No. 1 2 3

Best regularization parameter 5e-4 5e-5 0

Training error 0.0130 0.0030 0.0015

1. In the test set, the sample analysis is given in Fig. 9a. The average absolute error is 0.0004, and the maximum absolute error is 0.0978. In the training set, the sample analysis result is Fig. 9b, and the average absolute error is 0.0004, and the maximum absolute error is 0.1266. Network 3 is also similar. In the test set, the sample analysis is Fig. 10a, and the average absolute error is 0.0002, and the maximum absolute error is 0.0897. In the training set, the sample analysis is Fig. 10b, and the average absolute error is 0.0002, and the maximum absolute error is 0.1790.

(a)

(b)

Fig. 8. Sample analysis of network 1: (a) is on the test set, (b) is on the train set.

(a)

(b)


(a)

(b)


Based on the sample analyses, on the one hand, the network 3 has the best performance on the test set. On the other hand, the maximum absolute errors of the three networks in the

Validation error 0.0130 0.0032 0.0020

Test error 0.0131 0.0031 0.0021

training set and the test set are much larger than the average absolute errors, which shows that the networks do not fit the outputs of all samples stably. E. Recognition Accuracy and Speed For a test sample, the output with the highest probability is the result of the recognition. In experiment, we use a GeForce GTX TITAN X GPU. The correct rate of object recognition for network 1 is 0.983, and the average calculation time for a test sample is 0.111 ms. The correct rate of recognition for network 2 is 0.977, the average calculation time is 0.062 ms. The correct rate for network 3 is 0.978 and the average calculation time is 0.106 ms. Although there is a large error fluctuation in the three networks, the correct rate of recognition is still relatively high. In addition, when these networks run forward, the speed is very fast. VI. C ONCLUSION In incomplete 3D data, we want to use probability to represent the results of shape recognition. Three CNN structures are designed and trained to fit the problem. The test results show that these networks can derive recognition probability from input volumetrid grid. The use of these networks achieves the goal of reducing the number of original parameters. Although these networks can fit shape recognition within a certain range of errors, they all have some fluctuations for fitting results. According to possibility, the reasons may include: 1) The coupling between feature responses of convolution layers may cause the nonlinearity between the feature responses and the probability representation to be too strong. To reduce this nonlinearity, propagating one feature response to later layers from one convolutional region may be worth trying. If all convolutional regions do not overlap with each other and each region only provides the strongest feature, the feature can better linearly represent the region. But more feature types may be needed to capture local shapes. 2) The adopted network structures may not be the most proper. When we consider eliminating overlaps of convolutional regoins and using one feature for one region, network structures need to be adjusted and tested. Besides, since results of shape recognition in a single posture are not stable enough, shape recognition under different poses has not been discussed.

ACKNOWLEDGEMENT This work was supported in part by the National Natural Science Foundation of China under Grant 51675501 and Grant 51275550, and in part by the Youth Innovation Promotion Association CAS under Grant 2012321. The authors would like to thank Information Science Laboratory Center of USTC for the hardware & software services.

[8]

[9]

R EFERENCES [1] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, T. Xiaoou, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920. [2] W. Wohlkinger, A. Aldoma, R. B. Rusu, and M. Vincze, “3DNet: Large-scale object class recognition from CAD models,” in IEEE International Conference on Robotics and Automation, 2012, pp. 5384–5391. [3] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3D recognition and pose using the viewpoint feature histogram,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010, pp. 2155–2162. [4] A. Aldoma, M. Vincze, N. Blodow, D. Gossow, S. Gedikli, R. B. Rusu, and G. Bradski, “Cad-model recognition and 6dof pose estimation using 3d cues,” in IEEE International Conference on Computer Vision Workshops, 2011, pp. 585–592. [5] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for 3D object classification,” in IEEE International Conference on Robotics and Biomimetics, 2011, pp. 2987–2992. [6] A. Aldoma, F. Tombari, R. Rusu, and M. Vincze, “OUR-CVFH – oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6DOF pose estimation,” Pattern Recognition, pp. 113– 122, 2012. [7] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3D pose estimation,” in IEEE

[10]

[11]

[12]

[13]

[14]

[15]

[16]

Conference on Computer Vision and Pattern Recognition, 2015, pp. 3109–3118. Y. Zhong, “Intrinsic shape signatures: A shape descriptor for 3D object recognition,” in International Conference on Computer Vision Workshops, 2009, pp. 689–696. R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (FPFH) for 3D registration,” in IEEE International Conference on Robotics and Automation, 2009, pp. 3212–3217. D. Maturana and S. Scherer, “VoxNet: A 3D convolutional neural network for real-time object recognition,” in International Conference on Intelligent Robots and Systems, 2015, pp. 922–928. N. Sedaghat, M. Zolfaghari, and T. Brox, “Orientationboosted voxel nets for 3D object recognition,” arXiv preprint arXiv:1604.03351, 2016. A. Garcia-Garcia, F. Gomez-Donoso, J. GarciaRodriguez, S. Orts-Escolano, M. Cazorla, and J. AzorinLopez, “Pointnet: A 3d convolutional neural network for real-time object class recognition,” in International Joint Conference on Neural Networks, 2016, Conference Proceedings, pp. 1578–1584. G. Riegler, A. O. Ulusoys, and A. Geiger, “OctNet: Learning deep 3D representations at high resolutions,” arXiv preprint arXiv:1611.05009, 2016. H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3D shape recognition,” in IEEE International Conference on Computer Vision, 2015, pp. 945–953. V. Hegde and R. Zadeh, “Fusionnet: 3d object classification using multiple data representations,” arXiv preprint arXiv:1607.05695, 2016. C. R. Qi, H. Su, M. Niener, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view CNNs for object classification on 3D data,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5648–5656.

Enhanced Sampling of Intrinsic Structural Heterogeneity of the BH3

Recommend Documents