Nucleosome Histone Tail Conformation and ... - ACS Publications

Mar 17, 2017 - and Suse Broyde*,†. †. Department of Biology and. ‡. Department of Chemistry, New York University, 100 Washington Square East, Ne...
19 downloads 0 Views 691KB Size
Human Action Recognition Based on Self-learned Key Frames and Features Extraction Qi Fu 1. School of Mechatronic Engineering and Automation Shanghai University Shanghai, China Email: [email protected] 

Lina Liu 1. School of Mechatronic Engineering and Automation Shanghai University Shanghai, China 2. School of Electrical and Electronic Engineering Shandong University of Technology, Zibo, China Email:[email protected] 

Abstract—Human action recognition is one of the most active research areas of computer vision. With the rapid development of deep learning, using neural networks to realize action recognition becomes a popular thesis. This paper proposes a self-learned action recognition method based on neural networks. The proposed method trains dictionaries with sparse autoencoder (SAE) and extracts the key frames with sparse long short term memory (LSTM) autoencoder. Moreover, the testing samples in the dataset can be represented as a vector by the dictionaries for classification. This method can reduce data dimension effectively. Meanwhile, it is also applicable for action recognition. The experimental results on the DHA dataset prove the effectiveness and practicability of the proposed method. Keywords—human action recognition, sparse autoencoder (SAE), sparse LSTM autoencoder

I. INTRODUCTION Human action recognition based on computer vision has been widely used in the intelligent monitoring, virtual reality, human-computer interaction, etc. With the development of deep learning, more and more methods based on neural networks were proposed to achieve self-learning and recognition [1]. General action recognition methods can be separated into four main parts: key frame extraction, feature extraction, action representation and classification. To eliminate the redundancy of action sequence, extracting key frames becomes a critical step. Liu et al. [2] utilizes some criterions(orientation, intensity, and contour information) to describe poses, using AdaBoost algorithm to select key frames. Wen et al. [3] accompany with a piecewise linear compress algorithm to shorten the whole input sequence to a constant length. Furthermore, a self-adaptive weighted affinity propagation clustering method was introduced to achieve key frames extraction in [4]. In most of previous researches, key frames extraction normally based on some handcraft and pre-designed features which make them hard to learn the inner relationship between action video frames. Recently, a new self-learned key frames extraction method is applied [5] through the sparse long short term memory (LSTM) autoencoder network which selects the critical time steps by the

Shiwei Ma 1. School of Mechatronic Engineering and Automation Shanghai University Shanghai, China Email: [email protected] 

memory cell of the LSTM instead of using traditional features and learns the inner information of action sequences through training process independently. However, most of the existing feature description methods are also based on hand-craft features and usually require preexistence knowledge. These methods inevitably depend on specific applications and neglect the inner structure of visional information. They can be divided into four categories: silhouette feature, geometric feature, motion information and spatialtemporal interesting points [6]. The Hu-moment, which is proposed in [7] was used to describe the silhouette feature. Zhang et al. [8] utilized the random transform to extract geometric features called R transform. Yu et al. [9] uses curve to extract human contours as skeleton vector, while fitting limbs and torso still exist some instability issues in certain aspect. Recently, more and more researchers begin to focus on features self-learning technique. This idea can automatically find a highlevel expression and discover inner information. The sparse autoencoder (SAE) [10] is one of simple models for features self-learning. And it is used to apply in the recognition of handwritten digits[11] and detection of moving body[12]. This method can be generalized to action recognition, which is able to extract features of posture images. In this paper, A novel method of human action recognition is proposed. The loss function of the LSTM autoencoder and sparsification method are simplified and improved. The sparse LSTM autoencoder combined with SAE achieves the extraction of self-learned key frames and features. Extracted features can be seen as the dictionaries of all images in the dataset. The lasso function is applied to solve the problem of action feature representation. Human action recognition experiments were performed on the challenging DHA dataset. II. PROPOSED METHOD A. The Framework of Proposed Method The framework of the proposed method is shown in Fig.1. It mainly composes four main parts. Firstly, all images of the action sequences are initialized into a fixed size by motion

The corresponding author: Shiwei Ma, School of Mechatronic Engineering and Automation, Shanghai University, NO.149, Yanchang Rd. 200072, Shanghai, People’s Republic of China. This work was supported by the National Natural Science Foundation of China: grant number 61671285; and the Natural Science Foundation of Shandong Province: grant number ZR2016FP04.

‹,(((



detection. Then these images are sent to train the dictionaries. Since SAE can learn the sparse features of input data, it is widely used in unsupervised feature extraction and dictionary training. The SAE is utilized in this part and yielding the extracted dictionaries as self-learning features. Another major step is key frames extraction. To simplify following operations, the similar frames are eliminated firstly. All the action sequences are then sent to the sparse LSTM autoencoder to extract key frames. This method is aided by the LSTM autoencoder with its ability to handle temporal context information. It can also be generalized to the processing of action sequence effectively. Moreover, using neural networks to extract key frames can obtain more internal information in contrast to the hand-craft methods.

approximation ‫ݔ‬ො of the input from activation ݄ . In this mechanism, the network tries to find out the correlation among the data from the high-dimensional input. And the activation ݄ can be seen as a sparse representation of input, the weights from input layer to hidden layer can be seen as dictionaries of this sample space. The activation function of the encoder and the decoder is a sigmoid function which is expressed in (1). ܵሺ‫ݖ‬ሻ ൌ ሺͳ ൅ ݁‫݌ݔ‬ሺെ‫ݖ‬ሻሻିଵ

(1)

Afterwards, these key frames are represented by the dictionaries which is seen as a encoding process. Using the dictionaries to minimize the loss function and get the coefficients of samples is a optimization problem which is also a lasso problem. Through the lasso function in sparse representation, the coefficients of action sequences can be solved accurately and efficiently. At last, each action sequence is represented as a vector and classified by support vector machine (SVM) for it has better classification ability for small sample data.



Fig.2 The structure diagram of SAE

And the training process of SAE can be separated into three steps. The first step is encoding through (2) in which ܹଵ and ܾଵ denotes the weights and the biases of the input layer and the hidden layer respectively. The next step is decoding which is similar as the encoding step, just as (3) shows, ܹଶ and ܾଶ denotes the weights and the biases of the hidden layer and the output layer. And the last step is optimizing the loss function of SAE which is expressed in (4).

‫ܬ‬ௌ஺ா ൌ

ଵ ௠

݄ ൌ ܵሺܹଵ ‫ ݔ‬൅ ܾଵ ሻ

(2)

‫ݔ‬ො ൌ ܵሺܹଶ ݄ ൅ ܾଶ ሻ

(3)



σ௠ ො௜ െ ‫ݔ‬௜ ԡଶ ൅ ԡܹԡଶଶ ൅ ௜ୀଵԡ‫ݔ‬ ଶ

ఘ ௗ೓ ߚ σ௝ୀଵ ሾߩ݈‫݃݋‬ ఘ



Fig.1 The framework of the proposed human action recogniton method

B. Dictionary Learning with SAE The SAE is a kind of simple neural network structure designed to reconstruct inputs with encoder-decoder mechanism. It contains only one hidden layer and the activation of the hidden layer is achieved sparsity through the loss function. This feedforward structure can be trained by back-propagation algorithm to learn unsupervised feature automatically[13]. The structure diagram of SAE is shown in Fig.2. As is shown in Fig.2, the encoder in the input layer transforms input data ‫ ݔ‬to a corresponding activation ݄. The decoder in the output layer is trained to reconstruct an



൅ ሺͳ െ ߩሻ݈‫݃݋‬

ଵିఘ ଵିఘೕ



(4)

The first term of the loss function is mean square error of ‫ݔ‬௜ and ‫ݔ‬ො௜ . ‫ݔ‬௜ and ‫ݔ‬ො௜ in (4) denotes the i-th input and output, and ݉ denotes the amount of inputs. The second term is a regularization term with a weight decay coefficient ߣ, and ܹ ൌ ܹଵ , ܹଶ ൌ ܹଵ ் . This term can decrease the magnitude of weights matrix to prevent over-fitting. The last term achieves the sparsity of ݄ by controlling the Kullback-Leibler (K-L) divergence between average activation ߩ௝ and desired activation ߩ of hidden unit ݆. ݀௛ and ߚ represents the number of hidden units and sparse coefficient, respectively. C. Key Frames Extraction with Sparse LSTM Autoencoder The LSTM neural network is a variant of recurrent neural network (RNN)[14]. For its self-connected memory cell, LSTM can exploit long range dependencies in temporal data. Figure 3 shows the architecture of a LSTM block and there exists three gates in a standard architecture.

can achieve the sparsity of ݄ଵ௧ , but also prevent the over-fitting in training. After training, ݄ଵ is achieved sparsity. Supposed that there need to extract ݊ frames as key frames. The maximal ݊ ݄ଵ௧ are extracted and the time steps ‫ ݐ‬are recorded. The input frames in these time steps are picked up as key frames.

Fig.3 The architecture of a LSTM block



The Input Gate controls whether the current input could flow into the memory cell. The Forget Gate decides how the last cell state influence current state. The Output Gate determines when the unit should output values in its memory [14]. In a data flow process, a LSTM cell preserves the last cell state ܿ௧ିଵ and last cell output ݄௧ିଵ . ‫ݔ‬௧ denotes the current input and ܿ௧ represents the current cell state of LSTM. Through the memory mechanism, the cell outputs the ݄௧ into network. The output value and the value of three gates are calculated by the following functions. (5) ݅௧ ൌ ܵሺܹ௫೔ ‫ݔ‬௧ ൅ ܹ௛೔ ݄௧ିଵ ൅ ܹ௖೔ ܿ௧ିଵ ൅ ܾ௜ ሻ ݂௧ ൌ ܵሺܹ௫೑ ‫ݔ‬௧ ൅ ܹ௛೑ ݄௧ିଵ ൅ ܹ௖೑ ܿ௧ିଵ ൅ ܾ௙ ሻ (6) ܿ௧ ൌ ݂௧ ܿ௧ିଵ ൅ ݅௧ ‫݄݊ܽݐ‬ሺܹ௫೎ ‫ݔ‬௧ ൅ ܹ௛೎ ݄௧ିଵ ൅ ܾ௖ ሻ (7) ‫݋‬௧ ൌ ܵሺܹ௫೚ ‫ݔ‬௧ ൅ ܹ௛೚ ݄௧ିଵ ൅ ܹ௖೚ ܿ௧ ൅ ܾ௢ ሻ (8) ݄௧ ൌ ‫݋‬௧ –ƒŠሺܿ௧ ሻ (9) The ܵ function is the same as (1). ܹ means the weights matrix which is composed by several subsections and ܾ denotes the biases. Inspired by the memory mechanism of LSTM and attention mechanism. The LSTM autoencoder can use “sequence to sequence” ideal to reconstruct a sequence or predict the next content in this sequence [15]. As for an action video sequences, it is a temporal signal, thus LSTM autoencoder can be implemented efficiently. Furthermore, several key frames of an action sequence can be extracted by adding a sparse limitation to the encoder output. The core ideal is that, a temporal sequence can be encoded by a LSTM cell and the encoder output can be decoded by another LSTM to reconstruct the input sequence, if the encoder output is sparse i.e. only several time steps are important to the reconstructed sequence, then the inputs in these time steps can be seen as key frames. This paper utilizes this sparse LSTM autoencoder to extract key frames of action sequence, the framework of the sparse LSTM autoencoder is shown in Fig.4. In Fig.4, ܺ௧ means the current frame in an action sequence. And each frame contains ݉ pixels. Then the data is sent to the first LSTM cell. The cell output ݄ଵ௧ and state ܿଵ௧ flow into next LSTM cell as input. The second LSTM reconstruct the input sequence in reverse. The loss function of this network can depart into two terms, the first term is a square error between input sequence and output sequence, another term is the L1 regularization for sparsity limitation. This loss function is calculated by (10). (10) ‫ܬ‬ௌି௅ି஺ ൌ σ்௧ୀଵԡܻ௧ െ ܺ௧ ԡଶଶ ൅ ߣ σ்௧ୀଵԡ݄ଵ௧ ԡଵ Where ߣ denotes the parameter of the regularization and ܶ denotes the time steps of action sequences. This term not only



Fig.4 The framework of the sparse LSTM autoencoder

D.

Encoding and Classification Combined with the dictionaries trained from SAE and the key frames extracted from sparse LSTM autoencoder, each action sequence can be represented as a matrix. It also can be flattened into a vector. It means that all samples in the dataset can be converted into a vector and the classification of action video frames is converted into an operation on vectors. The lasso function is utilized to represent feature vectors with dictionaries and key frames. For the number of samples in the dataset is small, the SVM classifier is implemented, which can roughly recognize most actions. III. EXPERIMENTS

To verify the validity of the proposed method, experiments were implemented in Python compiling environment. And the neural network frameworks are built under the TensorFlow which is a deep learning framework of google. Relevant experimental results and performance analysis have been done on the DHA dataset [16]. The DHA dataset includes bend, jack, jump and taichi etc. 23 categories of human behavior and every category contains 12 male performers and 9 male performers. And the length of each action video frames is not consistence. The binary images of action silhouette are chosen for the recognition experiments. A. Dictionary Learning The dictionaries were extracted from SAE training process. In this paper, for achieving dictionary learning, all the images of action sequences are sent to train SAE network. The number of input layer units is set to 900 cause of the pixel of each image is ͵Ͳ ൈ ͵Ͳ. The hidden layer has 25 units in this structure which leads to the number of the dictionaries is 25 and the size of each dictionary is ͵Ͳ ൈ ͵Ͳ. According to the algorithm discussed above, the parameters setting of SAE is shown in Table 1. TABLE I. parameter

SAE PARAMETERS SETING

value

Input layer units

900

Hidden layer units

25

Output layer units Iterations

900 10000

Desired activation ߩ

0.01

Sparse coefficient ߚ

3

Weight decay coefficient ߣ

This sample is an action sequence of bend. The first line represents the input sequence of the network, and the second line shows the output sequence reconstructed by the LSTM autoencoder. When adding the L1 regularization of ݄ଵ௧ , the output also can reconstruct the input sequence properly. The output of sparse LSTM autoencoder is shown in the third line. Moreover, time steps of the maximum 6 ݄ଵ௧ was recorded and input frames on these time steps can be extracted as key frames. The key frames extracted from this mechanism is shown in the last line in Fig.6.

0.0001

All images were sent to the training process and the SAE was convergent finally. Weights matrix of input layer and the hidden layer were extracted as self-learning features. The features also can be seen as dictionaries in recognition. And the 25 dictionaries extracted are shown in Fig.5. Fig.6 A set of results for sparse LSTM auto-encoder training

From the simple action sample, it can be find that sparse LSTM autoencoder is efficient in key frames extraction. The key frames extracted like kick and taichi also can describe the action features appropriately. These two samples are shown in Fig.7.

Fig.5 Dictionaries extracted from SAE

As is shown in Fig.5, the brighter areas of the dictionaries reflect the move parts of action. It is proved that SAE network learned the inner information from posture images. And all posture images can be represented as a vector by these 25 dictionaries. B. Key Frames Extraction The sparse LSTM autoencoder is used to select crucial time steps for key frames extraction. In this paper, the input action sequences contain 20 frames so the time steps is 20. The network was analyzed above and parameters in this experiment are set in Table 2. TABLE II.

SPARSE LSTM AUTOENCODER PARAMETERS SETING

parameter

value

Input layer units

900

Hidden layer units

200

Output layer units

900

Time steps

20

Iterations

10000

L1 regularization coefficient ߣ

0.0001

After 10000 iterations, the value of loss function can be reduced to less than 0.02, and the encoder output ݄ଵ is demonstrated as a sparse tensor. Only in some time steps ‫ݐ‬, ݄ଵ௧ is significant lager than other ones. The network records these important time steps to select 6 key frames for each action sequence. A set of results for sparse LSTM autoencoder training is shown in Fig.6.



Fig.7 The key frames of the action kick and taichi

The first line and the second line of Fig.7 is the input sequences of kick and taichi. In the third lineˈthe 6 frames on the left side are key frames of kick, and the 6 frames on the right side are the key frames of taichi. Figure 6 and Figure 7 can both prove the validity of this key frames extraction method. C. Action Recognition In this paper, lasso function is used to solve the coefficients of the dictionary representation. The lasso function has three main parameters, ߩ , Ƚ, and ߣ . ߩ is the augmented Lagrangian parameter, Ƚ is the over-relaxation parameter (typical values for alpha are between 1.0 and 1.8), and ߣ is the sparsity parameter. In this experiment, the value of ߩ , Ƚ, and ߣ is 1, 1, and 0.01, respectively. Each action sequence is represented as a vector with the length of 150. Since the number of samples is too small to use the neural network classifier, SVM is utilized to verify its validity. The SVC of the python sklearn model is chosen to be the classifier and the value of the penalty parameter is 100, the kernel function is RBF and the value of the kernel function parameter is 0.001. The five action samples of 21 performers are chosen for model verifying. The leave-one-out method is adopted and each sample was test 10 times to get the average value. Totally 210 experiments were executed to verify the proposed method. The results are shown in Table 3. TABLE III.

RESULT OF THE MODEL VERIFYING

action

bend

side

walk

bend

20

0

1

onehand wave 0

twohand wave 0

traditional methods and has remarkable effect on data dimension reduction as well as effective action recognition.

side

0

18

1

0

2

walk

1

0

20

0

0

Onehand wave

0

0

0

19

2

ACKNOWLEDGMENT

Twohand wave

0

0

0

0

21

This work was supported by the National Natural Science Foundation of China: grant number 61671285, and the Natural Science Foundation of Shandong Province of China: grant number ZR2016FP04.

Table 3 illustrates the confusion matrix of the proposed recognition system. From SAE dictionary learning and sparse LSTM autoencoder key frames extraction. The correct recognition rate of this model can attain 92.4%. This method is also compared with some recognition methods based on handcraft features proposed in [3]. Six actions (including bend, jack, side, walk, onehand wave and twohand wave) are chosen in this experiment and the results are shown in Table 4. TABLE IV.

REFERENCES [1]

[2] ACCURACY OF DIFFERENT METHODS

Recogniton method

Correct recognition rate

SAE+ sparse LSTM autoencoder+ SVM

89.8%

Hu moment+ piecewise linear compress+ HMM

80.8%

Radon transform+ piecewise linear compress+ HMM

89.7%

Star-skeleton+ piecewise linear compress+ HMM

89.5%

[3]

[4]

[5]

[6]

From Table 4, the proposed method perform better than the other three methods. It demonstrates that the self-learned features and key frames include more information in a certain extent than the hand-craft one. The possible reason for misclassification is similarity of the representations between actions, and the number of training samples is too small. In this classification, the number of dictionaries is 25, which makes the feature vector of each sample only contains 25 values. And the small sample data makes it impossible for neural networks to fully learn all information inside the data. It means that the similarity between samples is relatively higher. And enhance classifier is also an orientation to improve performance. IV. CONCLUSION REMARKS

[7]

[8]

[9] [10] [11]

[12] [13]

A self-learned action recognition method is proposed in this paper. Experiment results show the effectiveness of this method. The self-learned features are extracted by a SAE efficiently and these features sever as dictionaries of the sample space. A sparse LSTM autoencoder is applied to select key frames of action sequences. The extracted key frames can describe the action well and reduce the redundancy of data. Each action sequence is represented by the dictionaries with the lasso function utilized. The represented data can be recognized properly by SVM classifier. It can omit the manual preprocessing as used in



[14] [15]

[16]

Zhang L, Wu X, Luo D, “Recognizing human activities from raw accelerometer data using deep neural networks,” Proc of 2015 IEEE 14th International Conference on Machine Learning and Applications, pp. 865870, 2015. L.Liu, L. Shao, and P. Rockett, “Boosted key-frame selection for huamn action recognition,” Pattern recognition, vol. 46, no. 7, pp. 1810-1818, 2013. Wen Jiarui, Liu Lina, Rui Ling and Ma Shiwei, “Human action recognition based on self-learing feature and HMM,” Journal of System Simulation, vol. 27, no. 8, pp. 1782-1795, 2015. Wang Yongxiong, Sun Shuxin, Ding Xueming, “A self-adaptive weighted affinity propagation clustering for key frames extraction on human recognition,” J.Vis. Commun. Image R, pp. 193-202, 2015. Mao Xin, Hong Zhang, Minggui Sun and Ding Yuan, “Recurrent temporal sparse autoencoder for attention-based action recognition,” 2016 International Joint Conference on Neural Networks, pp. 456-463. Zhao H.Y., Liu, Z.J., Zhang, H., “Human action recognition using the image contour,” Journal of Optoelectronics & Laser, 21(10), pp. 15471551, 2010. Ahad MAR, Ogata T, Tan JK, et al, “Moment-based human motion recognition from the representation of DMHI templates,” USA: Sice Conference. IEEE, pp. 578-583, 2008. Zhang Xudong , et al, “Human activity recognition using multi-layered motion history images with Time-Of-Fligh(TOF) camera,” Journal of Electronics and Information Technology, S1064-2269, no.36(5), pp. 1139-1144, 2014. Yu Hui, Guo Li, “A posture description based on Star-skeleton and HMM,” Communications Technology, S1002-0802, no. 45(12), pp. 91-94, 2012. Andrew Ng, “Sparse autoencoder,” CS294A lecture notes, pp.72, 2011. Y Lecun, L Bottou, Y Bengio, et al, “Gradient-based learning applied to document recognition,” Proccedings of the IEEE, S0018-2919, no. 86(11), pp. 2278-2324,1998. Xu Suping, “Research on human detection based on feature learning in depth image,” Xiamen University, China, 2014. Binbin Huang, Zilu Ying, “Sparse autoencoder for facial expression recognition,” 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications, pp.1529-1532,2015. Andrew Pulver, Siwei Lyu, “LSTM with working memory,”2017 International Joint Conference on Neural Networks, pp. 845-851,2017. Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, “Unsupervised learning of video representations using LSTMs,” 32nd International Conference on Machine Learning, ICML 2015, v 1, p 843852, 2015, Y. C. Lin, M. C. Hu, W. H. Cheng, Y. H. Hsieh, and H. M. Chen. Human action recognition and retrieval using sole depth information [C]. ACM International Conference on Multimedia, pp.1053-1056, 2012.