Video-based action recognition has broad application prospects in information retrieval, security monitoring and etc. It is one of the most famous topics in computer vision and pattern recognition. In recent years, many advances have been made in the study of action recognition and the methods based on local action features have achieved great successes. The traditional action feature methods extract keypoints from video or image, and then design certain dimension of vectors named descriptors such as HOG, HOF to represent the local motion. Although these features are simple and effective, they can not meet the requirements of representation and description of videos which have multi-target and complex backgrounds. Therefore, how to get discriminative and representative motion features has always been the core issue in action recognition research. With the successful application of feature learning algorithms in the field of image recognition, the introduction of feature learning into action recognition has also achieved state-of-the-art results. Such algorithms obtain the representation of action features directly from the data through unsupervised or supervised learning methods which have better abilities to represent complex videos. This paper focuses on the related issues of feature design and learning in action recognition, and combines the knowledge of unsupervised learning, semi-supervised learning and manifold learning to study how to describe and analyze the videos in order to improve the accuracy of action recognition. The main work of this paper is summarized as follows:
(1) An action recognition algorithm based on fast Haar3D features is proposed. The algorithm designs a family of three-dimensional wavelet features to detect and represent the motion areas quickly, then obtains an abundant global description through the spatio-temporal over-complete pooling, finally selects the optimal feature dimensions as the classification model via online feature selection algorithm. The algorithm improves in computational efficiency and recognition accuracy compared with classical action recognition methods. Local features are important methods in action recognition. These methods represent action by feature detection, feature descriptors design, and spatio-temporal pyramid modeling. In order to improve the computational efficiency of the motion features, this paper generalizes Haar wavelet into three-dimensional space to form Haar3D motion features, and extracts the components of the object’s moving parts in all directions as the local motion representations. The spatio-temporal over-complete pooling is utilized to increases the completeness of the action description and the temporal-spatial correlation of the local features because of the insufficient description ability of traditional pyramids model. Finally, the online feature selection method is used to select the best classification feature to improve the recognition efficiency while ensuring the accuracy of recognition
(2) An action recognition algorithm based on motion saliency feature learning is proposed. The algorithm applies a visual attention mechanism, extracts motion boundaries through motion analysis and then utilizes the original video and motion boundaries to jointly learn motion features via constraining them with same representations of actions. The feature learning algorithm reduces the correlation between motion features and image texture and color, and improves the representation and discriminating ability of motion features. For motion features, how to effectively representing the motion in the video is the key to improving the ability to recognition. Traditional feature learning algorithms learn directly from the pixels of an image or video. The learned features contain a large amount of information such as the texture and color of the image itself. In order to improve the ability of the features to focus on the motion, the algorithm firstly extracts the motion boundaries to keep the motion while removing the interference of other information in the image, and then jointly learn the features with the same motion representation from the original video and the motion edge as the motion features via learning algorithm is. Compared with traditional learning algorithms that depend only on video pixels, this kind of motion feature learning algorithm can effectively obtain the moving parts of video while avoiding the interference caused by the video itself and further improving the recognition accuracy.
(3) An action feature learning algorithm based on optical flow constrained auto-encoder is proposed. In order to improve the performance of action feature learning in extracting the dynamic characteristics of video, the algorithm takes use of optical flow field to constrain the auto-encoder which can effectively improve the recognition results on action recognition. The auto-encoder is widely used learning algorithm which can be used in deep learning and so on. In this paper, a new regularization auto-encoder learning network is designed which consists of a main network and an auxiliary network. The main network learns the statistical properties of the local video itself through the auto-encoder, and the auxiliary network constrains the output of the main network by adding the optical flow field. Two networks collaboratively acquire the characteristics of the motion features that combine video pixel information and dynamic optical flow information. This learning algorithm improves the traditional action features learning methods which only rely on video pixels and can not distinguish changes in the time dimension and changes in spatial dimension. The optical flow constrained auto-encoder not only improves the action feature’s representation of the dynamic characteristics of the video, but also enhances the action feature’s discriminating ability.
(4) A semi-supervised manifold constrained auto-encoder is proposed. The algorithm applies a semi-supervised manifold learning method to auto-encoder and solves the semi-supervised feature learning problem with few labeled samples. Semi-supervised problem is one of the most valuable problems in pattern recognition. For the mass data, the sample which is often labeled only a very small part of the total sample. Aiming at how to apply a small number of labeled samples to obtain a generalization model, this paper applies a semi-supervised manifold learning method to feature learning, and designs a maximum entropy constraints based auto-encoder network. The network consists of an auto-encoder network and a regularized network, the auto-encoder network performs unsupervised non-linear feature mapping of unlabeled data and the semi-supervised manifold network uses labels to adjust the nonlinear mapping of features. Experiments have achieved state-of-the-art results in both image recognition and attion recognition tasks.