As HD and 3D technologies become popular, streaming media services represented by Internet TV become a major part of Internet traffic. As an important streaming media application in the era of "Internet +", Internet TV has been favored by the majority of users due to its rich content and convenient usage. At the same time, the limited network transmission resources are difficult to meet the demand of massive Internet TV viewing, resulting in network congestion and packet loss delay, seriously affecting the Internet TV user experience. In order to reduce the pressure of Internet data transmission and to improve the service quality and user viewing experience of Internet TV, this paper analyzes the behavior of Internet TV users, studies the changing rules of Internet TV content popularity and builds a high-precision content popularity prediction model. A cache scheduling algorithm that adapts to the evolutionary trend of many kinds of popularity is proposed. And an Internet user behavior analysis platform with interactive processing capabilities is implemented. The main work and innovation are as follows:
(1) Aiming at the problems of long training time, large quantity of samples and poor prediction accuracy on the popularity of suddenly hot programs in existing prediction models, an Internet TV program popularity prediction model based on viewing behavioral dynamics features is proposed. 6 billion view behavior records from 2.8 million subscribers of an Internet TV platform are measured, and the evolution process of program popularity is divided into 4 types, which are endogenous, internal subcritical, exogenous and exogenous subcritical. The prediction models of Internet TV program popularity are constructed for each type using least squares support vector machines with double population particle swarm optimization. The experimental results show that, compared to the existing prediction model, the prediction accuracy can be increased by more than 17%, and the forecast period can be effectively shortened.
(2) Aiming at the problem that the program popularity evolution of different Internet TV platforms varies greatly and the existing prediction models cannot be used across the platforms, a forecasting approach of program popularity based on trend detection is proposed. Firstly, a dynamic time warping distance based K-Medoids algorithm is applied to group programs’ popularity evolution into four trends. Then, four trend-specific prediction models are built separately using random forests regression. According to the features extracted from an electronic program guide and early viewing records, newly published programs are classified into the four trends by a gradient boost decision tree. Finally, by combining forecasting values from the trend-specific models and the classification probability, the approach achieves better prediction results. The experimental results show that compared with the traditional single forecasting model, this approach can be dynamically adjusted according to users’ viewing behavior in different regions, and the prediction accuracy is improved by 20%.
(3) Aiming at the problem that the existing Internet TV platform excessively consumes storage space to increase the hit rate of the hot programs, a caching scheduling algorithm PPRA based on popularity prediction is proposed. Firstly, a random forest algorithm is used to construct the forecasting model of program popularity, and the "dimension disaster" of the selected features is solved by principal component analysis. Then, the programs in the cache are scheduled based on the program popularity prediction data. Finally, the PPRA algorithm is tested on the 120-day viewing data of 1.3 million users. Experimental results show that the PPRA algorithm only needs 30% storage space of LRU and LFU algorithm under the same cache hit rate, which can significantly reduce the platform construction costs.
(4) Based on the above methods, the paper designs and implements an Internet TV user behavior analysis system, and constructs a behavior event capture subsystem consisting of agents embedded in the set-top box and a capturing cluster. In addition, we develop a SQL-on-Hadoop engine with distributed transaction management that is compatible with unstructured data sources. Experimental results show that using a 32-node cluster environment on 16 million users’ commercial Internet TV platforms, it can handle 10.2 TB of user behavior data daily, which is nearly 40 times faster than the similar systems based on Hadoop/Hive.