With the rapid growth of Internet and the emergence of social media, long texts, such as news, blogs, reviews, and short texts, such as tweets and instant messages, accumulate day by day and month by month. Long texts are often well-edited and contain rich information. Although short texts contain limited information, they are updated very frequently and have a huge volume about general domain topics. These texts contain much valuable information, but mining knowledge from these texts automatically a is non-trivial task, which need the capability of text understanding.
Mining the latent semantic structure from these texts is a difficult problem in natural language processing and information retrieval. Topic model is a popular and effective method, which analyze the latent semantic structure on texts by mining the high-order word co-occurrences. Recently, topic models have been applied to many research topics and get a good achievement. For topic modeling of long texts, the existing work does not consider the external knowledge of words and entities in the texts. Moreover, the exist no existing work for sentiment and topic modeling of short texts.
With the increasing need of text analysis, this paper focus on the research of topic modeling or sentiment/topic modeling on three kinds of texts, i.e., news texts, review texts and tweet texts. Long texts, such as news texts, normally talk about the objective topics, while review texts and tweet texts contain rich subjective topics. Hence, we only model topics on news texts and model sentiments and topics on review texts and tweet texts. In this paper, we research three problems about topic and sentiment modeling: 1) Incorporating semantic knowledge into topic modeling for long texts; 2) Incorporating lexical knowledge into sentiment/topic modeling for review texts; 3) Sentiment/topic modeling on tweet texts. The detailed research topics are as follows:
Firstly, we analyze the shortness of existing work, i.e., these models only depend on the high-order word co-occurrences. Hence, topic models cannot work well on words with low frequencies. To overcome the problem, we propose a new topic model based on Wikipedia knowledge, which can leverage the external knowledge, i.e., concepts and categories in Wikipedia knowledge base, to improve the performance of topic modeling. Our proposed model, WCM-LDA, can model not only words and entities from texts, but also concepts and categories from the external knowledge base. WCM-LDA can solve the problem of words with low frequencies and visualize topics with words, concepts and categories.
Then, the existing sentiment and topic models on review texts only depend on the high-order word co-occurrences and does not work well for words with low frequencies. Semantic association between words are important for modeling sentiments and topics on review texts. Hence, to solve the problem, we propose a model to incorporate lexical knowledge from word embeddings, to introduce semantic association between words, into sentiment topic models, which can solve the problem of words with low frequencies. In our proposed HST-SCW model, close words on the space of word embeddings can be assigned to the same semantic clusters, so that semantically similar words can be assigned to the same sentiments and topics.
Finally, to solve the problem of context sparsity on short texts of tweets, we analyze the existing models which mainly depend on the high-order word co-occurrences. However, the existing work cannot work well on short texts for the context sparsity, i.e., the lack of word co-occurrence. To solve the problem, we find that content of tweets are strongly related to time and users in social media. The short texts related to users are mostly about users’ personal interests, while short texts related to time are often about current events/topics. Based on the observation, we propose a new model of topic and sentiment modeling on short texts, TUS-LDA, which can utilize time and users to solve the problem of context sparsity. In TUS-LDA, we model each short in the timeslice level or user level, where each short only talk about a topic but can describe multiple sentiments.
After the aforementioned research, we propose three new methods of topic modeling or sentiment/topic modeling on news texts, review texts and tweet texts, which are novel for topic modeling on texts. Moreover, our research is an important subject with practical values. Although there exist contributions of many researchers on long texts and short texts, modeling sentiments and topics on texts is still difficult and our work just aims to make a weak contribute for modeling sentiments and topics.