返回
类型 基础研究 预答辩日期 2017-10-19
开始(开题)日期 2016-04-28 论文结束日期 2017-06-07
地点 东南大学九龙湖校区计算机楼4楼会议室 论文选题来源 其他项目    论文字数 6.7 (万字)
题目 基于主题模型的文本情感和话题建模的研究
主题词 主题模型,情感分析,文本,知识库,词向量
摘要 随着互联网的快速发展以及社交媒体的崛起,长文本如新闻、博客、评论数据和短文本如微博、即时信息等数据日积月累。长文本数据编辑良好,信息量大,同时短文本数据长度虽短,但是更新速度快,数据规模大,涉及领域广。这些海量的文本数据中包含大量的有实用价值的信息,但是如何自动化地挖掘这些文本中的有意义的信息,体现了计算机对文本的语义理解和处理的效率。 如何挖掘这些文本中隐藏的语义结构信息,是目前自然语言处理和文本检索领域的一大研究难点。主题模型是一种比较热门且有效的方法,该方法通过词语之间在文档中的共现关系挖掘和分析文本中的语义结构信息。目前,主题模型的技术已经应用到各个研究领域并且取得了不错的成果。针对长文本的数据的主题模型,并没有考虑文本中词语或者实体包含的先验知识。同时,针对短文本的情感和话题的建模并没有提出有效的模型。 目前文本分析需求日益增长,本文以新闻长文本、评论数据、微博三种文本为实际应用背景,对于话题建模以及情感相关的话题建模展开了研究。因为新闻等长文本中一般包含客观的话题信息,而评论数据和微博中包含大量的主观信息,因此对于新闻长文本本文只建模话题信息,而针对评论数据和微博则建模话题和情感两种信息。本文重点研究了目前三个话题建模或者情感话题建模的三个重要的问题:长文本话题建模中融合语义知识,评论数据中的情感话题建模中融合词语之间的语义关系和短文本的情感话题建模型中内容稀疏问题。具体研究内容如下: 首先,我们分析现有的话题建模方法中的不足,即单纯依靠语料中词语之间的高阶共现关系建模话题,对于低频词,主题模型不能很好地挖掘它们的语义信息。为了克服这一问题,本文提出了一种新的基于维基百科知识的主题模型,也就是利用外界知识库中的概念和分类知识弥补文本中低频词共现不足的问题。我们提出新的主题模型WCM-LDA。该模型通过建模文本中的词语和实体,并且通过实体引入外界知识库中的概念和分类知识,可以缓解部分词语中共现不足的问题,并且该模型输出每个话题的词语、概念和分类,可以更加直观地展示每个话题。 其次,在评论数据中词语之间的语义关系尤为重要,现有的情感主题模型只能通过共现关系挖掘词语之间的语义关系,而对低频词在小规模的数据集中同样是效果不佳。为了克服这个困难,本文提出一种引入外界的word embedding,来构建词语之间的语义关联,这样可以使得低频词有效地建模情感和话题。本文提出的HST-SCW模型可以将word embedding的语义空间中语义相近的词语方法到相同的聚类,使得语义相近的词语可以分配到相同的话题和情感。 最后,针对短文本的内容稀疏问题,针对传统的情感主题模型我们进行了分析,这些模型单纯地利用词语之间的共现关系,而短文本的上下文稀疏,共现信息不足。为了克服这个困难,我们研究微博等社交媒体的特点,微博的内容发布的时间和用户有很强的关系,跟用户相关的微博一般都是跟个人兴趣相关,跟时间相关的微博一般都跟实时事件或者话题相关。根据这个特点,本文提出一种新的情感和话题建模的方法,利用用户和时间聚合微博以弥补上下文不足的问题。基于这个思想,我们提出了时间-用户的情感主题模型TUS-LDA,该模型中将微博分配到对应的用户或者时间下,并且同一条微博属于相同的话题,但是一条微博可以表达不同的情感。 通过以上的研究,本文针对新闻长文本、评论数据、短文本三种文本提出了话题或者情感相关的话题建模的新方法,为文本中的话题建模提供了一些新的思路。同时,本文的研究工作都是以实际应用为导向,具有实际的应用价值。长文本和短文本虽然都已经都很多研究者做出相当的贡献,但是这些文本的情感和话题建模仍然面临着诸多困难,本文旨在为推动文本情感和话题建模做出一点贡献。
英文题目 Research of Topic Model-based Approaches for Sentiment and Topic Modeling on Text
英文主题词 Topic Model, Sentiment Analysis, Texts, Knowledge base, Word embedding
英文摘要 With the rapid growth of Internet and the emergence of social media, long texts, such as news, blogs, reviews, and short texts, such as tweets and instant messages, accumulate day by day and month by month. Long texts are often well-edited and contain rich information. Although short texts contain limited information, they are updated very frequently and have a huge volume about general domain topics. These texts contain much valuable information, but mining knowledge from these texts automatically a is non-trivial task, which need the capability of text understanding. Mining the latent semantic structure from these texts is a difficult problem in natural language processing and information retrieval. Topic model is a popular and effective method, which analyze the latent semantic structure on texts by mining the high-order word co-occurrences. Recently, topic models have been applied to many research topics and get a good achievement. For topic modeling of long texts, the existing work does not consider the external knowledge of words and entities in the texts. Moreover, the exist no existing work for sentiment and topic modeling of short texts. With the increasing need of text analysis, this paper focus on the research of topic modeling or sentiment/topic modeling on three kinds of texts, i.e., news texts, review texts and tweet texts. Long texts, such as news texts, normally talk about the objective topics, while review texts and tweet texts contain rich subjective topics. Hence, we only model topics on news texts and model sentiments and topics on review texts and tweet texts. In this paper, we research three problems about topic and sentiment modeling: 1) Incorporating semantic knowledge into topic modeling for long texts; 2) Incorporating lexical knowledge into sentiment/topic modeling for review texts; 3) Sentiment/topic modeling on tweet texts. The detailed research topics are as follows: Firstly, we analyze the shortness of existing work, i.e., these models only depend on the high-order word co-occurrences. Hence, topic models cannot work well on words with low frequencies. To overcome the problem, we propose a new topic model based on Wikipedia knowledge, which can leverage the external knowledge, i.e., concepts and categories in Wikipedia knowledge base, to improve the performance of topic modeling. Our proposed model, WCM-LDA, can model not only words and entities from texts, but also concepts and categories from the external knowledge base. WCM-LDA can solve the problem of words with low frequencies and visualize topics with words, concepts and categories. Then, the existing sentiment and topic models on review texts only depend on the high-order word co-occurrences and does not work well for words with low frequencies. Semantic association between words are important for modeling sentiments and topics on review texts. Hence, to solve the problem, we propose a model to incorporate lexical knowledge from word embeddings, to introduce semantic association between words, into sentiment topic models, which can solve the problem of words with low frequencies. In our proposed HST-SCW model, close words on the space of word embeddings can be assigned to the same semantic clusters, so that semantically similar words can be assigned to the same sentiments and topics. Finally, to solve the problem of context sparsity on short texts of tweets, we analyze the existing models which mainly depend on the high-order word co-occurrences. However, the existing work cannot work well on short texts for the context sparsity, i.e., the lack of word co-occurrence. To solve the problem, we find that content of tweets are strongly related to time and users in social media. The short texts related to users are mostly about users’ personal interests, while short texts related to time are often about current events/topics. Based on the observation, we propose a new model of topic and sentiment modeling on short texts, TUS-LDA, which can utilize time and users to solve the problem of context sparsity. In TUS-LDA, we model each short in the timeslice level or user level, where each short only talk about a topic but can describe multiple sentiments. After the aforementioned research, we propose three new methods of topic modeling or sentiment/topic modeling on news texts, review texts and tweet texts, which are novel for topic modeling on texts. Moreover, our research is an important subject with practical values. Although there exist contributions of many researchers on long texts and short texts, modeling sentiments and topics on texts is still difficult and our work just aims to make a weak contribute for modeling sentiments and topics.
学术讨论
主办单位时间地点报告人报告主题
东南大学 2015.10.20 计算机学院 张磊 X-LiSA: Cross-lingual Semantic Annotation
东南大学 2014.5.11 计算机学院 王克文 Forgetting for Answer Set Programs
东南大学 2014.9.18 计算机学院 杜剑锋 ABox Abduction over Description Logic Ontologies
东南大学 2013.11.22 计算机学院 Frank van Harmelen Linked Open Data
东南大学 2016.10.22 计算机学院 徐康 终身主题模型的原理和应用
东南大学 2017.3.22 计算机学院 徐康 基于情感分析的舆情分析
东南大学 2013.9.20 计算机学院 徐康 决策树的原理及应用
东南大学 2016.12.19 计算机学院 徐康 基于主题模型的微博短文本情感分析的研究
     
学术会议
会议名称时间地点本人报告本人报告题目
ECAI 2016 2016.8 荷兰 海牙 A joint model for sentiment-aware topic detection on social media
Linked Data workshop in UPM 2016.6 西班牙 马德里 Incorporate Wikipedia Knowledge into Topic Model
NLPCC2014 2014.11 深圳
     
代表作
论文名称
Incorporating Wikipedia concepts and categories as prior knowledge into topic models
A Joint Model for Sentiment-Aware Topic Detection on Social Media
 
答辩委员会组成信息
姓名职称导师类别工作单位是否主席备注
曲维光 正高 教授 博导 南京师范大学
章成志 正高 教授 博导 南京理工大学
夏睿 正高 教授 博导 南京理工大学
翟玉庆 正高 教授 硕导 东南大学
周德宇 正高 教授 博导 东南大学
      
答辩秘书信息
姓名职称工作单位备注
张祥 其他 讲师 东南大学