摘要近年来,网络媒体微博的迅速发展,为命名实体的识别研究提供了一种全新的载体.针对中文微博文本短、表达不清、网络化严重等特点,论文提出了一种规则与统计相结合的中文微博命名实体识别方法.该方法首先利用中文微博的主题标签对处理后的数据进行筛选,然后再选取合适的特征模板,并利用条件随机场模型(Conditional random fields, CRF)进行实体识别.为了满足实验要求,该文将传统网页爬虫方法与API接口采集方法相结合进行微博数据采集.实验结果表明,该方法能够有效提高中文微博命名实体的识别效果.
Abstract:In recent years, the rapid development of network mediamicro-blog provides a new carrier for the research of named entity recognition. Considering Chinese micro-blog text is short, Chinese micro-blog expression is not clear, Chinese micro-blog is seriously networked and so on, the paper proposed a named entity recognition method of Chinese microblog combined of rules and statistics. Firstly, the proposed method uses the theme tag of Chinese microblog to filter the processed data, then devised feature templates for recognition method based on conditional random fields. In order to meet the requirements of the experiment, this paper combines the traditional web crawler and API method to collect data. Experimental results show that the proposed method can effectively improve the effectiveness of named entity recognition of Chinese microblog.
朱颢东*,杨立志,丁温雪,冯嘉美. 基于主题标签和CRF的中文微博命名实体识别[J]. 华中师范大学学报(自然科学版), 2018, 52(3): 316-321.
ZHU Haodong,YANG Lizhi,DING Wenxue,FENG Jiamei. Named entity recognition of Chinese microblog based on theme tag and CRF. journal1, 2018, 52(3): 316-321.