陈欣,张菁,李晓光,卓力.一种面向中文敏感网页识别的文本分类方法[J].测控技术,2011,30(5):27-31 |
一种面向中文敏感网页识别的文本分类方法 |
A Text Classification Method for Chinese Pornographic Web Recognition |
|
DOI: |
中文关键词: 中文敏感网页识别 新词识别 停用词表建立 CHI统计 朴素贝叶斯分类器 |
英文关键词:Chinese pornographic web recognition new words identification stop-word-list CHI square Naive Bayes classifier |
基金项目:国家自然科学基金资助项目(60772069,61003289);863计划资助项目(2009AA12Z111);北京市自然科学基金资助项目(4102008);人力资源与社会保障部留学归国人员科技活动优秀类资助和教育部留学归国人员科研启动基金项目 |
|
摘要点击次数: 1531 |
全文下载次数: 1643 |
中文摘要: |
提出了一种面向中文敏感网页识别的文本分类方法,主要包括中文分词、停用词表的建立、特征选择、分类器等4个部分。为丰富中文分词词库,提出了一种以词频统计为主、以人工判决为辅并标注词性的新词识别算法;提出了一种停用词表的建立算法,据此建立了含300个停用词的停用词表;采用开方拟合检验统计量方法作为特征选择方法,并确定了400维的特征词库。根据开方拟合统计量特征选择方法与朴素贝叶斯分类器的特点,加入待分类网页文本中所含特征项数目与特征集维数的比值以及特征项数目与文本所含词汇数目的比值两个影响因子,对朴素贝叶斯分类器进行了改进。考虑到不同的人群对敏感概念的主观理解差异较大,将待识别网页的敏感度值作为分类器的输出。实验结果表明,与现有的文本分类方法相比,所提出的文本分类方法可以获得更好的识别效果。 |
英文摘要: |
A text classification method for Chinese pornographic web recognition is proposed.It consists of four key components:automatic Chinese word segmentation,stop-word-list establishment,feature selection,text classification,etc.respectively discussed.To enrich the dictionary of Chinese word segmentation system,a new word identification algorithm is proposed,which is mainly based on word frequency statistics,and supplemented by artificial decision as well as Chinese part of tagging.On the basis of the Chinese stop-word-list selection method proposed,a stop-word-list containing 300 stop words is established.Subsequently,using the CHI square method,a 400-dimension feature vector is decided.In addition,by analyzing Naive Bayes classifier and CHI square method,two influencing factors are added.One is the ratio of included features number and selected feature number;the other is the ratio of included feature number and included unique words.Given the concept of different people’s subjective understanding of pornographic is quit different,the pornographic value of a web page is used as the output of the classifier.The experimental results show that the proposed method can achieve better classification performance,compared with the existing text classification method. |
查看全文 查看/发表评论 下载PDF阅读器 |
关闭 |
|
|
|