Fuzzy based semantic clustering of news articles

Loading...
Thumbnail Image
Date
2018-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
G.B. Pant University of Agriculture and Technology, Pantnagar - 263145 (Uttarakhand)
Abstract
Text mining is a process that uses data mining approaches to extract valuable information held in the hidden form in textual data. In this paper, a framework for fuzzy clustering of news articles is proposed. These news articles originate on different news portals on the web. The data sets are fetched from two different Indian news portals, The Hindu archive and Times Of India archive. Six data sets are used for implementation and evaluation: 4 news articles Times of India, 150 news articles Times of India, 1000 news articles Times of India, 4 news articles The Hindu, 150 news articles The Hindu, 1000 news articles The Hindu. The fetched data is stored in a central database and then preprocessing reduces the noise. Tokenization is done to split the text content into separate words. Stop words are removed from the text data as they have no significance for cluster discrimination. Then lemmatization technique is applied. Tf-idf is calculated for the data set and saved in the word frequency vector. On these vectors, distance measure or similarity measure function is used to find the similarity between articles. Tf-idf with cosine similarity measure gives semantic similarity between articles. One article may belong to more than one cluster so fuzzy membership values must be generated. The articles are clustered using two clustering algorithms k-means clustering and fuzzy c-means clustering. The similar documents are grouped into same cluster and dissimilar documents are put into different clusters. The proposed framework shows that fuzzy clustering does not restrict each news article to belong exactly to one cluster. Therefore this framework when applied to information retrieval systems or other application systems, system gives better performance and relevance to the users.
Description
Keywords
null
Citation
Collections