Document Categorization using Text Mining in Agricultural Domain

Loading...
Thumbnail Image
Date
2019
Journal Title
Journal ISSN
Volume Title
Publisher
ICAR-Indian Agricultural Statistics Research Institute ICAR-Indian Agricultural Research Institute New Delhi
Abstract
The World Wide Web (WWW) is a source of vast amount of information. Today, many researchers are dependent to WWW to carry out research to a large extent. Most of the scientific journals are available online for various domains. Agriculture is one of the sectors of research that is gradually growing interest among the researchers at a high pace. Agriculture, being a sector, which provides as high as 17-18 per cent GDP to the Indian economy and providing more than 60 per cent of the employment of the country, it is obvious reason for this increasing interest of the researchers. ICAR-IARI is an institute which publish plenty amount of research articles in various journals per year. Some of the articles are inter-disciplinary in nature, i.e., these researches are a combination of two or even more than two principles of disciplines. For example, research papers that are a combination of two disciplines such that social science research papers related to crop science. Where to find them? Whether in crop science journal or in social science journal? Or agricultural engineering research papers related to crop science. Whether in crop science journal or in social science journal? Our aim is to categorize the research documents in agricultural domain. Thus, these knowledge bases, in form of journals are unstructured. Machine learning, in more appropriate terms, text categorization using machine learning is a way out. We collected data for this research from Prof. M. S. Swaminathan Library, ICAR-IARI, New Delhi. Data consists of the titles and abstracts of different articles in plain text (.txt) format. Collected data was unstructured in nature which has represented in to a suitable machine readable format using pre-processing as described by the KDD process to smooth the path for the knowledge discovery process, we have used classifier subset evaluator and wrapper subset evaluator approach for feature selection. To adjudge the best method for feature selection, experiment is repeated 100 times using 10X10 Cross Validation. After feature selection applied some recognized text categorization algorithms to develop models for categorization. The categorization algorithms used were J48, KNN, Random Forest, Naïve Bayes, SVM and MLP, ZeroR and OneR, where ZeroR and OneR used as baseline algorithm. To customise these algorithms for the work in this thesis Java and R languages are used along with the standard algorithms related to text mining from WEKA 3.8, NetBeans IDE 8.0.2 and R 3.3.2. Text categorization was attempted using three scenarios on the text data collected for the purpose. In the first phase, it was attempted on titles of the research documents. 82 The hypothesis was that titles represent the document in most relevant words so they should also be able to categorise the documents with acceptable accuracy which should be higher than the probability of that category. We observed Naïve Bayes algorithm have highest accuracy of 78.77% using titles only. In second phase, experiment was performed by using abstracts of all the document in the corpus. The hypothesis was improvement in the categorisation results because of inclusion of more relevant text in the document. The results showed that accuracy improved with abstracts. We observed highest accuracy of 96.69% using Naïve Bayes algorithm in this scenario. In, third scenario, experiment was done by taking titles with abstracts so as to add more relevant knowledge to the model. The results showed highest accuracy of 93.41% using Random Forest algorithm in this scenario. To estimate the average accuracy, 10X10 Cross Validation was used in all scenarios using all the algorithms. The performance of the models were compared statistically and the best model was selected. MLP algorithms with CSNB (Classifier subset evaluation with Naïve Bayes as parameter algorithm) for feature selection perfumed best on agricultural text documents (abstracts inclusive titles) for categorising them with 90 per cent accuracy. In future, it is possible to improve this accuracy by more advanced techniques like deep learning. Further, in future attempts will be made to categorise by using various other combinations of texts like abstracts with conclusion and abstracts with results and so on.
Description
T-10276
Keywords
null
Citation
Collections