Document Clustering Based On Ontology and Fuzzy Approach
Subject Areas : Information and Knowledge TechnologyMaryam Amiri 1 , hasan khatan Lo 2
1 -
2 -
Keywords:
Abstract :
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important steps in document clustering are how documents are represented and the measurement of similarities between them. By giving a new ontological representation and a similarity measure, this research focuses on improving the performance of text clustering. The text clustering algorithm has been investigated in three aspects: ontological representation of documents, documents similarity measure, fuzzy inference system to measuring the final similarities. Ultimately, the clustering is carried out by bottom-up hierarchical clustering. In the first step, documents are represented as ontological graph according to domain knowledge. In contrast to keywords method, this method is based on domain concepts and represents a document as subgraph of domain ontology. The extracted concepts of document are the graph nodes. Weight is measured for each node in terms of concept frequency. The relation between documents’ concepts specifies the graph edges and the scope of the concepts’ relation determines the edge’s weight. In the second step, a new similarity measure has been presented proportional to the ontological representation. For each document, main and detailed concepts and main edges are determined. The similarity of each couple of documents is computed in three amounts and according to these three factors. In the third step, the fuzzy inference system with three inputs and one output has been designed. Inputs are the similarities of main concepts, detailed concepts and the main edges of two documents and the output is final similarities of the two documents. In final step, a bottom-up hierarchical clustering algorithm is used to clustering the documents according to final similarity matrix. In order to evaluate, the offered method has been compared with the results of Naïve Bayes method and ontology based algorithms. The results indicate that the proposed method improves the precision, recall, F-measure and accuracy and produces more meaningful results.