Classifying structured documents using data mining algorithms

Publication: 
Sault Ste. Marie, Ont.:
Standard No: 
OSTMA-COSC-Masahiro-Morodomi-20100423
Date Acquired: 
Fri, 04/23/2010 - 11:18
Date Reported: 
Fri, 04/23/2010 - 11:18
Creator: 

Masahiro, Morodomi

Historical Context: 

This paper introduces the use of a new weighting scheme for structured document classification. There are several ways to classify semantic dataset. The majority of existing text classifiers use machine learning techniques and represent text as a Bag of Words (BoW), that is, the features in document vectors represent weighed occurrence frequencies of individual words. (Sahlgren, et al., 2004) Term Frequency and Inverse Document Frequency (tf-idf) (Salton, et al., 1987) is a common weighting method for BoW, however, tf-idf considers the number of occurrences in a set of documents and ignores the important information of the semantic relationship in a document. To solve this problem, several methods have been proposed to improve text representation with external resources such as WordNet. (Miller, 1995; Hotho, et al., 2003) However, these approaches have some limitations such that they need external files and cannot cover all of like synonyms and acronyms. To improve efficiency and correctness of text categorization for well structured documents such as Wikipedia articles this paper proposes a new weighting method for structure-based documents called tfs to create a mining model using one data mining algorithm and techniques of knowledge discovery in databases to understand the relatedness of structured documents, and compare the effect of weighting schemes. Finally, the results of text categorization with the tfs weighting scheme shows that the performance of the mining model changed.

Responsibility: 
Morodomi Masahiro
Start Date: 
2010
Description Level: 
End Date: 
2010
Date Range: 
2010 April 23
Physical Description: 

924.84 KB of textual records (PDF)

Notes: 

Audience: Undergraduate. -- Dissertation: Thesis (B. A.). -- Algoma University, 2010. -- Submitted in fulfillment of the requirements of Computer Science 4235. -- Includes figures, tables, abbreviations, appendices, and a bibliography. -- Contents: Thesis.

rec_shelfloc: 
2013-064-001
Repository: 
Algoma University Archive
Container Number: 
001