Classifying structured documents using data mining algorithms

Submitted by reedelderc on Tue, 05/14/2013 - 11:27

Hide Description

Creator:

Masahiro, Morodomi

Responsibility:

Morodomi Masahiro

Start Date:

2010

End Date:

2010

Date Range:

2010 April 23

Physical Description:

924.84 KB of textual records (PDF)

Notes:

Audience: Undergraduate. -- Dissertation: Thesis (B. A.). -- Algoma University, 2010. -- Submitted in fulfillment of the requirements of Computer Science 4235. -- Includes figures, tables, abbreviations, appendices, and a bibliography. -- Contents: Thesis.

Hide Bibliographic Information

Publication:

Sault Ste. Marie, Ont.:

Standard No:

OSTMA-COSC-Masahiro-Morodomi-20100423

Hide Physical Location

rec_shelfloc:

2013-064-001

Repository:

Algoma University Archive

Accession No:

Algoma University Theses collection

Container Number:

001

Hide Conservation

Date Acquired:

Fri, 04/23/2010 - 11:18

Date Reported:

Fri, 04/23/2010 - 11:18

Historical Context:

This paper introduces the use of a new weighting scheme for structured document classification. There are several ways to classify semantic dataset. The majority of existing text classifiers use machine learning techniques and represent text as a Bag of Words (BoW), that is, the features in document vectors represent weighed occurrence frequencies of individual words. (Sahlgren, et al., 2004) Term Frequency and Inverse Document Frequency (tf-idf) (Salton, et al., 1987) is a common weighting method for BoW, however, tf-idf considers the number of occurrences in a set of documents and ignores the important information of the semantic relationship in a document. To solve this problem, several methods have been proposed to improve text representation with external resources such as WordNet. (Miller, 1995; Hotho, et al., 2003) However, these approaches have some limitations such that they need external files and cannot cover all of like synonyms and acronyms. To improve efficiency and correctness of text categorization for well structured documents such as Wikipedia articles this paper proposes a new weighting method for structure-based documents called tfs to create a mining model using one data mining algorithm and techniques of knowledge discovery in databases to understand the relatedness of structured documents, and compare the effect of weighting schemes. Finally, the results of text categorization with the tfs weighting scheme shows that the performance of the mining model changed.

Language:

All material is in English.

GMD:

electronic

textual record

Description Level:

File

Document: