Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework

Lim Wern Han and Saadat M. Alhashmi

Monash University, Kuala Lumpur, Malaysia

Copyright © 2010 Lim Wern Han and Saadat M. Alhashmi. This is an open access article distributed under the Creative Commons Attribution License unported 3.0, which permits unrestricted use, distribution, and reproduction in any medium, provided that original work is properly cited

Abstract

With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlike text documents, the nature of web pages limits the performance of successful traditional pure-text classification methods. Noises exist in the form of HTML tags, multimedia contents, dynamic contents and the network structure of web pages which requires a deeper look into effective feature selection of web pages. Often, these features are filtered out relying on the displayed texts of the web page for classification. This paper proposed a framework where web page features are taken into consideration during classification of the web page due to the potential valuable information that might be stored within each of the features. For this reason, this paper explores the potential of the universal Resource Locator (URL), web page title as well as the metadata for information to be used in classification with various categories defined by the users. The framework then explores suitable machine learning algorithms for individual classification of each web feature. The results would then be used for weighted voting to obtain the classification of that webpage. This approach showed improvements over pure-text as well as virtual-webpage classification approaches.

Keywords: web page classification, feature selection, machine learning
Shares