Automatic Categorization of Web Sites

Recent Updates:
New Scientific Paper:
Automatic Checking of Alternative Texts on Web Pages 2010-07-15
New Blog Post:
A collaborative approach for improving local government web sites 2010-07-30

General Information

Download Automatic Categorization of Web Sites as PDF (787 KB) .

Download Automatic Categorization of Web Sites Poster as PDF.

Download Automatic Categorization of Web Sites Presentation as PDF.


Title: Automatic Categorization of Web Sites.
Author(s): Lida Zhu.
Supervisor(s): Morten Goodwin, Agata Sawicka and Mikael Snaprud.
Published date: June 2008.
Published at: ICT University of Agder Grimstad Norway 2008

Abstract


In this thesis we have presented a solution to classify websites into
geographical attribute code (NUTS) and economical activities attribute codes (NACE).

We propose a solution for web site classification with high accuracy. We use keywordbased
document classification methods which had shown good performance. After
classification, each document is assigned a class label from a set of predefined
categories, which is based on a pool of pre-classified sample documents.
Our solution includes to remove stop words and skip html tags, which identify the

informative term, remove the non-informative or redundant terms to improve the
classification accuracy; use mutual information for feature selection to reduce the
dimensional feature space and produce vectors for classification; finally, use Naïve
Bayes and Decision Tree algorithm to perform the classification and also provide the
performance comparison.

The system has shown great performance in the experiment. It classifies web
sites into NACE categories with maximum accuracy of 97% performed on 46 web
pages, while NUTS classification has best accuracy of 93% performed on 223 web
pages.

The author of this document is:
Morten Goodwin
E-mail address is:
morten.goodwin [at] tingtun.no
Phone is:
+47 95 24 86 79

Valid XHTML 1.0! Valid CSS! Checked by eGovMon