Report on strategies for collecting URLs

General Information

Download Report on strategies for collecting URLs as PDF (661 KB) .

Title: Report on strategies for collecting URLs.
Deliverable: D6.6.1.1.2-4.
Author(s): Morten Goodwin, Nils Ulltveit-Moe and Mikael Snaprud.
Published date: May 2008.
Published at: European Internet Accessibility Observatory 2008

The EIAO project is co-funded by the European Commission, under the IST contract 2003-004526-STREP.

Abstract


This document outlines strategies for categorising NACE and NUTS categories. We present both manual and
automatic approaches, including suggestions based on machine learning.
Our experiments indicate that the most reliable approach for NACE categorisation is a classification approach based on term frequencies. A small proof of concept implementation provide results with an accuracy of 100% for NACE categorisation. A disadvantage with the current implementation is that the algorithm needs to be trained for each language.
In contrast, for NUTS categorisation, our tests indicate manual classification is still most efficient.

The author of this document is:
Morten Goodwin
E-mail address is:
morten.goodwin [at] uia.no
Phone is:
+47 95 24 86 79