Towards Automated eGovernment Monitoring

Recent Updates:
New Scientific Paper:
Towards a score function for WCAG 2.0 benchmarking 2012
New Blog Post:
United Nations Global E-government survey 2012 2012-03-01
Ph.D. Thesis:
Towards Automated eGovernment Monitoring

Paper J - An automatic approach for finding eGovernment services.

Author: Morten Goodwin

This chapter was originally published as scientific conference or journal paper. Please see the original source for the complete paper.1

  

Original paper author: Morten Goodwin

Morten Goodwin is with Tingtun AS, Kirkekleiva 1, 4790 Lillesand, Norway, morten.goodwin@tingtun.no.

Abstract

This paper considers the task of locating services and information in government web sites. In large scale eGovernment surveys this is a manual and resource intensive task.

To introduce automation in eGovernment assessment, this paper presents the lost sheep, a classification based algorithm to automatically locate targeted government material online. It is designed to work with minimum human interaction and use available resources efficiently. It centers around classifying link texts to determine if a web page should be downloaded for further analysis.

To show the functionality in practice, a reference implementation of the lost sheep is used to assess online government transparency. The algorithm provides information on whether transparency material exists online, and gives an indication of whether the material could be found by real users.

The lost sheep is shown to outperform all related algorithm both by achieving higher accuracy and downloaded fewer web pages.

Keywords: eGovernment Surveys; Web Page Classification; Web Crawling.

Introduction

The majority of surveys covering online government (eGovernment) are carried out manually by assessing web sites from a citizen perspective. Much focus in these surveys is on determining whether explicit services or information, useful for the citizens, are available at national or local government web sites. Some surveys go further and include tests carried out by experts on the quality of the services, while others aim at measuring the user perspective by investigating to what extent the online material is findable and usable by representative users. A few surveys are accompanied by results from automatic validity testing on for example link validity, web site performance or automatically detectable accessibility barriers. However, little work exists on supporting surveys with algorithms to automatically find information on web site level. Such algorithms could provide results either to be used directly as survey results, as results to be verified by experts or as input on how to prioritize manual testing.

A common approach in eGovernment surveys is to use experts to check if a specific service or information is available at national and local government web sites. Manual expert testing rely on human judgement which makes it resource intensive and costly. Further since the human judgement can be influenced by many factors it is often times biased. Thus, tests are often for verification purposes repeated several times. In practice such surveys depend upon significant resources which constraints eGovernment surveys to focus on a small set of sites. [1],[2],[3] Similarly, it limits who can carry out large scale eGovernment measurements to big organisations, such as United Nations and Capgemini. [4],[5],[6],[7]

To support eGovernment surveys, algorithms could be used to determine whether sought after information exists online and provide the location of this information. However, automatic information assessment in large dynamic governmental web sites is particularly challenging as traditional crawler techniques fall short. Since it is not known in advance where on the web site the potential material is located, a huge amount web pages need to be downloaded and indexed. Doing this for many government web site is not feasible as it would require a significant amount of capacity from the survey initiators.[8] Further, government web sites are in practice a compound of services bought from different vendors, often made available under different domain names. This absence of a formal web site scope makes crawling of the web site challenging, as there are no existing methods to target or limit the crawler.

In addition to assessing whether material exists online, many surveys focus on to what extent the information can found by actual users.[3] This requires controlled experiments with representative users who are for example asked to carry out a task within a time frame. Experts can not act as representative users because they typically have more knowledge on the topic than what can be considered ideal. Further, the representativeness of any selected users can be questioned as users quickly become familiar with the topic they are assessing and can no longer be representative. This means that even users that are specifically chosen the be representative will with some experience no longer provide accurate results, and could for example find information too quickly compared to real users.

Automatic algorithms can assist and sometimes substitute user testing. This is done in web usage mining, but requires access either to a server side web log or data collected by 3rd part applications installed on the web site.[9] This means that web usage mining techniques cannot be applied without administrative access to the server. However, some approaches exists that focus on the usability of web sites by only observing the web site. For example, when it comes to accessibility testing about 20% of the tests can be carried out by automatically analysing the web sites.[10]

The above scenarios have the following common characteristics. They could potentially be assisted by learning algorithms to assess web sites. Such algorithms provide results either as it is, as input to be verified by experts or as input on which tests should be prioritized in manual testing. This should allow for more organisations to test, more accountability of the tasks and more surveys carried out in larger scale. However, existing algorithms are neither explored in the area of survey assessment nor developed with consideration that the survey initiations have limited survey and download capacity compared to organisations that specialise in web crawling such as Google.

Shortcomings of existing methods.

Crawling and classification have been extensively studied. However, applying automatic crawling and/or classification algorithms to assist benchmarking of eGovernment is also relatively open. To the best of our knowledge, there is no work explicitly trying to detect material in online governmental web site to be used as input to surveys. Further, since benchmarking of eGovernment often comes down to finding the web page where the sought after material potentially exists, the approach can be defined as finding single pages in large web sites when there is not enough capacity to crawl and index the entire web site. This is also an area which is relatively unexplored in real web sites.

In practice, a straight forward approach when carrying out a surveys assessing web sites is to use a third party search engine such as Google. Such search engines have often crawled and indexed large part of the web sites to be analysed, which means the testers can get access to the web site without having resources available to download the entire web site. However, using a third party search engine is not an automatic approach, and the testers still need to spend extensive amount of time specifying queries and examining the search results. Further, this approach relies upon using a tool which is out of control of the testers, which is problematic for many organisation. For example, to minimize unintended dissemination of sensitive information, large search engines are often restricted to crawl and index part of governmental web sites.[11] This means that much material will not be found using a third party search engine.

A more automatic approach could be to use a focused crawler. Focused crawlers are resource aware and center around downloading as much as possible of web pages in relevant topics and still minimizing the number of irrelevant pages downloaded. However, the focused crawlers are not developed to find one specific web page and therefore do not work well for finding targeted eGovernment material. Moreover, the crawlers depend on formally scoped web sites, and do not take into account the partly disjoint web sites which exists in public governmental web sites.

Another automatic approach is using web page classification algorithms. However, web page classification is often focused on classification of web pages into categories, not locating targeted online information and is therefore not directly applicable either. Further, only a few web page classification approaches consider the capacity constraint needed for crawling large web sites.[12],[13] Similarly, web structure mining techniques which examines the web site graph structure depend on close to perfect knowledge of the web site to be examined, which requires a large downloaded section of the web site.[14]

Additionally, web usage mining techniques can be used to analyse how web sites are used in practice, but requires access to web server log which is not available for outsiders. This means that the standard web usage mining techniques cannot be directly applied to be used to support eGovernment surveys.[9]

Closest to our work are recent published solution to find single web pages in large web sites. This includes search algorithms applied for large network environments and the exact match on rare item searches.[13],[12] This is work on finding an unobservable web page in large web sites, equivalent to checking if a web site has a web page with specific relevant material. Unfortunately, neither of these approaches have been sufficiently applied in real web site settings. The [13] only used simulated web site environment, while the [12] only showed a simple example from a real environment, but mostly focused on simulated environments. In fact the authors applied the same algorithm in different simulated environments which yielded significantly different results. This suggest that simulated environment is not sufficient to test the algorithms.

Objective.

The goal of this research is to facilitate surveys assessing eGovernment web sites by introducing automation. To accomplish this an algorithm is developed to automatically locate a single unobservable web page with specification information or a service in a large governmental web site. An unobservable web page is a page not available to the algorithm prior to the search, and therefore the algorithm cannot without supervision know if it has found the correct page.

The objective is to have let the testers use an algorithm in their assessment of web sites. First, the tester wants assess many web sites to know which sites has the information he/she seeks, and location of the web page if it exists. The tester first manually classifies some training data and organize into the different material intended to be targeted. Additionally, the tester provides starting points to eGovernment sites, typically web site URLs.

At the testing phase, the algorithm crawls the web sites and, based on the training data, automatically detects whether material is available online and provides the URL. These results can either be used as is, or used as input for a manually verification.

Subsequently, using the automatically detected web pages, the number of clicks needed reach the information can be calculated. The results gives an indication whether the information is findable for real users.

The algorithm must satisfy:

Our contributions are as following. We present an algorithm to automatically detect web pages in governmental web sites which :

Organization of the paper

This paper is organised as following. Section detectservices:sec:related presents related work on eGovernment surveys, web-mining, classification and crawling. Section detectservices:sec:problem formally defines the problem. Section detectservices:sec:proposedsolution presents the proposed algorithm to assist survey assessment called the lost sheep, and presents application areas where the algorithm is applicable. Section detectservices:sec:experimentalevaluation presents an experimental evaluation of the algorithm in the application areas. Section detectservices:sec:conclusion presents the conclusion and further work. Finally, detectservices:app:metrics presents an analogy of metrics to evaluate the algorithm.

Related work

 

This section presents related work on web-mining, with special focus on surveys and automatic methods that can potentially by used for eGovernment surveys such as crawling and classification.

eGovernment Surveys

Many surveys exists on various aspects of public government web sites. Only a few are large scale international surveys, of which the most well known regularly repeated surveys are carried by Accenture, Brown University, the United Nations and Capgemini. The complete methodologies for these surveys are not publicly available, and it is not known whether automatic or semi-automatic algorithms have been presented to support any of these studies. However, there is no indication in the literature that automation is part of the assessment.

Even though Accenture takes the survey a step further and presents qualitative measurements based interviews[6],[5], these surveys center around quantitative information based on observing if some predefined information exists online. Other eGovernment studies focus on if the information is findable by real users.[3]

Some smaller scales one-off studies on eGovernment also exists. These either focus on one specific area and/or analyse very few web sites.[2] Local governments are only rarely part of such studies.

Accenture, Brown and the United Nations are international studies assessing national web sites, while Capgemini acknowledges that national eGovernment measurements have limited impact in Europe due to the advanced state of European web sites.[5] Capgemini claims that the future of eGovernment benchmarking should be focused on areas that have significant improvement potential, which in Europe means local governments specially in non-urban areas. [5]

If more surveys will be extended to assess local government web sites, it introduces capacity problems as there are significantly more local governmental web sites compared to national. Manual assessment simply does not scale to test all local government web sites even in large organisations such as the United Nations. As a mitigation, a representative selection of local government web sites could be assessed, which was done in [2]. Another approach could be to introduce automatic testing.

Note that several approaches exists on automatizing the processing of survey from a government perspective, i.e. what people input and submit to governments. [15] These focus on the important area of making governments more efficient, not on measuring government, and are therefore out of scope of this paper.

Using link based information retrieval in web mining

Many approaches exists that use link information to improve the web mining tasks. This is not surprising as it was shown in [16] that, except for a few deviations, the link anchor texts are topically related the pages they links to. This strongly implies that anchor texts could be useful for classification and focused crawling.

Others show that not online the links but pages that link together are topically related. Of the features available at parent pages (pages that link to the page to be classified), most conclude that the classification result of the parent pages itself is the most discriminatory feature.[17],[18]

However, using anchor texts as a pre-classifier in order to minimize the number of downloaded pages when finding a single page is a nearly untouched research area. However closely related approach have been shown to be useful in focused crawling (see section detectservices:sec:focusedcrawling) and in web page classification (see section detectservices:sec:webpageclassification).

Web Page Classification

 

Web page classification has been extensively studied. This section presents the most relevant approaches for eGovernment web site assessment.

Exact match on rare item searches (EMRIS)
 

The exact match on rare item searches (EMRIS) is a problem introduced in [13] on finding a single node in large directed graphs, such as finding a web pages in a large web site. The problem was proposed both because the objectivity of existing classification approaches on large sets of web page were questioned and because the problem is particularly challenging. The problem is further defined in [12] as finding a target page $p_t$ in a directed graph $S$ and named EMRIS.

Typically, classifiers are tested using a known corpus, such as a manually classified data set. By selecting some items to train the algorithms and other for verification, the accuracy and precision of algorithms could automatically be calculated. However, to verify a classifier working with large sets of web pages, it is impractical to first manually classify all pages to perform controlled experiments. The commonly used approach is therefore to first let the classifier run with a relatively small set of training data and then manually verify the classification results. Unfortunately, certifying the results is not a deterministic approach. People tend to, sometimes subconsciously, give an unfair advantage to their own algorithms.

With EMRIS, there is only one relevant page and this can easily be defined prior to conducting experiments. This eliminates the need for manual verification, and the objectivity of the defined algorithms can longer be questioned, even when working with large sets of web pages.

Three algorithms to solve the EMRIS are introduced in [13]: similarity based search (SIM), degree based search (DEG) and Similarity*Degree (SIMDEG).

Similarity based search (SIM) 

The Similarity Based Search (SIM) is introduced in [13] for the EMRIS. The SIM is a greedy search algorithm to find a target page $p_t$ from a starting page $p_s$ in a directed graph $S$. It compares a provided input query $q$ to the content of neighbouring pages, and follows the page which is most similar to $q$. When visiting a page $p_i$, the SIM downloads and calculates a similarity score between $q$ every page linked from $p_i$, and follows the page with the largest similarity score.

For SIM to work, the algorithm needs to be pulled towards the target pages $p_t$. This requires not only that $p_t$ has the highest similarity score, but pages between the starting page $p_s$ and target page $p_t$ have a higher similarity scores than the remaining notes.

Literature shows that parent pages (pages linking to $p_t$) are similar to $p_t$ and will therefore have a higher similarity score. However, literature also shows that grandparent pages (pages linking to parent pages of $p_t$) are not similar to $p_t$ and will therefore not have a high similarity score. [17],[18]

Following this reasoning, it could be questions if the SIM works for large graphs. Until the SIM reaches one of the parent pages, the pages will not be similar to $p_t$ and the similarity score will not help SIM towards $p_t$. Therefore, when SIM examines a page far away from $p_t$, which is expected in the beginning of a crawl, it will walk at random. This is a problem not addressed in [13].

Degree based search (DEG)

The Degree Based Search (DEG) is introduced in [13] for the EMRIS. The DEG approach is similar to the SIM, but instead of using the similarity score between $q$ and the pages, the DEG pursues the neighbouring page with the largest number of edges. Thus, the algorithm relies upon that $p_t$ is more likely to be linked from a page with many edges compared to a page with few.

Similarity*Degree Search (SIMDEG)

The Similarity*Degree Search \\(SIMDEG) combines SIM and DEG, and is introduced in [13] to solve EMRIS.

To utilize the advantages of both the SIM and the DEG they are combined as the SIMDEG. For each neighbouring page, the SIMDEG calculates the similarity score as well as a frequency of edges. The algorithm pursues the page with the highest similarity score times frequency of edges.

Thus when SIMDEG is far away from $p_t$ the edge frequency will have a larger impact on guiding the algorithm than the similarity score. However, when the algorithm detects a page $p_i$ with an edge towards $p_t$, the similarity score of $p_i$ will be larger than the frequency of edges. In this situation the SIMDEG will go to $p_i$ which is close to $p_t$.

It is empirically shown in [13] that SIMDEG works better than both SIM and DEG. However, [12] disproves this by showing that SIMDEG in most cases does not perform better than SIM. This indicates that the degree based search, which heavily influence the algorithm at the beginning of the crawl, does not guide the algorithm well towards the target page $p_t$. The discrepancies in the results may be because in [13] the algorithms are applied in different network structures, such as clustered, random and small world, which may not represent the real web, while in [12] uses an empirically verified web site model from [19].

Lost Sheep (LS)

A colony inspired classification algorithm to solve the EMRIS called the lost sheep (LS) is introduced in [12]2.

The algorithm works with a central control called herder and multiple agents called sheep. In contrast to colony optimization, it is not the common trend, such as the most commonly chosen path, which is the chosen solution. It is rather one sheep, the LS, that finds the target page.

Each sheep is a fully functional classifier run as a separate agent using a specific hidden Markov model based classifier called Learning Automata (LA). To fuel the convergence, the state transitions in the LAs are non-linear.

Initially, the herder releases the sheep at a start web page $p_s$, and the sheep sequentially classifies anchor texts and the corresponding pages. Each sheep works so that it uses the classification result from the anchor text to determine if the corresponding page should be downloaded and classified. The sheep return to the herder with the classification result and precision. Based on these result, the herder releases the sheep at the page with best classification result. This continues until one sheep has found what is believed to be the target page $p_t$.

This way algorithm has a built in priority mechanism so that it prioritizes links to pages that is believed to lead to valuable results. This means that it has two targets: Maximize the accuracy of the classification and at the same time minimize the number of downloaded pages.

The LS does not take into account that there exists many different classifiers that have already been proven to work well for web page classification.[20] Which classifier to use should depend on the data at hand. This is specifically important for the LS at it is not itself a classifier but method for improving classification. However, the LS only relies upon a specific HMM classifier.

In addition, the LS has a bias that the web page size influences the classification results in an unwanted way. Since the state movements are influenced by the number of words detected, the LAs will reach a higher state if a web page has many words, which directly influences the confidence.

Further, the LS is shown to improve classifications in a synthetic environment with only a simple example for real live web classification. It has not been shown whether it work for live web sites.

Metrics for EMRIS

[12] shows that due the special properties in the EMRIS, commonly used classification metrics are not applicable. Since the EMRIS problem is designed so it is only one target page, precision and recall are identical and can be expressed only by using the number of pages in the web site graph and if a true positive classification result exists. The metrics that describe the performance of an algorithm applied in EMRIS for one web site are:

To show evaluation metrics for multiple web sites simple arithmetic averages are sufficient. This analysis is cleaned up and presented in detectservices:app:metrics.

Focused crawling

 

A focused crawler aims at downloading web pages of a specific predefined topic. The objective is to harvest as much as possible of relevant information avoiding download of irrelevant pages, maximizing the harvest rate and minimizing the total recall. There are some clear similarities between the objective of focused crawlers and EMRIS, namely to find a subset of pages in a web site based on relevance. Many focused crawlers exists, this section presents those that are most relevant.

Fish and Shark search
 

The fish search is an early heuristic search on web sites utilizing that web pages often link to related web pages.[21] The fish search is extended as the more aggressive version shark search.[22]

The fish search uses a fish food representation to express how well a page matches a query. The fish food is calculated using regular expressions to categorize pages into three discrete categories: 1, 0.5, and 0. It also uses pollution that kill the fish as a representation for low bandwidth and non-existing pages. The crawler automatically prioritises relevant web pages, but the discrete approach means that many pages receive the same priority in short time which results in arbitrary pruning.

The shark search extends the fish search by introducing a similarity between a query $q$ and page $p_i$, $sim(p_i,q)$. The $sim(p_i,q)$ denotes a value between 0 and 1, where 1 means perfect conceptual match. It further introduces a fuzzy inherited relevance score, so that the score of a page influences the score of pages it links to, thus preferring children of pages with relevant score. It also improves the fish search by introducing priority of a web pages to visit based on available anchor texts and the text close to the link.

Even though it clearly can be used for live search of a web sites, the reference implementation of the shark search is an application for gathering relevant information on a web site for a user to browse the site more easily. Furthermore, the fish and shark search separates itself significantly from the work in [13] and [12] as the shark search is not designed to stop when the ``most relevant information'' has been found.

Ant focused crawling

The ant focused crawler is based on colony optimization and works without central control.[23] The algorithm releases ants on a web site and where ant follows the edges based on a balance of potential relevance and pheromone trails. The ants release pheromone trails on edges it visits, which means the ants will, as a colony, converge towards the pages with most pheromone trails.

The potential relevance is calculated as a combination of the inherited relevance of a page, relevance of the content and relevance of the anchor text(s) pointing to the page. This means there is no prioritising of which pages should be downloaded based on the anchor text.

In contrast to most other crawlers, the ant based focused crawler takes repetitions into account. This means that it becomes iteratively better when run towards the same site multiple times, even without supervision. Another innovative part is that it introduces indirect communication between ants, which enables a crawling process by many ants without central control. The authors claim that this saves memory and reduces hardware requirement.

It is shown in [23] that the ant based crawler is able to download larger percentage of relevant pages than both shark search and the best fit focused crawler.

Problem setting

 

This section presents a formal definition of the problem setting and the problem to be solved. It further presents straight forward solutions and why they are not applicable.

Problem definition

This section first presents the scenario and then define the problem.

Scenario

In eGovernment surveys, web sites are most often assessed by observation, so that a tester in a systematic manner observes the web site to check whether sought after material exists. This paper aims at introducing automation so that a tester instead runs an algorithm towards the web sites scheduled for evaluation. The algorithm informs both if the sought after information exists and the location, thus significantly reducing the manual resources needed to perform manual web site assessment.

The web site to be assess is defined as following. Let S is a directed labelled hyper-graph S( extbf{P, E, L)}, where the vertices $p \in P$, edges $e \in E$ and labels $l \in L$, corresponding to the web pages, links and link/anchor texts in web site S. An edge is a directed connection between two pages, from page $p_i$ to $p_j$ noted as $e_{i,j}$. Labels are associated with edges so that a label $l_{i,j}$ on edge $e_{i,j}$ is observable from $p_i$ without visiting $p_j$. In contrast, neither $e_{i,j}$ nor $l_{i,j}$ are observable from $p_j$. This corresponds to actual web sites.

Table detectservices:table:comparison summarizes the notations.

[ht]

List of notations

 

Notation Real web sitesWeb site graph model
$S$Web Site $S$ Directed graph $S$
$p_i$ Web page $i$Node $i$
$p_t$ Web page with target material Target node
$e_{i,j}$ Link page $p_i$ to $p_j$Edge from page $p_i$ to $p_j$
$l_{i,j}$ Link/anchor text for link $e_{i,j}$Label connected to edge $e_{i,j}$
Definition

The problem is finding the target page $p_t$ in S under the following constraints. Let $p_t$ be an unobservable fixed target page in a directed graph S. Let $tp(p_i,p_t)$ denote $1$ if $p_i=p_t$ and $0$ otherwise, and let $visit(p_i)$ denote $1$ if $p_i$ has been visited and $0$ otherwise. Further, let $Best(S)$ denote the page among the observed pages that best fit the target page so that $Best(S) = {\arg\max}_{\forall p_i \in P\textrm{ }if\textrm{ } visit(p_i)}\{E(tp(p_i,p_t))\}$, where $E(tp(p_i,p_t))$ is the expected result of $tp(p_i,p_t)$.

The problem becomes:

\begin{eqnarray}

Maximize : tp(Best(S),p_t)\nonumber

Minimize & : & \sum_{\forall p_i \in P} {visit(p_i)}\nonumber

\end{eqnarray}

I.e. select a page so that it is expected to be the target page while minimizing the number of downloaded pages.

Since $p_t$ is unobservable it means that it is not known if a page $p_i$ is $p_t$, even if $p_i$ is downloaded. This logically denotes that $tp(p_i,p_t)$ cannot be known, and any algorithm to be applied need to rely on finding a function that estimates whether $p_i$ is $p_t$. Thus, for the problem to be solvable, there needs to be a function to estimate the expected value of $tp(p_i,p_t)$: $E(tp(p_i,p_t))$. Further, since $p_t$ is not observable, the training data $q$ needs to accurately describe $p_t$ so that $tp'(p_i,q)$ is a comparison between $p_i$ and $q$, estimating $tp(p_i,p_t)$. $tp'(p_i,q)$, which only includes observable data, can be used for the classification. This is similar to classification problems where the algorithm itself cannot know if it has done a correct classification without supervision. Further in line with classification, the algorithm need to rely on accurate and representative training data $q$.

Keep in mind that $S$ is a directed graph with labelled edges. This means that an edge from a page to another, say from $p_i$ to $p_j$, is observable from $p_i$ as $e_{i,j}$ with the label $l_{i,j}$. By assuming that the labels describe the pages they link to, it is possible to extract information of a page $p_j$ only by visiting $p_i$ and collecting $l_{i,j}$. This gives the basis for our proposed algorithm.

Straightforward solutions

This section describes three straightforward solutions and why they are not sufficient.

Random walk
 

The Random Walk (RW) randomly crawls a directed graph. This is done by initially visiting a page $p_i$. Further, the RW selects a random edge $e_{i,j}$ available from $p_i$ and follows $e_{i,j}$ to page $p_j$. From $p_j$ it continues selecting random edges and repeats the step. Despite its simplicity it is an often used crawling strategy with valuable properties.[24] However, the RW is without learning and is not designed to prioritize pages which will lead to the target page. Thus, since it is unlikely that the RW reaches the target page $p_t$ the performance can be seen as a lower bound.

Optimal

The optimal algorithm finds the shortest path between the starting page $p_s$ and the target page $p_t$. In a controlled experiment there is perfect knowledge of the entire directed graph, and the Optimal algorithm can observe the otherwise unobservable target page $p_t$. Using a brute force approach, it determines which pages to visit and walks only the shortest path from $p_s$ from $p_t$. Since it will always reach the best possible result with the fewest number of downloaded pages and should be seen as an upper bound. Since the optimal algorithm relies upon knowledge of the unobservable node $p_t$, it cannot be applied in actual scenarios.

A*

Many path finding algorithms exits with the aim to find the best, often the shortest, path between the starting page $p_s$ and target page $p_t$. This is done by using $p_s$ and $p_t$ as input. I.e., in contrast to the EMRIS and classification algorithms, the algorithms always know when $p_t$ has been reached. There are however similarities between path finding and the EMRIS. Path finding algorithms use path and edge information in order to decide which link(s) to follow to get the most probable path to the $p_t$.

A common example of path finding algorithms is A*. To find the most cost effective path from $p_s$ to $p_t$ a distance cost function is calculated consisting of: (1) A path-cost function per path and (2) A heuristic estimate to the goal (e.g. air distance).

Since A* can observe the otherwise unobservable $p_t$, it should, similar to Optimal, be seen as upper bound. However, in contrast to Optimal, A* typically visits more pages than what is part of the path it determines as most cost effective. Thus, there is a more fair chance a proposed algorithms reach accuracy level as A* compared to Optimal.

Proposed solution

 

This section presents the proposed solution based on the lost sheep.[12] It extends the algorithm to make it applicable for eGovernment web site survey assessment. It further generalizes the lost sheep to make it applicable to use with any classifier.

Finally, this section presents application areas where the presented algorithm can be applied. The application areas are chosen to cover important and often assessed areas of eGovernment surveys, and to give a broad view of the possibilities of the algorithm.

The lost sheep (LS)

The Lost Sheep (LS) was first introduces in [12] as an algorithm for solving the EMRIS (see section detectservices:sec:webpageclassification).

The algorithm is based on the lost sheep metaphor: A herder has control over a herd of sheep. At the start of the day, the herder releases all sheep to walk around freely. Each sheep walks independently around for a while for then to return back to the herder. The sheep are very afraid of the unknown and prefer to stay as close to the herder as possible. However, the sheep have one weakness, if a sheep finds green grass, it forgets all about its fears and wanders to whatever location the green grass leads it. The sheep that do not find any green grass, will move quickly back to the herder, while the sheep that finds a lot of green grass will wander further.

The herder is also interested in finding areas with green grass for his herd to graze. To acquire this knowledge, he uses his special ability to communicate with the sheep. He is not interested in the sheep that quickly return because he knows they have found little or no green grass. He is rather interested in the sheep that wandered the furthest, the lost sheep, because he knows that this sheep has found the most green grass.

The lost sheep communicates back to the herder about all the green grass it has found and where it is located. Based on this information, the herder has three options. (1) If the sheep discovered a lot of grass, the herder then becomes so happy that he calls it a day and writes in his journal about the newly discovered area. (2) The herder can also choose to take the entire herd to the new area, release all sheep there and start the process over again. (3) At the end of day, if no sheep has found any satisfactory areas of green grass, the herder returns home sad and writes in his journal that no green areas were found that day.

The herder represents a central control, each sheep represents individual classifiers and the green grass represents positive classification results. The amount of green grass communicated from the sheep to the herder is a confidence level of a positive classification results. The sheep visit anchor texts and follow edges to web pages. The anchor texts are pre-classified for the sheep to determine if a page should be downloaded and classified

The no free lunch theorem states that a classification algorithm cannot learn what it has not seen in the training data. Thus, there is no over all ``best algorithm'' for classification, but a ``best algorithm'' for the data to be classified. Because of this, the algorithms to be used should depend on the data at hand and what should be tested.[20] Algorithm detectservices:algorithm:lostsheep generalises the lost sheep presented in [12] so that it can be applied with any classifier as long as the classifier is able to work with web pages and web sites. This is in contrast to previous presentations of the LS.

Furthermore, in contrast to [12] the confidence in the LS is normalised both on web page level in algorithm detectservices:algorithm:simplela and detectservices:algorithm:nlla. This means that the number of words in web page no longer has an undesired influence on the classification results.

size f the web page does not influence the result.

\begin{algorithm}

Lost Sheep
 

\begin{algorithmic}[1]

\STATE $depth \leftarrow 0$

\STATE $q \leftarrow $ training data

\STATE $x'.confidence \leftarrow 0$

\STATE $p_i \leftarrow p_s$

\STATE $Visited \leftarrow $ empty set.

\WHILE{$x'.confidence<threshold$ and $depth<maxdepth$}

\STATE Add $p_i$ to $Visited$.

\STATE Let $E = \{e_{i,0},...,e_{i,n}\}$ be the $n$ edges from $p_i$.

\STATE Let $L = \{l_{i,0},...,l_{i,n}\}$ be the labels connected to each edge in $E$.

\STATE The herder releases $n$ sheep $X = \{x_0, ..., x_n\}$ where each sheep $x_j \in X$ is connected to edge $e_{i,j}$ if $p_i \notin Visited$.

\FORALL{$x_j \in X$}

\STATE Let $l_{i,j}$ be the label connected to edge $e_{i,j}$.

\STATE Let $p_j$ be the not yet visited page available from $e_{i,j}$.

\STATE $x_j.confidence \leftarrow 0$

\STATE $x_j.page \leftarrow p_j$

\STATE $tp'(x,o,q) \leftarrow $ a classifier, e.g algorithm detectservices:algorithm:cosine, detectservices:algorithm:simplela or detectservices:algorithm:nlla

\STATE $x_j \leftarrow tp'(sheep x=x_j,object o=l_{i,j},training data q=q)$

\IF{$x_j.shouldcontinue$}

\STATE Download $p_j$

\STATE $x_j \leftarrow tp'(sheep x=x_j,object o=p_j,training data q=q)$

\STATE Add $p_j$ to $Visited$.

\ENDIF

\ENDFOR

\STATE $depth \leftarrow depth +1$

\STATE Find $x'' \in X$ where $x''.confidence \geq x_j.confidence$ for all $x_j \in X$

\IF{$x''.confidence>x'.normconfidence$}

\STATE $x' \leftarrow x''$

\STATE $p_i \leftarrow x'.page$

\ENDIF

\ENDWHILE

\end{algorithmic}

\end{algorithm}

It should be clearly noted that many web page classification algorithm exists, and that the purpose of this paper is not to introduce another classification algorithm. The purpose is rather to shows show that by using the LS, classification can be improved independent of algorithm used. Because of this, different classification algorithms are presented intended to show different properties of the LS: Algorithm detectservices:algorithm:cosine, detectservices:algorithm:simplela and detectservices:algorithm:nlla.

Cosine measurement

Algorithm detectservices:algorithm:cosine presents a simple cosine measurement between the training data $q$ and the object to be classified $o$.[14],[25] This is a simple but popular measure in information retrieval calculating the cosine between the vector of query terms and vector of document terms. A high cosine measurement indicates that there are similarities between $q$ and $o$ giving a positive classification result.

The LS is designed to classify the anchor text and web page sequentially, but the cosine similarity does not consider the sequence of the words. Therefore a manually specified threshold is added to indicate whether a page should be downloaded.

The cosine measurement is used in the LS to show that LS is independent of algorithms, and works even with very simple classification approaches. Thus postulating that if the LS works well with a simple cosine measurement, it is likely to also work for more progressive classification algorithms.

\begin{algorithm}

Cosine between vector of query terms $q$ and document $o$ ($tp'(o,q)$)
 

\begin{algorithmic}[1]

\REQUIRE sheep $x$, object to classify $o$, training data $q$

\STATE Let $w_{q,i} \in W_q$ be a vector of the training data terms.

\STATE Let $w_{o,i} \in W_q$ be a vector of the terms in the object to be classified.

\STATE $x.confidence = \frac{ \sum{w_{q,i}*w_{o,i}} }{(\sum{w_{q,i}^2})^\frac{1}{2}*(\sum{w_{o,i}^2})^\frac{1}{2}}$

\STATE $x.shouldcontinue \leftarrow (x.confidence <threshold)$

\RETURN $x$

\end{algorithmic}

\end{algorithm}

Note that the cosine measurement is the same approach used in the similarity search presented in section detectservices:sec:EMRIS.

Simple Learning Automata

Algorithm detectservices:algorithm:simplela is a simple Learning Automata (LA) classifier which, in contrast to algorithm detectservices:algorithm:cosine, can work without a manually specified threshold.[26] This was used as a classification algorithm in [12] for the LS. This simple LA has been shown to work well in interactive unknown environments and will in an attempt to find an unknown value $\lambda$ come arbitrarily close to this value as long as there is an environment which guides it.[27] $\lambda$ is mapped as confidence in algorithm detectservices:algorithm:simplela. LAs have also been shown to work well in teams when limited capacity available.[28],[29]

\begin{algorithm}

Simple LA ($tp'(o,q)$)
 

\begin{algorithmic}[1]

\REQUIRE sheep $x$, object to classify $o$, training data $q$

\STATE Let $W$ be a split of the content of $o$ into words.

\FORALL{$w \in W$}

\IF{$x.state

\STATE $w \leftarrow ApplyStemming(w)$.

\IF{$w \notin StopWords$}

\STATE $x.numwords \leftarrow x.numwords+1$

\IF{$w \in q$}

\STATE $x.state \leftarrow x.state - 1$

\ELSE

\STATE $x.state \leftarrow x.state + 1$

\ENDIF

\ENDIF

\ENDIF

\ENDFOR

\STATE $x.confidence = \frac{x.state}{x.numwords}$

\STATE $x.shouldcontinue \leftarrow (x.state < Q)$

\RETURN $x$

\end{algorithmic}

\end{algorithm}

Non-linear Learning Automata

Finally, algorithm detectservices:algorithm:nlla presents an extension of algorithm detectservices:algorithm:simplela. In contrast to algorithm detectservices:algorithm:cosine and detectservices:algorithm:simplela, algorithm detectservices:algorithm:nlla is not an attempt to only show the proof of concept of the LS, but also efficiently utilize the web page properties.

One improvement is done by adjusting the state movements in the LA. This is done by letting the states in the LA move in a non-linear space, with an aim that to increase the convergence speed without loss of accuracy. The first word, part of the anchor text, can make the LA move $Q/2$ states, where $Q$ is the number of states in the LA. The second word, which could either be in the anchor text or the web page, could make the LA move $Q/4$ states, and so on. Because each sheep classifies sequentially and the anchor text is discovered before the web page text, the anchor text will have more impact on the state movement than the web page text. This utilizes that anchor text describes the corresponding web page not only for choosing which pages to download as done in algorithm detectservices:algorithm:lostsheep, but also for classification[16]. Thus algorithm detectservices:algorithm:nlla prioritises the anchor text over the web page text even more than algorithm detectservices:algorithm:simplela.

Furthermore, algorithm detectservices:algorithm:nlla is improved by taking the frequency of words into account. Algorithm detectservices:algorithm:simplela only determines whether a discovered word in the data to be classified also exists in the training data. This is sub-optimal as word which occurs only once in the training data has an equal impact on the classification as words which occur frequently. As a mitigation, algorithm detectservices:algorithm:nlla is adjusted so that the frequency of words also contribute.[26] This is done by letting the state movement be multiplied by the normalised by frequency of the occurrences of words in the training data. This makes the frequent words in the training data contribute more than the infrequent words.

To avoid letting the state movements become so small that it does not contribute to the classification, the movements are adjusted to let the smallest possible state movement be 1.

\begin{algorithm}

Non-linear LA ($tp'(o,q)$)
 

\begin{algorithmic}[1]

\REQUIRE sheep $x$, object to classify $o$, training data $q$

\STATE Let $W$ be a split of the content of $o$ into words.

\IF{not $x.movement$}

\STATE $x.movement \leftarrow Q/2$

\ENDIF

\FORALL{$w \in W$}

\IF{$x.state

\STATE $w \leftarrow ApplyStemming(w)$.

\IF{$w \notin StopWords$}

\STATE $x.numwords \leftarrow x.numwords+1$

\IF{$w \in q$}

\STATE $x.state \leftarrow x.state -

(int(x.movement*NormFreq(w \in q)) or 1)$

\ELSE

\STATE $x.state \leftarrow x.state + x.movement$

\ENDIF

\STATE $x.movement \leftarrow int(x.movement/2) or 1$

\ENDIF

\ENDIF

\ENDFOR

\STATE $x.confidence = \frac{x.state}{x.numwords}$

\STATE $x.shouldcontinue \leftarrow (x.state < Q)$

\RETURN $x$

\end{algorithmic}

\end{algorithm}

Using labelled information

The LS uses labelled information as a pre-classifier to determine if web pages should be downloaded. The labels consists of anchor and title text in a link: <a href=" ... " title="LABEL">LABEL</a>&>. This works because anchor texts in most cases accurately describes the web page it links to.[16]

Others have shown that classification accuracy can be increased not only by using the anchor text, but also by using the anchor text context.[22] A similar approach could potentially increase the accuracy and reduce the number of downloaded pages of the LS. The algorithm opens up for such extensions, but it is not directly addressed. In fact, since the LS is not dependant on one specific classifier, the applicable features are only limited by the classifier used.

Minimizing resources

Exhaustively crawling web sites is a resource intensive task. This is because there exists many large public web sites. For example, the national government portal of Norway, http://regjeringen.no, has 7 250 000 web pages. With a good download capacity and a fast speed of ten pages per second, the exhaustive scan would take more than 8 days. Additionally, to automatically asses the web pages, they need to be parsed and the results stored. This is resource intensive and in practice limits who can do exhaustive crawls to a few organisations. It is therefore important that any algorithm to be used for eGovernment surveys use the available resources wisely and try to limit the number of downloaded pages, so that it can be applied also in organisations that do not have massive capacity available.

The LS is design to use as little as possible of the available resources when crawling and classifying web pages. This is done my letting each sheep evaluate the anchor text, and, only if the anchor texts seems relevant, the sheep downloads the pages. This is under the assumption that the used resources are directly related to the number of downloaded and parsed web pages. I.e. an algorithm will use less resources if it downloads fewer pages.

Formal web site scoping

What is within or not within a web site is from a user perspective not clearly defined. Sometimes a web site equivalent to the domain name. However, in connected web sites, this is often not the case. A web site may link to any web page, independent of domain name, in an attempt to provide useful material to the user. In fact, in a connected government web site it is often so that services are bought from third party vendors and are made available online on separate domains. The citizens can find these services either by a search engine or through links from the government web site.

A practical example of this is the mail record module in Norwegian local government. The modules are bought from limited set of vendors and often available on the domain of the vendor. Since many buy from the same vendor, local governments will often have the same domain for their mail record module. For users this is not a problem as they will use the module independent of whether it has the same domain as long as it is part of the local government. However, for a crawler this causes challenges as that is within or outside of the scope of the web site is not formally defined.

The common approach in existing crawlers and classifiers for automatic evaluation has been to formally define the scope of the web site.[30] This is a tedious manual process. In contrast, the LS does not require a formally scoped web site, only a web site entry point. The LS follows links without consideration of domain names. Thus, it does not matter if targeted information is linked from but not present at the same domain, or if a governmental service is part of a different domain than the governmental web site itself.

This approach works because web sites web sites tend only to link to content relevant for the user groups. For example, a municipal web site will link to the public records from their own municipality, not others.

Classification using only positive examples

A common challenge in classification is collection of enough and good training data. Often times collecting training data is a manual time consuming work, requiring both positive and negative examples. In general, the more training data available without over fitting, the better the classifier performs.

Therefore, it is a very desired property to reduce the amount of training data needed in classification.[31] The LS achieved this by only requiring positive training examples. This uncommon property significantly reduces the amount of needed training data.

The LS uses only positive training examples because the presented algorithms detectservices:algorithm:cosine, detectservices:algorithm:simplela and detectservices:algorithm:nlla rely only on positive examples. However other classifiers may rely on both positive and negative examples such as the Naïve Bayes classifier. For such an approach the negative training data could be directly collected from the data outside of the target category, but the performance and properties of using both negative and positive training data has not been examined in this paper.

Feature selection

There exists many potential features which can be used for web page classification. What are useful features depend on the training data.

The general LS (algorithm detectservices:algorithm:lostsheep) does not specifically define any features. The only requirement is that the features are available from the anchor text and the web page. However, each of the presented sheep algorithms (algorithm detectservices:algorithm:cosine, detectservices:algorithm:simplela and detectservices:algorithm:nlla) split the anchor text and the content into words. This means that the presented algorithms are based on the bag of words for features.

Even though the current references only uses the available words as features, due to the interchangeability of classification algorithms in the LS, several other potential features could be applicable. This includes, but are not limited to: text in the vicinity of the anchor text, the URL of the page to be evaluated, the heading information close to the anchor text, web page title, tag structure, meta tags, HTTP header information, and so on.[32],[33],[34] Such features may yield higher accuracy. However, the optimal set of features is not within scope of this paper.

Bag of words approach

As is common in text classification and naïve linguistic models, algorithm detectservices:algorithm:cosine, detectservices:algorithm:simplela and detectservices:algorithm:nlla use a bag of words approach. It treats each instance of the word independently. This is a common approach in classification as it yield good accuracy without being complex. [35] A disadvantages is that opposite phrases such as ``Mail records are online, not only offline anymore'' and ``Mail records are only offline, not online anymore'' are treated identical.

Again, due to the interchangeability of the classifiers in the LS, any classifier could be used, even classifiers that use takes sequence of words into account, using for example an n-gram classification approach.[36] With such a classifier the sequence of the words ``not online ... but offline'' would be treated differently than ``online ... but not offline''.

In fact, algorithm detectservices:algorithm:nlla utilizes the sequence to fuel convergence. In contrast to common bag of words approach, the first words detected, part of the anchor texts, have more influence on the state movements and are more important than the words detected later. However, relationships between words that occur close to each other, are not utilized in any of the presented algorithms.

Another disadvantage with the bag of words approach is that it is language dependant.

Language dependency

Language dependency is a common problem when applying web mining in across countries and languages. Some approaches exists that use automatic translation or multiple corpuses with different languages.[37]

However, it is generally considered that as long as words are used as part of the algorithm it is language dependant.

Language independence are not considered in this paper. However, a mitigation could be to either have training data from multiple languages available, or use features which are not language dependant such as number of links, types of document, dates, and so on.[38]

Stemming

A common challenge when dealing with text classification is the different conjunctions are not directly comparable. An example is having training data with the word plan. For accurate classification it is important that relevant words to be classified can be compared with plan, such as planning, planned, plans all with the stem plan.

Stemming makes such comparisons possible, which is the process is extracting the core parts, or the stems, of words. It is commonly applied to classification problems when conjugations of words are irrelevant and introduce noise. [39] The LS uses stemming both for the anchor texts and web page content.

Similarly, in HTML certain characters can be written both with HTML notation or as unicode characters. For example å can be written as

å
. To avoid noise, all HTML notations were translated to unicode characters.

Removal of stop words

Another common challenge in text classification are overly common language dependant words which are independent of the content and typically introduce noise to the classification. Example of stop words are in English are: and, to, for, in .... 3

The LS removes all stop words are removed from training and classification data.[39]

Organising of training data

The training data is organised into categories based on the material that is sought after. For each category, the training data includes examples of target web pages and anchors linking to the target pages. Both web pages and anchors have been manually classified.

The anchor texts and web page are treated separately by the algorithm. This means that the training data the anchor text is used only for the classification of the anchor text, and the training data for the web pages are used only for the web page classification. However, since there is a strong correlation between the anchor text and web page content, if few training data examples are available, the training data from the anchor text and web page could be merged.

Complexity

To analyse and evaluate the performance of the LS this section presents a complexity analysis of the algorithm

Considering a web site as a directed graph S, the number of pages available can be determined by the depth of the graph $d$ and a branching factor $b$. This means that a simple complete breath first crawler, that downloads every page once, will have the space and time complexity $O(b^d)$.[40] In a crawler that deliberately limits the depth to maxdepth, the space and time complexity will be $b^maxdepth$.

In the LS, maxdepth is used to control the maximum depth the algorithm reaches in the web site graph. The herder, that controls all downloaded pages in the set $Visited$, will not release a sheep towards a page which has already been visited. However, the herder could, and often does, release the herd at a page which a sheep has communicated have good classification results.

This means that by letting the sheep work sequentially instead of parallel and by letting each sheep always download the page it encounters, the sheep will be a breadth first crawler with complexity $b^maxdepth$. Additionally, the herder will downloaded one page per depth. Thus, the time and space complexity when downloading all pages must be: $b^maxdepth + maxdepth$.

Since the sheep only download web pages that potentially yield good classification results, the worst case scenario when it comes to time and space complexity must be the sheep always downloads the page it encounters. Thus, the complexity of the LS must be $O(b^maxdepth + maxdepth)$

This means that the LS has a higher complexity than a standard breadth first crawler with a manually defined maxdepth. However, this absolute worst case does not communicate well how the algorithm works in practice. The performance can better be communicated by introducing a relevance factor $r \in [0,1]$ representing the probability that a sheep pursues and downloads a page. This gives the time and storage complexity: $r*b^maxdepth + maxdepth$ where $r$ and $b$ defines the web site, and maxdepth is set in the LS. By letting $r=1$, it gives the worst case complexity (Big-O notation).

Since the objective of the LS is to detect the target page $p_t$ (see section detectservices:sec:problem), it needs to be downloaded. Thus, $p_t$ need to be within the potentially $r*b^maxdepth + maxdepth$ downloaded pages. Therefore maxdepth need to be set balancing the complexity and the data at hand: Considering both the available resources and that $p_t$ should be within reach.

The complexity analysis in this section is based on graph traversal. It does not consider the web page size nor the download capacity, which adds to the complexity. This analysis further relies on that all URLs uniquely identifies their corresponding web pages. This way, the crawler can know by only comparing the URLs prior to downloading a page whether it has been downloaded earlier. In practice there are situations where several URLs represent the same web page. Since there is then a chance that a page is downloaded more then once, it adds to the complexity. [41] A solution for this could be to use the HTTP header hash tag which could be retrieved without downloading the web page. These limitation and solutions are not addresses in this paper as they are not only applicable for the complexity analysis of the LS, but applying any graph search analysis applied for crawling for web sites. Thus, the exact same limitation and solutions applies to for example a complete breadth first crawler.

See section detectservices:sec:experimentalevaluation and [12] for empirical performance evidence.

Implementation

The algorithm has been implemented in Python both in a synthetic environment and an environment that crawler working with real web sites. The application is controlled from a command line interface, and all results are stored in a postgreSQL database.

The results are available in a simple web interface, and exportable to CSV so that the results can be used for statistical analysis or for example in a spread sheet application.

All HTTP requests are done using urllib2, and all HTML parsing implemented using Beautiful Soup. The stemming algorithm and stops words are a Python port of the Java Snowball.

The implementation is deliberately not following the robots exclusion protocol (robots.txt).[42] The reason for this is that many governmental web sites use the protocol to avoid unwanted dissemination.[43] Thus, by following the robots.txt much material would not be found. However, following robots.txt could be switched on by a parameter in the command line interface.

The synthetic environment includes additional algorithms than is presented in this paper. The reason for this is that the synthetic environment is easily controllable and therefore much easier implement and use to get a first overview how each algorithm performs. How the LS acts in the synthetic environment is extensively studied in [12].

All source code is available at http://svn.egovmon.no/svn/eGovMon/ under the GPL license.

Application I: Existence: Locating public transparency services and information in government web sites

 

The first case where the LS is applied is automatically assessing eGovernment transparency. Transparency is area which are often part of eGoverment surveys and is therefore a fitting application area for the LS.[44] The eGovernment survey which includes the most detailed description of the methodology is presented in [2]. In short, the approach in [2] rewards the presence of online information related to transparency. For example, a government web site with contact information is more transparent then a web site lacking this information. The LS could have been used to carry out this survey.

[2] includes five local government web sites (the web site of the capital and the four subsequent largest cities) from 15 European countries. Smaller municipalities are not included in the survey. Thus, [2] focus on only a small set of web sites, and much more capacity is required if more web sites such as web sites from local governments should have been part of the survey.

[2] examines as available information, while others have tried to examine transparency from a citizen perspective. An example is [44] which saw transparency as presence of services, information and data which gives insights to governmental processes on the web, such as for example local government budgets, video of governmental meetings and zoning information.

Based on these distinct approaches, online transparency could be divided into two unique categories with very different properties: transparency information and online transparency services. The LS is applicable for both categories.

Transparency information is available governmental material which could, to enhance transparency, be published online, such as an online budget or contact information. It is typically uploaded as a spread sheet or PDF on the local government CMS, and no special requirements or services are needed to make it available. Furthermore, it is almost exclusively located in the main domain local government web site.

Online transparency services are part of the web site providing functionality that could not be achieved at the same level without eGovernment. An example of this is searchable and subtitled videos of local government meetings organized by cases.4 The services are typically provided by a third party vendor more or less integrated into the local government web sites, but not necessarily within the same domain. Thus, links to government services at a different domains may be just as valuable for a citizen as links to services within the same domain, whether a service is within or outside of a web site is not clearly defined. When applying a crawling algorithm, formally defining the web site scope is important to avoid excessive downloads of web pages. However, a clear definition of what is within or outside the scope of a web site is not important to the users as long as it does not affect the browsing experience.

Based on this we let the LS be applied to the web site to automatically locate both information and services.

Application II: Findability: Determining whether government material can be found users

 

The second case where the LS is applied is suggesting whether online government material could be found by representative users.

Even if useful information is exists in public web sites, it is not necessarily so that real users can locate it. Assessing if material exists or if it can be found online are substantially different. Tests on existence (existence tests) only assess whether information is present, typically carried out by experts. On the other hand, tests on whether users can find information online (findability tests) assess the usability of a web site, and are typically run in controlled environments with representative users. Findability tests are therefore carried out with a higher cost, but are nevertheless often assessed in eGovernment surveys.[3]

Findability tests are related to web usage mining which requires either access to web logs or installing a third party mining software such as Google analytics. In either case, it requires some administrative access not publicly available.

However, many have showed that navigable web sites are more user friendly and therefore related to web structure mining.[45] A usability optimization approach is presented in [46] fusing both weighted page ranks and the web log mining the user behaviour. Among the presented findings, [46] show that users could reach popular pages more easily than unpopular pages. This rather obvious finding shows an important point: The reachability of a service and information is important for the usability, even in modern complex dynamic web sites.

It is therefore arguable that findability implies reachability, and that fewer clicks needed for a user to reach a target page indicates both increased reachability and findability. From this we can define an approach which gives an indication of the findability without access to the web log by measuring the number of required clicks from the start page $p_s$ to the target page $p_t$ as $Findability(p_s,p_t)$ as outlined in algorithm 4.6. It further means that testing for existence versus findability can be set up as described in algorithm detectservices:algorithm:existanceversusfindability.

\begin{algorithm}

Findability
 

\begin{algorithmic}[1]

Findable($p_i$,$p_t$)

\IF{$p_i$=$p_t$}

\RETURN 0

\ELSE

\RETURN $\min_{\forall p_j\mbox{ linked from }p_i\mbox{ and } visit(p_j)}\{\mbox{Findable}(p_j,p_t)+1\}$

\ENDIF

\end{algorithmic}

\end{algorithm}

\begin{algorithm}

Existence versus Findability
 

\begin{algorithmic}[1]

\STATE $p_t = LostSheep(q,S)$ \/\/ Existence test.

\IF{$p_t \neq \emptyset$}

\STATE $t = Findable(p_t,S)$

\IF{$t > threshold$}

\STATE $p_t$ exists and is findable.

\ELSE

\STATE $p_t$ exists, but is not findable.

\ENDIF

\ELSE

\STATE $p_t$ does not exists.

\ENDIF

\end{algorithmic}

\end{algorithm}

Since findability uses number of clicks as a metric, it implies that many clicks yield bad findability, and visa versa. Thus, there exists a threshold that separates if a material is findable or not, as indicated in algorithm detectservices:algorithm:existanceversusfindability. The threshold should optimally be set to how many links users click to reach a task before giving up. Some studies show that users tend to stop surfing he/she has clicked links at least thee times. Such evidence is provided in [47], but stresses that the number of clicks itself is not the important factor, but rather whether users are able to find information. The number of clicks before quitting a task depend significantly on the user. Some users even kept on clicking after 25 pages.

This suggests that the threshold should be set to three clicks. However, it should be verified with user experiments.

Application III: Locating services and information live and in web sites in the deep web

 

The third application is automatically finding government material in a live environment.

Searching the web is a common and mature task and many users use Internet almost exclusively through large search engines. Searching the huge web is possible because the web sites have been crawled and indexed prior to the actual search. In practice this means that users search an offline copy of the web.

However, many web sites cannot be indexed by major search engines and fall into the deep web.[48] Examples of this includes web sites which update more frequently than the visiting interval of the search engine crawlers, web sites which are password protected, behind forms or restrict crawlers to visit by using robots.txt. Further, government web sites are starting to fall into the deep web as web sites become password protected and often with personalised information.

To enable search in some of the deep web sites, algorithms can perform live searches in an interactive environment with the tester. With an input query the algorithms can crawl and with a high precision find the web page that best matches the query. Additionally, the algorithm could prompt for passwords, form content, and so on. For this to be possible, the algorithm need to provide results quickly. This application requires that the algorithm is able to find information in a short time period. Therefore, an algorithm that finds the information quicker will score better.

Further, live search makes it possible to evaluate eGovernment web sites on the fly. This could work similar to how W3C Markup Validation Service works which gives validation results on a web page[49]. An eGoverment checker would rather work on web site level to detect whether material exists and if it is findable. However, for such a service to work in practice it needs to be able to respond quickly.

Experimental evaluation

 

This section first presents the experimental setting including the data set. Then the section presents empirical data and analyse of the algorithm performance.

Experimental setting

This section presents the set up of the experiments. This includes an overview of the algorithms evaluated, the data set used, how the training data is collected, which tasks the classifiers carry out and metrics used for evaluating the performance.

Selected transparency tasks
 

The experiments are designed to be as realistic as possible, but still show the performance of the algorithms.

To comply with this, the algorithms are assigned with tasks in line with section detectservices:sec:existence and detectservices:sec:findability on transparency services and information online and to meet the following criteria.

Table detectservices:table:transparencytasks presents a list of tasks the algorithms locate. These tasks are similar to classification classes as a web page $p_i$ could be classified to be part of each class. However, the tasks differer from classification classes as the intention is to determine both if the material exists and the location. Therefore the wording task is used rather then classes, in line with survey assessment not classification.

[h!]

List of transparency tasks.
 

\begin{tabular*}{0.99\textwidth}{l l p{10.0cm} }

IDTypeDescription 1 Information Contact information 2 Information Recent information section 3 Information Budget 4 Information Local government calendar 5 Information Local government plan 6 Information Zoning information and plans 7 Service Mail record 8 Service Search functionality (on a single page) 9 Service Online local government board meetings 10 Service Online local government executive council 11 Service Chat with administrative or political officials 12 Service Video of city or municipality board meetings 13 Service Online city or municipal plan meeting

\end{tabular*}

Summary of algorithms

Table detectservices:table:algorithms presents an overview of algorithms used in the experiments. These algorithms are chosen because they are shown to both solve similar problems yield good results in synthetic environments.

[h!]

Summary of algorithms
 

\begin{tabular*}{0.99\textwidth}{l l l}

Abbr.AlgorithmPresented in RW Random Walk Section detectservices:sec:randomwalk SS Shark Search Section detectservices:sec:sharksearch SIM Similarity Search Section detectservices:sec:SIM LS SIM Lost Sheep with cosine similarity Algorithm detectservices:algorithm:lostsheep,detectservices:algorithm:cosine LS LA Lost Sheep with simple LA Algorithm detectservices:algorithm:lostsheep,detectservices:algorithm:simplela LS NLLA Lost Sheep with non-linear LA Algorithm detectservices:algorithm:lostsheep,detectservices:algorithm:nlla

\end{tabular*}

The setups of the LS are in line with the findings in [12]. The threshold in LS SIM that controls when a page should be downloaded is set to 0.1. The confidence threshold in the LS algorithm, to decide when the LS should stop crawling, is set to 0.75. I.e. the LS need to have a normalised confidence of 0.75 before stopping prior to the reaching maxdepth. The number of states in LS LA and LS NLLA, $Q$, is set to 100. Additionally, maxdepth is set to 5.

Note that since neither the RW nor the SS has a stopping mechanism, they are deliberately stopped after downloading 200 pages. This is a fair stop criteria since the other algorithms in average downloaded 55 pages, and only 212 times out of 22170 exceeded downloading 200 pages.

Further, since the RW does not give information on what is the expected target page $p_t$, the page with the largest similarity score among the downloaded page is chosen (${\arg\max}_{\forall p_i \in P\textrm{ }if\textrm{ } visit(p_i)}\{sim(p_i,q\}$). Note that the RW is without learning and should be seen as a lower bound. The RW shows if it is possible to find the target page by simply randomly selecting 200 web pages. High accuracy for the RW indicates that the problem is relatively easy, while low accuracy indicates that it is difficult.

Data set

This paper focuses on real web sites. The data set used to test the algorithms are web sites from all Norwegian local governments. The set consists of 430 web sites of which 3 where unavailable during the experiments.

Even though these are complex web sites often consisting of several domains, none of the web sites were formally scoped. The only input for the algorithms are the home page of the web site.

The size of the web sites are estimated by randomly selecting 89 (20.84% of the entire data set) web sites and checking the number of crawled pages by Google, postulating that Google has crawled and indexed the entire web sites. The web sites have a median size of 26 000 pages. Note that the size of the sites are not Gaussian distributed and have arithmetic mean of 87 000. Some sites are huge, typically from large municipalities, while others are small.

To avoid discrepancies in the data due to updates of the web sites, the data set is organised into tuples consisting of a web site and a task. When running the experiments, a random tuple is selected for evaluation. The web site and task is extracted from the tuple and all algorithms run towards the site and task before the experiments continued to the next tuple. This minimizes the chance of a web site being updated between running two algorithms. Since there are 13 tasks and 427 web sites, this gives a data set of 5551 tuples, and in total 33306 experiments.

Synthetic environment

The advantages of applying the algorithms is a synthetic environment is that it allows many experiments without manual verification and therefore with minimum human interaction. The main disadvantage is that a synthetic environment may not be representative for real web sites.

This paper focuses on applying the algorithms in real web site environments. However, experiments and analysis in synthetic environment is available in [12]. This includes how all evaluated algorithms work in different web site sizes and how the number of states influence the accuracy and convergence.

[12] also includes comparisons between the LS and several other algorithm available in the literature, including algorithms which require perfect knowledge of the web site graph, such as Optimal and A*. Because of the time consuming manual verification process which is required when applying the algorithms to real live web sites, and because the comparativeness has already been extensively shown in [12] it is omitted from this paper.

However, the findings in [12] provide the basis for what has been implemented in a real environment part of this paper.

Training data

The training data where selected from the 427 municipal web sites. Since the experiments are organised into different tasks, the training data is grouped the same way.

For each task, the web sites were ordered randomly. The training data is collected by locating the target page if it exists, and corresponding anchor text in the first site. Subsequently, training data is collected from the next site on the list. The collection is stopped when a simple diversity analysis of the words in the data yields good results. The number of web pages collected for the training data ranges from 13 down to 5 pages. Note that all algorithms have the exact same training data.

To avoid classifying and training towards the same data, the web sites part of the training data are removed in the analysis, unless it is explicitly mentioned. The training is only kept in counting the number of locale government web sites where a material is located. The reason for this is that this data does not show the algorithms performance, but rather examples of interesting eGovernment results on existence and findability.

Metrics

Based on the analogy in detectservices:app:metrics, the algorithms are analysed according to the following performance metrics.

Additionally, the following data is collected and presented.

Note that the latter two are not evaluations of the algorithm, but rather show examples of results on if targeted eGovernment material exists and an indication of whether it can be found by real users.

Server

All experiments are run on Linux Debian 2.6.24-6 etchnhalf.8etch1 on a Fujitsu PRIMERGY RX200 Intel D5000 machine with two Xeon DP E5405 2x Quadcore CPUs and 8GB RAM.

Experimental results

This section presents empirical evidence and analysis on the performance of the algorithms as well as data on existence and findability.

All experiments were carried between February and April 2011.

Verification

The performance of the algorithms are verified partly automatically and manually. In most cases the results are calculated automatically based on the execution of the algorithm, such as the number of downloaded pages and execution duration.

In contrast, since it is not possible to automatically calculate whether an algorithm has reached a true positive, this is manually verified. Two tasks are fully verified: (1) Whether contact information is available at the web site and (7) whether mail record services are available online. For the remaining tests, a randomly selected sample of 43 web sites (10%) are analysed. The same random sample is used for all algorithms and tasks.

Overview of experimental results

Table detectservices:table:results and detectservices:table:results2 present an overview of the results. The table shows data on the performance of the algorithms on accuracy (average TP), number of pages downloaded and duration in seconds. For these data the best result is presented in bold. The performance results are further analysed in section detectservices:sec:accuracy, detectservices:sec:numberofdownloads and detectservices:sec:resultsduration.

Additionally, the table presents results on existence testing and findability testing.

[h!]\footnotesize

(Task 1-7) Detected transparency material with the different applied algorithms: (A) Manually verified accuracy (Average TP), (B) Average number of pages downloaded, (C) Average duration in seconds, (D) Exstance testing: Number of municipalities where the service of information exists and (E) Findability Testing: Percentage of target pages within reach of 3 clicks, for the algorithms Lost Sheep with Cosine Similarity (LS SIM), Lost Sheep with LA and Q=100 (LS LA), Lost Sheep With Non Linear LA and Q=100 (LS NLLA), Random Walk (RW), Shark Search (SS), Similarity Based Search (SIM).
 
Task ID 1 2 3 4 5 6 7
A LS SIM 0.82 0.60 0.80 0.92 0.85 0.82 0.88
A LS LA 0.84 0.58 0.80 0.93 0.86 0.78 0.87
A LS NLLA 0.85 0.57 0.85 0.86 0.91 0.80 0.87
A RW 0.36 0.15 0.66 0.63 0.65 0.64 0.32
A SS 0.46 0.28 0.65 0.77 0.64 0.67 0.56
A SIM 0.30 0.15 0.66 0.68 0.64 0.66 0.34
B LS SIM 5.89 18.64 34.5 27.98 32.4 35.46 9.81
B LS LA 5.99 17.53 33.42 28.34 30.59 36.36 9.87
B LS NLLA 5.93 18.81 34.1 28.82 30.66 36.6 9.96
B RW 200 200 200 200 200 200 200
B SS 200 200 200 200 200 200 200
B SIM 138.0 133.28 135.18 136.31 137.26 136.15 136.55
C LS SIM 23.9 50.43 72.17 61.91 58.87 75.67 20.93
C LS LA 24.29 46.01 70.58 68.04 66.9 81.81 22.55
C LS NLLA 10.67 31.34 54.01 49.03 50.22 63.95 21.26
C RW 50.55 61.2 72.6 62.59 99.62 77.09 83.09
C SS 58.29 102.94 70.5 149.52 101.29 67.2 71.44
C SIM 139.96 126.69 107.41 103.43 128.69 114.27 111.27
D LS SIM 414 340 186 239 209 166 381
D LS LA 416 350 199 238 216 155 379
D LS NLLA 416 338 190 230 218 154 378
D RW 351 156 86 42 57 60 118
D SS 405 212 106 121 97 78 307
D SIM 394 226 107 90 77 77 169
E LS SIM 0.83 0.48 0.30 0.44 0.3 0.21 0.83
E LS LA 0.83 0.47 0.28 0.44 0.34 0.21 0.83
E LS NLLA 0.83 0.47 0.28 0.44 0.33 0.2 0.83
E RW 0.32 0.28 0.33 0.24 0.3 0.25 0.25
E SS 0.29 0.36 0.27 0.25 0.32 0.28 0.32
E SIM 0.0 0.0 0.0 0.0 0.0 0.0 0.01

[h!]\footnotesize

(Task 8-13) Detected transparency material with the different applied algorithms: (A) Manually verified accuracy (Average TP), (B) Average number of pages downloaded, (C) Average duration in seconds, (D) Exstance testing: Number of municipalities where the service of information exists and (E) Findability Testing: Percentage of target pages within reach of 3 clicks, for the algorithms Lost Sheep with Cosine Similarity (LS SIM), Lost Sheep with LA and Q=100 (LS LA), Lost Sheep With Non Linear LA and Q=100 (LS NLLA), Random Walk (RW), Shark Search (SS), Similarity Based Search (SIM).
 
Task ID 8 9 10 11 12 13
A LS SIM 0.69 0.78 0.75 0.98 0.98 0.96
A LS LA 0.69 0.77 0.64 0.98 0.97 0.96
A LS NLLA 0.68 0.79 0.79 0.98 0.97 0.98
A RW 0.47 0.25 0.24 0.93 0.94 0.94
A SS 0.65 0.38 0.29 0.93 0.95 0.92
A SIM 0.53 0.30 0.27 0.94 0.94 0.93
B LS SIM 15.66 18.41 23.04 49.16 48.98 47.18
B LS LA 15.84 18.79 23.04 49.11 48.69 46.96
B LS NLLA 15.6 19.18 22.97 49.46 48.53 47.33
B RW 200 200 200 200 200 200
B SS 200 200 200 200 200 200
B SIM 135.51 135.96 136.28 136.97 138.81 137.38
C LS SIM 33.77 44.74 51.18 109.59 103.06 92.59
C LS LA 50.84 44.82 63.13 102.72 96.52 94.45
C LS NLLA 37.84 33.49 38.68 146.59 84.51 69.25
C RW 153.45 72.3 58.98 64.18 90.75 59.49
C SS 75.37 97.18 109.79 88.01 60.38 77.0
C SIM 164.17 120.35 118.17 132.68 109.49 117.03
D LS SIM 362 339 303 25 22 38
D LS LA 364 332 292 20 27 39
D LS NLLA 368 336 299 23 30 41
D RW 196 179 107 23 21 24
D SS 266 242 156 28 24 28
D SIM 299 222 153 19 21 32
E LS SIM 0.49 0.43 0.35 0.44 0.5 0.32
E LS LA 0.49 0.46 0.36 0.55 0.44 0.28
E LS NLLA 0.48 0.43 0.36 0.3 0.4 0.24
E RW 0.26 0.3 0.3 0.3 0.33 0.54
E SS 0.32 0.35 0.38 0.25 0.29 0.5
E SIM 0.0 0.0 0.0 0.0 0.0 0.0

Existence testing

Table detectservices:table:results(D) and detectservices:table:results2(D) shows the results for existence testing on the 427 local government web sites. Since task 1 (contact information) and task 7 (mail records) have been fully manually verified, it is possible to compare the existence testing between the correct results and the reported results by the algorithms. This comparison is available in table detectservices:table:comparisonexistence. Note that this comparison does not take the accuracy of the results into account, which is shown in detectservices:table:results(A) and detectservices:table:results2(A) and section detectservices:sec:accuracy.

Table detectservices:table:comparisonexistence clearly indicates that there are differences between the algorithms. All LS algorithms are able to reach a number close to the true value, with a difference with only 12-10 for task 1 and 27-30 on task 7. In contrast the similarity search differs with 32 and 239.

[ht]

Comparison on the existence testing between the algorithms. The data show the number of web sites where the target reported that a target page has been found. The difference from the correct result is shown in parenthesis.

 

Algorithm Task 1Task 7
Correct results 426408
LS SIM 414 (12) 381 (27)
LS LS 416 (10) 379 (29)
LS NLLA 416 (10) 378 (30)
RW 315 (111) 118 (290)
SS 405 (21) 307 (101)
SIM 394 (32) 169 (239)

Findability testing

Table detectservices:table:results(E) and detectservices:table:results2(E) shows that results for findability testing on the 427 local government web sites. As with existence testing, there are differences between the algorithms.

Unfortunately, the correct result is not known in this case. To give an indication of how the different algorithm perform, a comparison between the result for each algorithm to the algorithm which reached the highest percentage of target pages reachable within 3 clicks is done in table detectservices:table:comparisonfindability.

[ht]

(Task 1-7) Comparison on the findability testing between the algorithms. The data is compared with the results retrieved from the algorithm that reached the highest percentage of target pages reachable within 3.

 

Algorithm 1 2 3 4 5 6 7
Best result0.830.480.330.440.340.280.83
LS SIM 0.00 0.00 -0.03 0.00 -0.04 -0.07 0.00
LS LA 0.00 -0.00 -0.02 0.00 0.04 0.00 0.00
LS NLLA 0.00 0.00 0.00 0.00 -0.01 -0.00 0.00
RW -0.51 -0.19 0.05 -0.20 -0.03 0.05 -0.58
SS -0.03 0.08 -0.06 0.01 0.02 0.03 0.07
SIM -0.29 -0.36 -0.27 -0.25 -0.32 -0.28 -0.31

[ht]

(Task 8-13) Comparison on the findability testing between the algorithms. The data is compared with the results retrieved from the algorithm that reached the highest percentage of target pages reachable within 3.

 

Algorithm 8 9 10 11 12 13
Best result0.490.460.380.550.500.54
LS SIM 0.00 -0.03-0.03 -0.11 0.00 -0.22
LS LA 0.00 0.03 0.00 0.11 -0.06 -0.04
LS NLLA -0.01 -0.03 0.00 -0.25 -0.04 -0.04
RW -0.22 -0.13 -0.06 0.00 -0.07 0.30
SS 0.06 0.05 0.08 -0.05 -0.04 -0.04
SIM -0.32 -0.35 -0.38 -0.25 -0.29 -0.50

Again there are significant difference on how the algorithms perform. It is clear the LS algorithm are able to locate much more web pages within the three clicks than the remaining algorithms.

Perhaps most noticeably, the SIM in almost all cases is not able to locate the page within three clicks. This suggests that the SIM takes unnecessary long paths to reach the target page.

Accuracy of the algorithms

 

To use classifiers in practice it is important that they can be trusted to yield accurate results.

Figure detectservices:figure:algorithmandaccuracy presents box plot of the accuracy of the evaluated algorithms summarising the 13 tasks presented in table detectservices:table:transparencytasks. The black line depicts the median value, the box is drawn between the quartiles (from the 75th percentile to the 25th percentile). The dashed line extend to the minimum and maximum values except outliers.

The plot shows that there is a significant difference between the achieved accuracy of the algorithms. It further shows that all LS algorithms have a much better performance than RW, SIM or SS. This is despite the fact that RW, SIM and SS download more web pages per web site as shown in section detectservices:sec:numberofdownloads.

LS LA, LS NLLA and LS SIM reached a median accuracy of 0.84, 0.85 and 0.82. In contrast SIM and SS reached a median accuracy of 0.64 and 0.65. RW, which should be seen as a lower bound, yields an average accuracy of 0.63.

The accuracy of the algorithms (13 tasks per algorithm).

 

A conclusion based on this empirical data is that the LS in most cases reach a higher accuracy than comparable algorithms. Further, since there are only small differences between LS LA, LS NLLA and LS SIM which classification algorithm is used with the LS have little influence.

Even thought LS SIM and SIM both use the same cosine similarity algorithm, LS SIM reaches a significant better accuracy than SIM. These empirical results strongly indicate that using LS when to locate and classify web pages increases the accuracy.

Number of downloaded pages per algorithm

 

Since crawling and downloading web sites is a resource intensive task, it is important that the algorithms utilize the available resources as best possible and therefore minimise the number of pages.

Figure detectservices:figure:algorithmanddownloads presents box plot of the number of downloaded pages for all 427 local government web sites. Note that RW and SS have been omitted since they have a fixed downloaded rate.

The number of downloaded page per algorithm (427 web sites per algorithm).

 

The median number of downloaded pages for LS LA is $\sim27$, LS SIM and LS NLLA have a median download of $\sim28$. In contrast, SIM has a median download of $\sim135$.

It is evident from figure detectservices:figure:algorithmanddownloads and the median numbers that all LS algorithms downloaded less web pages than SIM. Thus, it is arguably so that pre-classifying anchor texts in order to determine if a page should be downloaded considerably reduces the number of pages to downloaded.

Duration of algorithms

 

To apply automatic algorithms for live eGovernment measurements, it is important that the duration to carry out the tasks is so low that it is practical to be applied by the testers.

Intuition dictates that there should be a correlation between the duration of the algorithms and number of downloaded pages. However, the results show that this is not strong as the Pearson correlation is only 0.67. Figure detectservices:figure:pagessversusduration presents a plot between the number of downloaded pages versus duration for all 427 web sites, independent of algorithms including a best fit regression. The blue dashed line shows the mean response estimate with a 95% confidence interval. The red dashed line shows the prediction response distribution with 95% confidence interval. 95% of the items fall within these red dashed lines.

Number of downloaded pages versus duration (427 web sites per algorithm).

 

The duration of the algorithms (427 web sites per algorithm).

 

Figure detectservices:figure:algorithmsversusduration shows a box plot of duration per site per algorithm without outliers. The figure shows that there are considerable differences between the algorithms.

The box plot shows that the LS NLLA is the fastest algorithm with a median duration 8.95 seconds. Close by comes LS LA and LS SIM with an average duration 11.24 and 12.09 seconds respectfully. The RW and SS, who both are limited to 200 pages, have a close median duration of 30.20, 28.39. Finally, SIM has a median duration of 59.09 seconds. Thus, the LS is in average a lot faster.

It is therefore arguably so that of the evaluated algorithms, LS most applicable in live searches.

Number of pages downloaded versus size of web site

 

To show an indication the performance of the algorithms in practice, this section presents empirical data on how the algorithms work on web sites with different number of pages ($|P|$). This could be used to indicate how $|P|$ influences the performance and if the empirical data indicate a complexity.5

Figure detectservices:figure:sizeversusdownload presents a plot of the number of downloaded pages versus the number of pages available in the web site for the algorithms LS SIM, LS NLLA and similarity search including a best fit regression for each algorithm. Note that due to the logarithmic y-scale, the best fit regressions appear curved, but they are in fact linear.

The plot shows that in almost all cases the LS algorithms downloads fewer web pages than the SIM. Further, the regression for LS NLLA increases faster than LS SIM. LS NLLA and LS SIM intersect at web sites with $\sim59000$ pages. This indicates that, with respect to number of downloaded pages, LS NLLA performs better if $|P|<59000$, LS SIM otherwise.

An explanation could be that the LS NLLA gives a higher priority to the anchor texts than LS SIM, which indicates that the anchor text have a more discriminatory effect on web sites with $|P|<59000$, and therefore represents the corresponding web pages better.

However, LS LA always performs better than all other algorithms and intersects only at the end of the data set at at $|P|=\sim475000$.

This suggests that LS LA scales better than all other algorithms when it comes to number of downloads versus the size of the site. Thus using a simple LA classifier has the best scalability properties in the LS.

Number of pages in the web site versus number of downloaded pages (90 randomly selected web sites per algorithm).

 

Number of pages downloaded versus number of links \analysis

This sections presents how the LS performs in respect to available links. Similar to the analysis in section detectservices:sec:numberofdownloadsversussize, this can be used to indicate the complexity of the algorithm.

Figure detectservices:figure:linksversusdownload shows a plot between the number of links analysed versus the number of pages downloaded for the LS NLLA where the LS reports that the target page has been found. For a just comparison the outliers have been remove. The outliers are evaluation results of web sites with more than 6000 anchor texts evaluated, but still very few pages downloaded.

Number of links discovered by the LS versus number of downloaded pages (427 web sites).

 

The solid black line shows a best fit regression.6 + 1.699117*10^{-02}*x - 1.820809*10^{-06}*x^2$, where $x$ is the number of evaluated links and $y$ is the number of downloaded pages. This regression has a p-value: $< 2.2*10^{-16}$.} The Pearson correlation between the data sets is 0.83.

Not surprisingly, there is a correlation between the number analysed links and downloaded pages. Despite intuition, correlation is better then linear. This indicates that the LS performs better in terms of number of downloaded pages when the number of available links increase.

Conclusion

 

This paper presents and analyses the use of the lost sheep algorithm with special focus on locating targeted information and services in government web sites. Finding government material online is a time consuming manual task part of most eGovernment surveys. Introducing the lost sheep algorithm reduces the human resources needed to carry out eGovernment surveys, which enables both more organisation to carry out surveys and larger sample of web sites to be part of existing surveys.

The lost sheep is a flexible classification based algorithm that uses a pre-classifier on link anchor texts to determine whether web pages should be downloaded and classified. This way the algorithm can locate services based on manually collected training data by only downloading a small number of pages. The lost sheep is classifier independent and can utilize the classifier that best fit the problem area.

The results in this paper show that the lost sheep outperforms all comparable algorithms. In the realistic task of locating information and services related to transparency in local government web sites, the lost sheep was able to reach an accuracy of 85% by only downloading in average 27 pages per site. In contrast, the best comparable algorithm reached an accuracy of 64% when downloading 135 pages.

Further work

In the further work additional properties of the lost sheep should be explored. This includes investigating other potential features than words in the anchor text and web page, such as text in the vicinity of the anchor text and URLs.

Additionally further work should include exploring the potential language independent features so that the lost sheep could be applied in eGovernment surveys that covers more languages. Such features could include available documents formats, telephone numbers, document IDs, size of the page, and so on.

Furthermore, how the algorithm performs with other classifiers and in other environments than what is presented in this paper should be explored.

Acknowledgement

The eGovMon project, http://www.egovmon.no, is co-funded by the Research Council of Norway under the VERDIKT program. Project no.: VERDIKT 183392/S10.

Appendix J.1 — Metrics

 

This section is cleaned up version of the analogy presented in [12]. It defines metrics for the exact match on rare item searches(EMRIS) based on widely accepted metrics from relevant fields, namely: web page classification, information retrieval and focused crawling.

Classification

 

Note that $TP$ could be written as $TP(S)$, but is for presentation reasons simplified.

The EMRIS can written as a binary classification problem:

The aim for web page classification is to maximize the number of web pages classified correctly (true positives TP, and true negatives TN), and minimize the number of web pages wrongly classified (false positives FP, and false negatives FN).

With this aim, classification algorithms are typically evaluated according to macro-precision, $P_m(S)$ and macro-recall $R_m(S)$ which for the binary classification problems can be written as following [50] where $P_m^1$ and $R_m^1$ refer to Group 1, and $P_m^2$ and $R_m^2$ refer to Group 2.

\begin{eqnarray}

P_m(S) = \frac{1}{2} \Big( \sum_{i=1}^{2} \frac{TP_i}{(TP_i + FP_i)} \Big) nbsp;\nonumber = \frac{1}{2} \Big( \frac{TP_1}{(TP_1 + FP_1)} + \frac{TP_2}{(TP_2 + FP_2)} \Big) \nonumber

&=& \frac{1}{2} \Big( P_m^1(S) + P_m^2(S) \Big)\\ 

R_m(S) = \frac{1}{2} \Big( \sum_{i=1}^{2} \frac{TP_i}{(TP_i + FN_i)} \Big)\nonumber = \frac{1}{2} \Big( \frac{TP_1}{(TP_1 + FN_1)} + \frac{TP_2}{(TP_2 + FN_2)} \Big)\nonumber

&=& \frac{1}{2} \Big( R_m^1(S) + R_m^2(S) \Big) 

 

\end{eqnarray}

Since EMRIS deliberately limits the pages to be classified as relevant to one, it separates itself from other classification problems as there are only two possible scenarios:

Scenario 1: The correct target page is found
 

Group 1: When a correct page has been found, it means that $p_t$ is correctly classified as relevant. This gives $TP_1=1$, $FP_1=0$ and $FN_1=0$.

Using equations detectservices:eq:macroprecisiondet and detectservices:eq:macrorecalldet gives:

\begin{eqnarray}

P_m^1(S) = \frac{TP_1}{(TP_1 + FP_1)} = 1,when TP_1=1\nonumber\\

R_m^1(S) = \frac{TP_1}{(TP_1 + FN_1)} = 1,,when TP_1=1 \nonumber

\end{eqnarray}

Group 2: Keeping in mind that there is only one relevant page in EMRIS, when the correct page has been found it means that all not relevant $N-1$ pages have also been correctly classified as not relevant. This gives that $TP_2=N-1$, $FP_2=0$ and $FN_2=0$.

Using equations detectservices:eq:macroprecisiondet and detectservices:eq:macrorecalldet gives:

\begin{eqnarray}

P_m^2(S) = \frac{TP_2}{(TP_2 + FP_2)} = \frac{N-1}{(N-1 + 0)}=1,when TP_1=1\nonumber\\

R_m^2(S) = \frac{TP_2}{(TP_2 + FN_2)} = \frac{N-1}{(N-1 + 0)}=1,when TP_1=1\nonumber

\end{eqnarray}

Scenario 2: The correct target page is not found
 

Group 1: When a correct page has not been found, it means that $p_t$ is wrongly classified as not relevant, which gives $TP_1=0$ and $FP_1=1$.

Using equations detectservices:eq:macroprecisiondet and detectservices:eq:macrorecalldet, if the selected page is not $p_t$, $TP_1=0$ this gives:

\begin{eqnarray}

P_m^1(S) = \frac{TP_1}{(TP_1 + FP_1)} = 0,when TP_1=0\nonumber\\

R_m^1(S) = \frac{TP_1}{(TP_1 + FN_1)} = 0,,when TP_1=0 \nonumber

\end{eqnarray}

Group 2: When $p_t$ is wrongly classified as not relevant it must be an FN. Still all but one of the not relevant pages are still correctly classified as not relevant. Similarly, another page must be wrongly classified as relevant which means that the target page must be a FP. This gives $TP_2 = N-2$, $FP_2=1$ and $FN_2=1$.

Using equations detectservices:eq:macroprecisiondet and detectservices:eq:macrorecalldet, if the selected page is not $p_t$, $TP_1=0$ this gives:

\begin{eqnarray}

P_m^2(S) = \frac{TP_2}{(TP_2 + FP_2)} = \frac{N-2}{(N-2 + 1)},when TP_1=0\nonumber\\

R_m^2(S) = \frac{TP_2}{(TP_2 + FN_2)} = \frac{N-2}{(N-2 + 1)},when TP_1=0\nonumber

\end{eqnarray}

$P_m( extbf{S
)$ and $R_m(S)$} 

Following the logic in section detectservices:sec:correctfound and detectservices:sec:correctnotfound as well as equation detectservices:eq:macroprecisiondet and detectservices:eq:macrorecalldet, $P_m(S)$ and $R_m(S)$ must, if the algorithm correctly classifies $p_t$ as relevant, be:

\begin{eqnarray}

P_m(S) = \frac{1}{2} \Big( P_m^1(S) + P_m^2(S) \Big)\nonumber = \frac{1}{2} \Big( 1+1 \Big) =1,when TP_1=1\nonumber R_m(S) = \frac{1}{2} \Big( R_m^1(S) + R_m^2(S) \Big)\nonumber

&=& \frac{1}{2} \Big( 1+1 \Big) =1,when TP_1=1\nonumber

\end{eqnarray}

If the algorithm wrongly classifies $p_t$ as not relevant $P_m(S)$ and $R_m(S)$ must be:

\begin{eqnarray}

P_m(S) = \frac{1}{2} \Big( P_m^1(S) + P_m^2(S) \Big) \nonumber = \frac{1}{2} \Big( 0+ \frac{N-2}{(N-2 + 1)} \Big)\nonumber = \frac{N-2}{2N-2},when TP_1=0\nonumber R_m(S) = \frac{1}{2} \Big( R_m^1(S) + R_m^2(S) \Big) \nonumber = \frac{1}{2} \Big( 0+ \frac{N-2}{(N-2 + 1)} \Big)\nonumber

&=&P_m(S),when TP_1=0\nonumber

\end{eqnarray}

This shows, for EMRIS, $P_m(S)$ and $R_m(S)$ are always identical. Furthermore, $P_m(S)$ and $R_m(S)$ can be derived directly from only knowing if $TP_1$ exists and the number of number pages in the directed graph, $N$.

Thus, equation detectservices:eq:macroprecision and detectservices:eq:macrorecall can for the EMRIS be simplified as following. If $TP_1$, then $R_m(S)=TP$, otherwise $R_m(S)=\frac{(|TP-1|)(N-2)}{2N-2}$, or written as following:

\begin{equation}

R_m(S) = P_m(S) = TP + \frac{(|TP-1|)(N-2)}{2N-2}

\end{equation}

Usefulness of $P_m( extbf{S
)$ and $R_m(S)$ for EMRIS} 

In addition to being simplified, despite the common usage in classification problems and used for EMRIS in, [13] it is noteworthy that $R_m(S)$ is not a good metric for the EMRIS. This can be shown by defining an algorithm which deliberately wrongly classify $p_t$ as not relevant. This is the worst algorithm possible as $TP_1$ will always be $0$ and it will always "guess wrong". Even for such an algorithm $R_m(S)$ will correctly classify $N-2$ pages as non-relevant, which is highly rewarded $R_m(S)$.

Thus, even for the worst possible algorithm large directed graphs will receive a score of $\frac{1}{2}$ for $R_m(S)$.

\begin{eqnarray}

R_m(S)\lim_{N \rightarrow \infty} = TP + \frac{(|TP-1|)(N-2)}{2N-2}\nonumber

&=& \frac{1}{2} ,when TP=0.

\end{eqnarray}

Accuracy

Another often used, but simpler, metric in the classification field is accuracy. It is defined as the percentage of $TP$ over the total number of tests applied. In EMRIS, there can at most be only one TP per site.

This further means that for more than one site

Accuracy will for the EMRIS be defined as the average $TP$ over multiple sites. :

\begin{eqnarray}

Accuracy = \sum_{i=0}^{i=T}TP(S_i)/N

\end{eqnarray}

where $S_i$ is one testes web site and $N$ is the number of sites tested.

Note that this is for multiple web sites. In EMRIS, for one $TP(S) \in \{0,1\}$, and, for all tests $T$, $Accuracy \in [0,1]$.

Keep in mind that $P_m(S),R_m(S)$, or averages of these over number of $N$, are considered more valuable metrics for classification problems as these also include $FN, FP,$ and $TN$ [50], but are not good metrics for EMRIS (see section detectservices:sec:usefulpmrm).

Information Retrieval

 

[13] presented evaluation metrics retrieved from various sources for the EMRIS:

[13] defines effectiveness as $F_1(S)=\frac{2 P_m(S) R_m(S)}{P_m(S)+R_m(S)}$, but failed to notice that $P_m(S)$ and $R_m(S)$ are identical for the EMRIS (see section detectservices:sec:realpmrm) which gives the following:

\begin{eqnarray}

F_1(S) = \frac{2 P_m(S) R_m(S)}{P_m(S)+R_m(S)}\nonumber

&=&

\frac{2 (R_m(S))^2}{2R_m(S)}\nonumber\\

&=& R_m(S) = P_m(S)

\end{eqnarray}

Thus for EMRIS $F_1(S) = R_m(S) = P_m(S)$

Additionally, [13] defined efficiency as the path length (number of downloaded pages) and duration time.

Focused crawling

 

There are two main metrics in focused crawling [51]:

\begin{description}

  • [Target Recall:] The fraction of relevant pages on the web downloaded by the crawler.
  • [Harvest Rate:] The fraction of crawled pages over those that are relevant.
  • \end{description}

    \begin{description}

  • [Recall:] Ratio of total number of relevant retrieved documents to the total number of existing relevant documents.
  • [Precision:] Ratio of total number of relevant retrieved documents to the total number of retrieved documents.
  • \end{description}

    Keeping in mind that the number of relevant pages in a directed graph is deliberately set to one for the EMRIS, the metrics can be simplified.

    Let $D(S)=\sum_{(\forall p_i \in P)} {visit(p_i)}$ be the number of downloaded pages and $R_m(S)$ be the number of relevant pages from directed graph $S$. This gives:

    \begin{eqnarray}

    Harvest Rate = \frac{D(S)}{R_m(S)}=\frac{D(S)}{1}=D(S)

    Total Recall &=& \frac{R_m(S)}{D(S)}=\frac{1}{D(S)}

    \end{eqnarray}

    Thus, from focused crawling, only the number of downloaded pages \\($\sum_{(\forall p_i \in S)} {visit(p_i)}$) is a relevant metric for the EMRIS.

    Relevant metrics

     

    As discussed in sections detectservices:sec:metricclassification, detectservices:sec:ir and detectservices:sec:focused, the relevant functions are $TP(S)$ (the number of true positives, accuracy) and $\sum_{(\forall p_i \in S)} {visit(p_i)}$ (the number of downloaded pages).

    This is not surprising as it is in line with the definition of the problem. However, this section shows that the results are denoted directly from the existing metrics in related fields. It is in contrast to papers in the literature, e.g. [13], which use $P_m(S)$, $R_m(S)$ and $F_1(S)$.

    Footnotes

    1. [This paper is in submission for journal publication. International Journal of Information Technology \& Decision Making]

    2. [The lost sheep is extended and presented in full detail in section detectservices:sec:proposedsolution]

    3. [In some classification problems the stop words do not introduce noise, but have discriminatory effects. An example of this is language detection.]

    4. [Note that a service on a web site is not the same as a web service. A service is a governmental application which could be available on the web, while a web service is an web application programming interface.]

    5. [Examples of complexities include growing linearly, exponentially, and so on.]

    6. [The formula for the best fit regression is: $y = 7.834322*10^{-01}]

    Bibliography

    [1] eGovernment measurement for policy makers,Millard, J.,European Journal of ePractice,2008,4

    [2] Is E-Government Leading to More Accountable and Transparent Local Governments? An Overall View,Pina, V. and Torres, L. and Royo, S.,Financial Accountability \& Management,3--20,2010,26,1

    [3] Testfakta kommunetest januar 2011,The Consumer Council of Norway,2011,Retrieved March 23rd, 2011, from http://forbrukerportalen.no/Artikler/2011/testfakta_kommunetest_januar_2011

    [4] Global E-Government Survey 2010, Leveraging E-government at a Time of Financial and Economic Crisis,United Nations Department of Economic and Social Affairs,Retrieved May 11th, 2010, from http://www2.unpan.org/egovkb/global_reports/10report.htm,February

    [5] Digitizing Public Services in Europe: Putting Ambition into Action,Capgemini,Retrieved March 16th, 2011, from http://www.capgemini.com/insights-and-resources/by-publication/2010-egovernment-benchmark/

    [6] Benchmarking e-Government - A Comparative Review of Three International Benchmarking Studies,Lasse Berntzen and Morten Goodwin Olsen,International Conference on the Digital Society,77-82,2009,0,978-0-7695-3526-5,http://doi.ieeecomputersociety.org/10.1109/ICDS.2009.55,Los Alamitos, CA, USA

    [7] Understanding and measuring eGovernment: international benchmarking studies,Heeks, R.,27--28,2006,UNDESA workshop,“E-Participation and E-Government: Understanding the Present and Creating the Future”, Budapest, Hungary

    [8] Web Crawling,Olston, C. and Najork, M.,Information Retrieval,175--246,2010,4,3

    [9] Web mining: a survey of current research, techniques, and software,Zhang, Q. and Segall, R.S.,International Journal of Information Technology and Decision Making,683--720,2008,7,4,0219-6220

    [10] Global Web Accessibility Analysis of National Government Portals and Ministry Web Sites,Goodwin, M. and Susar, D. and Nietzio, A. and Snaprud, M. and Jensen, C.S.,Journal of Information Technology \& Politics,41--67,2011,8,1,1933-1681

    [11] A large-scale study of robots.txt,Sun, Yang and Zhuang, Ziming and Giles, C. Lee,1123--1124,2007,WWW '07: Proceedings of the 16th international conference on World Wide Web,978-1-59593-654-7,http://doi.acm.org/10.1145/1242572.1242726,New York, NY, USA,Banff, Alberta, Canada

    [12] A solution to the exact match on rare item searches,Goodwin, M.,1,2011,Proceedings of International Conference of Web Intelligence, Mining and Semantics

    [13] Scalability of findability: effective and efficient IR operations in large information networks,Ke, W. and Mostafa, J.,74--81,2010,Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval,ACM

    [14] Web structure mining: an introduction,da Costa Jr, MG and Gong, Z.,6--pp,2005,Information Acquisition, 2005 IEEE International Conference on,IEEE,0780393031

    [15] An AI framework for the automatic assessment of e-government forms,Chun, A.H.W.,AI Magazine,52,2008,29,1,0738-4602

    [16] Topical locality in the Web,Davison, B.D.,272--279,2000,Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,ACM,1581132263

    [17] Data mining for hypertext: A tutorial survey,Chakrabarti, S.,ACM SIGKDD Explorations Newsletter,1--11,2000,1,2

    [18] Mapping the semantics of web text and links,Menczer, F.,IEEE Internet Computing,27--36,2005,9,3

    [19] Stochastic models for the web graph,Kumar, R. and Raghavan, P. and Rajagopalan, S. and Sivakumar, D. and Tomkins, A. and Upfal, E.,57--65,2002,Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on,IEEE

    [20] Data mining on the web,Greening, D.R.,Web Techniques,2000,5,1

    [21] Information retrieval in distributed hypertexts,De Bra, P. and Houben, G.J. and Kornatzky, Y. and Post, R.,481--491,1994,Proceedings of the 4th RIAO Conference

    [22] The shark-search algorithm. An application: tailored Web site mapping,Hersovici, M. and Jacovi, M. and Maarek, Y.S. and Pelleg, D. and Shtalhaim, M. and Ur, S.,Computer Networks and ISDN Systems,317--326,1998,30,1-7,0169-7552

    [23] Ant Focused Crawling Algorithm,Dziwinski, P. and Rutkowska, D.,Artificial Intelligence and Soft Computing--ICAISC 2008,1018--1028,2008

    [24] High-performance web crawling,Najork, M. and Heydon, A.,Handbook of massive data sets,2002

    [25] Focused crawling guided by link context,Dong, J. and Zuo, W. and Peng, T.,365--369,2006,Proceedings of the 24th IASTED international conference on Artificial intelligence and applications,ACTA Press,0889865566

    [26] Learning automata based classifier,Zahiri, S.H.,Pattern Recognition Letters,40--48,2008,29,1,0167-8655

    [27] Stochastic searching on the line and its applications to parameter learning in nonlinear optimization,Oommen, B.J.,Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,733--739,1997,27,4,1083-4419

    [28] Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation,Granmo, O.C. and Oommen, B.J. and Myrer, S.A. and Olsen, M.G.,Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,166--175,2007,37,1,1083-4419

    [29] Optimal sampling for estimation with constrained resources using a learning automaton-based solution for the nonlinear fractional knapsack problem,Granmo, Ole-Christoffer and Oommen, B. John,Applied Intelligence,3--20,2010,33,0924-669X,August,1,http://dx.doi.org/10.1007/s10489-010-0228-1,1851993,Hingham, MA, USA

    [30] Architecture for large-scale automatic web accessibility evaluation based on the UWEM methodology,,Norwegian Conference for Informatics (NIK),978-8-251-923-866,November

    [31] Data Mining for Web Intelligence,Han, Jiawei and Chang, Kevin,Computer,64--70,2002,35,0018-9162,November,11,10.1109/MC.2002.1046977,622106,Los Alamitos, CA, USA

    [32] Web Content Accessibility Guidelines (WCAG) 2.0,World Wide Web Consortium,Retrieved November 4th, 2009, from http://www.w3.org/TR/REC-WCAG20-20081211/

    [33] Web page classification without the web page,Kan, Min-Yen,262--263,WWW Alt. '04: Proceedings of the 13th international World Wide Web conference on Alternate track papers \& posters,1-58113-912-8,http://doi.acm.org/10.1145/1013367.1013426,New York, NY, USA,New York, NY, USA

    [34] Web page classification: Features and algorithms,Qi, X. and Davison, B.D.,ACM Computing Surveys (CSUR),1--31,2009,41,2,0360-0300

    [35] Learning to classify text using support vector machines: Methods, theory, and algorithms,Joachims, T.,Computational Linguistics,656--664,2002,29,4

    [36] N-gram based Text Categorization,N{\'a,2005

    [37] Mining the web for bilingual text,Resnik, P.,527--534,1999,Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics,Association for Computational Linguistics

    [38] Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums,Abbasi, A. and Chen, H. and Salem, A.,ACM Transactions on Information Systems (TOIS),1--34,2008,26,3,1046-8188

    [39] An extensive empirical study of feature selection metrics for text classification,Forman, George,J. Mach. Learn. Res.,1289--1305,2003,3,1532-4435,March,17,944974

    [40]

    [41] Detecting near-duplicates for web crawling,Manku, G.S. and Jain, A. and Das Sarma, A.,141--150,2007,Proceedings of the 16th international conference on World Wide Web,ACM

    [42] A larger scale study of robots.txt,Kolay, Santanu and D'Alberto, Paolo and Dasdan, Ali and Bhattacharjee, Arnab,1171--1172,2008,WWW '08: Proceeding of the 17th international conference on World Wide Web,978-1-60558-085-2,http://doi.acm.org/10.1145/1367497.1367711,New York, NY, USA,Beijing, China

    [43] e-Exclusion and Bot Rights: Legal aspects of the robots exclusion standard for public agencies and other public sector bodies with Swedish examples,Lundblad, N.,First Monday,2007,12,8-6

    [44] D2.1.2 State-of-the-art review: transparency indicators,Lasse Berntzen and Annika Nietzio and Morten Goodwin Olsen and Ahmed AbdelGawad and Mikael Snaprud,Retrieved 2009 from http://www.egovmon.no/en/downloads/

    [45] Web site usability, design, and performance metrics,Palmer, J.W.,Information systems research,151--167,2003,13,2

    [46] Optimizing Web Structures Using Web Mining Techniques,Jeffrey, Jonathan and Karski, Peter and Lohrmann, Björn and Kianmehr, Keivan and Alhajj, Reda,653-662,2007,4881,Intelligent Data Engineering and Automated Learning - IDEAL 2007,Computer Science Dept, University of Calgary, Calgary, Alberta Canada,Lecture Notes in Computer Science,Yin, Hujun and Tino, Peter and Corchado, Emilio and Byrne, Will and Yao, Xin

    [47] Testing the Three-Click Rule,Joshua Porter,2003,Retrieved April 1st, 2011, from http://www.uie.com/articles/three_click_rule/

    [48] Accessing the deep web: A survey,He, B. and Patel, M. and Zhang, Z. and Chang, K.C.C.,Communications of the ACM,95--101,2007,50,5

    [49] The W3C Markup Validation Service,World Wide Web Consortium,Retrieved April 8th, 2011, from http://validator.w3.org/

    [50] An N-Gram Based Approach to Automatically Identifying Web Page Genre,Jane E. Mason and Michael Shepherd and Jack Duffy,Hawaii International Conference on System Sciences,1-10,2009,0,978-0-7695-3450-3,http://doi.ieeecomputersociety.org/10.1109/HICSS.2009.581,Los Alamitos, CA, USA

    [51] Learning to crawl: Comparing classification schemes,Pant, Gautam and Srinivasan, Padmini,ACM Trans. Inf. Syst.,430--462,2005,23,4,1046-8188,http://doi.acm.org/10.1145/1095872.1095875,New York, NY, USA

    The author of this document is:
    Morten Goodwin
    E-mail address is:
    morten.goodwin [at] tingtun.no
    Phone is:
    +47 95 24 86 79

    Valid XHTML 1.0! Valid CSS! Checked by eGovMon