# Towards Automated eGovernment Monitoring

## Paper I - A solution to the exact match on rare item searches. Introduction of the Lost Sheep algorithm

Author: Morten Goodwin

This chapter was originally published as scientific conference or journal paper. Please see the original source for the complete paper.1

Original paper author: Morten Goodwin

Morten Goodwin is with Tingtun AS, Kirkekleiva 1, 4790 Lillesand, Norway, morten.goodwin@tingtun.no.

### Abstract

This paper proposes an approach for finding a single web page in a large web site or a cloud of web pages. We formalize this problem and map it to the exact match on rare item searches (EMRIS). The EMRIS is not much addressed in the literature, but many closely related problems exists. This paper presents a state-of-the-art survey on related problems in the fields of information retrieval, web page classification and directed search.

As a solution to the EMRIS, this paper presents an innovative algorithm called the lost sheep. The lost sheep is specifically designed to work in web sites with of links, link texts and web pages. It works as a pre-classifier on link texts to decide if a web page is candidate for further evaluation.

This paper also defines sound metrics to evaluated the EMRIS. The lost sheep outperforms all comparable algorithms both when it comes to maximizing accuracy and minimizing the number of downloaded pages.

Keywords: Crawling, Classification, Search.

### Introduction

The amount of information available online on public web sites is increasing drastically. Many public web sites host huge amount of information on diverse topics, often in dynamic environments. Furthermore, the concept of web sites is becoming less important as it is neither clear nor important for most users where one web site starts and another ends. In these highly dynamic web sites, it is not possible to keep all information indexed at all times, which is required for traditional crawlers and search engines. Additionally, for copyright or privacy reasons, many web site owners deliberately restrict crawlers from visiting part of their web site using the robot exclusion protocol [1].

The consequence of these emerging properties of the public web is that large parts of published information falls into the deep web and unindexable dynamic clouds and cannot be found by crawlers nor indexed by major search engines. This leads to navigation challenges for users even when the users know the area they are looking for, and critical web pages could be very difficult find.

This paper is on finding a single web page in a web site when only one page is relevant. This is a specifically challenging area of information retrieval named the exact match on rare item searches [2], referred to in this paper as EMRIS.

In web page classification algorithms, web pages are often treated as single isolated entities. However, web pages have more information available apart from what is available in the page itself. Each page is part of a structured web site. Literature shows that web pages which link to each other have a semantic correlation. For example, it is more likely that a web page about sport will link to another web page about sport, than linking to a web page about the financial market [3]. Similarly, web pages within the same topics tend to link to each other [4]. Hence for web page classification it is valuable information that one web page links to another. This information is seldom utilized in web classification algorithms.

The paper is organized as following. The EMRIS is not much addressed in literature, but a survey introducing EMRIS and related algorithms are presented in section Related work. Section The lost sheep algorithm (LS) presents the lost sheep a novel approach to the EMRIS. No sound metric to evaluate EMRIS is available in the literature and section Performance metrics presents metrics from related fields and based on these defines metrics for EMRIS. The empirical results are presented in Experiments. Finally the conclusion and future work are presented in Conclusion and Future work.

### Related work

This section presents the state-of-the-art on how web sites are modeled, on finding single web pages in web sites, and relevant algorithms within each field. The algorithms presented here are also implemented and the corresponding results presented in section Results from generated directed graph.

The terms used in the following algorithm descriptions in some cases differ from the original cited paper. This is intentionally done to make a consistent description throughout the paper.

#### Website / directed graph

Web sites are often modeled as graphs of nodes connected with directed edges [5],[6],[7]. This is done to enable analytic analysis and controlled empirical experiments on the web sites and algorithms interacting with the web sites. In these models, each web page is represented as a node, and the links between the web pages are represented as directed edges between the nodes. In addition, web pages have link texts that intend to informatively describe corresponding web pages with short texts [5]. Link texts are not addressed in the directed graph models. Because of this, this paper extends the existing models with labels containing natural language text connected to each edge.

In this paper, a web site is modeled as a directed graph of nodes connected with directed edges and labels $S_i = \{P_i, E_i,L_i\}$. Each web page is represented as a node $p_j$. $N$ is the number of nodes in the directed graph $S_i$ so that $P_i=\{p_0 ... p_N\}$.

The nodes are connected together with directed edges so that edge $e_{j,k} \in E_i$ is a link from node $p_j$ to $p_k$. Each edge has a corresponding label so that $l_{j,k} \in L_i$ is a natural language label connected to $e_{j,k}$ and observable from node node $p_j$.

This is in line with existing models of web sites in the literature, but is extended with labels [5],[6],[7] and is empirically shown to represent real web sites (see section Generated directed graphs). Table 1 presents the comparisons between the directed graph model and read web sites.

Figure 1 shows an example of a simple web site $s_1$ modeled as a directed graph. $s_1$ consists of four web pages $p_s,p_t,p_1,p_2$. The web page links consist of the directed edges $e_{s,t},e_{s,1},e_{s,2}$ and the corresponding labels $l_{s,t},l_{s,1},l_{s,2}$. These labels are observable from the starting node, $p_s$, with out following the any of the edges.

[ht]

 Web site model Notation Real web sites Node $p_j$ Web page Edge $e_{j,k}$ Link Label $l_{j,k}$ Link text Directed graph $S_i$ Web site
Table 1. Mapping between information retrieval and web sites.

Figure 1. Web site example.

#### Exact match on rare item searches (EMRIS)

The EMRIS is a problem introduced by [2] focusing on finding a single web page in a large set of web pages, such as a web site or cloud of web pages. It was introduces for the following reasons; It is particularly challenging to find a single node in a large directed graph, and the objectivity of approaches that classify large amounts of data were questioned. The EMRIS has not previously been formally defined.

When a classification approach is developed with the intention to classify large amount of data, it is not practically possible to first manually classify all data to perform controlled experiments. The common approach is therefore that the classifier is run with a relatively small set of manually classified training data. Subsequently, the classification results are manually verified. Unfortunately, certifying the results is not a deterministic approach. People tend to (subconsciously) give a unfair advantage to their own algorithms. In contrast, with the EMRIS, there is only one relevant node and this can easily be defined prior to conducting experiments. Thus, as no manual verification of the results are needed, the objectivity of the algorithms can no longer be questioned.

A practical example of the EMRIS could be to find a target web page, $p_t$, with contact information, such as physical address and phone number. Neither the location nor properties of $p_t$ are known, only that the content of $p_t$ is related to contact information. To identify contact information let $q$ be the input query

$q=[contact,about,phone,address]$.

Let $a$ be a hypothetical algorithm with the query ($q$), directed graph ($s_1$ in figure 1) and starting node ($p_s$) as input, and the expected target node as output: $a(q,s_1,p_s) \in P$. Using the directed graph $s_1$, $a$ first downloads and parses node $p_s$. $a$ observes that $p_s$ has the edges $e_{s,t},e_{s,1},e_{s,2}$ and corresponding labels $l_{s,t},l_{s,1},l_{s,2}$.

$a$ knows, from examining the edges $e_{s,t},e_{s,1},e_{s,2}$, that the nodes $p_t,p_1,p_2$ exists. However, the content of $p_t$,$p_2$ and $p_3$ are unknown. The goal is therefore to find which of $p_t$,$p_1$ and $p_3$ have contact information and is the target node.

Since $p_s$ has been observed, the following could be known and conclusions drawn:

• [$p_{t}$:] A node with the edge $e_{s,t}$ (from $p_s$ to $p_t$) is represented with the label $l_{s,t}="About us"$. Since "about" $\in q$, $a$ may conclude that $p_t$ is likely to contain to contact information.
• [$p_{1}$:] A node with the edge $e_{s,1}$ (from $p_s$ to $p_1$) is represented with the label $l_{s,1}="Contact us"$. Since "contact" $\in q$, $a$ may conclude that $p_1$ is likely to contain contact information.
• [$p_{2}$:] A node with the edge $e_{s,2}$ (from $p_s$ to $p_2$) is represented with the label $l_{s,2}="Image Gallery"$. Since neither $"image" \in q$ nor $"gallery"\in q$, $a$ may conclude that $p_2$ is not likely to have contact information.

Most web page classification algorithms use equal amount of resources on $p_{t}$, $p_{2}$ and $p_{3}$. However, keeping in mind that the labels describe the corresponding nodes [8], it is more likely that either $p_{t}$ or $p_{1}$ is the target node. Thus, since the aim is to find the target node and it is significantly more costly to download a node than doing a memory operation [9], it makes sense to focus most of the available resources on what is likely to be the target nodes, namely $p_{t}$ and $p_{1}$, instead of $p_{2}$.

To continue, let $a$ download $p_{t}$ and $p_{1}$. By comparing the content of $p_{t}$ and $p_{1}$ with $q$, $a$ may conclude that $p_t$ matches $q$ best. $a$ then returns $p_t$ as the node most probable to be the target node.

[2] introduced three algorithms to solve the EMRIS: similarity search (SIM), degree based search (DEG) and Similarity*Degree (SIMDEG).

##### Similarity based search (SIM)

The Similarity Based Search (SIM) compares the input query $q$ to the content of neighboring nodes, and follows the node which is most similar to $q$, SIM was used in [2] for EMRIS.

For each node $p_j$ the SIM follows all edges available from $p_j$, $\{e_{j,0}, ... e_{j,n}\}$. It calculates a similarity score between $q$ and all nodes available from $\{e_{j,0}, ...$ $e_{j,n}\}$. The algorithm pursues the node with the largest calculated similarity score.

For SIM to work, the algorithm needs to be pulled towards the target node $p_t$. To pull SIM towards $p_t$ the nodes which lead towards $p_t$ should have a higher similarity score than remaining nodes. Since $p_t$ should be similar to $q$, it will have a high similarity score. Similarly, nodes close to $p_t$ (parent nodes) will be similar to $p_t$ and thus also have a high similarity score. However, grandparent nodes (nodes linked to parent nodes of $p_t$) have been shown not to be similar to $p_t$ [3],[4]. Thus we can expect high similarity score of parent nodes, but not of grandparent or any other nodes.

This means, until the SIM reaches one of the parent nodes, the nodes will not be similar to $q$ and the similarity score will not help the SIM towards $p_t$. As a result, when the algorithm is at a node far away from the target node, which is expected in the beginning of a crawl, it will walk at random. This problem is not addressed [2].

##### Degree based search (DEG)

A Degree Based Search (DEG) uses the number of edges from neighboring nodes to guide the crawler. DEG was used in [2] for the EMRIS.

The DEG approach is similar to the SIM. However, instead of using the similarity score between $q$ and the nodes, DEG pursues the neighboring node with the largest number of edges. Thus, it assumes that $p_t$ is more likely to be linked from a node with many edges compared to a node with few.

##### Similarity*Degree Search (SIMDEG)

The Similarity*Degree Search (SIMDEG) combines SIM and DEG. The SIMDEG was used in [2] for the EMRIS.

[2] showed some advantages and disadvantages of the DEG and the SIM; The DEG assumes that $p_t$ is more likely to be linked from a node with many edges, and SIM only uses the content of the nodes and $q$ to calculate which are most similar without using the edges.

To utilize the advantages of both the SIM and the DEG it was combined as the SIMDEG. For each neighboring node, the SIMDEG calculates the similarity score as well as a frequency of edges. The algorithm pursues the node with the highest similarity score times frequency of edges.

In practice this means that in the beginning a crawl, when the algorithm is far away from $p_t$, the frequency of edges will be larger than the similarity score and the SIMDEG will be guided towards the nodes with the largest number of edges. When a node $p_j$ with an edge towards $p_t$ is reachable, the similarity score of $p_j$ will increase and be larger than the frequency of edges. In this situation the SIMDEG will crawl towards $p_j$ which is close to $p_t$.

It is therefore expected that the SIMDEG outperforms both the SIM and the DEG, which was shown in [2].

#### Web page classification and surrounding pages

The state-of-the-art classification algorithms for web pages originates from text classification algorithms. Since web pages are mostly text, it is natural that most web page classification algorithms are based on text classification algorithm.

Some algorithms utilize the tagged semantic information available in web pages such as title, heading information etc. Others use information available outside of the actual web page in a classification approach.

[10] used a portion of the parent web page (web page linking to the page to be classified). This approach works so that for each page $p_j$ to be classified, a set of neighboring pages (pages linked to and from $p_j$) should be downloaded. The approach distinguishes between trusted and not trusted links whereas a trusted link is a link similar to the target page.

Similarly, several approaches have been attempted in using the link information for classification of web pages [11],[12]. The approaches still centers around a web page and try to find the corresponding in-links, as following. For every web page $p_j$, search other pages that link to $p_j$ and use the descriptive link text.

Others have used the link structure and link text for analyzing web sites [9],[13]. These approaches focus on classifying complete web sites, not individual web page.

The approaches further requires that significant parts of the web sites, if not the entire web sites, to be downloaded and processed, which is impractical and often impossible [14].

The above shows that in-links are powerful for web mining. However, the approaches assumes knowledge of in-links to web pages. Knowing all in-links to a page within a web site requires significant resources [14]. Knowing all in-links from other web sites is in practice impossible. A practical solution would be to use search engines such as Google for finding in-links.

[8] introduces an approach which utilizes the topical correlation between linked web pages and takes advantage of the links to other pages for describing the current page. Hence, the approach shows that even though the links on a page $p_j$ are to other pages, it can still be used to describe the content of the page $p_j$.

Hence, existing web page classification approaches do not consider that web pages which are structured together can be classified in a sequential manner, such as is common for focused crawling where the link descriptions are used to decide if web pages should be downloaded.

Similarly, no web page classification approach has been implemented with a pre-classification on the link texts to decide if a page should be downloaded.

Despite evidence that given a web page $p_j$, the pages linking to $p_j$ (parent pages) can be useful for classification of $p_j$, pages which link to parent pages of $p_j$ (grandparent pages) are not useful. This is because they are most often not topically related to $p_j$ [3],[4].

Traditional classification algorithms are not created to work in directed graphs. Thus, they do not have functionality to prioritize which nodes to visit.

Note that main purpose of this paper is not to find which classification algorithms work best, but find a practical solution to the EMRIS. Because of this, only a few algorithms are presented in this section.

##### Hidden Markov Model Classifier (HMM)

This section presents a specific type of the HMM called the Learning Automata (LA). More specifically LA presented here is a two action multiple state Tsetlin Learning Automata. This is an LA with no absorbing states, and is accordingly ergodic supporting dynamic environments, such as web sites [15],[16].

This LA could be used as a binary classifier with the two actions $A = \{\beta_1,\beta_2\}$ and $S$ states. The model has a beginning state $B \in S$ and a quit state $Q \in S$ as can been see in figure 2.

Each LA is connected to a node and has the label text and node text available. Initially, stemming is performed on the label and node text. The state transitions follow a reward penalty system. Whenever a word matches $q$ $s_j(t + 1) := s_j(t) - 1$, $s_j(t + 1) := s_j(t) + 1$ otherwise. Thus, whenever the classifier detects a word which was not in $q$, the model moves one state closer to the quit state $Q$. Similarly, whenever the classifier detects a word which was in the query, the model moves one state closer to the state $B$. Thus, the LA closest to the $Q$ state is most probably connected to the target node and the LA closest to the $Q$ is least likely to be the target node.

The transitions are done by comparing the detected words with $q$ and number of comparisons is counted by $t$. Note that $t$ is not used by the LA, but by the HMM Twofold (section Hidden Markov Model Classifier Two Fold (HMM2F)) and the LS (section The lost sheep algorithm (LS)).

Figure 2. Hidden Markov Model

##### Hidden Markov Model Classifier Two Fold (HMM2F)

By using the HMM it is often the choice between accuracy and speed. By choosing many hidden states, the classifier will be able to reach high accuracy but will use a many transactions to do so. In contrast, by using few hidden states, it is possible to quickly converge but will not reach a high accuracy. [17].

Instead of letting the HMM move in a linear space, it has been shown that by letting the number of states moved in each transaction vary, quick convergence speed and high accuracy can be reached. This is done so that if $t=0$, it moves $S/2$ states, if $t=1$ it moves $S/4$ states and so on [18].

#### Focused crawling

A focused crawler is a crawler that aims at downloading web pages of a specific predefined topic. The aim is to harvest as much as possible of relevant information, maximizing the harvest rate and minimizing the total recall (see section Focused crawling).

There are some clear similarities between focused crawl and an EMRIS, namely to find a subset of nodes in a directed graph based on relevance.

However, in contrast to the EMRIS, it is of no interest for a focused crawler to only find one node with relevant information as this would not maximize the harvest rate nor minimize the loss rate [19],[20].

##### Content based focused crawler using query input

There are many different strategies used for focused crawling. Of these, the content based crawlers have been shown most efficient [21].

Most focused crawlers use training data specifically designed for focused crawling or user input. Unfortunately, this is not available in the directed graph (see section Website / directed graph).

[22] presented a content based focused crawler using query keywords as input. The crawler basis itself around a queue of nodes prioritized according to a satisfactory number. The satisfactory number for a node $p_j$ is calculated as the number of relevant nodes with edges linking towards $p_j$. A node is relevant if the content has at least one word in common with the input query $q$.

For this approach to work, there has to be knowledge of the nodes linking to $p_j$. This is not possible from the beginning of a crawl. Thus, the approach assumes knowledge of the directed graph which is not available. This is even acknowledged by [22] which assume that nodes have been downloaded prior to the crawl.

### The lost sheep algorithm (LS)

This section proposes a new innovative algorithm for the EMRIS, named the lost sheep (LS) after story which it resembles. The LS is not a classifier, but a classification framework for the EMRIS. It can be seen as a pre-classifier on labels to decide if a node is candidate for further evaluation and should be downloaded.

The algorithm works as following. A herder has a number of sheep which runs out on a web site. The herder releases all sheep at the same time. Each sheep runs a for while and then returns back to the herder. A sheep which discovers something interesting will spend longer time than a sheep which did not discover anything interesting. The last sheep to return to the herder, the lost sheep, has discovered the most interesting information. The herder, which has perfect knowledge of which nodes the sheep found, can then choose to either accept the node the lost sheep found as the most relevant node, or move to the node the lost sheep found and repeat the process. The LS is formally described in section Lost sheep description.

Figure 3 presents an example of the lost sheep algorithm. The figure shows a shepherd releasing 5 sheep $\{x_1,...,x_5\}$. $x_2$,$x_3$ and $x_5$ examines the labels and choose not to pursue the node further. However, $x_1$ and $x_4$ choose to following their corresponding edge further, which means that either $x_1$ or $x_4$ finds the most interesting information and will be the lost sheep.

Note that the duration is typically not duration in time, as this could be biased, but a number indicating the level of interesting information found.

This has the advantage that the algorithm does not spend resources on classification of nodes which can, within a reasonable certainty, say is not correct. Furthermore, the resources are spent on the parts of the directed graph which has the potential to be the target node. By nature, the algorithm spends resources on parts of the web site which has the potential to be correct. Or, in other words, the parts of the web site closest to the correct solution.

The authors are of course aware of the "no free lunch'' theorem which states that classifiers cannot learn what is not available in the training data and that there is no algorithm that works best for all cases [23]. This means that which algorithm that will work best depends on the properties of the training data and available features. Because of the "no free lunch'' theorem, the LS can be generalized to use any classification algorithm which can interact with web sites. For practical reasons this paper only presents a specific version of the lost sheep using a hidden Markov model classifier (described in section Hidden Markov Model Classifier (HMM)). This description can be used directly as an implementation specification.

#### Lost sheep description

This section presented the specific LS using a hidden Markov model classifier. (For details on the hidden Markov model classifier see section Hidden Markov Model Classifier (HMM).) This works on a single directed graph $S_i$.

• Let the input query be a list of words: $q$.
• Let a threshold indicate when the algorithm should stop be set to: $\text{threshold} \in [0,1]$.
• Let a max depth to prevent infinite loops be set to $maxdepth>0$.
• Let $f(x_j)=t$ at all times, and let $t:=0$ before LS starts.
Figure 3. Lost sheep algorithm

Lost sheep with a hidden Markov model classifier

1 Let $p_s$ be the start node with the edges $E = \{e_{s,0}... e_{s,n}\}$     and corresponding labels $L = \{l_{s,0}... l_{s,n}\}$ directly     observable, where $n$ is the number of edges from $p_s$.

2 The herder releases $n$ sheep $X = \{x_0, ..., x_n\}$.

3 for each $x_j \in X$

3.1 Extract the edge label $l_{s,j}$.

3.2 Apply stemming to $l_{s,j}$.

3.3 Remove stop words from $l_{s,j}$.

3.4 Split $l_{s,j}$ into a list of words: $W_{s,j}$

3.5 for each word $w_{s,j,k} \in W_{s,j}$.

3.5.1 $t:= t +1$

3.5.2 if $w_{s,j,k} \in q, x_j(t + 1) := x_j(t) + 1$,

otherwise $x_j(t + 1) := x_j(t) - 1$.

3.5.3 if $x_j(t) = Q$, $f(x_j) := t$, return.

3.6 Extract the web page text $p_{j}$.

3.7 Apply stemming to $p_{j}$.

3.8 Remove stop words from $p_{j}$.

3.9 Split $p_{j}$ into a list of words:: $W_{j}$

3.10 for each word $w_{j,k} \in W_{j,k}$.

3.10.1 $t:= t + 1$

3.10.2 if $w_{j,k} \in q, x_j(t + 1) := x_j(t) + 1$,

otherwise $x_j(t + 1) := x_j(t) - 1$.

3.10.3 if $x_j(t) = Q$, $f(x_j) := t$, return.

4 numruns := numruns +1

5 Find $x_{best} \in X$ where $f(x_{best}) = \arg\max_{x_j}f(x_j)$

6 If $\frac{f(x_{best})}{\sum_0^n{f(x_j)}} ### Performance metrics No satisfactory metrics for EMRIS exists. This section defined metrics for EMRIS based using both existing EMRIS metrics [2] and widely accepted metrics from relevant fields, namely: web page classification, information retrieval and focused crawling. #### Classification Web classification algorithms aim at maximizing the number of pages classified correctly (True Positives TP, and True Negative TN) as well as minimizing the number of web pages wrongly classified (False Positives FP, and False Negative FN). Classification algorithms are typically evaluated according to macro-precision,$P_m(S_i)$and macro-recall$R_m(S_i)$for the binary classification problems as following [24]. •$P_m(S_i)= \frac{1}{2} \Big( \sum_{i=1}^{2} \frac{TP_i}{(TP_i + FP_i)} \Big)$(Equation 1) •$R_m(S_i) =\frac{1}{2} \Big( \sum_{i=1}^{2} \frac{TP_i}{(TP_i + FN_i)} \Big)$(Equation 2) For$P_m(S_i)$and$R_m(S_i)$to be applicable, the EMRIS can written as a binary classification problem 1. Classify the target node$p_t$as relevant ($P_m^1$and$R_m^1$). 2. Classify all$N-1$non-relevant nodes ($\{p_j \in P : p_j \neq p_t\}$) as non-relevant ($P_m^2$and$R_m^2$). Note that$i$in equation 1 and 2 refers to the different classification problems. For EMRIS, one of the classification problems is to classify the relevant node as relevant (Problem 1:$P_m^1$,$R_m^1TP_1$,$TN_1$,$FP_1$and$FN_1$) the other is to classify all not relevant nodes as not relevant (Problem 2:$P_m^2$,$R_m^2TP_2$,$TN_2$,$FP_2$and$FN_2$). To be able to simplify the algorithms,$P_m(S_i)$is split into$P_m^1(S_i)$,$P_m^2(S_i)$and$R_m(S_i)$into$R_m^1(S_i)$,$R_m^2(S_i)$. •$P_m(S_i) = \frac{1}{2} \Big( \frac{TP_1}{(TP_1 + FP_1)} + \frac{TP_2}{(TP_2 + FP_2)} \Big)= \frac{1}{2} \Big( P_m^1(S_i) + P_m^2(S_i) \Big)$(Equation 3) •$R_m(S_i) = \frac{1}{2} \Big( \frac{TP_1}{(TP_1 + FN_1)} + \frac{TP_2}{(TP_2 + FN_2)} \Big) = \frac{1}{2} \Big( R_m^1(S_i) + R_m^2(S_i) \Big) (Equation 4)

Since the EMRIS is on finding only one node, there is, in contrast to most other binary classification problems, only two possible scenarios for EMRIS; The correct node has been found (section The correct node has been found) or the correct node has not been found (section The correct node has not been found). This means that even though both metrics $P_m(S_i)$ and $R_m(S_i)$ are relevant for binary classification problems, it could be simplified for EMRIS.

Section The correct node has been found and The correct node has not been found presents all possible outcomes of $P_m(S_i)$ and $R_m(S_i)$ for EMRIS. Using all possible outcomes, section $P_m(S_i)$ and $R_m(S_i)$ presents simplified $P_m(S_i)$ and $R_m(S_i)$.

##### The correct node has been found

When a correct node has been found, it means that $p_t$ is correctly classified as relevant. This gives $TP_1=1$, $FP_1=0$ and $FN_1=0$.

Keeping in mind that there is only one relevant node in EMRIS, when the correct node has been found it means that that all not relevant $N-1$ nodes have also been correctly classified as not relevant. This gives that $TP_2=N-1$, $FP_2=0$ and $FN_2=0$.

Using equations 3 and 4 gives:

## Bibliography

[1] e-Exclusion and Bot Rights: Legal aspects of the robots exclusion standard for public agencies and other public sector bodies with Swedish examples,Lundblad, N.,First Monday,2007,12,8-6

[2] Scalability of findability: effective and efficient IR operations in large information networks,Ke, W. and Mostafa, J.,74--81,2010,Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval,ACM

[3] Data mining for hypertext: A tutorial survey,Chakrabarti, S.,ACM SIGKDD Explorations Newsletter,1--11,2000,1,2

[4] Mapping the semantics of web text and links,Menczer, F.,IEEE Internet Computing,27--36,2005,9,3

[5] The web as a graph: Measurements, models, and methods,Kleinberg, J.M. and Kumar, R. and Raghavan, P. and Rajagopalan, S. and Tomkins, A.S.,1--17,1999,Proceedings of the 5th annual international conference on Computing and combinatorics,Springer-Verlag

[6] Stochastic models for the web graph,Kumar, R. and Raghavan, P. and Rajagopalan, S. and Sivakumar, D. and Tomkins, A. and Upfal, E.,57--65,2002,Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on,IEEE

[7] Towards logical hypertext structure. A graph-theoretic perspective,Mehler, A. and Dehmer, M. and Gleim, R.,Lecture notes in computer science,136,2006,3473,0302-9743

[8] Web page classification: Features and algorithms,Qi, Xiaoguang and Davison, Brian D.,ACM Comput. Surv.,1--31,2009,41,2,0360-0300,http://doi.acm.org/10.1145/1459352.1459357,New York, NY, USA

[9] Web site mining: a new way to spot competitors, customers and suppliers in the world wide web,Ester, M. and Kriegel, H.P. and Schubert, M.,258,2002,Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,ACM

[10] A practical hypertext catergorization method using links and incrementally available class information,Oh, H.J. and Myaeng, S.H. and Lee, M.H.,264--271,2000,Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,ACM

[11] Hyperlink ensembles: A case study in hypertext classification,Fürnkranz, J.,Information Fusion,299--312,2002,3,4,1566-2535

[12] Web classification using support vector machine,Sun, A. and Lim, E.P. and Ng, W.K.,96--99,2002,Proceedings of the 4th international workshop on Web information and data management,ACM,1581135939

[13] Two-phase Web site classification based on Hidden Markov Tree models,Tian, Y.H. and Huang, T.J. and Gao, W.,Web Intelligence and Agent Systems,249--264,2004,2,4

[14] Architecture for large-scale automatic web accessibility evaluation based on the UWEM methodology,,Norwegian Conference for Informatics (NIK),978-8-251-923-866,November

[15] Learning automata algorithms for pattern classification,Sastry, PS and Thathachar, M.A.L.,Sadhana,261--292,1999,24,4

[16] Pattern recognition: from classical to modern approaches,Pal, S.K. and Pal, A.,2001,9810246846

[17] Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation,Granmo, O.C. and Oommen, B.J. and Myrer, S.A. and Olsen, M.G.,Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,166--175,2007,37,1,1083-4419

[18] A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Sampling,Granmo, O.C. and Oommen, B.,Next-Generation Applied Intelligence,523--534,2009

[19] Focused crawling: a new approach to topic-specific Web resource discovery,Chakrabarti, S. and Van den Berg, M. and Dom, B.,Computer Networks,1623--1640,1999,31,11-16

[20] A survey of focused web crawling algorithms,Novak, B.,Proceedings of SIKDD,55--58,2004

[21] Probabilistic models for focused web crawling,Liu, H. and Milios, E. and Janssen, J.,16--22,2004,Proceedings of the 6th annual ACM international workshop on Web information and data management,ACM,1581139780

[22] Intelligent crawling on the World Wide Web with arbitrary predicates,Aggarwal, C.C. and Al-Garawi, F. and Yu, P.S.,96--105,2001,Proceedings of the 10th international conference on World Wide Web,ACM,1581133480

[23] On the futility of blind search: An algorithmic view of “no free lunch”,Culberson, J.C.,Evolutionary Computation,109--127,1998,6,2,1063-6560

[24] An N-Gram Based Approach to Automatically Identifying Web Page Genre,Jane E. Mason and Michael Shepherd and Jack Duffy,Hawaii International Conference on System Sciences,1-10,2009,0,978-0-7695-3450-3,http://doi.ieeecomputersociety.org/10.1109/HICSS.2009.581,Los Alamitos, CA, USA

[25] Learning to crawl: Comparing classification schemes,Pant, Gautam and Srinivasan, Padmini,ACM Trans. Inf. Syst.,430--462,2005,23,4,1046-8188,http://doi.acm.org/10.1145/1095872.1095875,New York, NY, USA

[26] Patterns on the Web,Bharat, K.,1,2003,String processing and information retrieval: 10th international symposium, SPIRE 2003, Manaus, Brazil, October 8-10, 2003: proceedings,Springer-Verlag New York Inc,3540201777

[27] Zipf’s law and the Internet,Adamic, L.A. and Huberman, B.A.,Glottometrics,143--150,2002,3,1

[28] Los, Ein informasjonsstruktur for offentlege tenester,Norwegian Agency for Public Management and eGovernment (Difi),2010,Retrieved October 29th, 2010, from http://los.difi.no/

[29] Automatic Checking of Alternative Texts on Web Pages,Olsen, M. and Snaprud, M. and Nietzio, A.,Computers Helping People with Special Needs,425--432,2010

[30] High-performance web crawling,Najork, M. and Heydon, A.,Handbook of massive data sets,2002

The author of this document is:
Morten Goodwin