Towards Automated eGovernment Monitoring

Paper D - Automatic checking of alternative texts on web pages.

  

Author: Morten Goodwin

This chapter was originally published at the Proceedings of the International Conference on Computers Helping People with Special Needs 2010. Please see the original source for the complete paper.1

Original paper authors: Morten Goodwin Olsen, Mikael Snaprud, and Annika Nietzio

Morten Goodwin Olsen and Mikael Snaprud is with Tingtun AS, Kirkekleiva 1, 4790 Lillesand, Norway, email: morten.goodwin@tingtun.no and mikael.snaprud@tingtun.no.

Annika Nietzio is with Forschungsinstitut Technologie und Behinderung (FTB) der Evangelischen Stiftung Volmarstein, Grundsch\"otteler Str. 40 58300 Wetter (Ruhr), Germany. egovmon@ftb-net.de

Abstract

For people who cannot see non-textual web content, such as images, maps or audio files, the alternative texts are crucial to understand and use the content. Alternate texts are often automatically generated by web publishing software or not properly provided by the author of the content. Such texts may impose web accessibility barriers. Automatic accessibility checkers in use today can only detect the presence of alternative texts, but not determine if the text is describing the corresponding content in any useful way. This paper presents a pattern recognition approach for automatic detection of alternative texts that may impose a barrier, reaching an accuracy of more then 90%.

Introduction

The Unified Web Evaluation Methodology (UWEM) [1],[2] has been presented as a methodology for evaluating web sites according to the Web Content Accessibility Guidelines [3]. The UWEM includes both tests which can be applied manually by experts and tests which can be applied automatically by measurement tools and validators.

All automatic tests in the UWEM are deterministic, which has some drawbacks. As an example, one of the automatic UWEM tests checks whether an image (<img> element) has an alternative text. There are no automatic tests checking the validity of such alternative texts. This means, for a web site to conform to the automatic UWEM tests, any alternative text is sufficient. People and applications such as search engines, who are unable to see images, rely on the alternative text to convey the information of non-textual web content. If this information is not present, or when the text does not describe the image well, the information conveyed in the image is lost to these users.

To make sure web sites are accessible, appropriate textual alternatives are needed in many places, such as in frame titles, labels and alternative texts of graphical elements [3],[4],[5].2 In many cases, these alternative texts have either been automatically added by the publishing software, or a misleading text has been supplied by the author of the content. Examples of such include alternative texts of images such as "Image 1", texts which resemble filenames such as "somepicture.jpg" or "insert alternative text here". Most automatic accessibility checkers, including validators that comply with the automatic UWEM tests, check only for the existence of alternative texts. The above mentioned texts, which are undescriptive and are thus not considered accessible, will not be detected by those tests. Our data shows that 80% of the alternative texts are not describing the corresponding content well.

This paper proposes an extension of UWEM with tests for automatic detection of alternative texts which, in its context, is in-accessible using pattern recognition algorithms.

To the best of our knowledge using pattern recognition to test for undescriptive use of alternative texts in web pages has not been done previously. However, similar related approaches has been conducted. For example, the Imergo Web Compliance Manager [6] provides results for suspicious alternative texts for images. The algorithm is not presented in the literature.

Furthermore, a technique for automatic judging of alternative text quality of images has been presented by Bigham [7].3 This approach judges alternative texts in correspondence with the images, including classification using common words found in alternative texts and check if the same text is present in any other web page on Internet. The classifier uses Google and Yahoo, and has in their best experiment an accuracy of 86.3% using 351 images and corresponding alternative texts. However, in the presented results 7999 images are discarded because the algorithm fails to label the images. It is evident that discarding 7999 images (95.8%) is undesirable and has a severe impact on the over all accuracy. Taking the discarded images into account, the true accuracy of the presented algorithm is only 3.6%.

Detecting undescriptive texts in web pages has many similarities with detection of junk web pages and emails where heuristics and pattern classification has been successfully applied [8],[9],[10],[11].

It is worth noticing that descriptiveness of a text could be seen in correspondence with the content. For example, if there is an image of a cat, an appropriate alternative textual description may be "cat" while "dog" would be wrong. In order to detect these situations image processing would most likely be needed in addition to text classification. Even though this could increase the over all accuracy, it is a much more challenging task which also includes significant increase in computational costs [12]. Image processing is not within the scope of this paper.

Approach

This paper presents a method for detecting undescriptive use of alternative texts using pattern recognition algorithms. The paper follows traditional classification approach [13]: Section Data presents the data used for the classification. From this data features are extracted in section Feature Extraction. The classification algorithms and results are presented in section Classification; Naïve Bayes in section Approach 1 - Nearest Neighbor and Nearest Neighbor in section Approach 2 - Naïve Bayes. Finally, section Conclusion and Further work present the conclusions and further work.

Data

 

The home page of 414 web sites from Norwegian municipality were downloaded. From these, more then 11 000 alternatives texts were extracted (more than 1700 unique alternative texts) and manually classified as either:

All web pages have been deliberately chosen to be from only one language to avoid possible bias due to language issues such as; the length of words, which is language dependent [14], or words that are known to be undescriptive which will differ between languages [15]. Despite only using web pages from one language in this paper, the algorithms presented are not expected to be limited to only Norwegian. The algorithms can be applied to any language as long as appropriate training data is used.

Note that frequent problem of absence of descriptive texts [16] has deliberately been removed from these experiments. Testing for the presence of such descriptive texts are already present in UWEM as a fully automatable test [1],[2] and is thus not addressed in this paper.

Feature Extraction

 

Several features of undescriptive texts in web pages have already been presented in the literature [7],[15].

Slatin [7] found that file name extensions, such as “.jpg”, is a common attribute of undescriptive alternative texts of images. Additionally, the study indicates that a dictionary of known undescriptive words and phrases can be useful for such a classification.

Craven [15] presented additional features that characterizes undescriptive alternative texts. Most significantly, he found certain words/characters that are common in undescriptive alternative texts such as "*", "1", "click", "arrow" and "home". Additionally, he found a correlation between the size of the image and length of the alternative text. His empirical data indicates that images of small sizes are more often used for decoration and should because of this have an empty alternative text.

In line with literature [7],[5],[15], the following features where extracted from the collected data:

Generally speaking, features will work well as part of classifiers as long as they have a discriminatory effect on the data [13]. This means, based on the properties of the features alone, it should be possible to separate the data which belongs to both the descriptive and undescriptive classes.

Figure 1: Density Graphs for the features number of words and length of texts.

 

Figure 2: Bar Charts with percentage of occurrence for features.

 

The features have different properties and distributions. The features "number of words" and "length of the alternative text" are represented by positive integer values $(1,2,3,...)$. The discriminatory effect of these features are presented as density graphs in figure 1. Figure 1 shows that common properties for the undescriptive texts are shorter alternative texts and fewer words. Most noticeably, figure 1 shows that having only one word in the alternative texts is a common property of the undescriptive class, while having two or more words is a common property for the descriptive class.

The remaining features are represented by a boolean value. As an example, a file name extension is either present or not present in the alternative texts. Figure 2 shows the discriminatory effect of features represented by boolean values.6 As an example, close to 50% of the undescriptive alternative texts had words which often cause accessibility barriers, while only 0.5% of the descriptive alternative texts had the same behaviour. Similarly about 2% of the undescriptive texts had file name extensions, while only 0.05% descriptive texts had filename extensions.

Classification

 

Essential for the algorithms is the actual classification. In this paper we have implemented and tested two well known classification algorithms; Nearest Neighbor and Naïve Bayes [17],[13].

All algorithms have been tested with leave one out cross validation [18]; 1. Train with all data set except one instance. 2. classify the remaining instances. 3. Select next instance and go to 1. This ensures that the training sample is independent from test set.

Approach 1 - Nearest Neighbor

 

With the Nearest Neighbor algorithm, the data are added in multidimensional feature space where each feature represents a dimension, and every record is represented by a coordinate in the feature space. The euclidean distance is calculated between the item to be classified, and all items part of the training data. This identifies the nearest neighbors, and voting between the k nearest neighbors decides the outcome of the classification. In our experiments, k was chosen to be 1.

Figure 1 suggests that length of the alternative texts and number of words could be sufficient features for a working classifier. However, the empirical results does not support this as using these features alone gives the classifier an accuracy of only 66.5% and 69.0%.

By using all features described in section Feature Extraction the classifier achieves an accuracy of 90.0%, which is significantly higher than the state-of-the-art [7]. A confusion matrix with the classification results can be seen in table 1.

Table 1: Confusion Matrix with classification accuracy using Nearest Neighbor
 
descriptiveundescriptive over all accuracy
descriptive93.9%6.1%
undescriptive27.2%72.8%
over all accuracy90.0%

Approach 2 - Naïve Bayes

 

How well the classification is working is dependant on how well the features are able to discriminate the classes. It could be that the chosen features described in section Feature Extraction are not the best way to identify descriptive and undescriptive alternative texts. The words themselves could have a significant discriminatory effect [19].

By relying on the words alone using a Naïve Bayes classifier, we get an accuracy of 91.9%. This is slightly more than Approach 1 and again significatly higher than the state-of-the-art [7]. A confusion with the classification results can be seen in table 2.

Table 2: Confusion Matrix with classification results using Naïve Bayes
 
descriptiveundescriptive over all accuracy
descriptive92.6%7.4%
undescriptive10.9%89.1%
over all accuracy91.9%

Conclusion

 

This paper presents an approach for classifying alternative texts in Web pages as descriptive or undescriptive. Undescriptive texts are not describing the corresponding content well and may impose an accessibility barrier. The paper presents two approaches; classification based on well known properties of undescriptive texts presented in literature and classification using the texts alone. Both approaches have an accuracy of more then 90%, which is better than the state-of-the-art. Furthermore, in contrast to the state-of-the-art, this paper presents approaches that are independent from third party tools.

The findings in this paper gives a strong indication that undescriptive alternative texts in Web pages can be detected automatically with a high degree accuracy.

Further work

 

Further work includes looking at alternative texts in comparison with the actual images. This would include adding image processing to improve the over all accuracy.

We could expect that the content of the alternative texts are related to the text of the page itself. We would like to explore to what extent descriptive text could be topicwise related content of the web page and how this can potentially part of the features.

This paper only presents an approach to detect undescriptive use of alternative texts. In the future we will explore how results from the tests can be incorporated with the existing UWEM framework.

Acknowledgements

The eGovMon project 7 is co-funded by the Research Council of Norway under the VERDIKT program. Project no.: VERDIKT 183392/S10. The results in the eGovMon project and in this paper are all built on the results of an exciting team collaboration including researchers, practitioners and users.

Footnotes

1. [The paper has been published in the Proceedings of the International Conference on Computers Helping People with Special Needs 2010: 425-432. Springer-Verlag Berlin, Heidelberg \copyright 2010]

2. [All types of textual alternatives are in this paper referred to as alternative texts. An alternative text is descriptive if it describes the corresponding content well, and undescriptive if it does not.]

3. [Note that this is limited only to alternative texts of images, while our approach includes several types of alternative texts.]

4. [Norwegian translations of these words were used.]

5. [In this study, all types of HTML is included. It is worth noticing that not every entity is problematic.]

6. [Note that the y-axis is logarithmic]

7. [http://www.egovmon.no]

Bibliography

[1] Web Accessibility Benchmarking Cluster,2007,Retrieved November 4th, 2009, from http://www.wabcluster.org/uwem1_2/

[2] Unified Web Evaluation Methodology Indicator Refinement,,2007,Retrieved November 4th, 2009, from http://www.eiao.net/resources

[3] Web Content Accessibility Guidelines 1.0. W3C Recommendation 5 May 1999,World Wide Web Consortium,Retrieved November 4th, 2009, from http://www.w3.org/TR/WCAG10/

[4] Web Content Accessibility Guidelines (WCAG) 2.0,World Wide Web Consortium,Retrieved November 4th, 2009, from http://www.w3.org/TR/REC-WCAG20-20081211/

[5] The art of ALT: toward a more accessible Web,Slatin, J.M.,Computers and Composition,73--81,2001,18,1

[6] Web Compliance Manager Demo,Web Compliance Center of the Fraunhofer Institute for Applied Information Technology (FIT),Retrieved November 4th, 2010, from http://www.imergo.com/home

[7] Increasing web accessibility by automatically judging alternative text quality,Bigham, J.P.,352,2007,Proceedings of the 12th international conference on Intelligent user interfaces,ACM

[8] Behavior-based email analysis with application to spam detection,Hershkop, S.,2006

[9] Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages,Fetterly, D. and Manasse, M. and Najork, M.,1--6,2004,Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004,ACM New York, NY, USA

[10] A comparative study for content-based dynamic spam classification using four machine learning algorithms,Yu, B. and Xu, Z.,Knowledge-Based Systems,355--362,2008,21,4

[11] Filtering spam using search engines,Kolesnikov, O. and Lee, W. and Lipton, R.,Technical Report GITCC-04-15, Georgia Tech, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, 2004-2005

[12] A Novel Web Page Filtering System by Combining Texts and Images,Chen, Zhouyao and Wu, Ou and Zhu, Mingliang and Hu, Weiming,732--735,2006,WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence,0-7695-2747-7,http://dx.doi.org/10.1109/WI.2006.21,Washington, DC, USA

[13] Pattern classification,Duda, R.O. and Hart, P.E. and Stork, D.G.,2001

[14] A language independent method for question classification,Solorio, T. and P\'e,1374,2004,Proceedings of the 20th international conference on Computational Linguistics,Association for Computational Linguistics

[15] Some features of alt text associated with images in web pages,Craven, T.C.,Information Research,2006,11

[16] Disability-accessibility of airlines’ Web sites for US reservations online,Gutierrez, C.F. and Loucopoulos, C. and Reinsch, R.W.,Journal of Air Transport Management,239--247,2005,11,4

[17] Data mining: concepts and techniques,Han, J. and Kamber, M.,2006

[18] Cross-validation (statistics) — Wikipedia,Wikipedia,2010

[19] Enhancing navigation in biomedical databases by community voting and database-driven text classification,Duchrow, T. and Shtatland, T. and Guettler, D. and Pivovarov, M. and Kramer, S. and Weissleder, R.,BMC bioinformatics,317,2009,10,1

The author of this document is:
Morten Goodwin
E-mail address is:
morten.goodwin [at] uia.no
Phone is:
+47 95 24 86 79