Home > sv-lncs

sv-lncs

 

      Edwin Portscher,1 James Geller1 and Richard Scherl2

Using Internet Glossaries to Determine Interests from Home Pages      

 Using Internet Glossaries to Determine Interests from Home Pages

 Edwin Portscher,1 James Geller1 and Richard Scherl2

 1New Jersey Institute of Technology, Newark, New Jersey 07102

 2Monmouth University, West Long Branch, New Jersey 07764

 
 
Abstract. There are millions of home pages on the web. Each page contains valuable data about the page’s owner that can be used for marketing purposes.  These pages have to be classified according to interests. The traditional Information Retrieval approach requires large training sets that are classified by human experts. Knowledge-based methods, which use handcrafted rules, require a significant investment to develop the rule base. Both these approaches are very time consuming. We are using glossaries, which are freely available on the Internet, to determine interests from home pages. Processing of these glossaries can be automated and requires little human effort and time, compared to the other two approaches.  Once the terms have been extracted from these glossaries, they can be used to infer interests from the home pages of web users. This paper describes the system we have developed for classifying home pages by interests. On an experiment of 400 pages, we found that the glossary with the highest number of word matches is the correct interest in 44.75% of the pages. The correct interest is in the top three highest returned interests in 72.25% of the pages, and the correct interest is in the top five returned interest matches in 84.5% of the pages.*

 
1 Introduction

Much work has been published on web page classification in the fields of Information Retrieval and Artificial Intelligence.  There are many learning methods for classifying web pages [2,8,9]. One type of learning method is supervised learning, e.g., Nearest Neighbor Learners, Bayesian Learners and Discriminative Classification methods such as SVM. There are also unsupervised and semi-supervised learning methods, where an algorithm determines the similarity between documents. Some of these techniques make use of features specific to web pages [11]. Handcrafted rule-based methods and Inductive Learning Methods for text classification have also been developed [8,12,13]. Some approaches analyze the structure of the web pages and the characteristics of the images in them [6]. There are also knowledge-based, Artificial Intelligence approaches [5] for home page classification.

 Our approach is unique in that we are using glossary information available on the web to categorize web pages. We do so without training sets that would be needed in Information Retrieval and many Machine Learning approaches. We don't need sophisticated Natural Language Processing (NLP) methods or complex knowledge bases as in knowledge-based approaches [12]. We infer an interest by using features specific to web pages, namely the occurrence of terms, which are specific to glossary topics for the particular interest.

 
2 Extracting Glossaries from the Web

Our classification system uses terms mined from Internet glossaries (for example, Figure 1) to determine an interest from a home page. These glossaries are freely available on the Internet. There are glossaries on every imaginable topic. They are also very easy to find, a simple GoogleTM search reveals many results. We have found that glossaries are also easy to process, because they tend to have regular structures. As can be seen from the sample glossary in Figure 1, the terms that we are interested in are usually in bold or highlighted in some way. This makes it easy to automate the extraction of the glossary terms.  
 
 

Fig. 1.  Example Glossary

 We currently have glossaries on 30 different topics in our system. It took comparatively little time to locate the glossaries on the Web, extract terms from them and manually review the results for errors. For example, to build our glossary for baseball, we searched GoogleTM for the term “baseball glossary.” The first 30 hits of this search returned distinct baseball glossaries. Naturally, there was a good degree of overlap between those glossaries, but some of them contained words rarely found in any of the other glossaries. The glossary terms are usually in bold or highlighted in some way.

 We wrote a program to process HTML files and extract words from within HTML tags, such as <b> </b> which mark a bold word. Our program converts the terms to lower case.  Any occurring symbols are replaced by blank spaces. We generate one output file per glossary topic. Our baseball file starts out empty; our program puts the terms from the first Internet baseball glossary into the empty text file. Since there are many baseball glossaries, for each baseball glossary after the first, the terms are checked against the baseball file to see whether they have been encountered already. Only if a term is new, it is added to the baseball file. When the program is finished, the output file is manually reviewed to make sure that we have one term per line.  We also remove any HTML that may have found its way into the glossary file and run a sorting program to alphabetize all glossary files for easier inspection. We have generated glossary files for 30 topic areas. A list of these topic areas can be seen in the left-most column of Table 1.

 
3 Classifying Home Pages

The work described in this paper forms one module of a larger system, which has the purpose of extracting demographic information and interest information from home pages of web users.  Many home pages follow a structured format, which may be enforced by a portal site.  On those pages it is easy to recognize interests of a home page owner, because they are prefixed with a keyword such as "Interests:" However, many other home pages contain interests "hidden" in paragraphs of free text. The purpose of the glossary module is to derive one interest for each free-text input home page. In other words, it classifies web pages by interests.

        Our system for categorizing home pages is written in JAVA and is currently set up to use Yahoo Geocities member’s home pages as test data. It uses a sophisticated web crawler to crawl linked pages, starting at any home page we specify. The crawler can also run through a Yahoo Geocities member page listing, which lists 20 home pages at a time. When our classifier starts, it first loads all glossary files. Each glossary file is hashed into a different hash table. The web crawler then takes over and visits every page in the member’s site. It extracts the words from the HTML page, including words from the Meta tags. These words are then compared against the glossary hash tables in a sliding window sequence from one word to seven words in length. When matches between a word on the web page and a word in the glossary hash table occur, the word or words and the glossary that they occurred in are recorded. At the end of the page the results are tallied and written to a final output file.

       The result of the classifier for a given home page consists of a list of pairs ((glossary topic 1, number of word matches 1), (glossary topic 2, number of word matches 2) ....). Ideally, the glossary topic with the largest number of word matches should be identical to the topic of the home page that we are classifying.

4 Experimental Results of Home Page Classification

We ran the classifier on 40 pages from each of 10 Geocities topics. After classifying these 400 pages, we found that the glossary with the largest number of word matches was indeed from the same topic as the home page 44.75% of the time.  If we consider it a success when the correct topic appears within the top three or top five topics returned by the classifier, then the result percentages become much better. 

    The correct interest is in the top three returned topics 72.25% of the time, and in the top five returned topics 84.5% of the time. If no words of the home page match any glossaries “interest could not be determined” is returned. This result was only returned for a page that contained “site under construction.” We consider this as a correct interest analysis. Geocities groups its pages by topic e.g. baseball. However, every once in a while, there is a rogue page, which is stored in a topic area where it does not belong. If a rogue page is from a topic for which we do not have a glossary then a random result will be returned.

Table 1. Golf Pages from Yahoo Geocities Golf Topic


   P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Woodworking 16 15 2 5 83 0 2 8 48 5
Football 128 21 2 10 73 1 10 39 84 4
Soap making 10 0 2 4 0 0 0 1 4 0
Weaving 15 7 0 5 0 1 0 3 13 0
Sewing 31 5 4 5 8 1 4 11 23 1
Scrapbooks 2 2 1 1 13 0 1 0 7 1
Quilting 8 8 0 4 19 2 2 1 47 0
Rubberstamping 2 1 0 4 7 0 0 0 2 3
Baseball 71 30 5 16 39 3 6 29 99 7
Polymer clay 1 1 1 0 1 0 0 0 3 0
Needlecrafts 0 0 0 1 0 0 0 0 0 0
Knitting 4 0 4 1 0 0 5 1 8 1
Jewelry Making 1 5 3 0 1 1 1 2 9 0
Tennis 135 46 9 23 101 8 14 46 148 6
Volleyball 11 11 0 7 14 1 1 6 29 1
Golf 199 110 20 49 291 21 19 96 437 39
Archery 54 16 1 5 4 3 7 12 60 3
Fencing 27 7 1 1 12 0 2 17 30 2
Wine 56 38 6 12 166 0 6 28 107 4
Boxing 51 4 0 7 9 1 1 7 21 6
Ceramics 8 6 0 2 62 1 0 0 12 0
Egg Painting 0 0 0 1 0 0 0 0 0 0
Candle Making 0 3 0 0 0 0 1 0 3 0
Real Estate 144 26 31 14 108 5 19 57 109 14
Scuba 0 0 0 0 0 0 0 0 0 0
Mountain biking 15 6 2 3 7 0 2 2 31 1
Caving 91 15 5 15 80 4 9 16 55 4
Auto Racing 7 9 4 7 21 0 3 14 26 0
Hiking 4 3 0 2 3 0 2 1 7 3
Birding 24 15 4 0 56 0 5 7 31 1
Correct Analysis IN* 1 1 3 1 1 1 3 1 1 1

 *Correct analysis IN top 1,3,5 results or Wrong 

 Tables 1, 2, 3 and 4 show the results of running our system against 40 home pages (P1--P40) from the Golf topic in Yahoo Geocities. In these tables, every number indicates a count of word matches. For example, in Table 1, the first number in the first row indicates that 16 words from the woodworking glossary were found in test page P1. For 32 of these test pages, Golf was the top glossary word match result. For 6 pages, Golf was in the top 3 highest glossary word matches. For 1 page, Golf was in the top 5 results, and one page was not a Golf page at all. This wrong page was accessed due to a bad link, which brought us to a standard Yahoo error message “Sorry, the page you requested was not found.” Such error message pages produce the same pattern every time, and in future work we will scan our results for such patterns and, in turn, return an error message. This would improve the accuracy of our system. We did not use this metric when analyzing our results; we considered these pages as wrongly categorized. 
 

Table 2. Golf Pages from Yahoo Geocities Golf Topic


   P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
Woodworking 1 4 4 0 7 0 0 2 0 1
Football 2 4 2 0 13 0 4 2 3 1
Soap making 0 0 0 0 0 0 0 0 0 0
Weaving 0 0 1 0 4 0 2 1 0 0
Sewing 0 1 0 0 1 1 1 0 1 0
Scrapbooks 0 0 1 0 0 0 0 1 0 0
Quilting 4 0 1 1 6 0 0 0 1 0
Rubberstamping 0 1 0 1 1 0 0 1 0 2
Baseball 0 2 8 1 19 1 4 3 7 0
Polymer clay 1 0 0 0 0 0 0 0 0 0
Needlecrafts 0 0 0 0 0 0 0 0 0 0
Knitting 1 1 0 0 1 1 0 0 0 0
Jewelry making 0 0 0 0 0 0 0 0 1 3
Tennis 6 3 8 0 27 1 5 3 8 0
Volleyball 0 0 1 0 10 0 3 0 0 0
Golf 30 8 12 1 83 2 10 6 19 3
Archery 0 0 4 0 9 0 1 1 4 0
Fencing 0 0 1 0 4 0 1 0 1 0
Wine 0 7 11 0 20 0 3 5 5 2
Boxing 0 0 0 0 7 0 2 0 0 0
Ceramics 0 2 0 0 2 0 0 0 0 1
Egg Painting 0 0 0 0 0 0 0 0 0 0
Candle Making 0 0 1 0 0 0 1 1 0 0
Real Estate 6 7 3 2 10 5 6 2 6 4
Scuba 0 0 0 0 0 0 0 0 0 0
Mountain biking 0 1 0 0 4 0 0 0 0 0
Caving 7 3 3 0 10 0 0 1 1 1
Auto Racing 0 1 0 0 3 1 1 0 1 0
Hiking 0 0 0 0 3 0 1 0 0 0
Birding 0 3 1 0 7 1 2 0 2 1
Correct Analysis IN* 1 1 1 5 1 W 1 1 1 3

 *Correct analysis IN top 1,3,5 results or Wrong

Table 3. Golf Pages from Yahoo Geocities Golf Topic


   P21 P22 P23 P24 P25 P26 P27 P28 P29 P30
Woodworking 5 2 38 0 0 10 6 11 2 14
Football 10 2 58 0 4 70 6 16 1 4
Soap making 2 0 18 0 0 7 0 1 0 0
Weaving 3 0 4 0 0 5 1 1 1 2
Sewing 0 5 13 0 1 39 2 3 3 1
Scrapbooks 0 0 6 0 0 0 0 0 0 2
Quilting 1 1 19 0 0 5 4 2 0 1
Rubberstamping 2 0 4 0 0 3 1 0 0 22
Baseball 16 1 58 0 0 38 11 13 4 4
Polymer clay 2 0 1 0 0 3 0 0 0 0
Needlecrafts 0 0 0 0 0 0 0 0 0 0
Knitting 2 0 15 0 0 0 1 3 2 1
Jewelry making 0 0 2 0 0 0 1 0 2 3
Tennis 37 8 121 0 4 110 27 17 9 14
Volleyball 10 1 31 0 1 17 4 1 1 3
Golf 84 20 636 1 19 122 58 80 14 64
Archery 21 1 19 0 1 22 5 3 1 6
Fencing 6 0 12 0 1 12 3 2 1 0
Wine 14 7 63 0 5 66 11 19 3 23
Boxing 7 0 5 0 2 9 0 3 1 1
Ceramics 7 0 0 0 0 1 0 2 0 0
Egg Painting 0 0 0 0 0 0 0 0 1 1
Candle Making 0 0 3 0 0 0 0 0 0 2
Real Estate 18 6 166 2 2 24 11 24 14 12
Scuba 0 0 0 0 0 0 0 0 0 0
Mountain biking 3 0 7 0 1 10 2 2 1 0
Caving 8 3 21 0 2 22 4 8 3 2
Auto Racing 3 0 40 0 0 20 4 5 2 3
Hiking 5 0 6 0 0 5 0 1 0 0
Birding 18 0 3 0 2 13 5 5 3 1
Correct Analysis IN* 1 1 1 3 1 1 1 1 3 1

Fig. 2. Example Page P11  
 
 

      Page 31 in Table 4 is a David Duval fan page. As one can see in the table, the Golf glossary has 60 word matches, which identifies Golf as the correct home page interest. There are 25 matching words from the Tennis glossary in David Duval's page. Thus, Tennis is a distant second.

 Page 11 of Table 2 is shown in Figure 2.  It is the personal home page of a Golf professional. Thirty words from the Golf glossary are found in this home page. The home page does not contain that many words from any other glossary. The second-best match is Caving, and there are only seven words of the Caving glossary in this home page. Thus, the classifier correctly recognizes the topic of this page as Golf.  

Table 4. Golf Pages from Yahoo Geocities Golf Topic


   P31 P32 P33 P34 P35 P36 P37 P38 P39 P40
Woodworking 2 5 2 4 3 1 0 14 3 3
Football 11 9 11 0 6 4 4 24 7 8
Soap making 0 3 2 0 0 0 0 1 0 2
Weaving 4 3 2 0 1 0 0 6 0 0
Sewing 2 4 7 1 4 0 2 3 5 2
Scrapbooks 0 1 0 0 0 0 10 0 0 3
Quilting 6 0 2 1 1 0 8 8 0 6
Rubberstamping 0 0 0 0 1 0 0 2 2 2
Baseball 18 13 18 3 5 3 14 27 0 15
Polymer clay 1 3 2 1 2 0 0 3 0 0
Needlecrafts 0 0 0 0 0 0 0 0 0 0
Knitting 1 1 1 0 0 1 0 2 0 1
Jewelry making 1 2 3 0 1 1 0 0 0 0
Tennis 25 27 22 10 8 3 17 42 9 13
Volleyball 1 7 5 2 1 1 1 13 0 3
Golf 60 68 34 19 16 14 38 112 36 47
Archery 8 13 4 2 3 3 4 9 2 12
Fencing 2 1 1 0 2 0 2 11 0 5
Wine 17 13 11 5 9 5 3 49 8 10
Boxing 2 3 4 0 0 0 0 6 0 3
Ceramics 3 0 1 0 0 0 0 3 2 1
Egg Painting 1 0 1 0 0 0 0 0 0 0
Candle Making 1 0 2 0 1 0 0 0 0 0
Real Estate 14 31 27 4 18 4 20 41 7 21
Scuba 0 0 0 0 0 0 0 1 0 0
Mountain biking 11 1 7 1 3 0 0 5 4 2
Caving 5 13 8 3 7 4 3 28 6 16
Auto Racing 3 5 5 2 2 2 3 5 0 4
Hiking 3 0 2 0 1 0 0 1 5 2
Birding 8 2 4 1 1 0 2 8 0 11
Correct Analysis IN* 1 1 1 1 1 1 1 1 1 1
 

 *Correct analysis IN top 1,3,5 results or Wrong 

 Table 5 summarizes the results of running our System with 30 glossaries on 400 pages from 10 topics. Column 2 shows that the glossary with the largest number of word matches was the correct interest for 44.75% of the 400 pages. Column three shows that the correct interest for the page was in the top three returned glossaries in 72.25% of the 400 pages. Column four shows that the correct interest for the page was in the top five highest glossary word matches in 84.5% of the 400 pages.   The last column shows the numbers of pages with classifications that were not contained in the top 5 glossary topics returned by our classifier.  These results must be considered as wrong.

Table 5. Summary of all Results


  Pages In Top 1 Pages In Top 3 Pages In Top 5 Wrong Pages
Archery 5 14 13 8
Auto Racing 3 15 8 14
Baseball 10 22 5 3
Football 18 11 4 7
Golf 32 6 1 1
Quilting 21 4 7 8
Real Estate 38 2 0 0
Soap Making 11 14 5 10
Tennis 18 17 3 2
Wine 23 5 3 9
Totals of all 400 pages

Cumulative Percent

179

44.75%

110

72.25%

49

84.5%

62

100%

4 Language Independence

In a series of experiments, we have found that several of our glossaries are surprisingly flexible when processing web pages written in European languages other than English. We have been successful in correctly categorizing home pages in Spanish, French, Portuguese, Italian, Swedish, Dutch, and Danish. This is possible due to the fact that many English words are the same in other languages, for example the glossary terms baseball, sangria, champagne, Chablis, and Golf are the same in English, French, German and Italian.  

Another reason why were able to classify non-English pages with English glossaries was that we found many pages in different languages that had English words strewn about them. In many cases this was not because the word is the same in that language, but because the page creator decided to use the English word instead of the same word in their native language. For example, a Spanish page about baseball contained: “autor del único triple play sin asistencia en nuestras Series Nacionales.” We also found home pages that contained a single paragraph, sentence or page in English, while the rest of the site was in a foreign language. This was enough for us to get a correct interest determination. Figure 3 is a Golf page in Spanish. The interest of the page was correctly determined by our system, with 64 glossary word matches. These results are shown in Table 1, for Page P30. 
 

Fig. 3.  Sample Foreign Language Page

5 Conclusions and Future Work

In this paper, we have presented a new method for classifying Web pages according to interests. Classic Information Retrieval methods using training sets or Artificial Intelligence methods using knowledge bases are hard to train or build. They require much time and effort. Using free, easily accessible Internet glossaries, we found the correct interest on a small sample set in the top five returned topics 84.5% of the time. In future work, we will analyze links to other pages and use that additional information to improve our answers. There has been successful work in using such links to improve page classification [1,2,4,10]. We also noticed that almost all of the foreign language pages contained links to pages in English, which we will use for improved interest determination for such pages.   

        Another topic for future research is to determine when a page should legitimately be classified as belonging to two or more interests.  For instance, we have encountered pages that express an interest in both football and baseball. How can we distinguish between a person that is truly interested in both and a person that is interested in only one, but we get a false positive for the other, because baseball and football have many words in common?  

        Many ambiguous words appear in very different glossaries.  Thus, “Diamond” could be indicative of an interest in baseball or an interest in jewelry. Clearly, such words have less discriminative power than words that appear only in one glossary.  In future work, we will use Information Retrieval methods to reduce the weights of such overlapping words.  We will also experiment with more powerful categorization methods, such as Naïve Bayes classifiers. Most importantly, we are going to add many more glossaries to our system, in order to determine whether classification results stay at an acceptable level.

References

  1. G. Attardi, A. Gullí, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105-119, Varese, IT, 1999.
  2. Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, Gary      W. Flake. Using Web Structure for Classifying and Describing Web Pages. Proceedings of WWW-02, International Conference on the World Wide Web, 2002.
  3. B. Gelfand, M. Wulfekuler, and W. F. Punch. Automated concept extraction from plain text. In Papers from the AAAI 1998 Workshop on Text Categorization, pages 13--17, Madison, WI, 1998.
  4. Johannes Furnkranz. Using Links for Classifying Web-pages.  Technical Report OEFAI TR-98-29. Austrian Research Institute for Artificial Intelligence.
  5. Hisao Mase. Experiments on Automatic Web Page Categorization for IR System. Technical Report. Department of Computer Science, Stanford University, 1998.
  6. Arul Prakash Asirvathan, Kranthi Kumar. Ravi. Web Page Classification based on Document Structure. International Institute of Information Technology, 2001.
  7. John M. Pierre. Practical Issues for Automated Categorization of Web Sites. ECDL 2000 Workshop on the Semantic Web.
  8. Apte, C., Damerau, F., and Weiss, S., Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, pp. 233-240, July 1994.
  9. Heterogeneous Learner for Web Page Classification. H. Yu, K. C.-C. Chang, and J. Han. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pages 538-545, Maebashi, Japan, December 2002.
  10. Shyh-Ming Tai, Chen-Zen Yang, Ing-Xian Chen. Improved Automatic Web-Page                                                 Classification by Neighbor Text Percolation. Department of Computer Engineering and Science, Yuan Ze University Kaohsiung, Taiwan, November 23, 2002.
  11. Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data.  
    Morgan Kaufmann Publishers: San Francisco, CA 2003.
  12. Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. Amsterdam: John Benjamins Publishing Company,  2002.
  13. P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990.

* This research was supported by the NJ Commission for Science and Technology

Search more related documents:sv-lncs
Download Document:sv-lncs

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#downhi.com
TOP