Home > sv-lncs
Edwin Portscher,1 James Geller1 and Richard Scherl2
Using Internet Glossaries to Determine Interests from Home Pages
Using Internet Glossaries to Determine Interests from Home Pages
Edwin Portscher,1 James Geller1 and Richard Scherl2
1New Jersey Institute of Technology, Newark, New Jersey 07102
2Monmouth University, West Long Branch, New Jersey 07764
Abstract. There are millions of home pages on the web. Each page contains valuable data about the page’s owner that can be used for marketing purposes. These pages have to be classified according to interests. The traditional Information Retrieval approach requires large training sets that are classified by human experts. Knowledge-based methods, which use handcrafted rules, require a significant investment to develop the rule base. Both these approaches are very time consuming. We are using glossaries, which are freely available on the Internet, to determine interests from home pages. Processing of these glossaries can be automated and requires little human effort and time, compared to the other two approaches. Once the terms have been extracted from these glossaries, they can be used to infer interests from the home pages of web users. This paper describes the system we have developed for classifying home pages by interests. On an experiment of 400 pages, we found that the glossary with the highest number of word matches is the correct interest in 44.75% of the pages. The correct interest is in the top three highest returned interests in 72.25% of the pages, and the correct interest is in the top five returned interest matches in 84.5% of the pages.*
1 Introduction
Much work has been published on web page classification in the fields of Information Retrieval and Artificial Intelligence. There are many learning methods for classifying web pages [2,8,9]. One type of learning method is supervised learning, e.g., Nearest Neighbor Learners, Bayesian Learners and Discriminative Classification methods such as SVM. There are also unsupervised and semi-supervised learning methods, where an algorithm determines the similarity between documents. Some of these techniques make use of features specific to web pages [11]. Handcrafted rule-based methods and Inductive Learning Methods for text classification have also been developed [8,12,13]. Some approaches analyze the structure of the web pages and the characteristics of the images in them [6]. There are also knowledge-based, Artificial Intelligence approaches [5] for home page classification.
Our approach is unique in that we are using glossary information available on the web to categorize web pages. We do so without training sets that would be needed in Information Retrieval and many Machine Learning approaches. We don't need sophisticated Natural Language Processing (NLP) methods or complex knowledge bases as in knowledge-based approaches [12]. We infer an interest by using features specific to web pages, namely the occurrence of terms, which are specific to glossary topics for the particular interest.
2 Extracting
Glossaries from the Web
Our classification system
uses terms mined from Internet glossaries (for example, Figure 1) to
determine an interest from a home page. These glossaries are freely
available on the Internet. There are glossaries on every imaginable
topic. They are also very easy to find, a simple GoogleTM
search reveals many results. We have found that glossaries are also
easy to process, because they tend to have regular structures. As can
be seen from the sample glossary in Figure 1, the terms that we are
interested in are usually in bold or highlighted in some way. This makes
it easy to automate the extraction of the glossary terms.
Fig. 1. Example Glossary
We currently have glossaries on 30 different topics in our system. It took comparatively little time to locate the glossaries on the Web, extract terms from them and manually review the results for errors. For example, to build our glossary for baseball, we searched GoogleTM for the term “baseball glossary.” The first 30 hits of this search returned distinct baseball glossaries. Naturally, there was a good degree of overlap between those glossaries, but some of them contained words rarely found in any of the other glossaries. The glossary terms are usually in bold or highlighted in some way.
We wrote a program to process HTML files and extract words from within HTML tags, such as <b> </b> which mark a bold word. Our program converts the terms to lower case. Any occurring symbols are replaced by blank spaces. We generate one output file per glossary topic. Our baseball file starts out empty; our program puts the terms from the first Internet baseball glossary into the empty text file. Since there are many baseball glossaries, for each baseball glossary after the first, the terms are checked against the baseball file to see whether they have been encountered already. Only if a term is new, it is added to the baseball file. When the program is finished, the output file is manually reviewed to make sure that we have one term per line. We also remove any HTML that may have found its way into the glossary file and run a sorting program to alphabetize all glossary files for easier inspection. We have generated glossary files for 30 topic areas. A list of these topic areas can be seen in the left-most column of Table 1.
3 Classifying
Home Pages
The work described in this paper forms one module of a larger system, which has the purpose of extracting demographic information and interest information from home pages of web users. Many home pages follow a structured format, which may be enforced by a portal site. On those pages it is easy to recognize interests of a home page owner, because they are prefixed with a keyword such as "Interests:" However, many other home pages contain interests "hidden" in paragraphs of free text. The purpose of the glossary module is to derive one interest for each free-text input home page. In other words, it classifies web pages by interests.
Our system for categorizing home pages is written in JAVA and is currently set up to use Yahoo Geocities member’s home pages as test data. It uses a sophisticated web crawler to crawl linked pages, starting at any home page we specify. The crawler can also run through a Yahoo Geocities member page listing, which lists 20 home pages at a time. When our classifier starts, it first loads all glossary files. Each glossary file is hashed into a different hash table. The web crawler then takes over and visits every page in the member’s site. It extracts the words from the HTML page, including words from the Meta tags. These words are then compared against the glossary hash tables in a sliding window sequence from one word to seven words in length. When matches between a word on the web page and a word in the glossary hash table occur, the word or words and the glossary that they occurred in are recorded. At the end of the page the results are tallied and written to a final output file.
The result of the classifier for a given home page consists of a list of pairs ((glossary topic 1, number of word matches 1), (glossary topic 2, number of word matches 2) ....). Ideally, the glossary topic with the largest number of word matches should be identical to the topic of the home page that we are classifying.
4 Experimental Results of Home Page Classification
We ran the classifier on 40 pages from each of 10 Geocities topics. After classifying these 400 pages, we found that the glossary with the largest number of word matches was indeed from the same topic as the home page 44.75% of the time. If we consider it a success when the correct topic appears within the top three or top five topics returned by the classifier, then the result percentages become much better.
The correct interest is in the top three returned topics 72.25% of the time, and in the top five returned topics 84.5% of the time. If no words of the home page match any glossaries “interest could not be determined” is returned. This result was only returned for a page that contained “site under construction.” We consider this as a correct interest analysis. Geocities groups its pages by topic e.g. baseball. However, every once in a while, there is a rogue page, which is stored in a topic area where it does not belong. If a rogue page is from a topic for which we do not have a glossary then a random result will be returned.
Table 1. Golf Pages from Yahoo Geocities Golf Topic
P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | |
Woodworking | 16 | 15 | 2 | 5 | 83 | 0 | 2 | 8 | 48 | 5 |
Football | 128 | 21 | 2 | 10 | 73 | 1 | 10 | 39 | 84 | 4 |
Soap making | 10 | 0 | 2 | 4 | 0 | 0 | 0 | 1 | 4 | 0 |
Weaving | 15 | 7 | 0 | 5 | 0 | 1 | 0 | 3 | 13 | 0 |
Sewing | 31 | 5 | 4 | 5 | 8 | 1 | 4 | 11 | 23 | 1 |
Scrapbooks | 2 | 2 | 1 | 1 | 13 | 0 | 1 | 0 | 7 | 1 |
Quilting | 8 | 8 | 0 | 4 | 19 | 2 | 2 | 1 | 47 | 0 |
Rubberstamping | 2 | 1 | 0 | 4 | 7 | 0 | 0 | 0 | 2 | 3 |
Baseball | 71 | 30 | 5 | 16 | 39 | 3 | 6 | 29 | 99 | 7 |
Polymer clay | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |
Needlecrafts | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Knitting | 4 | 0 | 4 | 1 | 0 | 0 | 5 | 1 | 8 | 1 |
Jewelry Making | 1 | 5 | 3 | 0 | 1 | 1 | 1 | 2 | 9 | 0 |
Tennis | 135 | 46 | 9 | 23 | 101 | 8 | 14 | 46 | 148 | 6 |
Volleyball | 11 | 11 | 0 | 7 | 14 | 1 | 1 | 6 | 29 | 1 |
Golf | 199 | 110 | 20 | 49 | 291 | 21 | 19 | 96 | 437 | 39 |
Archery | 54 | 16 | 1 | 5 | 4 | 3 | 7 | 12 | 60 | 3 |
Fencing | 27 | 7 | 1 | 1 | 12 | 0 | 2 | 17 | 30 | 2 |
Wine | 56 | 38 | 6 | 12 | 166 | 0 | 6 | 28 | 107 | 4 |
Boxing | 51 | 4 | 0 | 7 | 9 | 1 | 1 | 7 | 21 | 6 |
Ceramics | 8 | 6 | 0 | 2 | 62 | 1 | 0 | 0 | 12 | 0 |
Egg Painting | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Candle Making | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 |
Real Estate | 144 | 26 | 31 | 14 | 108 | 5 | 19 | 57 | 109 | 14 |
Scuba | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Mountain biking | 15 | 6 | 2 | 3 | 7 | 0 | 2 | 2 | 31 | 1 |
Caving | 91 | 15 | 5 | 15 | 80 | 4 | 9 | 16 | 55 | 4 |
Auto Racing | 7 | 9 | 4 | 7 | 21 | 0 | 3 | 14 | 26 | 0 |
Hiking | 4 | 3 | 0 | 2 | 3 | 0 | 2 | 1 | 7 | 3 |
Birding | 24 | 15 | 4 | 0 | 56 | 0 | 5 | 7 | 31 | 1 |
Correct Analysis IN* | 1 | 1 | 3 | 1 | 1 | 1 | 3 | 1 | 1 | 1 |
*Correct analysis
IN top 1,3,5 results or Wrong
Tables 1, 2, 3
and 4 show the results of running our system against 40 home pages (P1--P40)
from the Golf topic in Yahoo Geocities. In these tables, every number
indicates a count of word matches. For example, in Table 1, the first
number in the first row indicates that 16 words from the woodworking
glossary were found in test page P1. For 32 of these test pages, Golf
was the top glossary word match result. For 6 pages, Golf was in the
top 3 highest glossary word matches. For 1 page, Golf was in the top
5 results, and one page was not a Golf page at all. This wrong page
was accessed due to a bad link, which brought us to a standard Yahoo
error message “Sorry, the page you requested was not found.” Such
error message pages produce the same pattern every time, and in future
work we will scan our results for such patterns and, in turn, return
an error message. This would improve the accuracy of our system. We
did not use this metric when analyzing our results; we considered these
pages as wrongly categorized.
Table 2. Golf Pages from Yahoo Geocities Golf Topic
P11 | P12 | P13 | P14 | P15 | P16 | P17 | P18 | P19 | P20 | |
Woodworking | 1 | 4 | 4 | 0 | 7 | 0 | 0 | 2 | 0 | 1 |
Football | 2 | 4 | 2 | 0 | 13 | 0 | 4 | 2 | 3 | 1 |
Soap making | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Weaving | 0 | 0 | 1 | 0 | 4 | 0 | 2 | 1 | 0 | 0 |
Sewing | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
Scrapbooks | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Quilting | 4 | 0 | 1 | 1 | 6 | 0 | 0 | 0 | 1 | 0 |
Rubberstamping | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 2 |
Baseball | 0 | 2 | 8 | 1 | 19 | 1 | 4 | 3 | 7 | 0 |
Polymer clay | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Needlecrafts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Knitting | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
Jewelry making | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
Tennis | 6 | 3 | 8 | 0 | 27 | 1 | 5 | 3 | 8 | 0 |
Volleyball | 0 | 0 | 1 | 0 | 10 | 0 | 3 | 0 | 0 | 0 |
Golf | 30 | 8 | 12 | 1 | 83 | 2 | 10 | 6 | 19 | 3 |
Archery | 0 | 0 | 4 | 0 | 9 | 0 | 1 | 1 | 4 | 0 |
Fencing | 0 | 0 | 1 | 0 | 4 | 0 | 1 | 0 | 1 | 0 |
Wine | 0 | 7 | 11 | 0 | 20 | 0 | 3 | 5 | 5 | 2 |
Boxing | 0 | 0 | 0 | 0 | 7 | 0 | 2 | 0 | 0 | 0 |
Ceramics | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 |
Egg Painting | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Candle Making | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Real Estate | 6 | 7 | 3 | 2 | 10 | 5 | 6 | 2 | 6 | 4 |
Scuba | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Mountain biking | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
Caving | 7 | 3 | 3 | 0 | 10 | 0 | 0 | 1 | 1 | 1 |
Auto Racing | 0 | 1 | 0 | 0 | 3 | 1 | 1 | 0 | 1 | 0 |
Hiking | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 |
Birding | 0 | 3 | 1 | 0 | 7 | 1 | 2 | 0 | 2 | 1 |
Correct Analysis IN* | 1 | 1 | 1 | 5 | 1 | W | 1 | 1 | 1 | 3 |
*Correct analysis IN top 1,3,5 results or Wrong
Table 3. Golf Pages from Yahoo Geocities Golf Topic
P21 | P22 | P23 | P24 | P25 | P26 | P27 | P28 | P29 | P30 | |
Woodworking | 5 | 2 | 38 | 0 | 0 | 10 | 6 | 11 | 2 | 14 |
Football | 10 | 2 | 58 | 0 | 4 | 70 | 6 | 16 | 1 | 4 |
Soap making | 2 | 0 | 18 | 0 | 0 | 7 | 0 | 1 | 0 | 0 |
Weaving | 3 | 0 | 4 | 0 | 0 | 5 | 1 | 1 | 1 | 2 |
Sewing | 0 | 5 | 13 | 0 | 1 | 39 | 2 | 3 | 3 | 1 |
Scrapbooks | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
Quilting | 1 | 1 | 19 | 0 | 0 | 5 | 4 | 2 | 0 | 1 |
Rubberstamping | 2 | 0 | 4 | 0 | 0 | 3 | 1 | 0 | 0 | 22 |
Baseball | 16 | 1 | 58 | 0 | 0 | 38 | 11 | 13 | 4 | 4 |
Polymer clay | 2 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
Needlecrafts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Knitting | 2 | 0 | 15 | 0 | 0 | 0 | 1 | 3 | 2 | 1 |
Jewelry making | 0 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 2 | 3 |
Tennis | 37 | 8 | 121 | 0 | 4 | 110 | 27 | 17 | 9 | 14 |
Volleyball | 10 | 1 | 31 | 0 | 1 | 17 | 4 | 1 | 1 | 3 |
Golf | 84 | 20 | 636 | 1 | 19 | 122 | 58 | 80 | 14 | 64 |
Archery | 21 | 1 | 19 | 0 | 1 | 22 | 5 | 3 | 1 | 6 |
Fencing | 6 | 0 | 12 | 0 | 1 | 12 | 3 | 2 | 1 | 0 |
Wine | 14 | 7 | 63 | 0 | 5 | 66 | 11 | 19 | 3 | 23 |
Boxing | 7 | 0 | 5 | 0 | 2 | 9 | 0 | 3 | 1 | 1 |
Ceramics | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 |
Egg Painting | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Candle Making | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
Real Estate | 18 | 6 | 166 | 2 | 2 | 24 | 11 | 24 | 14 | 12 |
Scuba | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Mountain biking | 3 | 0 | 7 | 0 | 1 | 10 | 2 | 2 | 1 | 0 |
Caving | 8 | 3 | 21 | 0 | 2 | 22 | 4 | 8 | 3 | 2 |
Auto Racing | 3 | 0 | 40 | 0 | 0 | 20 | 4 | 5 | 2 | 3 |
Hiking | 5 | 0 | 6 | 0 | 0 | 5 | 0 | 1 | 0 | 0 |
Birding | 18 | 0 | 3 | 0 | 2 | 13 | 5 | 5 | 3 | 1 |
Correct Analysis IN* | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 3 | 1 |
Fig.
2. Example Page P11
Page 31 in Table 4 is a David Duval fan page. As one can see in the table, the Golf glossary has 60 word matches, which identifies Golf as the correct home page interest. There are 25 matching words from the Tennis glossary in David Duval's page. Thus, Tennis is a distant second.
Page 11 of Table
2 is shown in Figure 2. It is the personal home page of a Golf
professional. Thirty words from the Golf glossary are found in this
home page. The home page does not contain that many words from any other
glossary. The second-best match is Caving, and there are only seven
words of the Caving glossary in this home page. Thus, the classifier
correctly recognizes the topic of this page as Golf.
Table 4. Golf Pages from Yahoo Geocities Golf Topic
P31 | P32 | P33 | P34 | P35 | P36 | P37 | P38 | P39 | P40 | |
Woodworking | 2 | 5 | 2 | 4 | 3 | 1 | 0 | 14 | 3 | 3 |
Football | 11 | 9 | 11 | 0 | 6 | 4 | 4 | 24 | 7 | 8 |
Soap making | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
Weaving | 4 | 3 | 2 | 0 | 1 | 0 | 0 | 6 | 0 | 0 |
Sewing | 2 | 4 | 7 | 1 | 4 | 0 | 2 | 3 | 5 | 2 |
Scrapbooks | 0 | 1 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 3 |
Quilting | 6 | 0 | 2 | 1 | 1 | 0 | 8 | 8 | 0 | 6 |
Rubberstamping | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 2 | 2 |
Baseball | 18 | 13 | 18 | 3 | 5 | 3 | 14 | 27 | 0 | 15 |
Polymer clay | 1 | 3 | 2 | 1 | 2 | 0 | 0 | 3 | 0 | 0 |
Needlecrafts | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Knitting | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 1 |
Jewelry making | 1 | 2 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
Tennis | 25 | 27 | 22 | 10 | 8 | 3 | 17 | 42 | 9 | 13 |
Volleyball | 1 | 7 | 5 | 2 | 1 | 1 | 1 | 13 | 0 | 3 |
Golf | 60 | 68 | 34 | 19 | 16 | 14 | 38 | 112 | 36 | 47 |
Archery | 8 | 13 | 4 | 2 | 3 | 3 | 4 | 9 | 2 | 12 |
Fencing | 2 | 1 | 1 | 0 | 2 | 0 | 2 | 11 | 0 | 5 |
Wine | 17 | 13 | 11 | 5 | 9 | 5 | 3 | 49 | 8 | 10 |
Boxing | 2 | 3 | 4 | 0 | 0 | 0 | 0 | 6 | 0 | 3 |
Ceramics | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 2 | 1 |
Egg Painting | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Candle Making | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Real Estate | 14 | 31 | 27 | 4 | 18 | 4 | 20 | 41 | 7 | 21 |
Scuba | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Mountain biking | 11 | 1 | 7 | 1 | 3 | 0 | 0 | 5 | 4 | 2 |
Caving | 5 | 13 | 8 | 3 | 7 | 4 | 3 | 28 | 6 | 16 |
Auto Racing | 3 | 5 | 5 | 2 | 2 | 2 | 3 | 5 | 0 | 4 |
Hiking | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 5 | 2 |
Birding | 8 | 2 | 4 | 1 | 1 | 0 | 2 | 8 | 0 | 11 |
Correct Analysis IN* | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
*Correct analysis
IN top 1,3,5 results or Wrong
Table 5 summarizes the results of running our System with 30 glossaries on 400 pages from 10 topics. Column 2 shows that the glossary with the largest number of word matches was the correct interest for 44.75% of the 400 pages. Column three shows that the correct interest for the page was in the top three returned glossaries in 72.25% of the 400 pages. Column four shows that the correct interest for the page was in the top five highest glossary word matches in 84.5% of the 400 pages. The last column shows the numbers of pages with classifications that were not contained in the top 5 glossary topics returned by our classifier. These results must be considered as wrong.
Table 5. Summary of all Results
Pages In Top 1 | Pages In Top 3 | Pages In Top 5 | Wrong Pages | |
Archery | 5 | 14 | 13 | 8 |
Auto Racing | 3 | 15 | 8 | 14 |
Baseball | 10 | 22 | 5 | 3 |
Football | 18 | 11 | 4 | 7 |
Golf | 32 | 6 | 1 | 1 |
Quilting | 21 | 4 | 7 | 8 |
Real Estate | 38 | 2 | 0 | 0 |
Soap Making | 11 | 14 | 5 | 10 |
Tennis | 18 | 17 | 3 | 2 |
Wine | 23 | 5 | 3 | 9 |
Totals of all 400 pages
Cumulative Percent |
179
44.75% |
110
72.25% |
49
84.5% |
62
100% |
4 Language Independence
In a series of experiments,
we have found that several of our glossaries are surprisingly flexible
when processing web pages written in European languages other than English.
We have been successful in correctly categorizing home pages in Spanish,
French, Portuguese, Italian, Swedish, Dutch, and Danish. This is possible
due to the fact that many English words are the same in other languages,
for example the glossary terms baseball, sangria, champagne, Chablis,
and Golf are the same in English, French, German and Italian.
Another reason why were
able to classify non-English pages with English glossaries was that
we found many pages in different languages that had English words strewn
about them. In many cases this was not because the word is the same
in that language, but because the page creator decided to use the English
word instead of the same word in their native language. For example,
a Spanish page about baseball contained: “autor del único triple
play sin asistencia en nuestras Series Nacionales.” We also found
home pages that contained a single paragraph, sentence or page in English,
while the rest of the site was in a foreign language. This was enough
for us to get a correct interest determination. Figure 3 is a Golf page
in Spanish. The interest of the page was correctly determined by our
system, with 64 glossary word matches. These results are shown in Table
1, for Page P30.
Fig. 3. Sample Foreign Language Page
5 Conclusions and Future Work
In this paper, we have
presented a new method for classifying Web pages according to interests.
Classic Information Retrieval methods using training sets or Artificial
Intelligence methods using knowledge bases are hard to train or build.
They require much time and effort. Using free, easily accessible Internet
glossaries, we found the correct interest on a small sample set in the
top five returned topics 84.5% of the time. In future work, we will
analyze links to other pages and use that additional information to
improve our answers. There has been successful work in using such links
to improve page classification [1,2,4,10]. We also noticed that almost
all of the foreign language pages contained links to pages in English,
which we will use for improved interest determination for such pages.
Another topic for future research is to determine when a page should
legitimately be classified as belonging to two or more interests.
For instance, we have encountered pages that express an interest in
both football and baseball. How can we distinguish between a person
that is truly interested in both and a person that is interested in
only one, but we get a false positive for the other, because baseball
and football have many words in common?
Many ambiguous words appear in very different glossaries. Thus, “Diamond” could be indicative of an interest in baseball or an interest in jewelry. Clearly, such words have less discriminative power than words that appear only in one glossary. In future work, we will use Information Retrieval methods to reduce the weights of such overlapping words. We will also experiment with more powerful categorization methods, such as Naïve Bayes classifiers. Most importantly, we are going to add many more glossaries to our system, in order to determine whether classification results stay at an acceptable level.
References
* This research was supported by the NJ Commission for Science and Technology
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011