sv-lncs_Free Document Search and Download

Edwin Portscher,¹ James Geller¹ and Richard Scherl²

Using Internet Glossaries to Determine Interests from Home Pages

Using Internet Glossaries to Determine Interests from Home Pages

Edwin Portscher,¹ James Geller¹ and Richard Scherl²

¹New Jersey Institute of Technology, Newark, New Jersey 07102

²Monmouth University, West Long Branch, New Jersey 07764

Abstract. There are millions of home pages on the web. Each page contains valuable data about the page’s owner that can be used for marketing purposes. These pages have to be classified according to interests. The traditional Information Retrieval approach requires large training sets that are classified by human experts. Knowledge-based methods, which use handcrafted rules, require a significant investment to develop the rule base. Both these approaches are very time consuming. We are using glossaries, which are freely available on the Internet, to determine interests from home pages. Processing of these glossaries can be automated and requires little human effort and time, compared to the other two approaches. Once the terms have been extracted from these glossaries, they can be used to infer interests from the home pages of web users. This paper describes the system we have developed for classifying home pages by interests. On an experiment of 400 pages, we found that the glossary with the highest number of word matches is the correct interest in 44.75% of the pages. The correct interest is in the top three highest returned interests in 72.25% of the pages, and the correct interest is in the top five returned interest matches in 84.5% of the pages.^*

1 Introduction

Much work has been published on web page classification in the fields of Information Retrieval and Artificial Intelligence. There are many learning methods for classifying web pages [2,8,9]. One type of learning method is supervised learning, e.g., Nearest Neighbor Learners, Bayesian Learners and Discriminative Classification methods such as SVM. There are also unsupervised and semi-supervised learning methods, where an algorithm determines the similarity between documents. Some of these techniques make use of features specific to web pages [11]. Handcrafted rule-based methods and Inductive Learning Methods for text classification have also been developed [8,12,13]. Some approaches analyze the structure of the web pages and the characteristics of the images in them [6]. There are also knowledge-based, Artificial Intelligence approaches [5] for home page classification.

Our approach is unique in that we are using glossary information available on the web to categorize web pages. We do so without training sets that would be needed in Information Retrieval and many Machine Learning approaches. We don't need sophisticated Natural Language Processing (NLP) methods or complex knowledge bases as in knowledge-based approaches [12]. We infer an interest by using features specific to web pages, namely the occurrence of terms, which are specific to glossary topics for the particular interest.

2 Extracting Glossaries from the Web

Our classification system uses terms mined from Internet glossaries (for example, Figure 1) to determine an interest from a home page. These glossaries are freely available on the Internet. There are glossaries on every imaginable topic. They are also very easy to find, a simple Google^TM search reveals many results. We have found that glossaries are also easy to process, because they tend to have regular structures. As can be seen from the sample glossary in Figure 1, the terms that we are interested in are usually in bold or highlighted in some way. This makes it easy to automate the extraction of the glossary terms.

Fig. 1. Example Glossary

We currently have glossaries on 30 different topics in our system. It took comparatively little time to locate the glossaries on the Web, extract terms from them and manually review the results for errors. For example, to build our glossary for baseball, we searched Google^TM for the term “baseball glossary.” The first 30 hits of this search returned distinct baseball glossaries. Naturally, there was a good degree of overlap between those glossaries, but some of them contained words rarely found in any of the other glossaries. The glossary terms are usually in bold or highlighted in some way.

We wrote a program to process HTML files and extract words from within HTML tags, such as <b> </b> which mark a bold word. Our program converts the terms to lower case. Any occurring symbols are replaced by blank spaces. We generate one output file per glossary topic. Our baseball file starts out empty; our program puts the terms from the first Internet baseball glossary into the empty text file. Since there are many baseball glossaries, for each baseball glossary after the first, the terms are checked against the baseball file to see whether they have been encountered already. Only if a term is new, it is added to the baseball file. When the program is finished, the output file is manually reviewed to make sure that we have one term per line. We also remove any HTML that may have found its way into the glossary file and run a sorting program to alphabetize all glossary files for easier inspection. We have generated glossary files for 30 topic areas. A list of these topic areas can be seen in the left-most column of Table 1.

3 Classifying Home Pages

The work described in this paper forms one module of a larger system, which has the purpose of extracting demographic information and interest information from home pages of web users. Many home pages follow a structured format, which may be enforced by a portal site. On those pages it is easy to recognize interests of a home page owner, because they are prefixed with a keyword such as "Interests:" However, many other home pages contain interests "hidden" in paragraphs of free text. The purpose of the glossary module is to derive one interest for each free-text input home page. In other words, it classifies web pages by interests.

Our system for categorizing home pages is written in JAVA and is currently set up to use Yahoo Geocities member’s home pages as test data. It uses a sophisticated web crawler to crawl linked pages, starting at any home page we specify. The crawler can also run through a Yahoo Geocities member page listing, which lists 20 home pages at a time. When our classifier starts, it first loads all glossary files. Each glossary file is hashed into a different hash table. The web crawler then takes over and visits every page in the member’s site. It extracts the words from the HTML page, including words from the Meta tags. These words are then compared against the glossary hash tables in a sliding window sequence from one word to seven words in length. When matches between a word on the web page and a word in the glossary hash table occur, the word or words and the glossary that they occurred in are recorded. At the end of the page the results are tallied and written to a final output file.

The result of the classifier for a given home page consists of a list of pairs ((glossary topic 1, number of word matches 1), (glossary topic 2, number of word matches 2) ....). Ideally, the glossary topic with the largest number of word matches should be identical to the topic of the home page that we are classifying.

4 Experimental Results of Home Page Classification

We ran the classifier on 40 pages from each of 10 Geocities topics. After classifying these 400 pages, we found that the glossary with the largest number of word matches was indeed from the same topic as the home page 44.75% of the time. If we consider it a success when the correct topic appears within the top three or top five topics returned by the classifier, then the result percentages become much better.

The correct interest is in the top three returned topics 72.25% of the time, and in the top five returned topics 84.5% of the time. If no words of the home page match any glossaries “interest could not be determined” is returned. This result was only returned for a page that contained “site under construction.” We consider this as a correct interest analysis. Geocities groups its pages by topic e.g. baseball. However, every once in a while, there is a rogue page, which is stored in a topic area where it does not belong. If a rogue page is from a topic for which we do not have a glossary then a random result will be returned.

Table 1. Golf Pages from Yahoo Geocities Golf Topic

	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10
Woodworking	16	15	2	5	83	0	2	8	48	5
Football	128	21	2	10	73	1	10	39	84	4
Soap making	10	0	2	4	0	0	0	1	4	0
Weaving	15	7	0	5	0	1	0	3	13	0
Sewing	31	5	4	5	8	1	4	11	23	1
Scrapbooks	2	2	1	1	13	0	1	0	7	1
Quilting	8	8	0	4	19	2	2	1	47	0
Rubberstamping	2	1	0	4	7	0	0	0	2	3
Baseball	71	30	5	16	39	3	6	29	99	7
Polymer clay	1	1	1	0	1	0	0	0	3	0
Needlecrafts	0	0	0	1	0	0	0	0	0	0
Knitting	4	0	4	1	0	0	5	1	8	1
Jewelry Making	1	5	3	0	1	1	1	2	9	0
Tennis	135	46	9	23	101	8	14	46	148	6
Volleyball	11	11	0	7	14	1	1	6	29	1
Golf	199	110	20	49	291	21	19	96	437	39
Archery	54	16	1	5	4	3	7	12	60	3
Fencing	27	7	1	1	12	0	2	17	30	2
Wine	56	38	6	12	166	0	6	28	107	4
Boxing	51	4	0	7	9	1	1	7	21	6
Ceramics	8	6	0	2	62	1	0	0	12	0
Egg Painting	0	0	0	1	0	0	0	0	0	0
Candle Making	0	3	0	0	0	0	1	0	3	0
Real Estate	144	26	31	14	108	5	19	57	109	14
Scuba	0	0	0	0	0	0	0	0	0	0
Mountain biking	15	6	2	3	7	0	2	2	31	1
Caving	91	15	5	15	80	4	9	16	55	4
Auto Racing	7	9	4	7	21	0	3	14	26	0
Hiking	4	3	0	2	3	0	2	1	7	3
Birding	24	15	4	0	56	0	5	7	31	1
Correct Analysis IN*	1	1	3	1	1	1	3	1	1	1

*Correct analysis IN top 1,3,5 results or Wrong

Tables 1, 2, 3 and 4 show the results of running our system against 40 home pages (P1--P40) from the Golf topic in Yahoo Geocities. In these tables, every number indicates a count of word matches. For example, in Table 1, the first number in the first row indicates that 16 words from the woodworking glossary were found in test page P1. For 32 of these test pages, Golf was the top glossary word match result. For 6 pages, Golf was in the top 3 highest glossary word matches. For 1 page, Golf was in the top 5 results, and one page was not a Golf page at all. This wrong page was accessed due to a bad link, which brought us to a standard Yahoo error message “Sorry, the page you requested was not found.” Such error message pages produce the same pattern every time, and in future work we will scan our results for such patterns and, in turn, return an error message. This would improve the accuracy of our system. We did not use this metric when analyzing our results; we considered these pages as wrongly categorized.

Table 2. Golf Pages from Yahoo Geocities Golf Topic

	P11	P12	P13	P14	P15	P16	P17	P18	P19	P20
Woodworking	1	4	4	0	7	0	0	2	0	1
Football	2	4	2	0	13	0	4	2	3	1
Soap making	0	0	0	0	0	0	0	0	0	0
Weaving	0	0	1	0	4	0	2	1	0	0
Sewing	0	1	0	0	1	1	1	0	1	0
Scrapbooks	0	0	1	0	0	0	0	1	0	0
Quilting	4	0	1	1	6	0	0	0	1	0
Rubberstamping	0	1	0	1	1	0	0	1	0	2
Baseball	0	2	8	1	19	1	4	3	7	0
Polymer clay	1	0	0	0	0	0	0	0	0	0
Needlecrafts	0	0	0	0	0	0	0	0	0	0
Knitting	1	1	0	0	1	1	0	0	0	0
Jewelry making	0	0	0	0	0	0	0	0	1	3
Tennis	6	3	8	0	27	1	5	3	8	0
Volleyball	0	0	1	0	10	0	3	0	0	0
Golf	30	8	12	1	83	2	10	6	19	3
Archery	0	0	4	0	9	0	1	1	4	0
Fencing	0	0	1	0	4	0	1	0	1	0
Wine	0	7	11	0	20	0	3	5	5	2
Boxing	0	0	0	0	7	0	2	0	0	0
Ceramics	0	2	0	0	2	0	0	0	0	1
Egg Painting	0	0	0	0	0	0	0	0	0	0
Candle Making	0	0	1	0	0	0	1	1	0	0
Real Estate	6	7	3	2	10	5	6	2	6	4
Scuba	0	0	0	0	0	0	0	0	0	0
Mountain biking	0	1	0	0	4	0	0	0	0	0
Caving	7	3	3	0	10	0	0	1	1	1
Auto Racing	0	1	0	0	3	1	1	0	1	0
Hiking	0	0	0	0	3	0	1	0	0	0
Birding	0	3	1	0	7	1	2	0	2	1
Correct Analysis IN*	1	1	1	5	1	W	1	1	1	3

*Correct analysis IN top 1,3,5 results or Wrong

Table 3. Golf Pages from Yahoo Geocities Golf Topic

	P21	P22	P23	P24	P25	P26	P27	P28	P29	P30
Woodworking	5	2	38	0	0	10	6	11	2	14
Football	10	2	58	0	4	70	6	16	1	4
Soap making	2	0	18	0	0	7	0	1	0	0
Weaving	3	0	4	0	0	5	1	1	1	2
Sewing	0	5	13	0	1	39	2	3	3	1
Scrapbooks	0	0	6	0	0	0	0	0	0	2
Quilting	1	1	19	0	0	5	4	2	0	1
Rubberstamping	2	0	4	0	0	3	1	0	0	22
Baseball	16	1	58	0	0	38	11	13	4	4
Polymer clay	2	0	1	0	0	3	0	0	0	0
Needlecrafts	0	0	0	0	0	0	0	0	0	0
Knitting	2	0	15	0	0	0	1	3	2	1
Jewelry making	0	0	2	0	0	0	1	0	2	3
Tennis	37	8	121	0	4	110	27	17	9	14
Volleyball	10	1	31	0	1	17	4	1	1	3
Golf	84	20	636	1	19	122	58	80	14	64
Archery	21	1	19	0	1	22	5	3	1	6
Fencing	6	0	12	0	1	12	3	2	1	0
Wine	14	7	63	0	5	66	11	19	3	23
Boxing	7	0	5	0	2	9	0	3	1	1
Ceramics	7	0	0	0	0	1	0	2	0	0
Egg Painting	0	0	0	0	0	0	0	0	1	1
Candle Making	0	0	3	0	0	0	0	0	0	2
Real Estate	18	6	166	2	2	24	11	24	14	12
Scuba	0	0	0	0	0	0	0	0	0	0
Mountain biking	3	0	7	0	1	10	2	2	1	0
Caving	8	3	21	0	2	22	4	8	3	2
Auto Racing	3	0	40	0	0	20	4	5	2	3
Hiking	5	0	6	0	0	5	0	1	0	0
Birding	18	0	3	0	2	13	5	5	3	1
Correct Analysis IN*	1	1	1	3	1	1	1	1	3	1

Fig. 2. Example Page P11

Page 31 in Table 4 is a David Duval fan page. As one can see in the table, the Golf glossary has 60 word matches, which identifies Golf as the correct home page interest. There are 25 matching words from the Tennis glossary in David Duval's page. Thus, Tennis is a distant second.

Page 11 of Table 2 is shown in Figure 2. It is the personal home page of a Golf professional. Thirty words from the Golf glossary are found in this home page. The home page does not contain that many words from any other glossary. The second-best match is Caving, and there are only seven words of the Caving glossary in this home page. Thus, the classifier correctly recognizes the topic of this page as Golf.

Table 4. Golf Pages from Yahoo Geocities Golf Topic

	P31	P32	P33	P34	P35	P36	P37	P38	P39	P40
Woodworking	2	5	2	4	3	1	0	14	3	3
Football	11	9	11	0	6	4	4	24	7	8
Soap making	0	3	2	0	0	0	0	1	0	2
Weaving	4	3	2	0	1	0	0	6	0	0
Sewing	2	4	7	1	4	0	2	3	5	2
Scrapbooks	0	1	0	0	0	0	10	0	0	3
Quilting	6	0	2	1	1	0	8	8	0	6
Rubberstamping	0	0	0	0	1	0	0	2	2	2
Baseball	18	13	18	3	5	3	14	27	0	15
Polymer clay	1	3	2	1	2	0	0	3	0	0
Needlecrafts	0	0	0	0	0	0	0	0	0	0
Knitting	1	1	1	0	0	1	0	2	0	1
Jewelry making	1	2	3	0	1	1	0	0	0	0
Tennis	25	27	22	10	8	3	17	42	9	13
Volleyball	1	7	5	2	1	1	1	13	0	3
Golf	60	68	34	19	16	14	38	112	36	47
Archery	8	13	4	2	3	3	4	9	2	12
Fencing	2	1	1	0	2	0	2	11	0	5
Wine	17	13	11	5	9	5	3	49	8	10
Boxing	2	3	4	0	0	0	0	6	0	3
Ceramics	3	0	1	0	0	0	0	3	2	1
Egg Painting	1	0	1	0	0	0	0	0	0	0
Candle Making	1	0	2	0	1	0	0	0	0	0
Real Estate	14	31	27	4	18	4	20	41	7	21
Scuba	0	0	0	0	0	0	0	1	0	0
Mountain biking	11	1	7	1	3	0	0	5	4	2
Caving	5	13	8	3	7	4	3	28	6	16
Auto Racing	3	5	5	2	2	2	3	5	0	4
Hiking	3	0	2	0	1	0	0	1	5	2
Birding	8	2	4	1	1	0	2	8	0	11
Correct Analysis IN*	1	1	1	1	1	1	1	1	1	1

*Correct analysis IN top 1,3,5 results or Wrong

Table 5 summarizes the results of running our System with 30 glossaries on 400 pages from 10 topics. Column 2 shows that the glossary with the largest number of word matches was the correct interest for 44.75% of the 400 pages. Column three shows that the correct interest for the page was in the top three returned glossaries in 72.25% of the 400 pages. Column four shows that the correct interest for the page was in the top five highest glossary word matches in 84.5% of the 400 pages. The last column shows the numbers of pages with classifications that were not contained in the top 5 glossary topics returned by our classifier. These results must be considered as wrong.

Table 5. Summary of all Results

	Pages In Top 1	Pages In Top 3	Pages In Top 5	Wrong Pages
Archery	5	14	13	8
Auto Racing	3	15	8	14
Baseball	10	22	5	3
Football	18	11	4	7
Golf	32	6	1	1
Quilting	21	4	7	8
Real Estate	38	2	0	0
Soap Making	11	14	5	10
Tennis	18	17	3	2
Wine	23	5	3	9
Totals of all 400 pages Cumulative Percent	179 44.75%	110 72.25%	49 84.5%	62 100%

4 Language Independence

In a series of experiments, we have found that several of our glossaries are surprisingly flexible when processing web pages written in European languages other than English. We have been successful in correctly categorizing home pages in Spanish, French, Portuguese, Italian, Swedish, Dutch, and Danish. This is possible due to the fact that many English words are the same in other languages, for example the glossary terms baseball, sangria, champagne, Chablis, and Golf are the same in English, French, German and Italian.

Another reason why were able to classify non-English pages with English glossaries was that we found many pages in different languages that had English words strewn about them. In many cases this was not because the word is the same in that language, but because the page creator decided to use the English word instead of the same word in their native language. For example, a Spanish page about baseball contained: “autor del único triple play sin asistencia en nuestras Series Nacionales.” We also found home pages that contained a single paragraph, sentence or page in English, while the rest of the site was in a foreign language. This was enough for us to get a correct interest determination. Figure 3 is a Golf page in Spanish. The interest of the page was correctly determined by our system, with 64 glossary word matches. These results are shown in Table 1, for Page P30.

Fig. 3. Sample Foreign Language Page

5 Conclusions and Future Work

In this paper, we have presented a new method for classifying Web pages according to interests. Classic Information Retrieval methods using training sets or Artificial Intelligence methods using knowledge bases are hard to train or build. They require much time and effort. Using free, easily accessible Internet glossaries, we found the correct interest on a small sample set in the top five returned topics 84.5% of the time. In future work, we will analyze links to other pages and use that additional information to improve our answers. There has been successful work in using such links to improve page classification [1,2,4,10]. We also noticed that almost all of the foreign language pages contained links to pages in English, which we will use for improved interest determination for such pages.

Another topic for future research is to determine when a page should legitimately be classified as belonging to two or more interests. For instance, we have encountered pages that express an interest in both football and baseball. How can we distinguish between a person that is truly interested in both and a person that is interested in only one, but we get a false positive for the other, because baseball and football have many words in common?

Many ambiguous words appear in very different glossaries. Thus, “Diamond” could be indicative of an interest in baseball or an interest in jewelry. Clearly, such words have less discriminative power than words that appear only in one glossary. In future work, we will use Information Retrieval methods to reduce the weights of such overlapping words. We will also experiment with more powerful categorization methods, such as Naïve Bayes classifiers. Most importantly, we are going to add many more glossaries to our system, in order to determine whether classification results stay at an acceptable level.

References

G. Attardi, A. Gullí, and F. Sebastiani. Automatic Web page categorization by link and context analysis. In C. Hutchison and G. Lanzarone, editors, Proceedings of THAI-99, 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence, pages 105-119, Varese, IT, 1999.
Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, Gary W. Flake. Using Web Structure for Classifying and Describing Web Pages. Proceedings of WWW-02, International Conference on the World Wide Web, 2002.
B. Gelfand, M. Wulfekuler, and W. F. Punch. Automated concept extraction from plain text. In Papers from the AAAI 1998 Workshop on Text Categorization, pages 13--17, Madison, WI, 1998.
Johannes Furnkranz. Using Links for Classifying Web-pages. Technical Report OEFAI TR-98-29. Austrian Research Institute for Artificial Intelligence.
Hisao Mase. Experiments on Automatic Web Page Categorization for IR System. Technical Report. Department of Computer Science, Stanford University, 1998.
Arul Prakash Asirvathan, Kranthi Kumar. Ravi. Web Page Classification based on Document Structure. International Institute of Information Technology, 2001.
John M. Pierre. Practical Issues for Automated Categorization of Web Sites. ECDL 2000 Workshop on the Semantic Web.
Apte, C., Damerau, F., and Weiss, S., Automated Learning of Decision Rules for Text Categorization, ACM Transactions on Information Systems, pp. 233-240, July 1994.
Heterogeneous Learner for Web Page Classification. H. Yu, K. C.-C. Chang, and J. Han. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pages 538-545, Maebashi, Japan, December 2002.
Shyh-Ming Tai, Chen-Zen Yang, Ing-Xian Chen. Improved Automatic Web-Page Classification by Neighbor Text Percolation. Department of Computer Engineering and Science, Yuan Ze University Kaohsiung, Taiwan, November 23, 2002.
Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data.
Morgan Kaufmann Publishers: San Francisco, CA 2003.
Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. Amsterdam: John Benjamins Publishing Company, 2002.
P.J. Hayes and S.P. Weinstein. CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In A. Rappaport and R. Smith, editors, Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49--66. AAAI Press, Menlo Park, 1990.

^* This research was supported by the NJ Commission for Science and Technology

sv-lncs

Recent Documents:

Recent Search: