A Model of Vertical Crawler Based on Hidden Markov Chain

Ye Hu, Jun Tu, Wangyu Tong


The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. In order to get more objects by visiting few irrelevant web pages, the web crawler usually takes the heuristic searching strategy that ranks urls by their importance and preferentially visits the more important web pages. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers to harvest application or topic-specific collections. In this paper, using the Hidden Markov Model(HMM) learning ability to solve the problem of the theme of the crawler drift, has obtained the certain effect.

Full Text:



Mobasher B, Dai H, Luo T, Nakagawa M. Effective personalization based on association rule discovery from web usage data. In Proceedings of the 3rd International Workshop on Web Information and Data Management, WIDM 2001: 9–15.

Chakrabarti S, Berg M, Dom B. Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of the 8th International WWW Conference, May 1999: 237-252.

Pant G, Srinivasan P. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems, 2005, 23: 430-462.

Kollerand D, Sahami M. Hierarchically classifying documents using very few words. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), 1997: 170-178.

Wang W, Chen X, Zou Y. A focused crawler based on naïve bayes classifier. In Proceedings of the Third International Symp. on Intelligent Information Technology and Security Informatics, China, 2010: 517-521.

Yang Y, Lui X. A reexamination of text categorization methods. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), 1999: 42-49.

Dixit A. Design of Scalable Parallel Migrating Crawler Based on Augmented Hypertext Documents. Ph.D. Thesis, MDU, May 2010.

Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: bringing order to the web. Technical Report, Stanford InfoLab, 1998.

Xing W, Ghorbani A. Weighted pagerank algorithm. In Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR’04), 2004: 567-578.

Ding C, He X, Husbands P, Zha H, Simon H. Link analysis: Hubs and autorities on the world. Technical Report 2001: 447-463.

Kelly D, Teevan J. Implicit feedback for inferring user preference: A bibliography. In SIGIR Forum, 2003: 521-536.

Tan Q, Mitra P. Clustering-based incremental web crawling. ACM Trans. Inf. Syst. 2010; 28: 4-18.

Sadagopan N, Li J. Characterizing typical and atypical user sessions in clickstreams. In Proceedings of the 17th International Conference on World Wide Web, 2008: 885–894.

Wang X, Liu C. Semantic representation of complex resource requests for service-oriented architecture. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2014; 12(1): 741–746.

Hermawan H, Sarno R. Developing distributed system with service resource oriented architecture. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2012; 10(2): 389–399.

Paulus I. Cost and benefit of information search using two different strategies. TELKOMNIKA. 2010; 8(3): 195-206.

Qu X, Wang Y. The research on software resource re-sharing for small and medium-sized enterprise cloud manufacturing system. TELKOMNIKA Indonesian Journal of Electrical Engineering. 2014; 12(1): 711-717.

DOI: http://dx.doi.org/10.12928/telkomnika.v12i4.981

Article Metrics

Abstract view : 151 times
PDF - 164 times


  • There are currently no refbacks.

Copyright (c) 2014 Universitas Ahmad Dahlan

TELKOMNIKA Telecommunication, Computing, Electronics and Control
ISSN: 1693-6930, e-ISSN: 2302-9293
Universitas Ahmad Dahlan, 4th Campus, 9th Floor, LPPI Room
Jl. Ringroad Selatan, Kragilan, Tamanan, Banguntapan, Bantul, Yogyakarta, Indonesia 55191
Phone: +62 (274) 563515, 511830, 379418, 371120 ext. 4902, Fax: +62 274 564604

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.