Traditional Culture Encyclopedia - Traditional customs - What is a web crawler? What is this for?

What is a web crawler? What is this for?

A crawler is a program or script that can automatically access the Internet and download website content. It is similar to a robot, which can get the information of other people's websites on its own computer, and then do some filtering, screening, induction, sorting and so on.

What Web Crawlers Can Do: Data Collection.

Web crawler is a program that automatically extracts web pages. It downloads web pages for search engines from the World Wide Web and is an important part of search engines. The traditional crawler starts with the URL of one or several initial web pages and obtains the URL on the initial web pages. In the process of crawling the web page, it constantly extracts new URLs from the current page and puts them in the queue until it meets some stop conditions of the system.

Extended data:

According to the system structure and implementation technology, web crawler can be roughly divided into the following types: general web crawler, focused web crawler, incremental web crawler and deep web crawler. The actual network crawler system is usually realized by combining several crawler technologies. .

Universal web crawler

Universal web crawler is also called extensible web crawler. Crawling objects extend from some seed URLs to the whole Web, mainly collecting data for portal search engines and large web service providers. For commercial reasons, their technical details are rarely disclosed. The crawling range and quantity of this kind of network crawler are huge, which requires high crawling speed and storage space, but the order of crawling pages is relatively low At the same time, because there are too many pages to refresh, they usually work in parallel, but it takes a long time to refresh the pages. Although there are some defects, the general web crawler is suitable for searching a wide range of topics by search engines and has strong application value.

The structure of general web crawler can be roughly divided into several parts: page grabbing module, page analysis module, link filtering module, page database, URL queue and initial URL collection. In order to improve work efficiency, general web crawlers will adopt certain crawling strategies. Commonly used crawling strategies are: depth first strategy and breadth first strategy.

1) Depth first strategy: The basic method is to visit the next level of web links in order of depth from low to high until you can't go deeper. After completing a crawling branch, the crawler returns to the previous link node and further searches for other links. When all the links have been traversed, the crawling task is over. This strategy is more suitable for vertical search or in-site search, but it will cause huge waste of resources when crawling sites with deep page content.

2) Breadth-first strategy: This strategy crawls pages according to the depth at the directory level, and pages at the shallow directory level are crawled first. When the page at the same level has finished crawling, the crawler will go deep into the next level to continue crawling. This strategy can effectively control the crawling depth of the page, avoid the problem that crawling can't end when encountering infinite branches, and is convenient to implement without storing a large number of intermediate nodes. The disadvantage is that it takes a long time to crawl to a page with a deeper directory level.

Focus on web crawler

Focus crawler, also known as topic crawler, refers to a web crawler that selectively crawls pages related to predefined topics. Compared with the general web crawler, the focused crawler only needs to grab pages related to the topic, which greatly saves hardware and network resources, and the saved pages can be updated quickly because of the small number, and can also meet the needs of some specific people for information in specific fields? .

Compared with the general web crawler, the focused web crawler adds a link evaluation module and a content evaluation module. The key to realize the crawling strategy of focused crawler is to evaluate the importance of page content and links. Different methods have different importance of calculation, which leads to different access order of links.

Incremental web crawler

Incremental web crawler refers to a crawler that updates downloaded web pages incrementally and only crawls new or changed web pages, which can ensure that the crawled web pages are as new as possible to some extent. Compared with the web crawler that crawls and refreshes pages periodically, the incremental crawler only crawls newly generated or updated pages when necessary, and does not download unchanged pages again, which can effectively reduce data download, update the crawled pages in time, and reduce time and space consumption, but it increases the complexity and implementation difficulty of the crawling algorithm. The architecture of incremental web crawler [including crawling module, ranking module, updating module, local page set, URL set to crawl and URL set of local page? .

Incremental crawler has two goals: to keep the pages stored in the local page set updated and to improve the quality of the pages in the local page set. In order to achieve the first goal, the incremental crawler needs to update the page content in the local page set by revisiting the page. Common methods are: 1) unified update method: crawlers visit all pages at the same frequency, regardless of the change frequency of pages; 2) Individual updating method: the crawler revisits each page according to the change frequency of individual web pages; 3) Classification-based updating method: According to the changing frequency of web pages, the crawler divides them into two categories: a subset of web pages that are updated quickly and a subset of web pages that are updated slowly, and then visits these two types of web pages with different frequencies? .

In order to achieve the second goal, the incremental crawler needs to rank the importance of web pages, and the commonly used strategies are: breadth first strategy, PageRank first strategy and so on. WebFountain developed by IBM is a powerful incremental web crawler. It uses an optimization model to control the crawling process and does not make any statistical assumptions about the page change process. Instead, an adaptive method is adopted to adjust the page update frequency according to the crawling result of the previous crawling cycle and the actual change speed of the webpage. The incremental crawling system of Peking University Skynet aims at crawling domestic webpages, and divides webpages into two categories: changed webpages and newly added webpages, and adopts different crawling strategies respectively. In order to alleviate the performance bottleneck caused by maintaining a large number of web pages' history, it directly crawls the web pages that have changed many times in a short time according to the local law of web page change time. In order to get new pages as soon as possible, it uses index pages to track new pages. .

Deep web crawler

Web pages can be divided into surface pages and deep pages according to their existing modes. Surface page refers to a page that can be indexed by traditional search engines. Web pages are mainly composed of static pages that can be reached by hyperlinks. Deep Web is a web page that most content can't be obtained through static links and is hidden behind the search form. Only users can obtain it by submitting some keywords. For example, those pages whose contents are only visible after users register belong to Deep Web. In 2000, Bright Planet pointed out that the accessible information capacity of Deep Web is several hundred times that of Surface Web, and it is the largest and fastest growing new information resource on the Internet.

References:

Baidu encyclopedia-web crawler