Traditional Culture Encyclopedia - Traditional customs - What are the research results and problems of web crawlers nowadays

What are the research results and problems of web crawlers nowadays

Web crawler is the Italian translation of Spider (or Robots, Crawler) and other words, is a kind of high-efficiency information crawling tool, which integrates the search engine technology, and optimized by technical means, used to search from the Internet, crawling and save any standardized web page information through HTML (Hypertext Markup Language).

The mechanism is to send a request to a specific site on the Internet, interact with that site after establishing a connection, obtain the information in HTML format, then move to the next site and repeat the process. Through this automated working mechanism, the target data is saved in local data for use. Web crawlers can automatically obtain address information from HTML tags pointing to other web pages when visiting a hypertext link, thus automating efficient and standardized information acquisition.

With the increasingly wide application of the Internet in human economy and society, the scale of information covered by it is growing exponentially, and the form and distribution of information are diversified and globalized, the traditional search engine technology has been unable to meet the increasingly refined and specialized information acquisition and processing needs, and is facing a huge challenge. Since its birth, web crawler has developed rapidly and become a major research hotspot in the field of information technology. Currently, the mainstream web crawler search strategies are as follows.

>>>>

Depth Priority Search Strategy

Early development of the crawler used more search strategy is to depth priority, that is, in an HTML file, pick one of the hyperlink tags for depth search, until traversing this hyperlink to the bottom, by the logical operation of the end of the search by the end of the layer, and subsequently Exit this level of the loop, return to the upper loop and start searching for other hyperlink tags, until the initial document hyperlinks are traversed.

The advantage of the depth-first search strategy is that it can search all the information of a Web site, which is especially suitable for y nested document sets; while the disadvantage is that in the case of increasingly complex data structure, the vertical hierarchy of the site will be increased indefinitely, and cross-references between different levels will occur, and infinite loops will occur, and only by forcing the program to close can it exit the traversal, and the information obtained will be lost. Due to the large amount of repetition and redundancy, it is difficult to guarantee the quality.

>>>>

Breadth-first search strategy

Width-first search strategy corresponds to the depth-first search strategy, which is a cycle from the top to the bottom, searching for all the hyperlinks in the first level of the page, and then starting the second level of the page after completing the first level of the page traversal cycle until the bottom level. Until the bottom. When all the hyperlinks in a layer have been selected, a new round of search will be started based on the next level of hyperlinks obtained in the process of information retrieval in that layer (and will be used as a seed), prioritizing the shallow links.

One advantage of this model is that no matter how complex the vertical structural hierarchy of the search object is, it avoids dead loops to a great extent; another advantage is that it has a specific algorithm to find the shortest path between two HTML files. In general, most of the features we expect from a crawler can be easily implemented using a breadth-first search strategy, so it is considered optimal.

But the downside is that, because of the amount of time it takes, the breadth-first search strategy is not well suited for traversing specific sites and y nested HTML files.

>>>>

Focused search strategy

Unlike depth-first and breadth-first search strategies, focused search strategy is based on the "match-first principle" of accessing the data source, and based on specific matching algorithms, it actively selects and prioritizes data documents related to the topic of demand. Based on a specific matching algorithm, it actively selects and prioritizes documents that are relevant to the subject matter of the request, which guides subsequent data crawling.

This type of focused crawler will determine a priority score for hyperlinks in any page visited, and insert the link into a circular queue according to the score, which helps the crawler to prioritize the tracking of potentially higher degree of matching pages until it obtains sufficient quantity and quality of target information. It is easy to see that the main focus of the crawler search strategy lies in the design of the priority scoring model, that is, how to distinguish the value of the link, different scoring models for the same link will give different scores, which directly affects the efficiency and quality of information collection.

In the same mechanism, the scoring model for hyperlink tags can naturally be extended to the evaluation of HTML pages, because every web page is composed of a large number of hyperlink tags, and it seems that the higher the value of the link, the higher the value of the page where it is located, which provides theoretical and technical support for the search engine's search specialization and application of the broader. Currently, common focused search strategies include two kinds based on "consolidated learning" and "context map".

From the point of view of the degree of application, the current domestic mainstream search platform mainly adopts the breadth-first search strategy, mainly because of the low vertical value density of the information in the domestic network system, while the horizontal value density is high. However, this will obviously miss some of the smaller citation rate of network documents, and the horizontal value enrichment effect of the width-first search strategy, will lead to these small amount of links to the source of information is ignored indefinitely.

This situation can be alleviated by adopting a linear search strategy, whereby newer information is continuously introduced into the existing data warehouse, and multiple rounds of value judgments are used to decide whether to continue to save the information, rather than simply and brutally omitting it and blocking new information from the closed loop.

>>>>

Web page data dynamization

Traditional web crawler technology is mainly limited to the static page information crawling, the mode is relatively single, and in recent years, with the Web2.0/AJAX and other technologies have become the mainstream, the dynamic page due to the strong interactive capabilities, become the mainstream of network information dissemination, and has replaced the static page as the main stream. AJAX uses a JavaScript-driven asynchronous (non-synchronous) request and response mechanism to continuously update data without refreshing the web page as a whole, while traditional crawler technology lacks the interface to JavaScript semantics and interactive capabilities, making it difficult to trigger the asynchronous call mechanism of the dynamic no-refresh page and parse the returned data content, and cannot save the required information.

In addition, various front-end frameworks that encapsulate JavaScript, such as JQuery, make a lot of adjustments to the DOM structure, so that even the main dynamic content of a web page does not have to be sent from the server to the client in the form of static tags when a request is first established, but is constantly responding to the user's actions and dynamically drawing them out through the asynchronous invocation mechanism. This model greatly optimizes the user experience on the one hand, and reduces the interaction burden on the server to a large extent on the other hand, but poses a huge challenge to crawlers that are used to the DOM structure (relatively unchanged static pages).

The traditional crawler program is mainly based on "protocol-driven", while in the Internet 2.0 era, based on AJAX dynamic interaction technology environment, the crawler engine must rely on "event-driven" in order to obtain a constant flow of data from the data server. data feedback. To achieve event-driven, the crawler program must solve three technical problems: first, JavaScript interaction analysis and interpretation; second, DOM event processing and interpretation of the distribution; third, the semantic extraction of dynamic DOM content.

ForeSpider data collection system of ex-smell supports all kinds of dynamic websites in an all-round way, and most of the websites can be acquired through visualization. For websites with strict anti-crawler mechanisms, they can be easily acquired through ForeSpider's internal scripting language system through a simple scripting language.

>>>>

Distributed data collection

Distributed crawler system is running on top of a cluster of computers on the crawler system, each node of the cluster to run on the crawler program and the centralized crawler system works the same way, the difference is that the distributed needs to be coordinated between different computers to divide the tasks, resource allocation, information integration. Distributed crawler system of a computer terminal implanted in a master node, and through it to call the local centralized crawler to work, on this basis, the information interaction between different nodes is very important, so the key to determine the success of the distributed crawler system lies in the ability to design and realize the task of coordination.

In addition, the underlying hardware communication network is also very important. Since multiple nodes can be used to crawl web pages and dynamic resource allocation can be achieved, a distributed crawler system is much higher than a centralized crawler system as far as search efficiency is concerned.

After continuous evolution, all kinds of distributed crawler systems have their own characteristics in system composition, and the working mechanism and storage structure are constantly innovating, but the mainstream distributed crawler systems generally utilize the internal composition of "master-slave", which means that a master node controls other slave nodes in crawling information through the division of labor, resource allocation, and information integration, and the master node is the master node that controls other slave nodes in crawling information.

In the way of working, based on the cheap and efficient characteristics of the cloud platform, the distributed crawler system widely used cloud computing to reduce costs, and massively reduce the cost of hardware and software platform construction. In terms of storage methods, the current more popular is distributed information storage, that is, the file is stored in a distributed network system, so that the management of data on multiple nodes is more convenient. Usually the distributed file system used is the Hadoop-based HDFS system.

Currently, most of the general-purpose visualization crawlers on the market sacrifice performance for easy visualization. But the ForeSpider crawler is not. ForeSpider is programmed in C++, with a daily collection of more than 5 million entries/day on regular desktops and more than 40 million entries/day on servers. It is more than 10 times of other visualization crawlers in the market. At the same time, ForeSpider is embedded with ForeLib database developed by ForeSpider, which supports more than 10 million data storage at the same time for free.

>>>>

Generic and thematic crawler

Based on the type of collection target, the web crawler can be categorized as a "generic web crawler" and a "thematic web crawler".

Based on the type of collection target, web crawlers can be categorized as "general-purpose web crawlers" and "theme-based web crawlers.

General-purpose web crawler focuses on collecting larger data scale and wider data scope, and does not consider the order of web page collection and the theme of the target web page match. In the current context of exponential growth in the scale of network information, the use of general-purpose web crawlers is limited by the speed of information collection, the density of information value, and the degree of information specialization.

To alleviate this situation, topic-based web crawlers were born. Unlike general-purpose web crawlers, theme-based web crawlers are more focused on collecting the degree of match between the target and the web page information, to avoid irrelevant redundant information, this screening process is dynamic, throughout the entire workflow of theme-based web crawler technology.

Currently, the general-purpose crawlers on the market have limited collection ability, and the collection ability is not high enough to collect pages with complex web page structure. ForeSpider crawler is a general-purpose web crawler that can collect almost 100% of web pages and internally supports visual filtering, regular expressions, scripts and other kinds of filtering, which can 100% filter irrelevant redundant content and filter content by conditions. Relative to the theme-based crawler can only collect a class of websites, the general-purpose crawler has a stronger collection range, more economical and reasonable.