|
For the Baidu search engine, the spider black hole refers to the website through the very low cost to create a large number of parameters too much, and the content of the same but the specific parameters of the different dynamic URL, like an infinite loop of the "black hole" will spider trapped, Baiduspider wasted a lot of resources to crawl is invalid web page. For example, many websites have a filtering function, through the filtering function of the web page will often be a large number of search engine crawl, and a large part of the search value is not high, such as "500-1000 prices between the rental", first of all, the website (including the reality) on the basic no relevant resources, and secondly, the website (including the reality) on the basic no relevant resources. ) is basically no relevant resources, and secondly, the users of the station and search engine users do not have this search habit. This kind of web page is a large number of search engine crawling, can only take up the site's valuable crawling quota. So how to avoid this situation? We take a group-buying site in Beijing as an example, to see how the site is the use of robots to skillfully avoid this spider black hole: For ordinary screening results page, the site chose to use static links, such as: http://bj.XXXXX.com/category/zizhucan/weigongcun The same condition screening results page, when the user selects a different sorting conditions, it will generate a dynamic link with different parameters. dynamic links, and even the same sorting conditions (such as: are in descending order by sales), the parameters generated are different. For example: http://bj.XXXXX.com/category/zizhucan/weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqhek http://bj.XXXXX.com/category/zizhucan/ weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqi5c For the group-buying network, only let the search engine crawl the screening results page can be, and all kinds of with parameters of the results of the sorting page through the robots rules refuse to provide to the search engine. robots.txt file usage has such a rule: Disallow: /*? *, that is, to prohibit search engines from accessing all dynamic pages in the site. In this way, the site is exactly Baiduspider priority to show high-quality pages, blocked the low-quality pages, for Baiduspider to provide a more friendly site structure, to avoid the formation of black holes.
|