Clever robots to avoid spider black holes

admin · Posted on 10/23/2014 10:44:58 PM

For Baidu search engine, spider black hole refers to the website creating a large number of parameters at a very low cost, and dynamic URLs with similar content but different specific parameters, just like an infinite loop of "black hole" trapping spiders, Baiduspider wastes a lot of resources to crawl invalid web pages.
For example, many websites have a filtering function, and the web pages generated by the filtering function are often crawled by search engines, and a large part of them have low search value, such as "renting a house with a price between 500-1000", first of all, there are basically no relevant resources on the website (including in reality), and secondly, there is no such search habit for on-site users and search engine users. This kind of web page is crawled by search engines in large numbers, which can only occupy valuable crawl quota on the website. So how can this be avoided?
Let's take a group buying website in Beijing as an example to see how the website uses robots to cleverly avoid this spider black hole:

For normal filter results pages, the site chooses to use static links, such as http://bj.XXXXX.com/category/zizhucan/weigongcun
In the same conditional filter result page, when users select different sorting conditions, dynamic links with different parameters will be generated, and even if the same sorting criteria (e.g., all in descending order of sales), the generated parameters are different. For example: http://bj.XXXXX.com/category/zizhucan/weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqhek
http://bj.XXXXX.com/category/zizhucan/weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqi5c

For the group buying network, only the search engine can crawl the filter result page, while the result sorting page with various parameters is rejected by the search engine through robots rules.
robots.txt has a rule in file usage: Disallow: /*?*, which prohibits search engines from accessing all dynamic pages in the website. In this way, the website prioritizes high-quality pages and blocks low-quality pages for Baiduspider, providing Baiduspider with a more friendly website structure and avoiding the formation of black holes.

[Website Building Knowledge] Clever robots to avoid spider black holes

Related Posts

Sections viewed