This article is a mirror article of machine translation, please click here to jump to the original article.

View: 13134|Reply: 0

[Website Building Knowledge] Clever robots to avoid spider black holes

[Copy link]
Posted on 10/23/2014 10:44:58 PM | | |

For Baidu search engine, spider black hole refers to the website creating a large number of parameters at a very low cost, and dynamic URLs with similar content but different specific parameters, just like an infinite loop of "black hole" trapping spiders, Baiduspider wastes a lot of resources to crawl invalid web pages.
       For example, many websites have a filtering function, and the web pages generated by the filtering function are often crawled by search engines, and a large part of them have low search value, such as "renting a house with a price between 500-1000", first of all, there are basically no relevant resources on the website (including in reality), and secondly, there is no such search habit for on-site users and search engine users. This kind of web page is crawled by search engines in large numbers, which can only occupy valuable crawl quota on the website. So how can this be avoided?
       Let's take a group buying website in Beijing as an example to see how the website uses robots to cleverly avoid this spider black hole:

For normal filter results pages, the site chooses to use static links, such as http://bj.XXXXX.com/category/zizhucan/weigongcun
       In the same conditional filter result page, when users select different sorting conditions, dynamic links with different parameters will be generated, and even if the same sorting criteria (e.g., all in descending order of sales), the generated parameters are different. For example: http://bj.XXXXX.com/category/zizhucan/weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqhek
http://bj.XXXXX.com/category/zizhucan/weigongcun/hot?mtt=1.index%2Fpoi.0.0.i1afqi5c

For the group buying network, only the search engine can crawl the filter result page, while the result sorting page with various parameters is rejected by the search engine through robots rules.
       robots.txt has a rule in file usage: Disallow: /*?*, which prohibits search engines from accessing all dynamic pages in the website. In this way, the website prioritizes high-quality pages and blocks low-quality pages for Baiduspider, providing Baiduspider with a more friendly website structure and avoiding the formation of black holes.






Previous:VMware virtual machines are installed on MAC OSX Mountain Lion
Next:Install Mac OS X10.9 under Win system Black Apple tutorial
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com