1. HTTP request header
Every time an HTTP request is sent to the server, a set of attributes and configuration information is passed, which is the HTTP request header. Since the request header sent by the browser is different from the request header sent by the crawler code, it is likely to be discovered by the anti-crawler, resulting in the IP blocking.
2. Cookie settings
Websites track your visit through cookies and interrupt your visit immediately if crawler behavior is detected, such as filling out a form particularly quickly or browsing a large number of pages in a short period of time. It is recommended to check the cookies generated by these websites in the process of collecting websites, and then think about which one the crawler needs to deal with.
3. Access path
The general crawler access path is always the same, and it is easy to be recognized by anti-crawlers, try to simulate user access, and randomly access the page.
4. Frequency of visits
Most of the reasons for blocking IPs are because the access frequency is too fast, after all, they want to complete the crawler task quickly, but the speed is not reached, and the efficiency decreases after the IP is blocked.
The basic anti-crawler strategy is these, of course, some stricter anti-crawlers, not only these, which requires anti-crawler engineers to slowly study the anti-crawler strategy of the target website, with the continuous upgrading of the anti-crawler strategy, the crawler strategy also needs to be continuously upgraded, coupled with efficient and high-quality proxy IP, the crawler work can be carried out efficiently.
|