Many crawlers on the Internet are written in python, and some time ago, a aps.net simple crawler was also written, which can crawl the data you want to crawl. Nowadays, many websites have made a backcrawling mechanism, which makes it very difficult for crawlers to scrape data. There are probably several ways to reverse crawl most websites: there are verification codes, IP addresses, blacklists, etc., and some more advanced reverse crawling methods. This crawler has also taken some measures to deal with anti-crawling, bypassing verification codes, using proxies, etc., paste some of the code below, discuss and learn with you, please correct what is wrong! This crawler is mainly aimed at a certain website.
After entering the URL, you can crawl back the data according to the URL, and then filter and clean the data through XPath to obtain the data you want
To bypass backcrawling, you can use a proxy IP to access, you can download or grab a high-hiding IP on the Internet, and then randomly switch the proxy IP to grab
The above code is to first determine whether the switched IP is accessible Look at the source code for the specific code, and provide the source code!
Source code download
Tourists, if you want to see the hidden content of this post, please Reply
|