0x00
A web crawler (also known as a web spider, web bot, more commonly called a web chaser in the FOAF community) is a program or script that automatically scrapes information about the World Wide Web according to certain rules. Other less commonly used names include ants, auto-indexes, simulators, or worms.
0x01
To put it simply, crawlers grab data according to their own rules, analyze the captured data, and then obtain useful data for themselves.
0x02
Web crawler optimization can be divided into two stages:
1: Optimize when scraping data;
2: Optimize the processing of grasping results;
Today, we're just talking about optimization in the scraping process!
0x03
I have summarized a few points about the optimization in the crawling process:
1: It can be optimized on the physical address, for example: the target resource server is Tencent Cloud host in Shanghai, we try to choose the server in the same region, that is, the server in the Shanghai region, do not choose the server in Beijing, Qingdao and other regions, but also try to choose the server in the same IDC computer room, we know that this resource website is the server of Tencent Cloud, we try to put the crawler on the Tencent Cloud server, not on the Alibaba Cloud server!
2: Choose a stable and fast network, generally crawlers have high requirements for network quality, try not to use the home network, choose the company network or buy a server to capture data.
3: Choose a more efficient crawler language, I heard that python is better at crawlers, but I haven't used it, and I will test it later, today, I mainly explain it in .net language.
0x04
For things like rush buying, the requirements for grabbing speed are high, it can be described as a matter of time, early to get the data, increase the chance of grabbing, the following is I wrote a demo with the console, the test of grabbing the data of this website, as shown in the figure below:
(The shorter the time, the faster it is)
The above data ranking:1: Natively optimized code, 2: Native code, 3: Third-party plug-in dlls (packages)
0x05
Why do third-party plugins (packages) take the longest? Third-party plug-ins are actually a large number of encapsulations of native code, a large number of logical judgments, and relatively versatile, which may lead to slow crawling speed.
Here's the native code:
The native code is just a few lines above.The average time is still 184 milliseconds,The simpler the code, the harder it is to optimizeDo you feel that how can the above code be optimized to achieve an average time of 99 milliseconds?The speed difference is doubled!
0x06
If the target resource server supports gzip compression, when we access the website and the browser requests the website, the request header will have the following parameters:
Response Header Parameters:
Introduction to "Accept-Encoding": https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding
In layman's terms:
The client says:I support the "gzip, deflate, sdch, br" compression algorithm, you can use whatever you want when returning data.
The server said:I happen to support the gzip compression algorithm, so I will use the gzip algorithm to compress the data to you
The client says:Okay, then I will decrypt the received data with the gzip algorithm
gzip algorithm, which can compress the transmitted data and greatly reduce the transmitted content, so the request efficiency will be improved, so the optimized code is as follows:
Although it is a small detail, the efficiency can be said to be doubled! It is equivalent to the data you collected in two days, and now it can be collected in 1 day, and this article is dedicated to friends who learn crawling.
Note: The gzip compression algorithm has nothing to do with the programming language!
Finally, attach the source code:
Tourists, if you want to see the hidden content of this post, please Reply
|