This article is a mirror article of machine translation, please click here to jump to the original article.

View: 585936|Reply: 70

[Console Program] .net/c# The path to web crawler optimization

  [Copy link]
Posted on 4/19/2018 2:21:02 PM | | | |
0x00

A web crawler (also known as a web spider, web bot, more commonly called a web chaser in the FOAF community) is a program or script that automatically scrapes information about the World Wide Web according to certain rules. Other less commonly used names include ants, auto-indexes, simulators, or worms.

0x01

To put it simply, crawlers grab data according to their own rules, analyze the captured data, and then obtain useful data for themselves.

0x02

Web crawler optimization can be divided into two stages:

1: Optimize when scraping data;

2: Optimize the processing of grasping results;

Today, we're just talking about optimization in the scraping process!

0x03

I have summarized a few points about the optimization in the crawling process:

1: It can be optimized on the physical address, for example: the target resource server is Tencent Cloud host in Shanghai, we try to choose the server in the same region, that is, the server in the Shanghai region, do not choose the server in Beijing, Qingdao and other regions, but also try to choose the server in the same IDC computer room, we know that this resource website is the server of Tencent Cloud, we try to put the crawler on the Tencent Cloud server, not on the Alibaba Cloud server!

2: Choose a stable and fast network, generally crawlers have high requirements for network quality, try not to use the home network, choose the company network or buy a server to capture data.

3: Choose a more efficient crawler language, I heard that python is better at crawlers, but I haven't used it, and I will test it later, today, I mainly explain it in .net language.

0x04

For things like rush buying, the requirements for grabbing speed are high, it can be described as a matter of time, early to get the data, increase the chance of grabbing, the following is I wrote a demo with the console, the test of grabbing the data of this website, as shown in the figure below:


(The shorter the time, the faster it is)

The above data ranking:1: Natively optimized code, 2: Native code, 3: Third-party plug-in dlls (packages)

0x05

Why do third-party plugins (packages) take the longest? Third-party plug-ins are actually a large number of encapsulations of native code, a large number of logical judgments, and relatively versatile, which may lead to slow crawling speed.

Here's the native code:



The native code is just a few lines above.The average time is still 184 milliseconds,The simpler the code, the harder it is to optimizeDo you feel that how can the above code be optimized to achieve an average time of 99 milliseconds?The speed difference is doubled!

0x06

If the target resource server supports gzip compression, when we access the website and the browser requests the website, the request header will have the following parameters:

Response Header Parameters:






Introduction to "Accept-Encoding": https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding

In layman's terms:

The client says:I support the "gzip, deflate, sdch, br" compression algorithm, you can use whatever you want when returning data.

The server said:I happen to support the gzip compression algorithm, so I will use the gzip algorithm to compress the data to you

The client says:Okay, then I will decrypt the received data with the gzip algorithm

gzip algorithm, which can compress the transmitted data and greatly reduce the transmitted content, so the request efficiency will be improved, so the optimized code is as follows:


Although it is a small detail, the efficiency can be said to be doubled! It is equivalent to the data you collected in two days, and now it can be collected in 1 day, and this article is dedicated to friends who learn crawling.

Note: The gzip compression algorithm has nothing to do with the programming language!

Finally, attach the source code:

Tourists, if you want to see the hidden content of this post, pleaseReply

Score

Number of participants2MB+1 contribute+2 Collapse reason
conntfs + 1 Very powerful!
A little novice who loves to learn + 1 + 1 Support the owner to post a good post, and I will also post a good post!.

See all ratings





Previous:International practice newcomer reports come out
Next:.net/c# Next-Generation CAPTCHA Recognition System 2.3 Tutorial
Posted on 12/31/2019 10:48:25 AM |
Xiaobai couldn't get hurt. Is there any software similar to one-click collection of big data?
Posted on 6/10/2019 2:11:09 PM |
Xiaobai couldn't get hurt. Is there any software similar to one-click collection of big data?
Posted on 10/15/2019 10:29:57 AM |
GANJUETINGHAOWANDE  KANN
Posted on 4/20/2018 12:35:21 PM |
Thank you for sharing
Posted on 4/25/2018 11:33:55 AM |
Collection, maybe useful.
Posted on 5/17/2018 6:02:21 PM |
The Road to Web Crawler Optimization Collection
Posted on 5/18/2018 4:10:57 PM |
ooooooooooooooooooo
Posted on 7/18/2018 2:43:07 PM |
See if it works
Posted on 7/20/2018 10:09:50 AM |
DADASDSADSAD
Posted on 8/13/2018 1:06:50 PM |
Check out this source code
Posted on 8/20/2018 2:00:52 PM |

Thanks for sharing
Posted on 8/30/2018 11:42:26 AM |
srkskrskrskrskrskr
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com