.net/c# The path to web crawler optimization

Little scum · Posted on 4/19/2018 2:21:02 PM

0x00

A web crawler (also known as a web spider, web bot, more commonly called a web chaser in the FOAF community) is a program or script that automatically scrapes information about the World Wide Web according to certain rules. Other less commonly used names include ants, auto-indexes, simulators, or worms.

0x01

To put it simply, crawlers grab data according to their own rules, analyze the captured data, and then obtain useful data for themselves.

0x02

Web crawler optimization can be divided into two stages:

1: Optimize when scraping data;

2: Optimize the processing of grasping results;

Today, we're just talking about optimization in the scraping process!

0x03

I have summarized a few points about the optimization in the crawling process:

1: It can be optimized on the physical address, for example: the target resource server is Tencent Cloud host in Shanghai, we try to choose the server in the same region, that is, the server in the Shanghai region, do not choose the server in Beijing, Qingdao and other regions, but also try to choose the server in the same IDC computer room, we know that this resource website is the server of Tencent Cloud, we try to put the crawler on the Tencent Cloud server, not on the Alibaba Cloud server!

2: Choose a stable and fast network, generally crawlers have high requirements for network quality, try not to use the home network, choose the company network or buy a server to capture data.

3: Choose a more efficient crawler language, I heard that python is better at crawlers, but I haven't used it, and I will test it later, today, I mainly explain it in .net language.

0x04

For things like rush buying, the requirements for grabbing speed are high, it can be described as a matter of time, early to get the data, increase the chance of grabbing, the following is I wrote a demo with the console, the test of grabbing the data of this website, as shown in the figure below:

(The shorter the time, the faster it is)

The above data ranking:1: Natively optimized code, 2: Native code, 3: Third-party plug-in dlls (packages)

0x05

Why do third-party plugins (packages) take the longest? Third-party plug-ins are actually a large number of encapsulations of native code, a large number of logical judgments, and relatively versatile, which may lead to slow crawling speed.

Here's the native code:

Login is visible.

The native code is just a few lines above.The average time is still 184 milliseconds,The simpler the code, the harder it is to optimizeDo you feel that how can the above code be optimized to achieve an average time of 99 milliseconds?The speed difference is doubled!

0x06

If the target resource server supports gzip compression, when we access the website and the browser requests the website, the request header will have the following parameters:

Login is visible.

Response Header Parameters:

Login is visible.

Introduction to "Accept-Encoding": https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Encoding

In layman's terms:

The client says:I support the "gzip, deflate, sdch, br" compression algorithm, you can use whatever you want when returning data.

The server said:I happen to support the gzip compression algorithm, so I will use the gzip algorithm to compress the data to you

The client says:Okay, then I will decrypt the received data with the gzip algorithm

gzip algorithm, which can compress the transmitted data and greatly reduce the transmitted content, so the request efficiency will be improved, so the optimized code is as follows:

Login is visible.

Although it is a small detail, the efficiency can be said to be doubled! It is equivalent to the data you collected in two days, and now it can be collected in 1 day, and this article is dedicated to friends who learn crawling.

Note: The gzip compression algorithm has nothing to do with the programming language!

Finally, attach the source code:

Tourists, if you want to see the hidden content of this post, pleaseReply

aa7758258 · Posted on 12/31/2019 10:48:25 AM

Xiaobai couldn't get hurt. Is there any software similar to one-click collection of big data?

Luo mushrooms are cool · Posted on 6/10/2019 2:11:09 PM

Xiaobai couldn't get hurt. Is there any software similar to one-click collection of big data?

sdx55607545 · Posted on 10/15/2019 10:29:57 AM

GANJUETINGHAOWANDE KANN

Linn · Posted on 4/20/2018 12:35:21 PM

Thank you for sharing

coolcalf · Posted on 4/25/2018 11:33:55 AM

Collection, maybe useful.

Fleeting years of dreams · Posted on 5/17/2018 6:02:21 PM

The Road to Web Crawler Optimization Collection

wangwei465 · Posted on 5/18/2018 4:10:57 PM

ooooooooooooooooooo

13263955567 · Posted on 7/18/2018 2:43:07 PM

See if it works

Before sleeping · Posted on 7/20/2018 10:09:50 AM

DADASDSADSAD

coody · Posted on 8/13/2018 1:06:50 PM

Check out this source code

zlcz2000 · Posted on 8/20/2018 2:00:52 PM

Thanks for sharing

Deep · Posted on 8/30/2018 11:42:26 AM

srkskrskrskrskrskr

[Console Program] .net/c# The path to web crawler optimization

Score

Related Posts

Sections viewed