Several schemes for distributed crawlers to use proxy IPs

Little scum · Posted on 7/17/2018 1:54:35 PM

Without proxy IP, crawler work will be difficult, so many crawler engineers need to buy efficient and stable proxy IP. With a high-quality proxy IP, can you sit back and relax? Things are not that simple, and it is also necessary to optimize the scheme, rationally allocate resources, improve work efficiency, and carry out crawler work more efficiently, faster and more stably.

Option 1: Each process randomly selects a list of IPs from the interface API (for example, extracting 100 IPs at a time) to cycle through them, and then calls the API to obtain them if it fails, and the general logic is as follows:

1. Each process (or thread) randomly retrieves a batch of IPs from the interface, and tries to retrieve data from the IP list in a loop.

2. If the access is successful, continue to grab the next one.

3. If it fails (such as timeout, verification code, etc.), take a batch of IPs from the interface and continue trying.

Disadvantages of the solution: Each IP has an expiration date, if 100 are extracted, when the 10th is used, most of the latter may be invalid. If you set up an HTTP request with a connection timeout of 3 seconds and a read timeout of 5 seconds, you may waste 3-8 seconds of time, and maybe these 3-8 seconds can be grabbed dozens of times.

Option 2: Each process takes a random IP from the interface API to use, and then calls the API to obtain an IP if it fails, the general logic is as follows:

1. Each process (or thread) randomly retrieves an IP from the interface and uses this IP to access resources.

2. If the access is successful, continue to grab the next one.

3. If it fails (such as timeout, verification code, etc.), then randomly select an IP from the interface and continue trying.

Disadvantages: Calling APIs to obtain IP addresses is very frequent, which will put great pressure on the proxy server, affect the stability of the API interface, and may be restricted from extracting. This scheme is also not suitable and cannot be operated in a sustainable and stable manner.

Option 3: First, extract a large number of IPs and import them into the local database, and then take the IP from the database, the general logic is as follows:

1. Create a table in the database, write an import script, request the API per minute (consult the proxy IP service provider's suggestions), and import the IP list into the database.

2. Record the import time, IP, port, expiration time, IP availability status and other fields in the database;

3. Write a grab script, the crab script reads the available IP from the database, and each process obtains an IP from the database for use.

4. Perform crawling, judge the results, process cookies, etc., as long as there is a verification code or failure, give up this IP and change to a new IP.

This solution effectively avoids the consumption of proxy server resources, effectively allocates the use of proxy IP, is more efficient and stable, and ensures the durability and stability of crawler work.

spin100 · Posted on 7/18/2018 2:50:55 PM

Isn't Scheme 3 and Scheme 1 the same, extract a large number of IPs, and the ones that are not used later will soon expire

hdixjlh003 · Posted on 1/29/2019 9:06:25 PM

Mark, I learned the idea, and then I'll try to write it

Several schemes for distributed crawlers to use proxy IPs

Related Posts

Sections viewed