Several ways to use distributed locks (redis, zookeeper, database)

Little scum · Posted on 8/30/2018 3:04:32 PM

Q: One service server, one database, operation: query the user's current balance, deduct 3% of the current balance as the handling fee

synchronized
lock
db lock

Q: Two service servers, one database, operation: query the user's current balance, deduct 3% of the current balance as the handling fee
Distributed locks

What kind of distributed lock do we need?
It can ensure that in a distributed application cluster, the same method can only be executed by one thread on one machine at the same time.

If this lock is a reentrant lock (avoid deadlocks)

This lock is best to be a blocking lock (consider whether you want this one according to your business needs)

This lock is best to be a fair lock (consider whether you want this one or not according to business needs)

There are highly available acquisition and release lock functions

The performance of the acquisition and release locks is better

1. Distributed locks based on databases

Distributed locks based on table-based implementations

Login is visible.

When we want to lock a method, execute the following SQL:
insert into methodLock(method_name,desc) values (‘method_name’,‘desc’)
Because we have made a uniqueness constraint on the method_name, if multiple requests are submitted to the database at the same time, the database will ensure that only one operation can succeed, then we can assume that the thread that successfully obtained the method lock and can execute the method body content.

When the method is executed, if you want to release the lock, you need to execute the following Sql:
delete from methodLock where method_name ='method_name'

This simple implementation above has the following problems:

This lock depends on the availability of the database, which is a single point and will cause the business system to be unavailable once the database is hung up.
This lock has no expiration time, and once the unlocking operation fails, the lock record will remain in the database and other threads will no longer be able to obtain the lock.
This lock can only be non-blocking, because the insert operation of data will directly report an error once the insertion fails. Threads that do not acquire locks will not enter the queue, and will need to trigger the lock acquisition operation again to obtain the lock again.
The lock is non-reentrant, and the same thread cannot obtain the lock again until it is released. Because the data in the data already exists.
This lock is an unfair lock, and all the threads waiting for the lock compete for the lock by luck.

Of course, we can also solve the above problems in other ways.

Is the database a single point? Build two databases, and the data will be synchronized in both directions. Once hung up, quickly switch to the backup library.
No expiry time? Just do a scheduled task to clean up the timeout data in the database at regular intervals.
Non-blocking? Do a while loop until the insert succeeds and then returns success.
Non-reentrant? Add a field to the database table to record the host information and thread information of the machine that currently gets the lock, then the next time you get the lock, query the database first, if the host information and thread information of the current machine can be found in the database, you can directly assign the lock to him.
Unfair? Create another intermediate table to record all the threads waiting for the lock, and sort them according to the creation time, and only the first created one is allowed to acquire the lock

Distributed locks based on exclusive locks

In addition to adding and deleting records in the data table, distributed locks can also be implemented with the help of the locks that come with the data.

We also use the database table we just created. Distributed locks can be implemented through exclusive locks on databases. The MySql-based InnoDB engine can use the following methods to implement lock-up operations:

Login is visible.

If you add for update after the query statement, the database will add an exclusive lock to the database table during the query process. When an exclusive lock is added to a record, other threads can no longer add an exclusive lock to the record on that line.

We can think that the thread that obtains the exclusive lock can obtain the distributed lock, and when the lock is obtained, the business logic of the method can be executed, and then unlocked through the following methods:

public void unlock(){ connection.commit(); }

via connection.commit(); operation to release the lock.

This method can effectively solve the problems mentioned above about the inability to release the lock and block the lock.

Blocking locks? The for update statement returns immediately after successful execution and remains blocked until it succeeds.

Service down after lock, can't be released? In this way, the database releases the lock on its own after the service goes down.

However, it still does not directly solve the problem of database single point, reentrancy, and fair locking.

To summarize the way to use the database to implement distributed locks, both of which rely on a table in the database, one is to determine whether there is a lock by the existence of records in the table, and the other is to implement distributed locks through the exclusive lock of the database.

Advantages of Distributed Locking in Databases

Directly with the help of the database, it is easy to understand.

Disadvantages of implementing distributed locks in databases

There will be various problems, and the whole solution will become more and more complex in the process of solving the problem.

Operating the database requires certain overhead, and performance issues need to be considered.

2. Distributed locks based on cache

Compared with the database-based distributed locking solution, the cache-based implementation will perform better in terms of performance.

There are many mature caching products at present, including Redis, memcached, etc. Here we take Redis as an example to analyze the scheme of using cache to implement distributed locks.

There are many related articles on the Internet about implementing distributed locks based on Redis, and the main implementation method is to use the Jedis.setNX method.

Login is visible.

There are also several problems with the above implementation:

1. Single point problem.

2. This lock has no expiration time, once the unlocking operation fails, it will cause the lock record to be in redis all the time, and other threads can no longer obtain the lock.

3. This lock can only be non-blocking, and it will return directly regardless of success or failure.

4. This lock is non-reentrant, after a thread obtains the lock, it cannot obtain the lock again before releasing the lock, because the key used already exists in redis. setNX operations can no longer be executed.

5. This lock is unfair, all waiting threads start setNX operations at the same time, and lucky threads can get the lock.

Of course, there are also ways to solve it.

Nowadays, mainstream caching services support cluster deployment to solve single point problems through clustering.
No expiry time? The setExpire method of redis supports incoming expiration time, and the data is automatically deleted after the time is reached.
Non-blocking? while repeatedly executed.
Isn't it possible to re-enter? After a thread acquires the lock, save the current host information and thread information, and check whether you are the owner of the current lock before obtaining it next time.
Unfair? Put all waiting threads in a queue before a thread acquires a lock, and then acquire the lock on a first-in, first-out basis.

The synchronization policy of the redis cluster takes time, and it is possible that thread A gets a lock after setting NX successfully, but this value has not been updated to the server where thread B executes setNX, which will cause concurrency problems.

Salvatore Sanfilippo, the author of redis, proposed the Redlock algorithm, which implements distributed lock management (DLM) that is more secure and reliable than a single node.

The Redlock algorithm assumes that there are N redis nodes that are independent of each other, generally set to N=5, and these N nodes run on different machines to maintain physical independence.

The steps of the algorithm are as follows:

1. The client obtains the current time in milliseconds.
2. The client tries to obtain the lock of N nodes, (each node gets the lock in the same way as the cache lock mentioned earlier), and N nodes get the lock with the same key and value. The client needs to set the interface access timeout, and the interface timeout time needs to be much less than the lock timeout, for example, the lock automatically released time is 10s, then the interface timeout is set to about 5-50ms. This allows you to time out as soon as possible when accessing a redis node after it goes down, and reduces the normal use of the lock.
3. The client calculates how much time it takes to obtain the lock, by subtracting the time obtained in step 1 with the current time, only when the client obtains more than 3 nodes of the lock, and the time to acquire the lock is less than the timeout time of the lock, the client obtains the distributed lock.
4. The time of the client to acquire the lock is the set lock timeout time minus the time spent to obtain the lock calculated in step 3.
5. If the client fails to obtain the lock, the client will delete all the locks in turn.
Using the Redlock algorithm, it can be guaranteed that the distributed lock service can still work when hanging up to 2 nodes, which greatly improves the availability compared to the previous database lock and cache lock.

However, a distributed expert wrote an article "How to do distributed locking" questioning the correctness of Redlock.

The expert mentioned that there are two aspects to consider when considering distributed locks: performance and correctness.

If you use a high-performance distributed lock and the correctness is not required, then using a cache lock is sufficient.

If you use a highly reliable distributed lock, then you need to consider strict reliability issues. Redlock, on the other hand, does not meet the correctness. Why not? Experts list several aspects.

Nowadays, many programming languages use virtual machines with GC functions, in Full GC, the program will stop processing GC, sometimes Full GC takes a long time, and even the program has a few minutes of lag, the article lists the example of HBase, HBase sometimes GC for a few minutes, which will cause the lease to time out. For example, in the figure below, client 1 gets a lock and is about to process a shared resource, and when it is about to process a shared resource, Full GC occurs until the lock expires. In this way, client 2 gets the lock again and starts working on the shared resource. When client 2 is processing, client 1 completes full GC and starts processing shared resources, so that both clients are processing shared resources.

Experts gave a solution, as shown in the figure below, it looks like MVCC, bring a token to the lock, token is the concept of version, every time the operation lock is completed, the token will be added 1, bring the token when processing shared resources, only the specified version of the token can handle the shared resource.

Then the expert also said that the algorithm relies on local time, and that Redis relies on the getTimeOfDay method to get the time instead of the monotonic clock when handling key expiration, which also leads to time inaccuracies. For example, in a scenario, two client 1 and client 2 have 5 redis nodes (A, B, C, D and E).

1. Client 1 successfully acquires the lock from A, B, and C, and obtains the lock network timeout from D and E.
2. The clock of node C is inaccurate, causing the lock timeout.
3. client 2 successfully acquires the lock from C, D, and E, and obtains the lock network timeout from A and B.
4. In this way, both client 1 and client 2 get a lock.
To summarize the two points experts say about Redlock's unavailability:

1. GC and other scenarios can occur at any time, causing the client to obtain a lock, and the processing timeout causes another client to obtain the lock. Experts also gave a solution to using self-incrementing tokens.
2. The algorithm relies on local time, and the clock will be inaccurate, resulting in two clients getting locks at the same time.
Therefore, the conclusion given by experts is that Redlock can only work normally in the bounded network delay, bounded program interruption, and bounded clock error range, but the boundaries of these three scenarios cannot be confirmed, so experts do not recommend using Redlock. For scenarios with high correctness requirements, experts recommend Zookeeper, which will be discussed later on using Zookeeper as a distributed lock.

Response from the author of Redis

The Redis author responded by writing a blog after seeing the expert's article. The author politely thanked the expert, and then expressed his disagreement with the expert's view.

I asked for an analysis in the original Redlock specification here:http://redis.io/topics/distlock.So thank you Martin. However I don’t agree with the analysis.

The REDIS author's discussion on using tokens to solve the lock timeout problem can be summarized in the following five points:

Point 1, the use of distributed locks is generally in that you have no other way to control shared resources, experts use tokens to ensure the processing of shared resources, then there is no need for distributed locks.
Point 2: For token generation, in order to ensure the reliability of tokens obtained by different clients, the service that generates tokens still needs distributed locks to ensure the reliability of the service.
Point 3, for the way experts say that self-incrementing tokens, the redis author believes that it is completely unnecessary, each client can generate a unique uuid as a token, and set the shared resource to a state that only the client with the uuid can handle, so that other clients cannot process the shared resource until the client that obtains the lock releases the lock.
As shown in the figure above, if the client of token 34 sends GC during the write process and causes the lock to time out, another client may get the lock of token 35 and start writing again, resulting in a lock conflict. Therefore, the order of tokens cannot be combined with shared resources.
Point 5, the redis author believes that in most scenarios, distributed locks are used to handle update problems in non-transactional scenarios. The author should mean that there are some scenarios where it is difficult to combine tokens to handle shared resources, so you have to rely on locks to lock resources and process them.
Another clock problem that experts talk about, Redis authors also give an explanation. If the time it takes to acquire the lock is too long and exceeds the default timeout time of the lock, then the client cannot obtain the lock at this time, and there will be no examples proposed by experts.

Personal feelings

The first problem I summarize is that after a client obtains a distributed lock, the lock may be released after a timeout during the client's processing. Previously, when talking about the timeout set by the database lock of 2 minutes, if a task occupies an order lock for more than 2 minutes, then the other trading center can obtain this order lock, so that the two trading centers can process the same order at the same time. Under normal circumstances, the task is processed in seconds, but sometimes, the timeout set by joining an RPC request is too long, and there are multiple such timeout requests in a task, then it is likely that the automatic unlocking time will be exceeded. If we write in java, there may be Full GC in the middle, so after the lock is unlocked after the lock timeout, the client cannot perceive it, which is a very serious thing. I don't think this is a problem with the lock itself, as long as any distributed lock mentioned above has the characteristics of timeout release, such a problem will occur. If you use the lock timeout function, the client must set the lock timeout and take action accordingly, instead of continuing to process the shared resource. Redlock's algorithm returns the lock time that the client can occupy after the client acquires the lock, and the client must process this time to stop the task after that time.

The second problem is that distributed experts do not understand Redlock. A key feature of Redlock is that the time to acquire the lock is the total time the lock defaults to timeout minus the time it takes to acquire the lock, so that the time it takes for the client to process is a relative time, regardless of local time.

From this point of view, the correctness of Redlock can be well guaranteed. Careful analysis of Redlock, compared to redis of a node, the main feature provided by Redlock is higher reliability, which is an important feature in some scenarios. But I think Redlock has spent too much money to achieve reliability.

First, 5 nodes must be deployed to make Redlock more reliable.
Then you need to request 5 nodes to get the lock, and through the Future method, you can first request to 5 nodes concurrently, and then get the response result together, which can shorten the response time, but it still takes more time than a single-node redis lock.
Then because more than 3 of the 5 nodes must be obtained, there may be a lock conflict, that is, everyone has obtained 1-2 locks, and as a result, no one can get the lock, this problem, the Redis author borrows the essence of the raft algorithm, through the collision at random time, the conflict time can be greatly reduced, but this problem cannot be avoided very well, especially when the lock is acquired for the first time, so the time cost of acquiring the lock increases.
If 2 of the 5 nodes are down, the availability of the lock will be greatly reduced, first of all, you must wait for the results of these two downed nodes to time out before returning, and there are only 3 nodes, and the client must obtain the locks of all 3 nodes to have the lock, which is also more difficult.
If there is a network partition, then there may be a situation where the client will never be able to obtain the lock.

After analyzing so many reasons, I think the most critical point of Redlock's problem is that Redlock requires clients to ensure the consistency of writes, and the backend 5 nodes are completely independent, and all clients have to operate these 5 nodes. If there is a leader among 5 nodes, the client can synchronize the leader's data as long as the client obtains the lock from the leader, so that there will be no problems such as partitioning, timeouts, and conflicts. Therefore, in order to ensure the correctness of distributed locks, I think using a distributed coordination service with strong consistency can better solve the problem.

The question arises again, how long should I set the expiration time? How to set the invalidation time is too short, and the lock is automatically released before the method is executed, then there will be concurrency problems. If it takes too long, other threads that get the lock may have to wait a long time.

This problem also exists with the use of databases to implement distributed locks.

The current mainstream approach to this problem is to set a short timeout time for each lock obtained, and start a thread to refresh the lock timeout time every time it is about to reach the timeout. End this thread at the same time as releasing the lock. For example, redisson, the official distributed lock component of redis, uses this solution.

Advantages of using caching to implement distributed locks
Good performance.

Disadvantages of using caching to implement distributed locks
Implementation is too responsible, there are too many factors to consider.

Distributed locks based on Zookeeper implementation

Distributed locks based on Zookeeper temporary ordered nodes.

The general idea is that when each client locks a method, a unique instantaneous ordered node is generated in the directory of the specified node corresponding to the method on zookeeper. The way to determine whether to get a lock is simple, you only need to determine the one with the smallest serial number in the ordered node. When the lock is released, simply delete the instantaneous node. At the same time, it can avoid the problem of deadlocks caused by service downtime that cannot be released.

Let's see if Zookeeper can solve the problems mentioned earlier.

Lock won't release? Using Zookeeper can effectively solve the problem of locks not being released, because when creating a lock, the client will create a temporary node in ZK, and once the client obtains the lock and suddenly hangs it (the session connection is broken), then the temporary node will be automatically deleted. Other clients can get the lock again.
Non-blocking locks? Once the node changes, Zookeeper will notify the client, and the client can check whether the node it created is the smallest ordinal number among all the nodes.
Can't re-enter? When the client creates a node, it directly writes the host information and thread information of the current client to the node, and the next time you want to get the lock, you can compare it with the data in the current smallest node. If the information is the same as your own, then you can directly obtain the lock, and if it is different, create a temporary sequential node to participate in the queue.

The question arises again, we know that Zookeeper needs to be deployed in clusters, will there be data synchronization problems like Redis clusters?

Zookeeper is a distributed component that guarantees weak consistency, i.e., eventual consistency.

Zookeeper employs a data synchronization protocol called Quorum Based Protocol. If there are N Zookeeper servers in the Zookeeper cluster (N is usually odd, 3 can meet data reliability and have high read and write performance, and 5 have the best balance between data reliability and read and write performance), then a write operation of the user is first synchronized to N/2 + 1 servers, and then returned to the user, prompting the user to write successfully. The data synchronization protocol based on the Quorum Based Protocol determines the consistency of strength that Zookeeper can support.

In a distributed environment, data storage that meets strong consistency is basically non-existent, and it requires that all nodes be updated synchronously when updating the data of one node. This synchronization strategy appears in the master-slave synchronous replication database. However, this synchronization strategy has too much impact on write performance and is rarely seen in practice. Because Zookeeper writes N/2+1 nodes synchronously, and N/2 nodes are not updated synchronously, Zookeeper is not strongly consistent.

The user's data update operation does not guarantee that subsequent reads will read the updated value, but will eventually show consistency. Sacrificing consistency does not mean completely ignoring the consistency of the data, otherwise the data is chaotic, so no matter how high the system availability is, no matter how good the distribution is, it is of no value. Sacrificing consistency is just that strong consistency in relational databases is no longer required, but as long as the system can achieve eventual consistency.

A single point question? Using Zookeeper can effectively solve a single point problem, ZK is deployed in clusters, as long as more than half of the machines in the cluster survive, the service can be provided to the outside world.

Fairness issues? Using Zookeeper can solve the problem of fair locks, the temporary nodes created by the client in ZK are orderly, and every time the lock is released, ZK can notify the smallest node to obtain the lock, ensuring fairness.

The question arises again, we know that Zookeeper needs to be deployed in clusters, will there be data synchronization problems like Redis clusters?

Zookeeper is a distributed component that guarantees weak consistency, i.e., eventual consistency.

Zookeeper employs a data synchronization protocol called Quorum Based Protocol. If there are N Zookeeper servers in the Zookeeper cluster (N is usually odd, 3 can meet data reliability and have high read and write performance, and 5 have the best balance between data reliability and read and write performance), then a write operation of the user is first synchronized to N/2 + 1 servers, and then returned to the user, prompting the user to write successfully. The data synchronization protocol based on the Quorum Based Protocol determines the consistency of strength that Zookeeper can support.

In a distributed environment, data storage that meets strong consistency is basically non-existent, and it requires that all nodes be updated synchronously when updating the data of one node. This synchronization strategy appears in the master-slave synchronous replication database. However, this synchronization strategy has too much impact on write performance and is rarely seen in practice. Because Zookeeper writes N/2+1 nodes synchronously, and N/2 nodes are not updated synchronously, Zookeeper is not strongly consistent.

The user's data update operation does not guarantee that subsequent reads will read the updated value, but will eventually show consistency. Sacrificing consistency does not mean completely ignoring the consistency of the data, otherwise the data is chaotic, so no matter how high the system availability is, no matter how good the distribution is, it is of no value. Sacrificing consistency is just that strong consistency in relational databases is no longer required, but as long as the system can achieve eventual consistency.

Whether Zookeeper meets causal consistency depends on how the client is programmed.

Practices that do not satisfy causal consistency

Process A writes a piece of data to Zookeeper's /z and returns successfully
Process A informs process B that A has modified the data of /z
B reads the data of Zookeeper's /z
Since the server of the Zookeeper connected to B may not have been updated with A's written data, then B will not be able to read A's written data

Practices that meet causal consistency

Process B listens for data changes in /z on Zookeeper
Process A writes a piece of data to Zookeeper's /z, and before it returns successfully, Zookeeper needs to call the listener registered on /z, and the leader will notify B of the data change
After the event response method of process B is responded, it takes the changed data, so B will definitely be able to get the changed value
Causal consistency here refers to the causal consistency between Leader and B, that is, the leader notifies the data of a change

The second event listening mechanism is also the method that should be used to properly program Zookeeper, so Zookeeper should meet the causal consistency

Therefore, when we implement distributed locks based on Zookeeper, we should use the practice of satisfying causal consistency, that is, the threads waiting for the lock listen to the changes in the lock of Zookeeper, and when the lock is released, Zookeeper will notify the waiting thread that meets the fair lock conditions.

You can use the Zookeeper third-party library client directly, which encapsulates a reentrant lock service.

Login is visible.

Distributed locks implemented with ZK seem to fit exactly what we expected from a distributed lock at the beginning of this article. However, it is not, and the distributed lock implemented by Zookeeper actually has a disadvantage, that is, the performance may not be as high as that of the caching service. Because every time in the process of creating and releasing a lock, instantaneous nodes must be dynamically created and destroyed to realize the lock function. Creating and deleting nodes in ZK can only be performed through the leader server, and then the data is shared with all follower machines.

Advantages of using Zookeeper to implement distributed locks
Effectively solve single point problems, non-reentry problems, non-blocking problems, and lock failure to release. It's relatively simple to implement.

Disadvantages of using Zookeeper to implement distributed locks
The performance is not as good as using cache to implement distributed locks. An understanding of the principles of ZK is required.

Comparison of the three options

From the perspective of ease of understanding (from low to high)
Database > Cache > Zookeeper

From the perspective of implementation complexity (from low to high)
Zookeeper > cache > databases

From a performance perspective (from high to low)
Caching > Zookeeper >= database

From a reliability perspective (from high to low)
Zookeeper > cache > databases

Little scum · Posted on 9/22/2020 5:34:33 PM

[Actual combat]. NET Core implements distributed locks based on Redis
https://www.itsvse.com/thread-9391-1-1.html

.net/c# Zookeeper Distributed Lock Implementation [Source Code]
https://www.itsvse.com/thread-4651-1-1.html

Several ways to use distributed locks (redis, zookeeper, database)

Related Posts

Sections viewed