Practical distributed locking in backend systems
Tackling race conditions in multi-server architecture
Earlier this year I implemented a feature that tracks transaction hashes and other details related to a blockchain transaction to trigger a series of off-chain data processing. I implemented it using a simple message queue on the Postgresql table with support for retry/fail/pending states.
This will work as expected in a single-server architecture where you have a single instance of your server and worker picking up pending tasks from the queue to process. Scaling up to multi-server architecture with this implementation, I ran into issues like data races, and unnecessary compute usage in nodes because multiple worker instances are racing to process a single event.
Of course, there are many ways to fix this issue, the most obvious being a multi-server and single-worker architecture, With this, you’ll have a single worker processing events in the queue, no data races, spike in resource usage etc. Now this introduces a single point of failure in your architecture when the worker goes down, all processing is halted. That is where distributed locking comes into play.
What is distributed locking?
The purpose of a lock is to ensure that a resource can only be accessed by a single node at a time. It prevents multiple nodes from doing the same work like processing an event or performing some computation. In concurrent systems, locks are vital because they prevent two processes from accessing the same data. If a lock fails it can lead to corrupt data, memory bugs (referencing deallocated memory) or other serious issues. To read more about distributed locking check out this article by Martin kleppmann.
Ways to Implement Distributed Locking
There are many ways to implement a distributed lock, whichever method you choose should depend on your architecture.
If you’re using Postgres, there’s a property of transaction isolation. The default one is READ COMMITTED, Which allows queries to see rows committed before it began. Read committed is not ideal for our use case, one way to address this is to set the transaction isolation level to SERIALIZABLE
All statements of the current transaction can only see rows committed before the first query or data-modification statement was executed in this transaction. If a pattern of reads and writes among concurrent serializable transactions would create a situation which could not have occurred for any serial (one-at-a-time) execution of those transactions, one of them will be rolled back with a serialization_failure error.
This will enforce the correctness of your application but at the cost of availability.
Advisory locks (PostgreSQL)
This is another way of implementing distributed locking using PostgreSQL, This provides a means for creating locks with application-level defined meanings. They are called advisory locks because the system doesn't enforce their use.
Advisory locks can be acquired in two ways: at the session level or transaction level. Session-level locks are held until explicitly released or until the session is over, while transaction-level locks are automatically released at the end of the transaction.
SELECT pg_advisory_lock(id) FROM foo WHERE id = 12345; -- ok
SELECT pg_advisory_lock(id) FROM foo WHERE id > 12345 LIMIT 100; -- danger!
SELECT pg_advisory_lock(q.id) FROM
(
SELECT id FROM foo WHERE id > 12345 LIMIT 100
) q; -- ok
To learn more about advisory locks check here
The explicit use of locking increases the probability of deadlocks, this occurs when two (or more) transactions/processes each hold locks that the other wants. PostgreSQL checks for deadlocks and resolves them by rolling back one of the transactions.
Distributed locking with Redis
We all know Redis is probably the world’s fastest in-memory database but it’s much more than a cache. While researching different options for implementing a distributed lock I came across this article by Martin kleppmann about the Redlock algorithm from Redis. The algorithm claims to address the challenges of data consistency and race conditions in distributed systems where multiple processes, nodes or threads are accessing shared resources concurrently. Redlock is based on using multiple Redis nodes (instances) to achieve distributed locking. This Redlock algorithm is used to coordinate the interaction of those instances when there is an attempt to acquire a lock, if the majority of the nodes agree on the acquisition the client is granted the lock.
An example code from the nodejs implementation documentation:
import Client from "ioredis";
import Redlock from "./redlock";
const redisA = new Client({ host: "a.redis.example.com" });
const redisB = new Client({ host: "b.redis.example.com" });
const redisC = new Client({ host: "c.redis.example.com" });
const redlock = new Redlock(
// You should have one client for each independent redis node
// or cluster.
[redisA, redisB, redisC],
{
// The expected clock drift; for more details see:
// http://redis.io/topics/distlock
driftFactor: 0.01, // multiplied by lock ttl to determine drift time
// The max number of times Redlock will attempt to lock a resource
// before erroring.
retryCount: 10,
// the time in ms between attempts
retryDelay: 200, // time in ms
// the max time in ms randomly added to retries
// to improve performance under high contention
// see https://www.awsarchitectureblog.com/2015/03/backoff.html
retryJitter: 200, // time in ms
// The minimum remaining time on a lock before an extension is automatically
// attempted with the `using` API.
automaticExtensionThreshold: 500, // time in ms
}
);
Acquiring a lock looks like this:
// Acquire a lock.
let lock = await redlock.acquire(["a"], 5000);
try {
// Do something...
await something();
// Extend the lock.
lock = await lock.extend(5000);
// Do something else...
await somethingElse();
} finally {
// Release the lock.
await lock.release();
}
This looks pretty good and might suit your use case if you have multiple Redis instances. But if you only have one Redis instance in your infrastructure there is no reason to incur the cost of running five Redis instances if you’re implementing a distributed lock for efficiency purposes.
Depending on the use case, most people might want a lock for two major reasons: efficiency and correctness.
Efficiency: When a lock is used to prevent multiple nodes from doing the same work. if the lock fails you end up paying more $$$ for the extra work done on AWS or a minor inconvenience.
Correctness: A lock is used to prevent concurrent processes from interfering with each other and messing up the state of your system. If the lock fails and two nodes work on a shared resource concurrently, the resource is corrupted, data is lost, inconsistencies set in etc…
Distributed locking with a single Redis instance
You can use SET NX for simple locking in Redis, it is one of the simplest ways to implement a distributed lock in Redis. It only sets the key if it doesn’t already exist (SET if Not eXists).
Sample code in typescript:
async aquireLock(key: string, lockTime = this.MAX_LOCK_TIME) {
const result = await redisClient.set(key, 'true', { NX: true, EX: lockTime });
if (result) {
return true;
}
return false;
}
async freeLock(key: string) {
return await redisClient.del(key);
}
Limitations of SET NX Approach
Expiration: To overcome this limitation we can add an expiration time to the options when acquiring the lock, by setting a time-to-live (TTL) for the lock key we ensure that we automatically release the lock after the expiration even if a process crashes without releasing the locks.
Lock release: this is used to guarantee that only the lock holder can release the lock manually by invoking the delete method.
Code sample with lock release implementation:
async aquireLock(key: string, identifier: string, lockTime = this.MAX_LOCK_TIME) {
const result = await redisClient.set(key, identifier, { NX: true, EX: lockTime });
if (result) {
return true;
}
return false;
}
async freeLock(key: string, identifier: string) {
const holder = await redisClient.get(key);
if (holder === identifier) redisClient.del(key);
}
conclusion
Hopefully, you have learnt more about distributed locking, and how to make informed decisions on which one to implement depending on your use case.
If you want to learn more about Redlock see: Redis Lock. Please give this post a like and share so others can benefit.
If you’re interested in Fullstack development, distributed systems, software architecture and Rust 👀 Subscribe.
References
https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
https://severalnines.com/blog/understanding-deadlocks-mysql-postgresql/
https://www.red-gate.com/simple-talk/databases/postgresql/database-concurrency-in-postgresql/
Thank you for this👏