So You Want to Optimize gRPC - Part 2
How fast is gRPC? Pretty fast if you understand how modern clients and servers are built. In part 1, I showed how to get an easy 60% improvement. In this post I show how to get a 10000% improvement.
Setup
As in part 1, we will start with an existing, Java based, key-value service. The service will offer concurrent access for creating, reading, updating, and deleting keys and values. All the code can be seen here if you want to try it out.
Server Concurrency
Let’s look at the KvService
class. This service handles the RPCs sent by the client, making sure that none of them
accidentally corrupt the state of storage. To ensure this, the service uses the synchronized
keyword to ensure only one RPC is active at a time:
private final Map<ByteBuffer, ByteBuffer> store = new HashMap<>();
@Override
public synchronized void create(
CreateRequest request, StreamObserver<CreateResponse> responseObserver) {
ByteBuffer key = request.getKey().asReadOnlyByteBuffer();
ByteBuffer value = request.getValue().asReadOnlyByteBuffer();
simulateWork(WRITE_DELAY_MILLIS);
if (store.putIfAbsent(key, value) == null) {
responseObserver.onNext(CreateResponse.getDefaultInstance());
responseObserver.onCompleted();
return;
}
responseObserver.onError(Status.ALREADY_EXISTS.asRuntimeException());
}
While this code is thread safe, it comes at a high price: only one RPC can ever be active! We need some way of allowing multiple operations to happen safely at the same time. Otherwise, the program can’t take advantage of all the available processors.
Breaking the Lock
To solve this, we need to know a little more about the semantics of our RPCs. The more we know about how the RPCs are supposed to work, the more optimizations we can make. For a key-value service, we notice that operations to different keys don’t interfere with each other. When we update key ‘foo’, it has no bearing on the value stored for key ‘bar’. But, our server is written such that operations to any key must be synchronized with respect to each other. If we could make operations to different keys happen concurrently, our server could handle a lot more load.
With the idea in place, we need to figure out how to modify the server. The
synchronized
keyword causes Java to acquire a lock on this
, which is the instance of
KvService
. The lock is acquired when the create
method is entered, and released on return.
The reason we need synchronization is to protect the store
Map. Since it is implemented as a
HashMap
, modifications to it change the internal
arrays. Because the internal state of the HashMap
will be corrupted if not properly
synchronized, we can’t just remove the synchronization on the method.
However, Java offers a solution here: ConcurrentHashMap
. This class offers the ability to
safely access the contents of the map concurrently. For example, in our usage we want to check
if a key is present. If not present, we want to add it, else we want to return an error. The
putIfAbsent
method atomically checks if a value is present, adds it if not, and tells us if
it succeeded.
Concurrent maps provide stronger guarantees about the safety of putIfAbsent
, so we can swap the
HashMap
to a ConcurrentHashMap
and remove synchronized
:
private final ConcurrentMap<ByteBuffer, ByteBuffer> store = new ConcurrentHashMap<>();
@Override
public void create(
CreateRequest request, StreamObserver<CreateResponse> responseObserver) {
ByteBuffer key = request.getKey().asReadOnlyByteBuffer();
ByteBuffer value = request.getValue().asReadOnlyByteBuffer();
simulateWork(WRITE_DELAY_MILLIS);
if (store.putIfAbsent(key, value) == null) {
responseObserver.onNext(CreateResponse.getDefaultInstance());
responseObserver.onCompleted();
return;
}
responseObserver.onError(Status.ALREADY_EXISTS.asRuntimeException());
}
If at First You Don’t Succeed
Updating create
was pretty easy. Doing the same for retrieve
and delete
is easy too.
However, the update
method is a little trickier. Let’s take a look at what it’s doing:
@Override
public synchronized void update(
UpdateRequest request, StreamObserver<UpdateResponse> responseObserver) {
ByteBuffer key = request.getKey().asReadOnlyByteBuffer();
ByteBuffer newValue = request.getValue().asReadOnlyByteBuffer();
simulateWork(WRITE_DELAY_MILLIS);
ByteBuffer oldValue = store.get(key);
if (oldValue == null) {
responseObserver.onError(Status.NOT_FOUND.asRuntimeException());
return;
}
store.replace(key, oldValue, newValue);
responseObserver.onNext(UpdateResponse.getDefaultInstance());
responseObserver.onCompleted();
}
Updating a key to a new value needs two interactions with the store
:
- Check to see if the key exists at all.
- Update the previous value to the new value.
Unfortunately ConcurrentMap
doesn’t have a straightforward method to do this. Since we may not
be the only ones modifying the map, we need to handle the possibility that our assumptions
have changed. We read the old value out, but by the time we replace it, it may have been deleted.
To reconcile this, let’s retry if replace
fails. It returns true if the replace
was successful. (ConcurrentMap
asserts that the operations will not corrupt the internal
structure, but doesn’t say that they will succeed!) We will use a do-while loop:
@Override
public void update(
UpdateRequest request, StreamObserver<UpdateResponse> responseObserver) {
// ...
ByteBuffer oldValue;
do {
oldValue = store.get(key);
if (oldValue == null) {
responseObserver.onError(Status.NOT_FOUND.asRuntimeException());
return;
}
} while (!store.replace(key, oldValue, newValue));
responseObserver.onNext(UpdateResponse.getDefaultInstance());
responseObserver.onCompleted();
}
The code wants to fail if it ever sees null, but never if there is a non-null previous value. One
thing to note is that if another RPC modifies the value between the store.get()
call and the
store.replace()
call, it will fail. This is a non-fatal error for us, so we will just try again.
Once it has successfully put the new value in, the service can respond back to the user.
There is one other possibility that could happen: two RPCs could update the same value and overwrite each other’s work. While this may be okay for some applications, it would not be suitable for APIs that provide transactionality. It is out of scope for this post to show how to fix this, but be aware it can happen.
Measuring the Performance
In the last post, we modified the client to be asynchronous and use the gRPC ListenableFuture API. To avoid running out of memory, the client was modified to have at most 100 active RPCs at a time. As we now see from the server code, performance was bottlenecked on acquiring locks. Since we have removed those, we expect to see a 100x improvement. The same amount of work is done per RPC, but a lot more are happening at the same time. Let’s see if our hypothesis holds:
Before:
$ ./gradlew installDist
$ time ./build/install/kvstore/bin/kvstore
Apr 16, 2018 10:38:42 AM io.grpc.examples.KvRunner runClient
INFO: Did 24.067 RPCs/s
real 1m0.886s
user 0m9.340s
sys 0m1.660s
After:
Apr 16, 2018 10:36:48 AM io.grpc.examples.KvRunner runClient
INFO: Did 2,449.8 RPCs/s
real 1m0.968s
user 0m52.184s
sys 0m20.692s
Wow! From 24 RPCs per second to 2,400 RPCs per second. And we didn’t have to change our API or our client. This is why understanding your code and API semantics is important. By exploiting the properties of the key-value API, namely the independence of operations on different keys, the code is now much faster.
One noteworthy artifact of this code is the user
timing in the results. Previously the user time
was only 9 seconds, meaning that the CPU was active only 9 of the 60 seconds the code was running.
Afterwards, the usage went up by more than 5x to 52 seconds. The reason is that more CPU cores are
active. The KvServer
is simulating work by sleeping for a few milliseconds. In a real
application, it would be doing useful work and not have such a dramatic change. Rather than
scaling per the number of RPCs, it would scale per the number of cores. Thus, if your machine had
12 cores, you would expect to see a 12x improvement. Still not bad though!
More Errors
If you run this code yourself, you will see a lot more log spam in the form:
Apr 16, 2018 10:38:40 AM io.grpc.examples.KvClient$3 onFailure
INFO: Key not found
io.grpc.StatusRuntimeException: NOT_FOUND
The reason is that the new version of the code makes API level race conditions more apparent. With 100 times as many RPCs happening, the chance of updates and deletes colliding with each other is more likely. To solve this we will need to modify the API definition. Stay tuned for the next post showing how to fix this.
Conclusion
There are a lot of opportunities to optimize your gRPC code. To take advantage of these, you need to understand what your code is doing. This post shows how to convert a lock-based service into a low-contention, lock-free service. Always make sure to measure before and after your changes.
In Part 3, we will optimize the code even further. 2,400 RPC/s is just the beginning!