Uncover Cassandra's Throughput Boundaries With the New Adaptive Scheduler in Tlp-Stress

Introduction

Apache Cassandra remains the preferred choice for organizations seeking a massively scalable NoSQL database. To guarantee predictable performance, Cassandra administrators and developers rely on benchmarking tools like tlp-stress, nosqlbench, and ndbench to help them discover their cluster’s limits. In this post, we will explore the latest advancements in tlp-stress, highlighting the introduction of the new Adaptive Scheduler. This brand-new feature allows users to more easily uncover the throughput boundaries of Cassandra clusters while remaining within specific read and write latency targets. First though, we’ll take a brief look at the new workload designed to stress test the new Storage Attached Indexes feature coming in Cassandra 5.

Benchmarking SAI Indexes: Improving Query Flexibility

My fork of tlp-stress has early support for stress testing Storage Attached Indexes, abbreviated as SAI. SAI is a new feature coming in Cassandra 5.0 that allows for secondary indexes that are “attached” to each SSTable on disk, and are built using Lucene. This will give us the ability to add indexes to our tables supporting equality, range and Approximate Nearest Neighbor Vector search, which is heavily used in AI. The status quo is to maintain separate tables for each query we need to execute. This is probably one of the biggest complaints about working with Cassandra, and it’s a fair criticism. It’s not always practical (or enjoyable) to manually maintain our own views, especially when one or more of the following is true:

  • There are multiple fields we might want to query on.
  • The alternative views are the same exact main structure as the original view, just filtered by a field.
  • the index is a Vector index that benefits from heavily optimized code like Jonathan Ellis’s JVector project.
  • A combination of several of the above, leading to an impractical combinatorial explosion in tables simply to answer queries.

The optimal use case for SAI indexes are searching within a single partition, however global searches are also supported, and necessary for Vector search.

Here’s what running a quick test locally looks like:

bin/tlp-stress run SAI -d 1h --rate 100 --populate 1k -r .9 \
   --replication "{'class':'SimpleStrategy','replication_factor':1}" 

This creates the schema, pre-populates the database with 1,000 rows at RF=1, then runs a workload at 90% reads, performing 100 queries / second.

I’ll be doing a deep dive on this feature in a future post as we get closer to the release and I can share some numbers and guidance that won’t be immediately outdated, particularly around global queries which are a big unknown right now.

The Adaptive Scheduler: Optimizing Cassandra Benchmarking

The recently introduced Adaptive Scheduler takes Cassandra benchmarking to the next level by allowing a user to specify target read and write latencies, and let tlp-cluster find the max throughput for a given workload. This work is inspired by the Netflix Blog Post Performance Under Load, but deviates in that it allows the user to find the throughput that a cluster can sustain without exceeding its latency Service Level Objective (SLO).

Prior to the introduction of this work, older versions of tlp-stress had two issues that made it a challenge to work with in certain situations. While a rate limiter existed with the --rate flag, the primary mechanism to control throughput was a concurrency limiter, -c. This would allow for a specified number of in-flight queries by a semaphore, allowing more to execute only after in flight ones had completed. While this worked fine in a lot of situations, it had two issues:

  1. Limiting by a concurrency control meant queries weren’t generated at a predictable rate, and prevented the cluster from getting overwhelmed by more queries it could handle. On the surface this is great - you could just let it run, and it would give you a throughput and latency number. However, because the system lacked a scheduler, it suffered from coordinated omission, meaning the latency numbers you got back would be slightly better than they would be with a scheduler in place. It also meant you might not truly be sending requests at the rate you specified. For example, running a workload with --rate 10000000 would be a maximum limit, and almost certainly not a reflection of what latency you’d see with a cluster that’a actually handling ten million requests a second. This was documented in the TLP blog post Comparing stress tools for Apache Cassandra
  2. If you do you have a latency SLO, it could be time-consuming to figure out the right mix of --rate and -c to meet your objective. Load tests are typically run over a long period of time, the minimum being an hour, but the longer, the better. The effect of some issues only pop up after a long time such as full garbage collections, memory leaks, and faulty hardware. If the cost of running a test repeatedly is losing several days, then that’s a major problem.

It was the second issue that came up recently with a team that was using tlp-stress to discover their cluster’s capacity limits. This team had very strict latency goals, which meant we needed to restart the tests frequently. The ideal situation would be to specify the SLO and let tlp-stress figure out the throughput, which lead me to the adaptive scheduler.

The existing work surrounding adaptive rate limiting was focused on queuing as a feedback mechanism, which is great if you want to determine the maximum overall throughput for a system in order to not overwhelm it, but it doesn’t help us if we have an SLO on the database. Instead of using the queue as a feedback mechanism, we’ll use our internal benchmarking metrics. If we’re under the SLO, we increase the throughput, if we’re over, we decrease. Seems simple enough. The question is, by how much?

One of the tricky parts about changing throughput is that with a low SLO, small changes in throughput can make a big difference to p99 latency. To deal with this we use the SLO to help guide the absolute maximum rate in which throughput could be increased. Here’s what that Kotlin code looks like (maxLatency here is measured in milliseconds):

var maxIncrease = (1.0 + sqrt(maxLatency.toDouble()) / 100.0).coerceAtMost(1.05)

If our latency requirement is 4ms, we’ll only increase by 2% at most. This caps out at 5% to avoid large changes, and allows the system to behave reasonably well at higher latencies. Additionally, the closer tlp-stress is to the target latency, the less it’ll increase the throughput. Our goal is to gently approach the target latency, not wildly oscillate around it.

To use this feature, for now you’ll need to build from source.

git clone https://github.com/rustyrazorblade/tlp-stress.git
cd tlp-stress
./gradlew shadowdist installdist
# or
./gradlew distzip # alternatively if you want a zip you can copy around, look in build/distributions

To leverage this new feature, simply set your initial throughput using the --rate flag and define the maximum read and write latencies with --maxrlat and --maxwlat, respectively. The adaptive scheduler is completely optional, you can provide a --rate and the system will use it as a static scheduler instead.

Conclusion

The latest enhancements in tlp-stress, particularly the introduction of the Adaptive Scheduler, make it an invaluable tool for benchmarking Cassandra clusters. From benchmarking SAI indexes to optimizing resource allocation and meeting specific read and write latency targets, tlp-stress equips administrators and developers to unlock the full potential of their Cassandra clusters. Whether you’re an experienced Cassandra administrator or a newcomer, tlp-stress, with its new features, is the perfect companion in your quest to ensure your Cassandra clusters are prepared to tackle real-world workloads with ease.

For the foreseeable future, look forward to a new post on Tuesdays. November will feature 4 posts all focused on profiling, which has helped me uncover all sorts of weird issues in production.

If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.