This is the third post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In the first post, I examined how streaming operations impact node density and laid out the groundwork for understanding why higher node density leads to significant cost savings. In the second post, I discussed how compaction throughput is critical to node density and introduced the optimizations we implemented in CASSANDRA-15452 to improve throughput on disaggregated storage like EBS.
This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.
This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.
In the world of software engineering, especially within the realm of distributed systems, continuous learning and experimentation are not just beneficial; they’re essential. As a software engineer with a focus on distributed systems, particularly Apache Cassandra, I’ve taken this ethos to heart. My journey has led me to not only explore the intricacies of Cassandra’s distributed architecture but also to share my experiences and findings with a broader audience. This is why my YouTube channel has become an active platform where I stream at least once a week, engaging with viewers through coding sessions, trying new approaches, and benchmarking different Cassandra workloads.
As I promised in December, I redid my presentation from the Cassandra Summit 2023 on a live stream. You can check it out at the bottom of this post.
Going forward, I’ll be live-streaming on Tuesdays at 10AM Pacific on my YouTube channel.
Next week I’ll be taking a look at tlp-stress, which is used by the teams at some of the biggest Cassandra deployments in the world to benchmark their clusters. You can find that here.
Welcome back to the month of profiling! The first two weeks we took a look at using the async-profiler to generate flame graphs and bcc-tools to understand your hardware. This week we’ll be focusing on eBPF, a rapidly evolving technology that lets you extract a ton of useful information out of the Linux Kernel even if a tool wasn’t built specifically for it. Think of it as a tool that lets you create small programs that run safely in the Linux Kernel with almost no overhead for the purpose of observability.
Welcome back to my series on performance profiling! In this second installment, we’re shifting our focus towards a hands-on exploration of our hardware using the power of bcc-tools. Building on the foundational knowledge from our first post about flame graphs, we are now ready to delve deeper into the intricacies of system performance. These tools are a great way of understanding if we’re seeing bottlenecks in our hardware.
For a little background, bcc-tools is a collection of utilities that expose underlying kernel statistics through eBPF. eBPF is an amazing technology that lets us tap into what’s going on in the Linux kernel with negligible performance overhead. Fun stuff!
Welcome to the first article of ‘The Month of Profiling,’ a series I will be presenting throughout November. We’ll explore a range of technologies that are crucial for identifying and resolving performance issues, as well as for fine-tuning your systems. The series will conclude with a detailed walkthrough on how to integrate these technologies for effective performance optimization.
Introduction
Optimizing the performance of a complex application like a database requires a deep understanding of multiple components; its resource utilization, the behavior of the JVM, and how nodes interact with each other. There are several useful approaches to understanding an application’s behavior, either when performance tuning or dealing with an outage. In most circumstances, teams will rely on a set of dashboards based on various metrics emitted from the database and the underlying operating system. However, the picture is only as complete as the metrics that are available, and in my experience, there can never be enough metrics get a complete understanding of an application’s behavior. In an ideal world, we’d know exactly where our time is spent and what is allocating memory, two things that have an enormous impact on performance. By understanding what’s happening in our database and the tuning that are available, we can often tweak some settings for a significant improvement in performance.