Cassandra

Building easy-cass-mcp: An MCP Server for Cassandra Operations

3 min read

I’ve started working on a new project that I’d like to share, easy-cass-mcp, an MCP (Model Context Protocol) server specifically designed to assist Apache Cassandra operators.

After spending over a decade optimizing Cassandra clusters in production environments, I’ve seen teams consistently struggle with how to interpret system metrics, configuration settings, schema design, and system configuration, and most importantly, how to understand how they all impact each other. While many teams have solid monitoring through JMX-based collectors, extracting and contextualizing specific operational metrics for troubleshooting or optimization can still be cumbersome. The good news is that we now have the infrastructure to make all this operational knowledge accessible through conversational AI.

cassandra mcp operations
Read more

easy-cass-stress Joins the Apache Cassandra Project

3 min read

I’m taking a quick break from my series on Cassandra node density to share some news with the Cassandra community: easy-cass-stress has officially been donated to the Apache Software Foundation and is now part of the Apache Cassandra project ecosystem as cassandra-easy-stress.

Why This Matters

Over the past decade, I’ve worked with countless teams struggling with Cassandra performance testing and benchmarking. The reality is that stress testing distributed systems requires tools that can accurately simulate real-world workloads. Many tools make this difficult by requiring the end user to learn complex configurations and nuance. While consulting at The Last Pickle, I set out to create an easy to use tool that lets people get up and running in just a few minutes

cassandra
Read more

Optimizing Cassandra Repair Processes for Higher Node Density

5 min read

This is the third post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In the first post, I examined how streaming operations impact node density, and in the second post, we explored compaction strategies.

Now, we’ll tackle another critical aspect of Cassandra operations that directly impacts how much data you can efficiently store per node: repair processes. Having worked with repairs across hundreds of clusters, I’ve developed strong opinions on what works and what doesn’t when you’re pushing the limits of node density.

cassandra repair performance
Read more

Query Throughput Optimization for Maximum Cassandra Node Density

7 min read

This is the fourth post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In previous posts, we covered streaming operations, compaction strategies, and repair processes. Now, we’ll focus on optimizing query throughput, a critical aspect that can become a bottleneck as node density increases.

At a high level, these are the leading factors that impact node density:

  • Streaming Throughput
  • Compaction Throughput and Strategies
  • Various Aspects of Repair
  • Query Throughput (this post)
  • Garbage Collection and Memory Management
  • Efficient Disk Access
  • Compression Performance and Ratio
  • Linearly Scaling Subsystems with CPU Core Count and Memory

Why Query Throughput Matters for Node Density

When I first started working with high-density Cassandra deployments, I made a critical mistake: focusing solely on storage capacity while overlooking query efficiency. I learned this lesson the hard way when helping a client scale from 2TB to 8TB per node. Everything looked fine during initial testing, but once their application traffic ramped up, query latencies skyrocketed and throughput plummeted. What happened?

cassandra query throughput performance
Read more

Garbage Collection and Memory Management for High-Density Cassandra Nodes

6 min read

This is the fifth post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In previous posts, we covered streaming operations, compaction strategies, repair processes, and query throughput optimization. Now, we’ll tackle one of the most critical yet often misunderstood aspects of Cassandra performance: garbage collection and memory management.

At a high level, these are the leading factors that impact node density:

  • Streaming Throughput
  • Compaction Throughput and Strategies
  • Various Aspects of Repair
  • Query Throughput
  • Garbage Collection and Memory Management (this post)
  • Efficient Disk Access
  • Compression Performance and Ratio
  • Linearly Scaling Subsystems with CPU Core Count and Memory

Why Memory Management Matters for Node Density

Cassandra is a JVM-based application, which means it’s subject to the constraints and behaviors of Java’s memory management system. As node density increases, efficient memory usage becomes increasingly critical for several reasons:

cassandra garbage collection JVM tuning
Read more

Compaction Strategies, Performance, and Their Impact on Cassandra Node Density

8 min read

This is the third post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In the first post, I examined how streaming operations impact node density and laid out the groundwork for understanding why higher node density leads to significant cost savings. In the second post, I discussed how compaction throughput is critical to node density and introduced the optimizations we implemented in CASSANDRA-15452 to improve throughput on disaggregated storage like EBS.

cassandra compaction performance
Read more

Compression Performance and Ratio: The Final Frontier for Cassandra Node Density

9 min read

This is the seventh post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. We’ve already covered streaming operations, compaction strategies, repair processes, query throughput optimization, garbage collection, and efficient disk access. Now, we’ll focus on the final major factor impacting node density: compression performance and ratio.

At a high level, these are the leading factors that impact node density:

  • Streaming Throughput
  • Compaction Throughput and Strategies
  • Various Aspects of Repair
  • Query Throughput
  • Garbage Collection and Memory Management
  • Efficient Disk Access
  • Compression Performance and Ratio (this post)
  • Linearly Scaling Subsystems with CPU Core Count and Memory

Why Compression Matters for Node Density

Compression is one of the most overlooked yet impactful factors affecting Cassandra node density. It directly influences:

cassandra compression performance tuning
Read more

Efficient Disk Access: Optimizing Storage for High-Density Cassandra Nodes

9 min read

This is the sixth post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. We’ve already covered streaming operations, compaction strategies, repair processes, query throughput optimization, and garbage collection. Now, we’ll focus on one of the most fundamental aspects of database performance: efficient disk access.

For a quick refresher, these are the leading factors that impact node density:

  • Streaming Throughput
  • Compaction Throughput and Strategies
  • Various Aspects of Repair
  • Query Throughput
  • Garbage Collection and Memory Management
  • Efficient Disk Access (this post)
  • Compression Performance and Ratio
  • Linearly Scaling Subsystems with CPU Core Count and Memory

Why Disk Access Matters for Node Density

In my time of operating and optimizing Cassandra clusters, I’ve found that efficient disk access becomes significantly more important as node density increases. When your nodes hold just 1-2TB of data, you might barely notice inefficient disk access patterns. But push that to 10-20TB per node, and these same inefficiencies transform into critical bottlenecks that can cripple your entire system.

cassandra disk IO storage
Read more

The Ultimate Guide to Time Series Data with Apache Cassandra - Part 1: Fundamentals

6 min read

Time series data represents one of the most common and challenging workloads for modern data platforms. From IoT device metrics to financial market data, the ability to efficiently store and query chronological measurements at scale has become a critical requirement for many organizations. After years of helping clients implement time series solutions with Apache Cassandra, I’ve developed a set of best practices that can make the difference between a system that struggles and one that scales effortlessly.

time series cassandra data modeling
Read more

Cassandra Compaction Throughput Performance Explained

12 min read

This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.

This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.

cassandra compaction performance
Read more

How Cassandra Streaming, Performance, Node Density, and Cost are All related

16 min read

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.

cassandra streaming compaction
Read more

Cassandra 5 Released! What's New and How to Try it

3 min read

Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.

You can grab the latest release on the Cassandra download page.

cassandra easy-cass-lab
Read more

Variable Int Encoding: Optimizing Storage Efficiency in Cassandra

4 min read

When evaluating a database for your application, it’s essential to consider the total cost of ownership. This includes the obvious expenses like team salaries and hardware costs, but also extends to more subtle factors like storage efficiency and network bandwidth usage. In my years of working with Cassandra clusters, I’ve found that optimizing how data is encoded can have a surprising impact on both performance and operating costs.

Why Encoding Matters for Database Efficiency

Storage efficiency directly affects several aspects of database operations:

java cassandra optimization
Read more

easy-cass-lab updated with Cassandra 5.0 RC-1 Support

5 min read

I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.

For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.

easy-cass-lab cassandra
Read more

Benchmarking Disk Performance with FIO: A Practical Guide for Database Administrators

5 min read

In my previous post about LVM, I explored how we can use Logical Volume Management to efficiently manage our disks. I briefly touched on LVM’s powerful caching feature that can significantly accelerate disk access, but left out an important piece: how to properly test and measure these performance improvements. Today, I’ll walk you through creating realistic disk benchmarks using FIO (Flexible I/O Tester) that can help you understand your storage subsystem’s true capabilities.

fio benchmarking cassandra
Read more

easy-cass-lab now available in Homebrew

4 min read

I’m happy to share some exciting news for all Cassandra enthusiasts! My open source project, easy-cass-lab, is now installable via a homebrew tap. This powerful tool is designed to make testing any major version of Cassandra (or even builds that haven’t been released yet) a breeze, using AWS. A big thank-you to Jordan West who took the time to make this happen!

What is easy-cass-lab?

easy-cass-lab is a versatile testing tool for Apache Cassandra. Whether you’re dealing with the latest stable releases or experimenting with unreleased builds, easy-cass-lab provides a seamless way to test and validate your applications. With easy-cass-lab, you can ensure compatibility and performance across different Cassandra versions, making it an essential tool for developers and system administrators. easy-cass-lab is used extensively for my consulting engagements, my training program, and to evaluate performance patches destined for open source Cassandra. Here are a few examples:

cassandra open-source easy-cass-lab
Read more

Cassandra Training Signups For July and August Are Open!

1 min read

I’m pleased to announce that I’ve opened training signups for Operator Excellence to the public for July and August. If you’re interested in stepping up your game as a Cassandra operator, this course is for you. Head over to the training page to find out more and sign up for the course.

training cassandra
Read more

Streaming My Sessions With Cassandra 5.0

1 min read

As a long time participant with the Cassandra project, I’ve witnessed firsthand the evolution of this incredible database. From its early days to the present, our journey has been marked by continuous innovation, challenges, and a relentless pursuit of excellence. I’m thrilled to share that I’ll be streaming several working sessions over the next several weeks as I evaluate the latest builds and test out new features as we move toward the 5.0 release.

youtube benchmarking cassandra
Read more

Streaming Cassandra Workloads and Experiments

3 min read

Streaming

In the world of software engineering, especially within the realm of distributed systems, continuous learning and experimentation are not just beneficial; they’re essential. As a software engineer with a focus on distributed systems, particularly Apache Cassandra, I’ve taken this ethos to heart. My journey has led me to not only explore the intricacies of Cassandra’s distributed architecture but also to share my experiences and findings with a broader audience. This is why my YouTube channel has become an active platform where I stream at least once a week, engaging with viewers through coding sessions, trying new approaches, and benchmarking different Cassandra workloads.

youtube performance linux
Read more

Cassandra Summit Recap: Performance Tuning and Cassandra Training

3 min read

Hello, friends in the Apache Cassandra community!

I recently had the pleasure of speaking at the Cassandra Summit in San Jose. Unfortunately, we ran into an issue with my screen refusing to cooperate with the projector, so my slides were pretty distorted and hard to read. While the talk is online, I think it would be better to have a version with the right slides as well as a little more time. I’ve decided to redo the entire talk via a live stream on YouTube. I’m scheduling this for 10am PST on Wednesday, January 17 on my YouTube channel. My original talk was done in 30 minute slot, this will be a full hour, giving plenty of time for Q&A.

cassandra consulting training
Read more

Cassandra Summit, YouTube, and a Mailing List

2 min read

I am thrilled to share some significant updates and exciting plans with my readers and the Cassandra community. As we draw closer to the end of the year, I’m preparing for an important speaking engagement and mapping out a year ahead filled with engaging and informative activities.

Cassandra Summit Presentation: Mastering Performance Tuning

I am honored to announce that I will be speaking at the upcoming Cassandra Summit. My talk, titled “Cassandra Performance Tuning Like You’ve Been Doing It for Ten Years,” is scheduled for December 13th, from 4:10 pm to 4:40 pm. This session aims to equip attendees with advanced insights and practical skills for optimizing Cassandra’s performance, drawing from a decade’s worth of experience in the field. Whether you’re new to Cassandra or a seasoned user, this talk will provide valuable insights to enhance your database management skills.

cassandra speaking youtube
Read more

Profiling (almost) everything you can think of with bpftrace

7 min read

Welcome back to the month of profiling! The first two weeks we took a look at using the async-profiler to generate flame graphs and bcc-tools to understand your hardware. This week we’ll be focusing on eBPF, a rapidly evolving technology that lets you extract a ton of useful information out of the Linux Kernel even if a tool wasn’t built specifically for it. Think of it as a tool that lets you create small programs that run safely in the Linux Kernel with almost no overhead for the purpose of observability.

performance cassandra profiling
Read more

Hardware profiling with bcc-tools

8 min read

Welcome back to my series on performance profiling! In this second installment, we’re shifting our focus towards a hands-on exploration of our hardware using the power of bcc-tools. Building on the foundational knowledge from our first post about flame graphs, we are now ready to delve deeper into the intricacies of system performance. These tools are a great way of understanding if we’re seeing bottlenecks in our hardware.

For a little background, bcc-tools is a collection of utilities that expose underlying kernel statistics through eBPF. eBPF is an amazing technology that lets us tap into what’s going on in the Linux kernel with negligible performance overhead. Fun stuff!

performance cassandra profiling
Read more

Unveiling Performance Bottlenecks With Flamegraphs using async-profiler

6 min read

Welcome to the first article of ‘The Month of Profiling,’ a series I will be presenting throughout November. We’ll explore a range of technologies that are crucial for identifying and resolving performance issues, as well as for fine-tuning your systems. The series will conclude with a detailed walkthrough on how to integrate these technologies for effective performance optimization.

Introduction

Optimizing the performance of a complex application like a database requires a deep understanding of multiple components; its resource utilization, the behavior of the JVM, and how nodes interact with each other. There are several useful approaches to understanding an application’s behavior, either when performance tuning or dealing with an outage. In most circumstances, teams will rely on a set of dashboards based on various metrics emitted from the database and the underlying operating system. However, the picture is only as complete as the metrics that are available, and in my experience, there can never be enough metrics get a complete understanding of an application’s behavior. In an ideal world, we’d know exactly where our time is spent and what is allocating memory, two things that have an enormous impact on performance. By understanding what’s happening in our database and the tuning that are available, we can often tweak some settings for a significant improvement in performance.

performance cassandra profiling
Read more

Uncover Cassandra's Throughput Boundaries with the New Adaptive Scheduler in tlp-stress

7 min read

Introduction

Apache Cassandra remains the preferred choice for organizations seeking a massively scalable NoSQL database. To guarantee predictable performance, Cassandra administrators and developers rely on benchmarking tools like tlp-stress, nosqlbench, and ndbench to help them discover their cluster’s limits. In this post, we will explore the latest advancements in tlp-stress, highlighting the introduction of the new Adaptive Scheduler. This brand-new feature allows users to more easily uncover the throughput boundaries of Cassandra clusters while remaining within specific read and write latency targets. First though, we’ll take a brief look at the new workload designed to stress test the new Storage Attached Indexes feature coming in Cassandra 5.

cassandra benchmarking tlp-stress
Read more

AxonOps Review - An Operations Platform for Apache Cassandra

10 min read

Note: Before we dive into this review of AxonOps and their offerings, it’s important to note that this blog post is part of a paid engagement in which I provided product feedback. AxonOps had no influence or say over the content of this post and did not have access to it prior to publishing.

In the ever-evolving landscape of data management, companies are constantly seeking solutions that can simplify the complexities of database operations. One such player in the market is AxonOps, a company that specializes in providing tooling for operating Apache Cassandra.

cassandra axonops
Read more

Profling Apache Cassandra with async-profiler

1 min read

In this post I will show you how to use the async-profiler for Java to quickly get to the bottom of some production performance issues. We’ll look at an example using Apache Cassandra.

What is a flame graph?

A flame graph is a stack trace visualization invented by Brendan Gregg that makes it easy to visually identify where some activity occurs in software. Often that’s CPU time, but we can also use flame graphs to identify where allocations occur or when lock contention happens. This makes it an incredibly useful and powerful tool for performance analysis. Netflix makes no secret of the usefulness of flame graphs and at the end of this post we’ll briefly discuss another project, FlameScope.

cassandra java profling
Read more

Benchmarking Apache Cassandra with tlp-stress

6 min read

This post will introduce you to tlp-stress, a tool for benchmarking Apache Cassandra. I started tlp-stress back when I was working at The Last Pickle. At the time, I was spending a lot of time helping teams identify the root cause of performance issues and needed a way of benchmarking. I found cassandra-stress to be difficult to use and configure, so I ended up writing my own tool that worked in a manner that I found to be more useful. If you’re looking for a tool to assist you in benchmarking Cassandra, and you’re looking to get started quickly, this might be the right tool for you.

cassandra benchmarking
Read more

Back to Consulting!

1 min read

Saying “it’s been a while since I wrote anything here” would be an understatement, but I’m back, with a lot to talk about in the upcoming months.

First off - if you’re not aware, I continued writing, but on The Last Pickle blog. There’s quite a few posts there, here are the most interesting ones:

Now the fun part - I’ve spent the last 3 years at Apple, then Netflix, neither of which gave me much time to continue my writing. As of this month, I’m officially no longer at Netflix and have started Rustyrazorblade Consulting!

cassandra consulting
Read more

I Am Still Writing!

1 min read

If you were to take a look at my blog, you’d think I’d flipped a table and left the tech industry. Not the case at all. I’m still writing, but less frequently, and on the TLP blog. I intend to start writing here again, but the material will likely focus around topics other than Cassandra, since I’m already writing about it elsewhere. Here are the posts I’ve authored in the last 6 months or so:

cassandra
Read more

Instaclustr Now Supporting Apache Cassandra 3.7 as LTS

2 min read

Instacluster announced on the Apache Cassandra user list that they are making their supported branch of the Cassandra 3.7 tick tock release publicly available (see GitHub repo). Bug fixes that go into 3.8, 3.9, etc will be back ported to the Instacluster LTS. You can read the blog post about the decision.

Some people I’ve talked to are concerned about having different commercial entities doing long term supported releases, and this concern is understandable. The obvious preference is for the project maintainers to handle this and make an official LTS available. The big concern here is that third party LTS could fracture the project in the long term.

cassandra
Read more

Working Relationally With Cassandra

6 min read

I’ve spent the last 4 years working in the big data world with Cassandra because it’s the only practical solution if you have a requirement to scale out, uptime is a priority, and you need predictable performance. I’ve heard different ways of describing where Cassandra fits in your architecture, but I think the best way to think of it is close to your customer. Think of the servers your mobile apps communicate with or what holds your product inventory.

spark cassandra sql
Read more

Cassandra Dataset Manager Preview 1 Released

2 min read

One of the problems of learning a new database is getting used to a new way of data modeling. PostgreSQL looks different from Redis, which is different from a graph, and is different from Cassandra.

Cassandra Dataset Manager aims to reduce the time spent in a frustrating trial and error process trying to learn proper data modeling techniques for Apache Cassandra and Datastax Enterprise by providing curated data models which have been designed by professionals with years of experience. Think of it as a package manager for Cassandra data models and sample data.

cassandra cdm
Read more

Cassandra Dataset Manager Video Preview

1 min read

I posted a short preview showing off some of the work I’ve been doing recently on Cassandra Dataset Manager, a tool to help new Cassandra users learn how to create proper data models.

There’s documentation, but it’s still under heavy development.

cassandra
Read more

Cassandra 3.3 Released

1 min read

Apache Cassandra 3.3 was released last week. As per the Tick Tock release schedule, this release is focused on bug fixes and no new features were introduced. For practical purposes, consider this a bug fix release to Cassandra 3.2. All told there were almost 50 bugs fixed in this release. Many of the bugs fixed in this version also applied to Cassandra 3.0.3, which was also released last week.

With any Cassandra release, it’s a good idea to read the Changelog and News before upgrading.

cassandra
Read more

Cassandra Secondary Index Preview #1

7 min read

If you’ve looked into using Cassandra at all, you probably have heard plenty of warnings about its secondary indexes. If you’ve come from a relational background, you may have been surprised when you were told to create multiple tables (materialized views) instead of relying on indexes. This is because Cassandra is a distributed database, and the impact of doing a query that hits your entire cluster is you lose your linear scalability. If you’re capped at 25K queries per second per server, it doesn’t matter if you have one or a thousand servers, you’re still only able to handle 25k queries per second, total.

cassandra open source
Read more

Async Python and Cassandra with Gevent

6 min read

Introduction

Building a web app relying on database calls with CPython (the standard Python distribution) is pretty easy, but can suffer from performance problems. Python itself isn’t particularly fast, and in 2.x, it’s concurrency story is especially weak.

For starters, there’s the dreaded GIL. The GIL prevents us from taking advantage of multi core systems, so even if we use try to use threads we’re missing out on their main performance benefit, which is parallel computation.

python cassandra gevent
Read more

Cassandra 3.2 Overview

2 min read

The 3.0 release of Apache Cassandra marked an important milestone. One of the biggest updates was CASSANDRA-8099, the JIRA to modernize the storage engine. It was also the first release in the new Tick Tock cycle, which lands a new release of Cassandra every month. Even .x numbers (such as 3.2) are feature releases, and odd .x numbers (such as 3.1) are bug fix releases. Cassandra 3.2, released about a week ago, is the first feature release following 3.0. This post will briefly cover the changes.

cassandra
Read more

KillrAnswers Status Update, and Introducing Frank Dux

2 min read

In a previous post, I introduced a new project, KillrAnswers. I had originally planned on writing KillrAnswers using Rust, leveraging the Cap’n Proto library for RPC and object serialization.

I’ve had some time to think about this, and decided to switch back to Python. I also started my own RPC project, FrankDux, based on ZeroMQ and MessagePack for object serialization instead of Cap’n Proto.

Let’s get the obvious question out of the way - why not use Rust?

killranswers python cassandra
Read more

RAMP Made Easy

7 min read

Introduction

In this post I’ll introduce RAMP, a family of algorithms for performing atomic reads across partitions when working with distributed databases. The original paper, Scalable Atomic Visibility with RAMP Transactions, was written by Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein and Ion Stoica, of UC Berkeley and University of Sydney. Peter has graciously reviewed this blog post to ensure its accuracy. As part of the overview, I’ll explain the RAMP-Fast algorithm, the first of 3 algorithms covered in the paper. RAMP-Small and RAMP-Hybrid will be covered in follow up posts. First, let’s take a look at a few properties of database systems.

db cassandra algorithms
Read more

Introducing KillrAnswers

3 min read

The last few months have been a non stop whirlwind of traveling and speaking. I’ve been very fortunate to have spoken at Strata New York, give a couple sessions at the Cassandra Summit, and even had a few minutes on stage for the Cassandra Summit keynote (I’m at minute 22 with Luke Tillman). When I have time, I end up hacking on random projects. For example, a couple months ago I was working on a recommendation engine for KillrVideo. I also end up playing with bleeding edge builds of Cassandra and Spark.

spark cassandra python
Read more

Migrating from MySQL to Cassandra Using Spark

15 min read

MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django.

From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale past a single server or have high availability needs. MySQL’s solution to both of these needs is replication. Replication is ok at handling read heavy workloads in a single datacenter, but it falls on it’s face under heavy writes or if you need multiple datacenters. Fortunately Cassandra excels at scalability and high availability. It’s a common story for people to migrate from a relational database to Cassandra for one or both of these reasons. (For further reading on choosing Cassandra even with small datasets read Matt Kennedy’s Little Big Data article)

mysql cassandra python
Read more

Cassandra + PySpark DataFrames revisted

4 min read

A little while back I wrote a post on working with DataFrames from PySpark, using Cassandra as a data source. DataFrames are, in my opinion, a fantastic, flexible api that makes Spark roughly 14 orders of magnitude nicer to work with as opposed to RDDs. When I wrote the original blog post, the only way to work with DataFrames from PySpark was to get an RDD and call toDF().

Sound freaking amazing - what’s the problem?

spark cassandra python
Read more

Joining DataFrames with Pandas

3 min read

In this post I’ll walk through the process of reading in various plain text database files using Pandas, and then joining together the different DataFrames. All my work was done through an IPython notebook.

I decided to mess around with the labor statistics database that’s up on Amazon. My end goal was to save all the relevant information into Cassandra for future analysis with PySpark. If the files were bigger, I’d do all the initial loading with PySpark, but they’re pretty small and Pandas has a lot of functionality that’s still missing on the Spark side.

python dataframes pandas
Read more

You're Already Eventually Consistent

7 min read

New people to Apache Cassandra are often concerned about the phrase “eventual consistency.” It’s one of those things that seems so foreign, especially if you’re coming from a relational database. When I am with with my RDBMS I get wrapped in the sweet cocoon of ACID transactions!

Is the entire system really safe though? Are we perfectly ACID throughout our entire application? Probably not. Let’s see how it breaks down and where the tradeoffs are.

cassandra rdbms
Read more

Spark Streaming With Python and Kafka

5 min read

Last week I wrote about using PySpark with Cassandra, showing how we can take tables out of Cassandra and easily apply arbitrary filters using DataFrames. This is great if you want to do exploratory work or operate on large datasets. What if you’re interested in ingesting lots of data and getting near real time feedback into your application? Enter Spark Streaming.

Spark streaming is the process of ingesting and operating on data in microbatches, which are generated repeatedly on a fixed window of time. You can visualize it like this:

spark pyspark cassandra
Read more

On The Bleeding Edge - PySpark, DataFrames, and Cassandra

4 min read

A few months ago I wrote a post on Getting Started with Cassandra and Spark.

I’ve worked with Pandas for some small personal projects and found it very useful. The key feature is the data frame, which comes from R. Data Frames are new in Spark 1.3 and was covered in this blog post. Till now I’ve had to write Scala in order to use Spark. This has resulted in me spending a lot of time looking for libraries that would normally take me less than a second to recall the proper Python library (JSON being an example) since I don’t know Scala very well.

spark pyspark cassandra
Read more

Introduction to Spark & Cassandra

6 min read

I’ve been messing with Apache Spark quite a bit lately. If you aren’t familiar, Spark is a general purpose engine for large scale data processing. Initially it comes across as simply a replacement for Hadoop, but that would be selling it short. Big time. In addition to bulk processing (goodbye MapReduce!), Spark includes:

  • SQL engine
  • Stream processing via Kafka, Flume, ZeroMQ
  • Machine Learning
  • Graph Processing

Sounds awesome, right? That’s because it is, babaganoush. The next question is where do we store our data? Spark works with a number of projects, but my database of choice these days is Apache Cassandra. Easy scale out and always up. It’s approximately this epic:

spark cassandra tutorial
Read more

Diagnosing Problems in Production Webinar Posted

1 min read

The webinar from Nov 18, Diagnosing Problems in Production, has been posted to YouTube. I’ve embedded it at the bottom of this post.

The webinar is an extended version of the talk I gave at the Cassandra Summit with Blake Eggleston, which I recapped in my blog as well. I had almost double the time to talk in the webinar and so I was able to go into more detail

cassandra devops
Read more

Getting Started With Pandas and HDF5

2 min read

Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.

python pandas hdf5
Read more

Cassandra Summit Recap: Diagnosing Problems in Production

10 min read

Introduction

Last week at the Cassandra Summit I gave a talk with Blake Eggleston on diagnosing performance problems in production. We spoke to about 300 people for about 25 minutes followed by a healthy Q&A session. I’ve expanded on our presentation to include a few extra tools, screenshots, and more clarity on our talking points.

There’s finally a lot of material available for someone looking to get started with Cassandra. There’s several introductory videos on YouTube by both me and Patrick McFadin as well as videos on time series data modeling. I’ve posted videos for my own project, cqlengine, (intro & advanced), and plenty more on the PlanetCassandra channel. There’s also a boatload of getting started material on PlanetCassandra written by Rebecca Mills.

cassandra devops
Read more

CQLEngine now using the Python Native Driver

2 min read

I’m happy to announce that cqlengine is now using the Python Native Driver. For the most part, this should be a trivial upgrade. See the notes below on upgrading.

The Good News

  • Significantly less code to maintain in cqlengine itself. We no longer need to maintain connection pools, deal with fail over, dead servers, server discovery, server removal
  • Native driver multiplexes queries over each socket, so less sockets stay open
  • Notifications can be sent back to the client from the server. An example of this is a schema modification or when a new server is added.
  • You can now use the policies for load balancing and failover. See the policies api of the native driver for more information.

Upgrading

If you’re using an earlier version of cqlengine, there are a few caveats to upgrading.

cqlengine cassandra native-driver
Read more

No Downtime Database Migrations

4 min read

Introduction

Back at my last job, we successfully migrated from MongoDB to Cassandra without any downtime. We did two webinars with Datastax at the time (I am now a Datastax employee). Our first webinar was a general overview on the migration. The second, we covered some of the lessons we learned after being in production with Cassandra for a while. We touched on our migration process, but didn’t get deep into the details. This post will discuss the strategy, it’s goals, and what we learned along the way. The strategy applies to any database migration, and is not scoped only to moving between databases either.

nosql cassandra databases
Read more

Cassandra FAQ: Can I start with a Single Node?

3 min read

A frequently asked question on the mailing list by developers new to Cassandra is if it’s possible to start with a single node and scale up as their needs grow. This seems to come most often from people familiar with MySQL, Mongo, or another database which uses replication to scale reads.

The short answer to this question is yes, you can absolutely run a one node cluster. However, it’s important to understand the caveats of doing so. Cassandra was built with the intention of running in a cluster. This means that there are several reasonable defaults for a cluster either aren’t practical or don’t apply with a single node.

cassandra
Read more

What's new in cqlengine 0.7

3 min read

Recently we released version 0.7 of cqlengine, the Python object mapper for CQL3. We’ve been steadily moving towards full support of all of CQL3 for both queries and for table configuration. This post will outline the new features and provide examples on how to use them.

Counters

With counter support finally included it’s now possible to create and use tables with counter columns. They are exposed to the Python application as simple integers, and changes to their values will be sent as deltas to Cassandra. Let’s take a look at an example. I’ll assume you already have Cassandra running locally.

cassandra cql
Read more

Cassandra, CQL3, and Time Series Data with timeuuid

3 min read

Cassandra is a BigTable inspired database created at Facebook. It was open sourced several years ago and is now an Apache project.

In cassandra, a row can be very wide and is identified by a key. Think of it as more like a giant array. The data is stored on disk sorted by the key you pick, meaning if you pick the right sort option and key you can have some really fast queries. Here we’ll go over a time series.

cassandra CQL python
Read more

Setting up RAID0 in Ubuntu 12.04 in AWS High I/O

4 min read

Amazon announced high I/O instances today. This is huge for anyone with a database larger than available memory, as it’s been a complete nightmare dealing with EBS up till now. Now your Cassandra, MongoDB, MySQL, or whatever your using should be able to perform well without requiring keeping your entire dataset in memory.

With each instance you get 2x1TB of disk. In this tutorial I’ll be setting it up as a RAID0 to get a single 2TB disk which should deliver excellent performance.

amazon cassandra mongodb
Read more

Analyzing Cassandra Performance with Flame Graphs

One of the challenges of running large scale distributed systems is being able to pinpoint problems. It’s all too common to blame a random component (usually a database) whenever there’s a hiccup even when there’s no evidence to support the claim. We’ve already discussed the importance of monitoring tools, graphing and alerting metrics, and using distributed tracing systems like ZipKin to correctly identify the source of a problem in a complex system.

cassandra performance tuning flame graphs
Read more

Apache Cassandra Performance Tuning - Compression with Mixed Workloads

This is our third post in our series on performance tuning with Apache Cassandra. In our first post, we discussed how we can use Flame Graphs to visually diagnose performance problems. In our second post, we discussed JVM tuning, and how the different JVM settings can have an affect on different workloads.

In this post, we’ll dig into a table level setting which is usually overlooked: compression. Compression options can be specified when creating or altering a table, and it defaults to enabled if not specified. The default is great when working with write heavy workloads, but can become a problem on read heavy and mixed workloads.

cassandra compression performance
Read more

Cassandra Time Series Data Modeling For Massive Scale

One of the big challenges people face when starting out working with Cassandra and time series data is understanding the impact of how your write workload will affect your cluster. Writing too quickly to a single partition can create hot spots that limit your ability to scale out. Partitions that get too large can lead to issues with repair, streaming, and read performance. Reading from the middle of a large partition carries a lot of overhead, and results in increased GC pressure. Cassandra 4.0 should improve the performance of large partitions, but it won’t fully solve the other issues I’ve already mentioned. For the foreseeable future, we will need to consider their performance impact and plan for them accordingly.

cassandra data modeling time series
Read more

TWCS part 1 - how does it work and when should you use it ?

In this post we’ll explore a new compaction strategy available in Apache Cassandra. We’ll dig into it’s use cases, limitations, and share our experiences of using it with various production clusters.

Time Window Compaction Strategy : how does it work and when should you use it ?

Cassandra uses a Log Structured Merge Tree engine, which allows high write throughput by flushing immutable chunks of data, in the form of SSTables, to disk and deferring consistency on the read phase. Over time, more and more SSTables are written to disk, resulting in a partition having chunks in multiple SSTables, slowing down reads. To limit fragmentation of data, we use a process called compaction to merge sstables together. Several compaction strategies are available in Cassandra that merge SSTables together. These strategies are designed for different workloads and data models.

cassandra compaction twcs
Read more

TWCS Part 2 - Using before Cassandra 3.0

In our first post about TimeWindowCompactionStrategy, Alex Dejanovski discussed use cases and the reasons for its introduction in 3.0.8 as a replacement for DateTieredCompactionStrategy. In our experience switching production environments storing time series data to TWCS, we have seen the performance of many production systems improve dramatically.

The examples Alex gives for making use of TWCS work great for recent versions of Cassandra. However, a significant number of users are still using 2.0, 2.1, and 2.2. If you’re in this group, you can still use TWCS, but it’ll require a little extra work. Let’s take a look at how to achieve this.

cassandra twcs operations
Read more

Understanding the Nuance of Compaction in Apache Cassandra

Compaction in Apache Cassandra isn’t usually the first (or second) topic that gets discussed when it’s time to start optimizing your system. Most of the time we focus on data modeling and query patterns. An incorrect data model can turn a single query into hundreds of queries, resulting in increased latency, decreased throughput, and missed SLAs. If you’re using spinning disks the problem is magnified by time consuming disk seeks.

cassandra
Read more