Cassandra is a BigTable inspired database created at Facebook. It was open sourced several years ago and is now an Apache project.
In cassandra, a row can be very wide and is identified by a key. Think of it as more like a giant array. The data is stored on disk sorted by the key you pick, meaning if you pick the right sort option and key you can have some really fast queries. Here we’ll go over a time series.
A time series is a naturally sorted list, since things are happening over time. Sensor readings or live chat are good examples. In older versions of Cassandra, you’d use timestamp as your column name, and the value would be the actual data. This would give you your list of data, sorted in order. The benefit of this is your queries would likely be looking at slices of time, and with the data stored sequentially on disk you’ll get very fast reads, since there only needs to be one seek (if the data isn’t already in memory).
To make it insanely unlikely that 2 timestamps would ever conflict, the column would actually be a uuid1, which has an embedded timestamp. Data stax gave a good example of a table definition back from Cassandra 0.8:
1 2 3 4
As Cassandra has matured, it’s evolved really nice schema definition options giving you the choice of some additional structure if you want it. CQL is a SQL-ish language for defining tables, where you specify the column names beforehand. This makes using our time series data a little challenging since you can’t possibly know all the timestamps you’re going to be using. The upcoming version of the language is CQL3. Here’s a great DataStax blog post on some of the CQL3 features.
In particular, the Cassandra team has introduced 2 important items. 1 is the timeuuid field, and the other is specifying compound primary keys with compact storage. This causes the data to be stored sequentially by the timeuuid column, exactly like a really wide row. Starting cqlsh with the -3 option gives us a CQL3 console. Here we define our schema:
1 2 3 4 5 6 7 8 9 10 11 12
Here’s a little Python script to put in some example data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
And the result (edited to fit on 1 line):
1 2 3 4 5 6 7 8
You can see how even though I was generating a uuid1, Cassandra is showing us a timestamp.
Huge thanks to everyone that’s worked on Cassandra to get it to this point. It’s an absolutely amazing piece of software.