In a previous post, I introduced a new project, KillrAnswers. I had originally planned on writing KillrAnswers using Rust, leveraging the Cap’n Proto library for RPC and object serialization.
I’ve had some time to think about this, and decided to switch back to Python. I also started my own RPC project, FrankDux, based on ZeroMQ and MessagePack for object serialization instead of Cap’n Proto.
Let’s get the obvious question out of the way - why not use Rust?
Last week I wrote about using PySpark with Cassandra, showing how we can take tables out of Cassandra and easily apply arbitrary filters using DataFrames. This is great if you want to do exploratory work or operate on large datasets. What if you’re interested in ingesting lots of data and getting near real time feedback into your application? Enter Spark Streaming.
Spark streaming is the process of ingesting and operating on data in microbatches, which are generated repeatedly on a fixed window of time. You can visualize it like this: