Python

Async Python and Cassandra with Gevent

6 min read

Introduction

Building a web app relying on database calls with CPython (the standard Python distribution) is pretty easy, but can suffer from performance problems. Python itself isn’t particularly fast, and in 2.x, it’s concurrency story is especially weak.

For starters, there’s the dreaded GIL. The GIL prevents us from taking advantage of multi core systems, so even if we use try to use threads we’re missing out on their main performance benefit, which is parallel computation.

python cassandra gevent
Read more

FrankDux RPC Preview #1

3 min read

In my previous post, I briefly mentioned FrankDux, a new project I’m working on. FrankDux is a framework for quickly building RPC microservices in Python. This is a preview of it’s functionality and subject to change.

A goal of FrankDux is to provide a means of building stateless microservices that’s as easy as working with Flask or Bottle, but also the conveniences of Cap’n Proto, of which I’m a huge fan. Here’s the classic Hello World example, using Bottle:

python rpc frankdux
Read more

KillrAnswers Status Update, and Introducing Frank Dux

2 min read

In a previous post, I introduced a new project, KillrAnswers. I had originally planned on writing KillrAnswers using Rust, leveraging the Cap’n Proto library for RPC and object serialization.

I’ve had some time to think about this, and decided to switch back to Python. I also started my own RPC project, FrankDux, based on ZeroMQ and MessagePack for object serialization instead of Cap’n Proto.

Let’s get the obvious question out of the way - why not use Rust?

killranswers python cassandra
Read more

Introducing KillrAnswers

3 min read

The last few months have been a non stop whirlwind of traveling and speaking. I’ve been very fortunate to have spoken at Strata New York, give a couple sessions at the Cassandra Summit, and even had a few minutes on stage for the Cassandra Summit keynote (I’m at minute 22 with Luke Tillman). When I have time, I end up hacking on random projects. For example, a couple months ago I was working on a recommendation engine for KillrVideo. I also end up playing with bleeding edge builds of Cassandra and Spark.

spark cassandra python
Read more

Migrating from MySQL to Cassandra Using Spark

15 min read

MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django.

From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale past a single server or have high availability needs. MySQL’s solution to both of these needs is replication. Replication is ok at handling read heavy workloads in a single datacenter, but it falls on it’s face under heavy writes or if you need multiple datacenters. Fortunately Cassandra excels at scalability and high availability. It’s a common story for people to migrate from a relational database to Cassandra for one or both of these reasons. (For further reading on choosing Cassandra even with small datasets read Matt Kennedy’s Little Big Data article)

mysql cassandra python
Read more

Cassandra + PySpark DataFrames revisted

4 min read

A little while back I wrote a post on working with DataFrames from PySpark, using Cassandra as a data source. DataFrames are, in my opinion, a fantastic, flexible api that makes Spark roughly 14 orders of magnitude nicer to work with as opposed to RDDs. When I wrote the original blog post, the only way to work with DataFrames from PySpark was to get an RDD and call toDF().

Sound freaking amazing - what’s the problem?

spark cassandra python
Read more

Joining DataFrames with Pandas

3 min read

In this post I’ll walk through the process of reading in various plain text database files using Pandas, and then joining together the different DataFrames. All my work was done through an IPython notebook.

I decided to mess around with the labor statistics database that’s up on Amazon. My end goal was to save all the relevant information into Cassandra for future analysis with PySpark. If the files were bigger, I’d do all the initial loading with PySpark, but they’re pretty small and Pandas has a lot of functionality that’s still missing on the Spark side.

python dataframes pandas
Read more

Spark Streaming With Python and Kafka

5 min read

Last week I wrote about using PySpark with Cassandra, showing how we can take tables out of Cassandra and easily apply arbitrary filters using DataFrames. This is great if you want to do exploratory work or operate on large datasets. What if you’re interested in ingesting lots of data and getting near real time feedback into your application? Enter Spark Streaming.

Spark streaming is the process of ingesting and operating on data in microbatches, which are generated repeatedly on a fixed window of time. You can visualize it like this:

spark pyspark cassandra
Read more

Hangout Announcement - Python Performance Profiling

1 min read

Just wanted to let everyone know I’m going to be doing a Google Hangout on Air on Thursday, 2pm PT / 5PM ET on Python Performance Profiling. I’m going to be covering several tools and exposing a variety of ways of understanding your applications. You can RSVP on the event page.

I’ll be answering Q&A along the way so be sure to have your questions ready and upvote the ones you find useful!

hangouts python
Read more

Getting Started With Pandas and HDF5

2 min read

Yesterday I was pulling down some stock data from Yahoo, with the goal of building out a machine learning training set using Spark and Cassandra. If you haven’t tried Cassandra yet, it’s a database built for high availability and linear scalability. I’ve got a intro talk up here. Spark is another apache project that kicks Cassandra into overdrive by providing a framework for batch analytics, streaming, and machine learning. On the way is support for graph operations which makes me giddy.

python pandas hdf5
Read more

Say Hello to Meatbot

3 min read

What is Meatbot?

Meatbot is a HipChat bot for managing status updates for our growing team of Evangelists at DataStax. It’s built in Python 2.7, utilizing the Will library. The status updates are stored in Cassandra using cqlengine. Yep, it’s up on github.

There’s a few simple commands. First, you tell Meatbot about each project you work on.

Once you’ve got your projects, you can list them with lsproject or delete them with rmproject.

python hipchat
Read more

Python for Programmers

7 min read

When I started learning Python, there’s a few things I wish I had known about. It took a while to learn them all. This is my attempt to compile the highlights into a single post. This post is targeted towards experienced programmers just getting started with Python who want to skip the first few months of researching the Python equivalents of tools they are already used to. The sections on package management and standard tools will be helpful to beginners as well.

python
Read more

The Myth of Schema-less

7 min read

I have grown increasingly frustrated with the world as people have become more and more convinced that “schema-less” is actually a feature to be proud of (or even exists). For over ten years I’ve worked with close to a dozen different databases in production and have not once seen “schemaless” truly manifest. What’s extremely frustrating is seeing this from vendors, who should really know better. At best, we should be using the description “provides little to no help in enforcing a schema” or “you’re on your own, good luck.”

databases nosql python
Read more

CQLEngine now using the Python Native Driver

2 min read

I’m happy to announce that cqlengine is now using the Python Native Driver. For the most part, this should be a trivial upgrade. See the notes below on upgrading.

The Good News

  • Significantly less code to maintain in cqlengine itself. We no longer need to maintain connection pools, deal with fail over, dead servers, server discovery, server removal
  • Native driver multiplexes queries over each socket, so less sockets stay open
  • Notifications can be sent back to the client from the server. An example of this is a schema modification or when a new server is added.
  • You can now use the policies for load balancing and failover. See the policies api of the native driver for more information.

Upgrading

If you’re using an earlier version of cqlengine, there are a few caveats to upgrading.

cqlengine cassandra native-driver
Read more

Creating AWS Cloudwatch Alarms Using Boto

2 min read

In this post I’ll walk through the process of setting up cloudwatch alarms programatically in Python through Boto. We’ll be setting up a single alarm for a metric StatusCheckFailed, but you can configure other alarms as well. Check the AWS alarms console for the full list.

This post assumes you already have an instance, instance_id, AWS, and your boto config set up. Also assumed is that you’ve created a SNS Topic already. My SNS Topic is called “Server_Down”, and is simply an email that gets sent to me when a server fails a status check.

aws boto python
Read more

I've Moved to Pelican

1 min read

As of Sunday, August 25, rustyrazorblade is now powered by pelican. So far, no complaints.

It was easy to get started. Installed through pip into a virtualenv and up and running in just a few minutes. It was a significantly better experience than my attempt at using octopress, which mixed theming, code, and my content all into one mess of a projct.

The new blog is just a folder of content (markdown), my theme, and a Makefile. When I want to publish, I just do make s3_upload and all the content is built and synced to an s3 bucket which serves static content. Building a preview via make html takes less than 3 seconds, which is significantly better than the 5 minutes octopress took.

blog python
Read more

Cassandra, CQL3, and Time Series Data with timeuuid

3 min read

Cassandra is a BigTable inspired database created at Facebook. It was open sourced several years ago and is now an Apache project.

In cassandra, a row can be very wide and is identified by a key. Think of it as more like a giant array. The data is stored on disk sorted by the key you pick, meaning if you pick the right sort option and key you can have some really fast queries. Here we’ll go over a time series.

cassandra CQL python
Read more

Installing vim-ipython with MacVim

2 min read

I got really excited at the notion of having IPython built into MacVim (vim-ipython), so over the last few days I’ve spent some time mucking around trying to get this whole thing to work.  Unfortunately there’s not a lot of documentation on how to fix the issues that might pop up, so hopefully this will help some people.  (spoiler - MacVim download is 32 bit zeromq is 64)

First, your prerequisites.  I’m assuming you’re using the awesome HomeBrew.  If you’re not, you’re on your own for some of these sections.

python vim
Read more

Splitmytab ready for the public!

1 min read

Splitmytab.net is finally for the public to check out. Splitmytab is a bill splitting and IOU system for friends. It uses facebook’s login, so you won’t need to put in anyone’s emails, names, or get people to sign up for an account.

It’ll automatically keep balances of who owes who, so you can keep a running tab with friends and always know who’s buying the next case of beer.

Please note: I’m not a designer, so there’s a few rough corners, but what’s there is simple and it works.

coffeescript mysql python
Read more

'New Project: Jester'

1 min read

I’ve started a new open source project called Jester. Jester is a rules engine for points and badges, also known as Gamification.

Jester is written in Python, using Redis for storage.

I’ve created a tiny Domain Specific Language for defining rules using the pyparsing library.

A couple examples of rules:

create rule on game_play award 5 points
create rule on game_play award badge game_addict when game_play occurs 5 times in 1 day

This project is in the very early stages and is not yet functional as of this posting. However, I expect to have a rough working version of it up by the end of next week.

jester pyparsing python
Read more

Installing MySQLdb on MacOS Lion

1 min read

I ran into an issue installing the MySQLdb module.

>>> import MySQLdb
/Library/Python/2.7/site-packages/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg/_mysql.py:3: UserWarning: Module _mysql was already imported from /Library/Python/2.7/site-packages/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg/_mysql.pyc, but /Users/jhaddad/Downloads/MySQL-python-1.2.3 is being added to sys.path
Traceback (most recent call last):
  File "", line 1, in
  File "MySQLdb/__init__.py", line 19, in
    import _mysql
  File "build/bdist.macosx-10.7-intel/egg/_mysql.py", line 7, in
  File "build/bdist.macosx-10.7-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/var/root/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /var/root/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so
  Reason: image not found

I fixed it by doing the following:

mac mysql python
Read more

Another Attempt At Python

3 min read

I tried Python out a while ago, but stopped trying it to learn it after some major frustrations. Maybe I didn’t dig deep enough into it. I found the documentation hard to read, and the module layout seemed a little random at times. For some reason I found executing an external process and getting the results to be a little convoluted. (Since then I’ve learned to use popen(..).communicate())

I ended up messing with other languages to try to find one that suits my tastes, like Erlang and D. I read through 7 languages in 7 weeks, and not really getting a lot out of it. I didn’t fall in love with ruby at all and I’m not going to actually use Prolog anywhere, even if I thought it was pretty cool. I never liked Java, and I wasn’t impressed with Scala.

erlang java python
Read more

Python Module Docs

1 min read

I’ve been trying to get into Python in my spare time, since it’s got such a huge volume of modules and looks like it should be easy to be ripping it up in no time. But of course, I have my complaints.

Fortunately I don’t need to write a long blog post, this guy did it for me. It’s kind of alarming this post is from 2 years ago and the docs are still a major problem.

python
Read more

PyDev Tutorial

1 min read

I found a good tutorial on IBM on doing Python development with Eclipse. It might be a little out of date, but I think only the screens got moved around a little bit. It includes details on how to use ant, which I’ve recently started using with cruise control and PHP, so I’m becoming a fan.

I did run into an issue where I’d get the error ‘Variable references empty selection: ${project_loc}”, but a quick google brought me to a solution here.

eclipse python
Read more

Installing NumPy on MacOS X Snow Leopard

1 min read

NumPy is a requirement to work with PyTables. This is the second step in the install process, after getting HDF5 set up.

These instructions are based on the ones found on the NumPy site, but I’m summarizing things for my own use later on.

First, you’ll need to install the Fortran compiler for OSX (gFortran). Fortunately there’s a Fortran universal binary installer.

Next, get the NumPy source. You can find it on Sourceforge.

Now, build with gFortran.

fortran mac numpy
Read more

Issues Compiling HDf5 1.8.3 on MacOS X Snow Leopard

1 min read

I’m trying to evaluate pytables as a replacement for very large Python dictionaries, but having some issues getting HDF5 installed on my Mac (OS X Snow Leopard).

I’ve been getting this error:

configure: error: C compiler cannot create executables

I haven’t been able to figure out what’s wrong yet - anyone have any ideas? I’ve got XCode Tools installed, I’ve compiled Apache, PHP and Memcached without issue (prior to Snow Leopard Update).

hdf5 python
Read more

Mutable class instance variables in Python act like they're static

2 min read

There’s a weird behavior in Python when dealing with Mutable types such as dictionaries, that when you modify a variable defined as a class attribute, you’re actually modifying a shared dictionary amongst all the classes. This seemed weird to me. You can read the lovely discussion about it, if you want. Or, just follow my code for a demo on how to deal with the issue. I just started Python on Monday night, so please overlook my n00bness.

python
Read more

Getting phpsh to work on a mac

1 min read

I had an issue getting phpsh to work on my mac - I kept getting the following error:

Traceback (most recent call last): File “./phpsh”, line 20, in import readline

OK, seems easy enough. So I compiled python with readline support.

./configure –prefix=/usr/local/python –enable-readline

I change the PATH variable in my .bash_profile to point to the /usr/local/python directory first, and source’d it to get the new PATH settings. Still get the same error.

php python
Read more