MySQL is a popular choice for new projects. It’s a flexible database that’s easy to set up and start querying. There’s loads of documentation, examples and frameworks it works with, such as Wordpress, Pandas, Ruby on Rails, and Django.
From the above paragraph it reads like a pretty fantastic database, and at small scale it can be great. The problem arises when you need to scale past a single server or have high availability needs. MySQL’s solution to both of these needs is replication. Replication is ok at handling read heavy workloads in a single datacenter, but it falls on it’s face under heavy writes or if you need multiple datacenters. Fortunately Cassandra excels at scalability and high availability. It’s a common story for people to migrate from a relational database to Cassandra for one or both of these reasons. (For further reading on choosing Cassandra even with small datasets read Matt Kennedy’s Little Big Data article)
I have grown increasingly frustrated with the world as people have become more and more convinced that “schema-less” is actually a feature to be proud of (or even exists). For over ten years I’ve worked with close to a dozen different databases in production and have not once seen “schemaless” truly manifest. What’s extremely frustrating is seeing this from vendors, who should really know better. At best, we should be using the description “provides little to no help in enforcing a schema” or “you’re on your own, good luck.”
Amazon announced high I/O instances today. This is huge for anyone with a database larger than available memory, as it’s been a complete nightmare dealing with EBS up till now. Now your Cassandra, MongoDB, MySQL, or whatever your using should be able to perform well without requiring keeping your entire dataset in memory.
With each instance you get 2x1TB of disk. In this tutorial I’ll be setting it up as a RAID0 to get a single 2TB disk which should deliver excellent performance.
I decided to take a look at Drizzle today and was encouraged by what I saw. Here’s my favorite part:
There is no UNSIGNED (as per the standard). * There are no spatial data types GEOMETRY, POINT, LINESTRING & POLYGON (go use Postgres). * No YEAR field type. * There are no FULLTEXT indexes for the MyISAM storage engine (the only engine FULLTEXT was supported in). Look at either Lucene, Sphinx, or Solr. * No “dual” table. * The “LOCAL” keyword in “LOAD DATA LOCAL INFILE” is not supported
Splitmytab.net is finally for the public to check out. Splitmytab is a bill splitting and IOU system for friends. It uses facebook’s login, so you won’t need to put in anyone’s emails, names, or get people to sign up for an account.
It’ll automatically keep balances of who owes who, so you can keep a running tab with friends and always know who’s buying the next case of beer.
Please note: I’m not a designer, so there’s a few rough corners, but what’s there is simple and it works.
>>> import MySQLdb
/Library/Python/2.7/site-packages/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg/_mysql.py:3: UserWarning: Module _mysql was already imported from /Library/Python/2.7/site-packages/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg/_mysql.pyc, but /Users/jhaddad/Downloads/MySQL-python-1.2.3 is being added to sys.path
Traceback (most recent call last):
File "", line 1, in
File "MySQLdb/__init__.py", line 19, in
import _mysql
File "build/bdist.macosx-10.7-intel/egg/_mysql.py", line 7, in
File "build/bdist.macosx-10.7-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/var/root/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
Referenced from: /var/root/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.7-intel.egg-tmp/_mysql.so
Reason: image not found
I was just working today on a report consisting of a series of queries, taking about a minute to generate. Some of the data would be created in a temporary table and queried against multiple times for performance reasons, and ultimately spit out into a CSV file for someone to examine later. I also would like to be able to return the result set, and perform queries on it, which is much faster than querying a view.
Looking for instances particular column in a large schema can be a pain. Fortunately the information schema makes this pretty easy, if your columns have a consistent naming convention.
SELECT table_schema, table_name, column_name
FROM information_schema.columns
WHERE column_name LIKE '%some_name%';
Now, if we want to wrap this up into an easy to use stored procedure, we can do something like this:
drop procedure find_column;
delimiter //
CREATE PROCEDURE find_column(c varchar(255))
begin
SET @a = CONCAT("%", c, "%");
SELECT table_schema, table_name, column_name, column_type
FROM information_schema.columns
WHERE column_name LIKE @a;
end
//
delimiter ;
We need to use the concat statement in order to properly get the quotes in there without using the literal string “c” in the LIKE statement.
I don’t remember having to do this before, but I had compiled MySQL 5.1.34 recently on my dev box (os x) and I saw a warning on a create table statement. It turns out InnoDB was not enabled (or even listed in the list of storage engines.
Before:
mysql> show engines;
+————+———+———————————————————–+————–+——+————+
| Engine | Support | Comment | Transactions | XA | Savepoints |
+————+———+———————————————————–+————–+——+————+
| CSV | YES | CSV storage engine | NO | NO | NO |
| MRG_MYISAM | YES | Collection of identical MyISAM tables | NO | NO | NO |
| MEMORY | YES | Hash based, stored in memory, useful for temporary tables | NO | NO | NO |
| MyISAM | DEFAULT | Default engine as of MySQL 3.23 with great performance | NO | NO | NO |
+————+———+———————————————————–+————–+——+————+
4 rows in set (0.00 sec)
Really interesting read about how to examine what’s stored in memcached.
Peep uses ptrace to freeze a running memcached server, dump the internal key metadata, and return the server to a running state. If you have a good host ejection mechanism in your client, such as in the Twitter libmemcached builds, you won’t even have to change the production server pool. The instance is not restarted, and no data is lost.
I have never, ever seen this before. I don’t even know how the table was created with a different collation. However, I had all my other tables created with the character set utf8, no collation specified.
I had to convert the second table to match the character set. This probably wouldn’t have been a problem if I wasn’t joining on a character field.
alter table exclusion CONVERT TO CHARACTER SET utf8;
I haven’t been using vim for very long, but I’ve gotten over the initial learning curve of getting used to the different editing modes. With some help from the guys in #vim on irc.freenode.net, I managed to get this gem:
Altering a big table sucks, and to make it worse you have no idea what’s happening or how long it will take. I’d like a progress bar, or some status output, or something that gives me the feeling like my server didn’t die.
A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. (from Oracle).
I was cruising the MySQL Forge Worklog when I came across the idea of Time Delayed Replication. I had never considered the benefits of deliberately keeping a slave server behind a master.
Kristian Koehntopp gives a good example:
Kristian Koehntopp writes:
TDS: Time delayed SQL_THREAD (Have a replication slave that is always
lagging 30 minutes behind).
Currently, replication is a rolling recovery: To set up replication you
restore from a full dump with a binlog position. You then continously
download binlog and roll forward. In case of a master crash a slave is a
readily recovered instance (as opposed to a backup, which still has to be
restored).
I’m not sure how I didn’t see this earlier, but it looks like MySQL 5.1 is coming with a load balancer for replicated servers. I’m absolutely pumped about this - we’ve got a few sites running with multiple db slaves and it’s so annoying having to check if they’re behind the master.
It looks like the load balancer will automatically pull slaves out if they fall behind, and route connections to the ones that are the most up to date. It is based on MySQL Proxy, which is currently in Alpha.
While I’ve known how floating points are stored, I didn’t know if decimal was stored any differently, or if it was more or less accurate. According to a post on the MySQL list:
Bruno Rodrigues Silva wrote:
Dear all.
The MySQL Manual inform that Decimal Data Type is used
for represent exact-number, but like Float Point Data
Type, Decimal use rounding case the fractional part is
not sufficient. Therefore, what the difference?
I need to compile something and use the MySQL C++ library. I have mysql and mysql++ already compiled, I won’t go over how to do that now.
I added the following code to the top of my source:
#include <mysql++.h>
I got an error
/Users/jhaddad/dev/search_engine/main.cpp:4:21: error: mysql++.h: No such file or directory
Not cool.
How to fix:
In XCode, open up the project settings (under the project menu). Click the build tab. Go down to search paths, and you can change your Header search paths to the correct locations where you installed whatever you’re looking for. In this case, mine was /usr/local/includes and /usr/local/mysql/
I’m setting up my first mysql cluster, and just wanted some clarification on a few things.
In the manual, it says:
Online schema changes. It is not possible to make online schema changes such as those accomplished using ALTER TABLE or CREATE INDEX, as the NDB Cluster engine does not support autodiscovery of such changes. (However, you can import or create a table that uses a different storage engine, and then convert it to NDB using ALTER TABLE tbl_name ENGINE=NDBCLUSTER. In such a case, you must issue a FLUSH TABLES statement to force the cluster to pick up the change.)
We were using replication to deal with certain queries that were producing table scans. I realize this is not a great long term solution but we were migrating a web site that was set up this way, so it wasn’t really a choice.
We had a database that was a mix of InnoDB and a few MyISAM tables. The MyISAM tables were used for fulltext searches.
To get a database dump, we were using the command
mysqldump --all-databases --single-transaction --master-data=1 > dumpfile.sql
You can see here that we have 24Megs of unused query cache. This is essentially wasted memory. Make sure you’re not using a ridiculously high amount, since every time you perform an insert or update to a table it’ll kill any queries in the cache that reference those tables.
I didn’t find any simple ways to determine how hard our mysql database was working, so I whipped this up. It uses a 10 second delay to figure out the queries per second average.
After banging my head against a wall for a while, I finally found this thread from June 2005 in the MySQL support forums regarding using variables in the LIMIT clause. They don’t support it yet. There’s no time table that indicates when it will be supported either. It’s very unfortunate, I was looking forward to benchmarking the difference between our ad hoc code and the stored proc version.
This code will not work:
SELECT field from table LIMIT limit_var
Aggregates are awesome. But sometimes you want to do 2 counts in a query, and have one of them be more restrictive than the other. Lets say our database focuses on pictures, and rating them on a scale from 1-100. We want to know the average rating, the minimum rating, the max, the number of ratings, and the number of ratings over 75.
First let’s setup the tables.
create table rating ( picture int not null, userid int not null, rating int not null, primary key (picture, userid) ) engine=innodb;
MySQL won’t let you delete from a table using a subquery that references itself. Fair enough. To get around this, up till now, I’ve used temporary tables. However, I’ve never really liked it, and I’ve always wanted a JOIN delete. Well, apparently it exists.
I’m cleaning out some data from some really old database dump. I needed to preserve the data in the article table, but remove the rows in the city table based on a column in the article table.
Using mysql -e’s feature, combined with awk and xargs, I was able to call an existing stored procedure repeatedly for a resultset. Yes, I could have written another stored procedure to do this, I realize. But I guess I like doing things the hard way. Either that, or this is just less code. Or I wanted the awk and xargs practice. Whatever.
mysql database -e "select id from category where foreign_key in (2771, 2769, 2766, 2772, 2767)" | awk -F\| '{print $1}' | xargs -ivar mysql database -e "call move_category(var, 5666)"
Every now and then, we find that we will have a sudden increase in the number of apache processes, load average will spike up, and then go back down to normal. In rare cases, we will see the same thing happen, and the load avg spike WAY up, all queries appear locked up, and the server must be rebooted. I am looking for ways of determining what caused this. I should note that it happens extremely rarely, and has never shown up in a load test.
As of MySQL 5.0 there is no SHOW CREATE TRIGGER which is pretty annoying. I don’t like using “show triggers like “, so I figured I’d write a wrapper around mysqldump. For the record, this was a huge pain the ass. I did this on Fedora, GNU sed version 4.1.5. You need to have your password stored in ~./my.cnf for this to work, I believe. I haven’t tried it without it.
Our sysadmin has a nice blog post with a few tips for using mysqldump, especially if your database is used for more than a basic site, or if you have stored procedures and/or triggers.
A while ago when I started my first job in LA, I was using MSSQL. I was told to never use count(), and rather to use count(1). Several people insisted this performs better than count(), and since I really didn’t care to argue or look into it, I just started doing it.
Fast forward to today, where I am still writing count(1), but on MySQL. I’m at a different company now, but working with a lot of the same people. The person who used to be my boss is now my equal, and while mytop was running, noticed the count(1) and was happy to see it. Which reminded me to wonder - does it even matter?
REPLACE INTO will actually perform a delete and then an insert, while INSERT … ON DUPLCIATE KEY UPDATE will perform an update (as the name suggests). I would think the latter would be faster. I have not done any performance testing between the two, but it only seems logical the update would be faster than the delete/insert. Please correct me if I’m wrong.
Since the two statements both end up with the same result, I’m not sure yet what the benefit of REPLACE into it.
For a long time, whenever I wanted to do paging to browse through a table, I used to run 2 queries. The first would get the results, and the second would be an almost identical query, with a count() instead of fields, and I’d use the result of the second query to figure out how many pages I would need.
Now there’s a better solution, and it’s called found_rows(). Say you have a simple select statement:
Maybe not breaking news, but I think it’s interesting enough of a point, and I didn’t really find anything about the topic when I googled it. If you do any addition, subtraction, multiplication, or division (and probably a lot more mathematical functions for that matter) and NULL is one of your values, the entire expression will evaluate to NULL.
If you’ve worked with any of the commerical databases other than MySQL, you may have already had some experience with triggers.
I’m glad that the MySQL developers finally managed to squeeze in this extremely important feature. I no longer dread trying to defend MySQL from the Microsoft and Oracle developers who have had triggers for so many years.