Saturday, November 12, 2011

Why I Like PostgreSQL

Today I gave a short presentation at work about PostgreSQL, and why I much prefer it to MySQL.

PostgreSQL vs MySQL: Eternal Battle


I may be misreading this, but it seems that there is a recent trend within startups to move away from MySQL, probably thanks to folks like Heroku on one side (who use PostgreSQL to the extreme, and help and contribute to it's development), vs folks like Oracle on the other side, tainting the "open source pureness" of MySQL :)

At my work we currently use a mid-sized MySQL 5.1 Percona instance, which is holding up quite well I must admit. Both PostgreSQL and MySQL have definitely converged to cover most features that people want, but my leaning is still towards PostgreSQL. I just agree with it's focus on data integrity, recovery, constraints, extensibility, while some of the early decisions in MySQL's design do not agree with me at all (like truncating long strings, 1/0 instead of booleans, ambiguous group by, etc). I think that data integrity may not have been the top priority in MySQL's early design (and it wasn't, MyISAM was fast, but not great at integrity).

I work for an e-commerce company, where transactions are very, very important. Loosing data is just not acceptable. Creating data that fails validation is a huge pain for our business analysts and accounting, who have to make sense of it. Early MySQL did not have sufficient constraints and foreign keys to encourage data modellers to use it.

I even believe that this lack of support for constraints and validation was the reason early Rails adopters rejected database constraints, and pushed people towards 100% in-application validation, while treating your database sort of like a document store (early Rails books and applications rarely create non-null columns, or even give varchar columns length specification, leaving MySQL to create all string columns with the default 256 character limit).

Among examples of where MySQL data integrity seems like an afterthought, is one where we recently discovered that a MySQL Replica being used for reporting allows writes, and was in fact quite out of sync with the master. How can you trust your reports that run off a replica, which is no longer a true replica?

Anyway, MySQL has plenty of support, fans and still enjoys wide spread usage. But if you are ready to try out PostgreSQL, here's my crash course of PostgreSQL install, features and some recent gotchas I had to deal with.

Note: parts of this post were inspired by a related post on data and PostgreSQL on SquareUp Technical blog.

PostgreSQL Basics

Installers are now available for Mac OS-X and Windows.

But I prefer compiling from sources. Compiling on unixes is very easy. Download the tar ball, unpack it and then run:
/configure --prefix=/usr/local/pgsql-9.1
make
make install

then let's create a database in directory /db on this server:
/usr/local/pgsql-9.1/bin/initdb –D /db
/usr/local/pgsql-9.1/bin/pg_ctl –D /db start
/usr/local/pgsql-9.1/bin/psql –U postgres postgres

Note that if you use an installer, your default user may be not "postgres" but your Mac or Windows username.

Configuration

Two critical files in /db:

postgresql.conf

Most db settings, performance, memory, optimizer, network interface to listen on, go into this file.

You will generally want to change the following (let me know if anyone is interested, and I can make some recommendations about which values I use).
shared_buffers 
temp_buffers 
work_mem 
maintenance_work_mem 
checkpoint_segments 
wal_keep_segments 
effective_cache_size
I also like to enable logger for slow queries in pg_log directory:
logging_collector = on 
log_directory = 'pg_log' 
log_filename = 'postgresql-%Y-%m-%d.log' 
log_rotation_age = 1d 
log_rotation_size = 0 
log_min_error_statement = error 
log_min_duration_statement = 200 
log_lock_waits = on 
log_statement = 'none'

pg_hba.conf

Access, including network, replication, etc. This is the file you want to modify to allow remote replication, remote access, decide which authentication method to use, etc.

Most of the time I use trust on a local system, and md5 for remote authentication.

Some Neat Features

PostgreSQL 9 boast a set of pretty cool features, some of them are listed below:
  • Partial Indexes (Reduce size of the index (say if only 10% of products are active):
create index on products (category_id) where isactive = true;
  • Function Indexes (instead of creating another column with lower case email):
create index on users to_lower(email);
select * from users where to_lower(‘MyEmail@GMAIL.COM’) = ‘myemail@gmail.com’;

More Cool Stuff

  • Create indexes concurrently (without table locks)
  • Schema modifications can be done in a transaction
  • Instant non-locking adding of nullable columns to large table (major issue today with MySQL)
  • Schemas, table spaces!  Can create indexes on a different table space (ie disk partition)
  • Extensible stored procedures: Java, Perl, Python, Ruby, Tcl. C/C++ and its own PL/pgSQL
  • Cost-based optimizer is generally better than rule base optimizer.  Takes into consideration data distribution
  • Full featured text search
  • INTERSECT and EXCEP in addition to UNION
  • Built-in performance statistics: pg_stat_activity

PostgreSQL In Practice: Replication

PostgreSQL 9.1.1 is the latest stable version as of this writing, and since version 9.0 PG supports streaming replication which I have recently setup on several servers.

There are a number of decent guides out there, for example here, and also here.

While setting up replication on PostgreSQL 9.1.1, I had a tiny trouble being able to replicate from the master db. The error was being printed on the master: "FATAL: must be replication role to start walsender"

Somehow, superuser "postgres" was not explicitly given "Replication" role. Weird, considering it's a super user. I think this may be a recent change in PostgreSQL default permissions. But to go around this, either add the REPLICATION role to user "postgres", or create a new role for replication only (must also have LOGIN role):
CREATE ROLE REPLICATOR REPLICATION LOGIN ENCRYPTED PASSWORD '....';

Then in pg_hba.conf:
host replication  replicator  10.0.0.0/32    md5

Using Replica for Queries

Another bump I ran into is the following: I wanted to use the PostgreSQL replica to run reports, pg_dump and other long-running queries. But they immediately failed with an error: "ERROR: canceling statement due to conflict with recovery"
See this thread for more info: http://postgresql.1045698.n5.nabble.com/Hot-Standby-ERROR-canceling-statement-due-to-conflict-with-recovery-td3402417.html
The solution is to increase the following parameter to allow queries longer than 10 minutes to successfully execute:
max_standby_streaming_delay = 600s
This also means your replica may be up to 10 minutes behind the master, but in my case this was an acceptable compromise.

I am guessing if someone wanted to setup a true hot-standby with minimum delay, it would not be very usable for reporting. So perhaps the following setup provides both redundancy and a reporting instance:
[master-db] <- [hot standby] <-- [hot standby, 2 hrs delay, reports]

Where is my Processlist?

MySQL users will no doubt miss infamous "show processlist" command. Don't fret: there is such thing in PostgreSQL too:
select * from pg_stat_activity;

This used to be in a separate contrib module, but now included by default. Which is great, because it's fast and provides locking information (whether queries wait on locks). This module uses a tiny temporary storage (usually in pg_stat_tmp) directory, which if you want to be really fancy, you would mount on a RAM disk partition. It looks like the size of the file inside that directory is constant (does not grow).

Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1             9.7M  226K  9.0M   3% /db/data9/pg_stat_tmp

Locks, Waits, and Deadlocks

Locks and deadlocks are bane of any database application because they suck: at least one process would have to abort, sometimes more, and depending on how good your error handing is, this may have some undesired consequences.

But how to you find and eliminate deadlocks? Short answer is -- there is no short answer.

On many applications I worked, whether they were written in C, Perl, Java or Ruby, I've seen deadlocks happen again and again. Debugging deadlocks is a painful exercise and there is no prescribed answer that works in all cases. Debugging distributed deadlocks (that happen when distributed transactions spanning multiple databases lock up), is a lot harder than debugging deadlocks in a single database.

But in both cases, being able to determine who is locking who, is very very important. PostgreSQL keep lock information in several supporting tables, which can be queried. I found this info invaluable, as you can see which processes are blocking, and which ones are waiting, and eventually figure out how reorder operations in your application or reduce contention on the same database object. The point is that information here is key, and PostgreSQL luckily provides a good deal of it.

If you experience deadlocks, please see this page for detailed queries on lock contentions and deadlocks: http://wiki.postgresql.org/wiki/Lock_Monitoring

Summary

I wanted to share some of the recent finds, gotchas and also excitement about PostgreSQL 9 database, it's features and capabilities. I hope you found this post informative, and if anything maybe PostgreSQL will peek your interest. Feel free to leave a comment on any of the related topics.