So, we love Postgresql at iZettle. We really do (we also have another kind of love for Cassandra, but that's another story).
Why do we love it? Because it's stringent, adhering to standards and almost never surprises us in unpleasant ways. It might not be THE greatest open source application ever written, but it's definitely up there at the top of the list somewhere.
We have a fairly distributed environment with multiple different Postgres DBs around. Instead of scaling an individual DB horizontally by sharding the data, or using complex or expensive third party cluster solutions (such as Postgres-XL etc), we've chosen to have multiple identical RO slaves and partition the table data by age.
Postgresql has support for basic table partitioning, but they come with a catch, which I will try to describe below.
A partition in Postgresql is actually just a normal table. Only thing that differentiates it from other tables is that is has a parent table. Also it might (should) have constraints on the range of data you can insert into it. The parent table in turn (the one that you as a client query and insert data into), should have triggers that delegates data inserts and queries to relevant child tables. All this should be invisible for the client.
* Data retention:
When you want to delete data that's older than a certain age (and you should!), you'd in a classical scenario just issue something like select
DELETE FROM cheese WHERE best_before < cutoff_time;. That's fine until you have a very large table. For large tables, these kind of statements will take a very long time. Also, it will cause quite a bit of movement in the table, triggering a long running VACUUM process that can make your database server(s) groan. If you, on the other hand have lots of partitions behind the parent table
cheese, you could simply just execute:
ALTER TABLE cheese_2014 NO INHERIT; DROP TABLE cheese_2014;. This would be instant, not cause any vacuuming at all and lastly: free up space on disc.
* More efficient memory usage on the DB server:
If you query the large table often, the DB server will try to keep that tables indices in memory. These indices might be huge and eventually require more hardware to be added. If nothing else, they might push other important indices down to disc. If you know that you'll almost always query the large table for the latest data, you'd be better off with partitions as these are logically different tables. Then the old partitions might be completely stale and unused and their indices can stay on disc, while only the newest partitions will be in memory.
* Joins over smaller indices are faster
So, if the DB server can figure out that only a specific partition will contain the relevant data (constraint exclusion), it can use that partitions small index when joining instead of a big one belonging to a huge table.
How we do it:
We have automated jobs running continuously, assuring that:
* We have enough partitions in the future so we are not suddenly refused to insert data because of a partition constraint.
* Each partition has the indices it needs to have to be queryable.
* Partitions with expired data are dropped.
What's the problem?
Today, most application servers, no matter the language you're using, will use prepared statements. This is to allow the database to use the optimizer to perform it's magic only once per session instead of each time the statement is executed.
Let's say that we have the empty parent table
cheese, with two child tables (e.g. partitions)
cheese_201702. Given the query
SELECT country FROM cheese WHERE created = '2017-01-15 15:00:00';, the query planner will directly tell me that it would ignore the parent table (as it's empty), and go directly to the
cheese_201701 table. Exactly as intended! But, inside a reasonably modern application, you'd rather have a prepared statement looking something like this:
SELECT country FROM cheese WHERE created = ?;. This statement is sent to the server, prepared and saved with a reference. The server side preparation of this statement doesn't have any constraint information available to make any assumptions on what partition to use! So when the prepared statement is actually referenced by the client, only then providing the parameter value, the cached query plan for this statement is basically worthless. The query plan would almost certainly always include both partitions, making two of the assumptions above invalid!
Exactly this turned out to be our problem
We have a couple of large - at least in our standards - tables, having a couple of billion rows (no, they're not about cheese). They're partitioned, so that each partition only has a small fraction of that data, making each partition nice and queryable.
After having the system running in production for almost a year, we suddenly started to see some infrequent but really long response times. The slow queries could take anything between a couple of minutes up to more than an hour(!). These became more and more frequent which caused quite some concern. When we took the query, exactly as it was logged (with parameter values and all), and ran it in psql against the same DB machine with
explain analyze, the query plan was fine and took less than 100ms! How's this even possible? I mean: of course there's different times of day when your load differs, but a factor of 10000 slower on our application servers vs. running it manually? Such huge differences can not really be explained by anything other than application behavioral differences. Hardware and traffic would not alone suffice to create such a huge difference.
So, what kinds of behaviors can be different when we query from an application server vs from within psql? It could of course be that, at the time of the long running queries, the database had the wrong indices in memory, and different concurrent queries causing indices to be swapped to disc. Reading disc vs reading memory could possibly explain the factor 10000. The next question then becomes: Why would it have the wrong indices in memory, and why would not all of the relevant fit there to begin with?
Could it possibly be that the DB tried to have the indices for ALL partitions in memory? That would explain the lack of physical memory, as the total size of indices for these tables amounted to around 1 TB. This certainly seemed to be the case. After some googling, we realized that prepared statements really shouldn't be done for queries over partitioned tables. So, what to do now? Change all application code to execute raw SQL strings instead? The horror!
There was actually a better way. We're using the JDBC driver, and there's a parameter controlling the threshold for how many times a query should be executed before server side PREPARE takes place. Setting this value to 0, simply means: don't ever prepare my statements. Said and done, we added the parameter to the driver initialization url:
prepareThreshold=0. As per usual we rolled out this change quickly to production without user interruption as this is how we’ve designed most of our ever growing fleet of micro services.
(The vertical dotted line represents the time of the release.)
Of course there's a cost for not preparing your queries. The server has to do query planning for each incoming query. But, given the difference between 100ms and 1h, accepting this extra cost was a no-brainer.
More on partitioning in postgres for the curious: https://www.postgresql.org/docs/current/static/ddl...
Also, newer versions of Postgresql may handle the planning of prepared statements better, even if the docs already, for our fairly old version, would have you believe it did: https://www.postgresql.org/docs/devel/static/relea...