A PostgreSQL Backup Journey
I figured for PGSQL Phriday #002 , telling the story from when I looked after an energy insights database would be the most interesting. During the course of the project we used three differing backup tools and approaches, mainly driven by the ongoing exponential growth of the system.
I also want to cover my biggest learning from that project. Which was something David Steele said at pgconf.eu:
Make recovery part of your everyday processes.
The project took smart energy meter reads and customer billing data from a major UK energy company. This data was used to display insights about customer energy usage, using a range of analytics.
PostgreSQL was highly integral to the whole system with all of the data processing happening inside PostgreSQL with stored functions.
When I joined the project the system ingested data from 600k meters, by time I left the project it was over 5M meters. This took the database from 600GB to over 5TB. We also went through a couple of iterations of physical tin in a datacentre and eventually a forced cloud migration (which did not improve anything).
The data was mostly received as a batch workload (the data is collected from meters overnight when GSM data is cheap). So each morning we had two hours to:
- Load the meter data
- Load the billing data
- Compute all the pricing and analytics
This process would effectively max all CPU and disk IO for two hours, it was parallelised over ~32 CPU cores, generating a lot of WAL very quickly.
Another interesting facet of the system meant that a lot of the analytics data was recalculated more than once. So the system would typically update a large amount of the data marts on each data load.
Good Old pg_dump
At first the database was just backed up using
pg_dump , running nightly. This would dump all database tables, using a parallel dump in compressed binary format. This was dumped onto a low grade (a consumer thing brought from a local shop, because... startup) NAS box, over a 1GbE connection.
As you can imagine, this process was not great, it was slow and error prone. It typically took over 8 hours to dump, even when the DB was less than 1TB. Most of the errors came from the NAS box deciding to have problems, before one time when it completely failed and we lost all the backups. At this point the NAS was at least replaced by a more enterprise box, which was... at least expensive.
But by far the worst side here was trying to do a restore. That would typically take almost 48 hours. Needless to say we didn't do this very often, if ever.
Moving On Up To pg_basebackup
I moved the backups to
pg_basebackup pretty quickly. Pretty much for one reason: the speed up.
We actually first started using
pg_basebackup as a way to rebuild replica nodes quickly, but also to mirror onto our test systems. Using
pg_basebackup we could typically do this in roughly four hours.
We then started using
pg_basebackup for our backups, using the same nightly schedule, manual retention scheme and NAS target. This took half the time and a fraction of the CPU usage of
Recovery was improved, but in general it was still quicker and less error prone to stream from the primary database whenever we chose to rebuild a node.
As such our backups were largely untested!
Another downside was having no point in time recovery options. This wasn't a major problem at first since we could always just replay our data loads.
Relaxing Into pgBackRest
I first learnt about
pgBackRest at pgconf.eu 2015 from a presentation by David Steele. I could immediately see it addressing some the problems we were facing and how it could enhance our rather lackluster approach. However it took us about 9 months to get around to deploying
pgBackRest , by which point our databases were around the 2TB point.
At the time we were rebuilding our database hardware, this gave me the chance to build a dedicated PostgreSQL backup server, by juggling around the hard drives from our older servers. We ended up with a dedicated box for
pgBackRest with 24TB of storage.
pgBackRest we were able to implement full, differential and incremental backups. This allowed us to have roughly 1 months worth of daily backups. Coupled with continuous WAL archiving we also gained full point in time recovery.
We had fun getting
pgBackRest able to archive the WAL fast enough and did end up having to use the async archiving approach. This was back in the day it was all Perl, but we were bottle-necking on the network mostly.
Even with incremental backups, it was surprising how much space these took. The loading and recalculations meant we had about 10% to 20% churn per day in the underlying data files.
However the advantages of
pgBackRest where night a day for us. It sure let me relax more knowing our backups were safe and well taken care of.
Making Recovery Part Of You Everyday Processes
By far the biggest advantage we gained from
pgBackRest was being able to make recovery part of our everyday processes. We typically rebuilt our test boxes from production once a month, if not more often. Due to the nature of what the product did testing with real data was crucial.
pgbackrest meant restores were quick enough to run directly from the backup server onto the test database servers, typically scheduling these to run overnight when our testing team wanted upto date data. This speed up was a mix of
pgBackRest and the fact we had decent dedicated hardware for the backup server, an approach we carried into the cloud too.
This meant we were able to test our backup on a regular basis, without having to think about it or make time to do it. This was such an advantage know our backups actually worked. I'm of the opinion that you don't have a backup unless you have tried restoring from it!
So to answer the bonus question, yeh
pg_dump is a backup, but, only if you tested you can restore it.
I hope you found this story interesting. I recommend
pgBackRest for an awful lot of situations. However there are many times when
pg_basebakup are perfectly good options.
Just test your backups regularly!
Feel free to message me on Mastodon: @firstname.lastname@example.org