I have read this report about the incident at GitLab.
The incident (TL;DR)
GitLab was being hammered by spammers (incident 1), that in turn put their replication system under stress (incident 2). An on-call sysop that was investigating the issues with replication accidentally deleted around ~300 GB of data from the production server (incident 3). A restore attempt show that of the several backup methods used, none was working properly and the latest backup available was a manual snapshot taken around 6 hours prior.
So 6 hours of production data were lost, furthermore GitLab had to be taken offline to complete the recovery.
Today I learned that:
- Spam can harm you in many ways, besides filling your site with garbage. Putting spam-protection mechanisms in place is better done sooner instead of later.
- Backup systems are not useful if you can not restore data. Test your backups otherwise you will have a Schrödinger backup, i.e. a backup for which you don’t know if it is working or not until you attempt to restore.
- Running commands with administrative privileges is dangerous. As the old saying goes: “think before you type”. If possible, use safer command (in this case since the directory should have been empty, it would have been better to use
- Give meaningful names to your servers. It is very easy getting confused between
db2.cluster.gitlab.com(not production). Use heroes names for production and villain names for testing (or vice-versa if you are working for an evil company).
- That final command that destroys your production data is not THE mistake. The mistake is that you arrived in that situation, i.e. the chain of mistakes that brought a lone sysop to issuing that command. When it happens you and all of your team already lost the game. Game Over. If you are in this situation, be as open and transparent as possible. we are all going to learn something.
Check your backups and try to restore data from one of them. NOW.