LoginSignup
1
0

GitLab Dev Deletes Entire Production Database

Posted at

In this article, we’ll look at how a developer’s mistake caused GitLab to lose six hours of data from their website. we’ll see what happened at that time, how they fixed it, and what they learned from it. For GitLab and its users, this issue was terrible.de

GitLab is one of the most popular platforms. But on January 31, 2017, GitLab had a big problem: one of their developers accidentally erased the whole production database, wiping out six hours' worth of data from GitLab.com, which was one of GitLab’s greatest nightmares.

The Problem: Too Much Data

The problem started around 6 pm UTC when GitLab saw that some bad people were making a lot of snippets (small pieces of code) on GitLab.com, making the database very busy and unstable. GitLab started blocking the bad people by their IP address and deleting their users and snippets.

Around 9 pm UTC, the database got worse, making it hard to write anything and making the website go down. GitLab saw that one user was using a project as a CDN, making 47,000 IPs sign in with the same account. This made the database very busy too. They deleted this user as well.

Around 10 pm UTC, GitLab got an alert because the database was not copying itself to another database, which is important for backup. This happened because there was too much data to copy and the other database could not keep up. GitLab tried to fix the other database by deleting its data folder and starting the copy again.

Mistake: Wrong Command

But the copy did not work, giving some errors. GitLab tried to change some settings on the main database, but this made PostgreSQL not start because of too many things open.

Around 11 pm UTC, one of the developers (team-member-1) thought that maybe the copy was not working because the data folder was there (even though it was empty) on the other database. He decided to delete the folder using rm -rf /var/opt/gitlab/postgresql/data/*.

But he made a big mistake: he ran the command on the main database instead of the other one. This deleted all the data from the website database, leaving GitLab.com with no data at all.

The Solution: Use an Old Backup

As soon as team-member-1 knew what he did, he told his team member and stopped everything on GitLab.com. They started looking for backups to get the data back.

They found out that they had some backup methods, but none of them worked well:

  • The disk snapshots were not turned on

  • The S3 backups were not found

  • The DB dumps were old

  • The copy process was broken

The only backup that worked was one that team-member-1 made by hand six hours before the problem. This backup had most of the data from GitLab.com, but not things like issues, merge requests, users, comments, snippets, etc. that people made or changed in those six hours.

GitLab decided to use this backup to make GitLab.com work again as soon as possible. They also asked their users to help them get back any lost data by sending them pictures or copies of their recent work.

The backup process took a long time and had many steps:

  • Putting the backup on a new database server

  • Making GitLab use the new database server

  • Checking and fixing the backup data

  • Starting GitLab services and testing if everything works

  • Talking to users and telling them what’s going on

GitLab.com was finally working again around 6:14 pm UTC on February 1st, more than 18 hours after the problem started.

The Lesson: Learn from Mistakes

GitLab looked at the problem very carefully and wrote a blog post about it. They found out why the problem happened and what made it worse, such as:

  • Human error: team-member-1 deleted the wrong folder

  • Lack of verification: none of the backup methods were tested or watched

  • Lack of documentation: there was no clear way to use the backups

  • Lack of communication: there was no good way to talk and work together

  • Lack of sleep: team-member-1 was working late at night and was tired

They also made a list of things to do and make better to stop such problems from happening again, such as:

  • Turning on disk snapshots and checking S3 backups

  • Making backup documents and testing ways better

  • Making alerts and watching for backup problems

  • Making role-based access control and audit logging for database servers

  • Teaching and helping with PostgreSQL copy

  • Making a blameless culture and a way to learn from mistakes

GitLab’s problem was very bad for them and their users. It showed that they needed to have good and tested backups and a clear and written way to use them.

GitLab was open and honest about the problem, and they shared what they found and learned with everyone. They also said sorry to their users and gave them something for the data loss. They got a lot of feedback and support from their community, who liked their openness and work.

Conclusion
GitLab messed up and lost data, but they fixed it and learned from it. GitLab’s problem is a reminder for all of us who work with data and databases to be careful, smart, and ready. We should always check our commands, test our backups, write our ways, talk with team members, and learn from our mistakes.

Reference

If You are using Medium Please support and follow me for interesting articles. Medium Profile

Stay updated with my latest and most interesting articles by following me.

If this guide has been helpful to you and your team please share it with others!

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0