How to handle a technical crisis at Gymglish (and live to tell the tale)

If our article on a typical day in the life of the Gymglish tech team was a bit too theoretical for you, you may need a concrete example of our crack staff in action handling a crisis.

Our CTO, Leo Tingvall will now take you through the incident report of a 4-hour server downtime on a Saturday morning. Warning: this article may evoke feelings of anxiety, heart palpitations and/or confusion.

0h31: Basically, our primary database server discovered that it was unable to use the table ‘nonlessonmail’. This is the table used to store all the emails (!!!) sent to the users (except lessons/corrections). The server said the table was ‘corrupt’ and unreadable. “Please restore from backup” said a (un)friendly message in the logs. 

0h31-8h: We started getting alert emails about errors during the night (thankfully we were sleeping). 

8h04: Bogdan, always an early riser, time-travelled 1h back in time (through something called a “Romanian time zone”) and informed us that we had, tam tam tam, a broken database! What a great start to a Saturday! RETD emergency was quickly summoned using Telegram. One by one the engineers started joining the conversation. 

8h45: After some investigation, we realized it’s almost the same issue we had in October 2020. We’ve fixed this once, so we should be able to fix it again!

9h04: Antoine makes sure we use the roles of Incident Commander (Leo), Scribe (Bogdan) and Customer Liaison (Antoine) officer are established. These roles will change hands at least twice throughout the morning, but having assigned roles works very well! 

9h15: after some discussion and checks, we decided that the solution is to dump the table from the replication server db7, and import it to the broken db8. The table is 60Gb. We’re in for some time. Surprisingly the compressed dump is only 1.8G – we wonder why and try to make sure we really did get everything. In the end it looks good. 

9h30: Here is one of the epic moments in the tek team: Leo and Antoine apologize as they each attempt a modified “Irish goodbye” (claiming childcare organization). This while Aurélien is investigating shirtless (it’s Saturday), Thomas is debugging things seemingly while driving his car (he claims he has stopped the car – still he’s in his car…), and Bogdan operating the big operation while eyeing the clock to make sure he won’t miss his flight back to Paris. 

10h14: the importing of the dumped nonlessonmail table starts. Now it’s just a matter of time : how many GB per minute do we get? The clock starts.

11h02: We’re at 24G. So not slow, but not really fast either…

11h54: the import is finished. Looks good. Only 45GB in the end (and not the 60Gb we expected…). Where is the rest? Probably a lot of scary looks going around (Leo wasn’t there to see…). Looks like it’s not needed – great! Spring-cleaning! 

12h02: testing services: everything looks okay. 

12h12: Removing emergency mode, putting website and services back online. Everything looks fine. 

12h25: Success is declared! We can all go on with our weekends. 

Lessons learned: 

  • keep an emergency procedure close by and up-to-date
  • training is good! The second time around is much easier
  • we really want to update our database servers : although we’re not sure of the cause here we hope it will improve with the updated operating system. 

Leo, on behalf of the Gymglish TEK team




Related articles:

Leave a Reply