On Dec 6 09:16 UTC, Chargify suffered a major outage due to unexpected termination of multiple database servers in our cluster. This was our first major outage in over a year. I would like to apologize for letting you down and for the disruption caused for your business.
We now know what happened and have addressed it on multiple levels. Read on if you would like to learn what happened and what we have done about it.
At approximately 09:16 UTC, our primary database server ran out of memory and the database process was killed. Our automation immediately failed over to one of the secondaries, but that server also quickly ran out of memory and that database process was also killed. At this point, in an inconsistent state, our automations were unable to select any other candidate as master and we were down, without access to our primary database.
Our team quickly diagnosed the issue and determined that we could elect a primary and bring the cluster back online. We spent some time to ensure that there was no detectable data corruption or loss. Once we were satisfied, we brought the cluster back online at 09:36 UTC and we were back online after 20 minutes of outage.
What We Did
We began to search for reasons that the memory was exhausted in the first place. We could not find an immediate cause, so we installed monitoring on the memory usage of just the database processes on all machines, and doubled the memory of each machine for good measure.
The next morning, we saw a sharp increase in memory at the same time of day, but with the extra capacity we installed it was not a problem. We were able to trace the cause to an hourly query that we run, that for various reasons, would become a memory hog at the 09:15 UTC run. What’s interesting is that the memory was not returned to the system (which we suspect to b a vendor bug), so the effects of this problem were cumulative. It may take a few days, but it would come back!
We fixed the offending query and have been monitoring ever since without seeing any problems.
Also Fixed: A Side Effect on Invoice Events
On Dec 10th it came to our attention that new events in our Invoice Events API were not being returned since around the time of the Dec 6 outage. We did not detect this immediately because the events were still being created, just not being made visible as new events in the API - they were showing as old historical events.
We use an ID on these events to indicate ordering. For example, an event with ID 124 comes after an event with ID 123. At the time of the Dec 6th outage, the current ID on events was around 525,000. While we repaired the data, we bumped this ID to 1,000,000. Then a few minutes later, we began inserting the missing events with IDs in the range of 700,000 to 800,000.
So, if you were fetching your events using the
since_id parameter during the period Dec 6 - Dec 10, make sure that you set that parameter to around 500,000 to ensure you don’t miss any events. Unfortunately, there was a short period where the events in the 700,000 range were missing while the events above 1,000,000 were available.
Also note: some of the events at low IDs (below 15,000), representing much older data, may have been corrupted in this event. We are repairing those now.
We feel we’ve fixed the issues that have caused this outage and learned lessons that will help us deliver service more reliably in the future. Thanks for your support!