First and foremost: We are extremely sorry. This weekend we had two outages (for 25 minutes and 65 minutes) that represent the biggest hit to our uptime in the last four years. We know that Chargify is a critical system for many business and one of our most important tenets is that we must be the most reliable and trustworthy service you use. We’re always improving and we hope this letter will give you some insight into the steps we’re taking to make sure we’re the best.
The ultimate cause of each outage was different, but the result was the same: our primary database cluster went down resulting in our major services (API, Web App, & Public Pages) being unavailable. During this time any requests sent to us were met with error responses (or possibly a timeout). However, there was no loss of any stored or confirmed data and once service was restored we caught up almost immediately on all pending renewals, emails, webhooks, and notifications. Continue reading at the bottom of this post to learn about the technical details of the outage.
Two outages back-to-back is surely of special concern for some. But we are very confident that the causes were unrelated and are not indicative of a larger or systemic problem. All our information simply points to bad luck that they happened in quick succession.
We have analyzed and investigated the immediate and root causes of both outages and will be implementing several fixes focused around: more monitoring, more automation, more protections, and more training.
I’d like to highlight some parts of this experience that I’m particularly proud of. Hopefully this demonstrates our absolute commitment to keeping Chargify up and secure:
As a reminder, the best place to look when you have questions about our availability is at http://status.chargify.io where we immediately post updates to our response.
And as always, our support team is always here to help and will be happy to answer any other questions you have. Just email email@example.com.
Thank you again for trusting us with the most sensitive financial data of you and your customers.
Technical details follow. You don’t have to keep reading, but if you’re curious:
We regularly do internal security scans of our systems to look malicious activity, configurations that don’t comply with our policies, or just anything out of the ordinary. These scans are heavily automated and the first step when launching a scan is to adjust our firewall rules (AWS Security Groups). We grant extra temporary access to the scanner to various parts of the network (otherwise, the scanner would be completely blocked from communicating with other systems and wouldn’t do much good).
The script that performs the firewall adjustments is one we use regularly. But it has some interesting idiosyncrasies. Within the last year, AWS changed how its API returns responses that describe a Security Group. Previously, every Ingress Permission Rule was split so that it identified only a single grant in each rule. Our script was initially written to add a new rule for the scanner, allow the scan to complete, then revoke this rule.
After the API change, however, each Rule can be returned with multiple grants for a single shared port. Here’s an example:
Groups: [peer_group] - Protocol: ICMP - Port: Any Groups: [scanner_group] - Protocol: ICMP - Port: Any
Groups: [scanner_group, peer_group] - Protocol: ICMP - Port: Any
As you can see, after the change, revoking the rule would remove ping access not just for the scanner group, but for the peer group as well (unintentionally!)
We learned of this change, then tested and adjusted our script accordingly some time ago. It has correctly handled these edge cases for many months. But after the fix, there was left a latent bug in the logic that issued the revoke command that could still allow it to remove more permissions than intended.
On Friday, Jan 13th we ran this script as part of some regular scans. Unknowingly (because of the previously unencountered bug) the rule permitting ICMP ping between the systems in our database cluster was revoked. This had no immediate effect. Many other cluster ports were still open and the cluster and replication protocol continued to work normally. The engineer performing the scan closed out the task with no anomalies.
Two days later (Sunday, Jan 15th), an itinerant network interruption caused a small amount of traffic between the database cluster to get lost. This is both normal and unremarkable. The typical response of the cluster is to re-evaluate the status of the cluster by sending ICMP pings to all its members as a first step in deciding if failover action needs to be taken. However, all three databases in the cluster found they could not ping each other.
Each database therefore saw itself as “in the minority” of 1 of 3. Therefore every database went into failsafe mode and refused to accept traffic or participate in the cluster (since that’s the job of the “majority” members).
Our health checks immediately saw the databases become unavailable and began to raise alerts. Our incident response team was activated and we began to diagnose the cause and status of the database within 4 minutes of the first sign of trouble.
After performing a series of basic checks, we were able to quickly correlate the symptom of the systems being unable to ping each other (despite all systems being up and available) with the recent firewall change from Friday.
We reversed that change, restored ping connectivity, and were able to bring the cluster back online very easily after that.
As part of good operating procedure, we regularly patch our systems every month. To do so involves performing a “safe” switchover of the primary database in our cluster to one of the secondaries. This is a well-practiced procedure that is a combination of manual steps and automated commands where possible. We perform it on a regular basis and it has proven to be a very reliable approach that allows us to perform critical maintenance without having scheduled downtime.
One of the commands in our checklist sequence was executed incorrectly which caused the primary database to shut down. The cluster at this point began a failover to a secondary. However, the failover did not succeed. The cluster moved into a ‘shunned’ mode that cuts off access until data integrity can be assured (the alternative would be to potentially have a split-brain scenario where two systems think they are both the primary, likely resulting in a loss of data).
Our incident response team was immediately mobilized. We proceeded to diagnose the cluster status and follow our restore playbooks. However, the error being thrown was one we have never seen before in five years of operating this cluster. We immediately paged our database support vendor and engaged with their support team for further help.
With their assistance we found the cause of the error and were able to correct it, which quickly restored operation to the cluster and the entire Chargify application suite.
Ultimately, this outage was caused by a very small area of our operations where our automation isn’t as sophisticated as we’d like. Coordinating a database switchover is a very complex task and incorrectly written automation can be even more unsafe than doing it manually. But we firmly believe the ultimate solution here is to programmatically add the sophistication necessary to allow our automation framework to perform every step of our database maintenance through the use of double and triple checks and operator oversight, but without the need for manual action that might cause mistakes. This will take time for us to develop and test thoroughly until we’re certain that it increases our reliability.
Our standard ongoing maintenance was a proximate cause involved, so I’d like to touch on that aspect. Ultimately, every change we make has risk. In following the best practices for our development and operations teams, we aim to constantly make small, isolated, easy-to-understand changes. We automate them at every opportunity and do them continuously because the best way to avoid major problems is to always be well-practiced in our procedures.
The alternative is to release less often, do maintenance less often, or make changes less often. But this is a bad tradeoff. In slowing down, your automation and practice suffers while the size of your changes increases. This has the side effect of greatly increasing risk, not reducing it.
We have very closely re-evaluated the risks of each maintenance procedure we perform. We feel confident that these continue to represent a “low” risk when done in very small, automated, and well-tested changes.
-- Drew Blas, Director of Operations, Chargify.com