Main Application Down
Incident Report for Chargify

Summary

First and foremost: We are extremely sorry. This weekend we had two outages (for 25 minutes and 65 minutes) that represent the biggest hit to our uptime in the last four years. We know that Chargify is a critical system for many business and one of our most important tenets is that we must be the most reliable and trustworthy service you use. We’re always improving and we hope this letter will give you some insight into the steps we’re taking to make sure we’re the best.

The ultimate cause of each outage was different, but the result was the same: our primary database cluster went down resulting in our major services (API, Web App, & Public Pages) being unavailable. During this time any requests sent to us were met with error responses (or possibly a timeout). However, there was no loss of any stored or confirmed data and once service was restored we caught up almost immediately on all pending renewals, emails, webhooks, and notifications. Continue reading at the bottom of this post to learn about the technical details of the outage.

Two outages back-to-back is surely of special concern for some. But we are very confident that the causes were unrelated and are not indicative of a larger or systemic problem. All our information simply points to bad luck that they happened in quick succession.

We have analyzed and investigated the immediate and root causes of both outages and will be implementing several fixes focused around: more monitoring, more automation, more protections, and more training.

I’d like to highlight some parts of this experience that I’m particularly proud of. Hopefully this demonstrates our absolute commitment to keeping Chargify up and secure:

  • Our incident response procedures, which we practice constantly, worked exactly as designed. Our alerts, pagers, and conference bridge all worked and we got all the critical parties involved immediately, even during the middle of the night (especially thanks to our 24/7 operations staff).
  • We were able to communicate with merchants immediately on our response so that everyone knew we were aware of the problem and working to fix it fast.
  • All merchant data remained safe and secure during the entire process.

As a reminder, the best place to look when you have questions about our availability is at http://status.chargify.io where we immediately post updates to our response.

And as always, our support team is always here to help and will be happy to answer any other questions you have. Just email support@chargify.com.

Thank you again for trusting us with the most sensitive financial data of you and your customers.

Technical details follow. You don’t have to keep reading, but if you’re curious:

Outage 1

15 Jan, 2017 12:22 UTC - 12:47 UTC (25 minutes)

We regularly do internal security scans of our systems to look malicious activity, configurations that don’t comply with our policies, or just anything out of the ordinary. These scans are heavily automated and the first step when launching a scan is to adjust our firewall rules (AWS Security Groups). We grant extra temporary access to the scanner to various parts of the network (otherwise, the scanner would be completely blocked from communicating with other systems and wouldn’t do much good).

The script that performs the firewall adjustments is one we use regularly. But it has some interesting idiosyncrasies. Within the last year, AWS changed how its API returns responses that describe a Security Group. Previously, every Ingress Permission Rule was split so that it identified only a single grant in each rule. Our script was initially written to add a new rule for the scanner, allow the scan to complete, then revoke this rule.

After the API change, however, each Rule can be returned with multiple grants for a single shared port. Here’s an example:

Rules Before:

    Groups: [peer_group] - Protocol: ICMP - Port: Any
    Groups: [scanner_group] - Protocol: ICMP - Port: Any

Rules After:

    Groups: [scanner_group, peer_group] - Protocol: ICMP - Port: Any

As you can see, after the change, revoking the rule would remove ping access not just for the scanner group, but for the peer group as well (unintentionally!)

We learned of this change, then tested and adjusted our script accordingly some time ago. It has correctly handled these edge cases for many months. But after the fix, there was left a latent bug in the logic that issued the revoke command that could still allow it to remove more permissions than intended.

On Friday, Jan 13th we ran this script as part of some regular scans. Unknowingly (because of the previously unencountered bug) the rule permitting ICMP ping between the systems in our database cluster was revoked. This had no immediate effect. Many other cluster ports were still open and the cluster and replication protocol continued to work normally. The engineer performing the scan closed out the task with no anomalies.

Two days later (Sunday, Jan 15th), an itinerant network interruption caused a small amount of traffic between the database cluster to get lost. This is both normal and unremarkable. The typical response of the cluster is to re-evaluate the status of the cluster by sending ICMP pings to all its members as a first step in deciding if failover action needs to be taken. However, all three databases in the cluster found they could not ping each other.

Each database therefore saw itself as “in the minority” of 1 of 3. Therefore every database went into failsafe mode and refused to accept traffic or participate in the cluster (since that’s the job of the “majority” members).

Our health checks immediately saw the databases become unavailable and began to raise alerts. Our incident response team was activated and we began to diagnose the cause and status of the database within 4 minutes of the first sign of trouble.

After performing a series of basic checks, we were able to quickly correlate the symptom of the systems being unable to ping each other (despite all systems being up and available) with the recent firewall change from Friday.

We reversed that change, restored ping connectivity, and were able to bring the cluster back online very easily after that.

Lessons learned and actions to be taken:

  • We’ll do significant additional testing on any script that makes firewall rule changes and expand the error checking to ensure that the before and after states match the expected results.
  • We have already added additional independent monitoring that will ensure that communication among our systems is not being impacted (even on something as simple as ping) and will report and alert on any anomalies, so they’re found right away.
  • We will also add additional database-specific checks and procedures to ensure communication between them aren’t interrupted.

Outage 2

16 Jan, 2017 9:17 UTC - 10:22 UTC (1 hour 5 minutes)

As part of good operating procedure, we regularly patch our systems every month. To do so involves performing a “safe” switchover of the primary database in our cluster to one of the secondaries. This is a well-practiced procedure that is a combination of manual steps and automated commands where possible. We perform it on a regular basis and it has proven to be a very reliable approach that allows us to perform critical maintenance without having scheduled downtime.

One of the commands in our checklist sequence was executed incorrectly which caused the primary database to shut down. The cluster at this point began a failover to a secondary. However, the failover did not succeed. The cluster moved into a ‘shunned’ mode that cuts off access until data integrity can be assured (the alternative would be to potentially have a split-brain scenario where two systems think they are both the primary, likely resulting in a loss of data).

Our incident response team was immediately mobilized. We proceeded to diagnose the cluster status and follow our restore playbooks. However, the error being thrown was one we have never seen before in five years of operating this cluster. We immediately paged our database support vendor and engaged with their support team for further help.

With their assistance we found the cause of the error and were able to correct it, which quickly restored operation to the cluster and the entire Chargify application suite.

Ultimately, this outage was caused by a very small area of our operations where our automation isn’t as sophisticated as we’d like. Coordinating a database switchover is a very complex task and incorrectly written automation can be even more unsafe than doing it manually. But we firmly believe the ultimate solution here is to programmatically add the sophistication necessary to allow our automation framework to perform every step of our database maintenance through the use of double and triple checks and operator oversight, but without the need for manual action that might cause mistakes. This will take time for us to develop and test thoroughly until we’re certain that it increases our reliability.

Lessons learned and fixes to implement:

  • Additional automation will be added to our operating procedure so it confirms each step correctly before moving to the next one.
  • We will add more restrictive checks to prevent commands from being run if they could cause problems with the cluster.
  • During maintenance we will make more aggressive use of our cluster’s “maintenance” mode which will prevent premature failovers in cases where it can do more harm than good.
  • The initial failover did not succeed because of a background script on the new primary. It kept a database lock held open that prevented the new primary from taking traffic. We will ensure that the lock is not held open or that a system with the lock open is never chosen as a failover primary.
  • Our cluster restore playbooks now have additional guidance on how to recover from the particular errors seen during this outage so we can deal with them in the future.
  • We discovered some health check scripts that do not handle the database outage well. They get locked up and start to “pile up”. Although this did not cause an issue this time, we will fix these scripts and test them better to prevent future problems.
  • Some people on our team did not have needed access to directly contact our database support vendor. This did not cause a problem this time because there were enough people who did. But we’ll do additional training to make sure everyone is up to speed on those procedures.

Final note

Our standard ongoing maintenance was a proximate cause involved, so I’d like to touch on that aspect. Ultimately, every change we make has risk. In following the best practices for our development and operations teams, we aim to constantly make small, isolated, easy-to-understand changes. We automate them at every opportunity and do them continuously because the best way to avoid major problems is to always be well-practiced in our procedures.

The alternative is to release less often, do maintenance less often, or make changes less often. But this is a bad tradeoff. In slowing down, your automation and practice suffers while the size of your changes increases. This has the side effect of greatly increasing risk, not reducing it.

We have very closely re-evaluated the risks of each maintenance procedure we perform. We feel confident that these continue to represent a “low” risk when done in very small, automated, and well-tested changes.

-- Drew Blas, Director of Operations, Chargify.com

Posted 11 months ago. Jan 17, 2017 - 15:02 CST

Resolved
The app is online and stable. We will follow up with a post-mortem within a few days.
Posted 11 months ago. Jan 16, 2017 - 04:55 CST
Monitoring
The app is back online. We're beginning to audit the services to ensure everything is clear.
Posted 11 months ago. Jan 16, 2017 - 04:24 CST
Identified
We're continuing to work to bring the application back online safely and without data loss. Please stand by.
Posted 11 months ago. Jan 16, 2017 - 04:07 CST
Investigating
Chargify is currently unreachable - we're working to repair an error in one of our database clusters.
Posted 11 months ago. Jan 16, 2017 - 03:31 CST