Subscription renewal processing currently delayed
Incident Report for Chargify
Postmortem

The first of the month is always a busy day for subscription processing at Chargify. We process about 100 times more subscriptions on the first of the month than in the other days of the month!

We strive to scale and tune things so that we process subscriptions as faithfully on the 1st as we do on the 21st. However, on December 1st 2018 we failed to live up to our expectations for renewing subscriptions in a timely manner.

First of all, we’d like to apologize for letting you down and disrupting your business. Read on if you’d like to learn what went wrong and what we plan to do about it so that we can all ring in New Year 2019 with uninterrupted cash flow!

There were two main contributors to our delayed processing incident: 1) more subscriptions than ever and 2) a new unanticipated bottleneck.

More subscriptions

We processed about 68% more subscriptions at the start of December 2018 compared to the start of November 2018. We knew about the impending increase, and we prepared in advance by doubling our capacity of subscription processing workers.

A bottleneck

However, there was a new bottleneck that we didn’t anticipate. Most of the new subscriptions in December are on our new Relationship Invoicing architecture. Every invoice gets its own unique-per-site invoice number, and this number generator failed to keep up under intense load. On top of that, the new capacity we put in place actually made things worse instead of better, due to the nature of this bottleneck. We saw our average time to process a single subscription jump from about 1 second to 30 seconds. This quickly filled up all of our available workers and pushed subscription renewals into a growing queue. At the peak, this queue was nearly 12 hours long.

What we did

The first thing that we did was to limit the number of subscriptions each site could process concurrently. This was important because the new-architecture subscriptions that needed the number generator were taking up all of the resources and not even giving subscriptions that didn’t need the number generator a chance. Once we made this change, the backlog finally started to clear. If you had few subscriptions to process you would have seen results very quickly. If you had many subscriptions to process, the progress would have been slow (because we limited your concurrency) but consistent (because resources were actually available for you).

What we’re going to do

We’re pretty motivated to have a relaxing New Years Day, so here’s what we’ve got in store:

1. Replacing the number generator: we already have a pattern for a much more scalable number generator in use throughout our app, so now the new invoices will be updated to use that instead of the problematic one.

2. Tuning capacity: we’re looking at increasing capacity again and tuning the per-site concurrency threshold to find that sweet spot where sites with a lot of subscriptions process as quickly as possible but don’t hog all of the resources.

3. Load testing: we’ll be doing a pretty intense load test once all of our changes are in so we can make sure we’re ready.

4. Better monitoring and alerting: we’ll be installing some newly tuned alerts so we can be more proactive to the situation next time.

Onward!

We’re sorry for letting you down this past weekend. We know it is important to renew your subscriptions in a timely manner, no matter what day it is. We appreciate your patience as we make these changes to serve you better!

Posted 6 days ago. Dec 05, 2018 - 07:54 CST

Resolved
This incident has been resolved. We will follow up with a postmortem.
Posted 9 days ago. Dec 02, 2018 - 08:18 CST
Update
Subscription processing is still running up to several hours behind, but it's recovering quickly.
Posted 10 days ago. Dec 01, 2018 - 20:11 CST
Monitoring
A fix has been implemented and we're monitoring renewal processing.
Posted 10 days ago. Dec 01, 2018 - 18:18 CST
Identified
The issue has been identified and a fix is being implemented.
Posted 10 days ago. Dec 01, 2018 - 18:06 CST
Investigating
We are currently investigating this issue.
Posted 10 days ago. Dec 01, 2018 - 15:37 CST
This incident affected: Scheduled Payment Processing.