
Some customers experienced build failures
Started 30 Mar at 01:32pm PDT.
Created
Yesterday at about 21:09 UTC, one of our deployment dependencies silently failed as we rolled out a change to our Capture Cloud infrastructure. The change was deployed successfully, but was slightly misconfigured due to the silent failure.
About a half hour later, at 21:44 UTC, this began causing some builds to fail during the build verification process. Many accounts were unaffected by this, but accounts with large Storybooks saw more frequent build failures, scaling with the size of the Storybook, with the largest Storybooks resulting in complete unavailability.
The problem went unrecognized because a very small percentage of builds were failing, below the threshold that would trigger our alerting system, significantly delaying our response. Today at 14:15 UTC, responding to reports from our users, we discovered the problem and pushed out a fix.
We are now deciding upon changes to prevent this class of incident from recurring, in addition to improvements to our incident response practices to address the underlying causes of our long response time in this case.