Incident 02 Jul. 2021: Payment Gateway Increased Errors: Symptom

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Incident 02 Jul.

2021:
Payment Gateway
increased errors
Availability: Sales down

% of stores affected: 100%

Duration of incident: 1 hour and 22 minutes

Symptom
From 17h25 to 18h47 UTC, some customers that tried to go to checkout on VTEX
stores were not able to complete the purchase.

Summary
At 17h37 UTC, we identified an anomaly in our Payments Gateway behavior, and
shortly after, we were notified of a relevant drop on our platform’s sales rate. Our
engineering team identified that the problem was caused by an incorrect configuration made
on our Payments Gateway and immediately started the process to revert this configuration.
At 18h47 UTC, our platform fully recovered its health.

Timeline
[2021-07-02 17:25 UTC] Sales started gradually decreasing.
[2021-07-02 17:37 UTC] Our engineering team was notified that there was a problem with
our Payments Gateway, and an investigation was initiated.

[2021-07-02 17:43 UTC] The whole engineering team was notified that there was a problem
with our platform’s sales rate.

[2021-07-02 17:58 UTC] The problem was identified and the process to revert the incorrect
configuration on our Payments Gateway was started.

[2021-07-02 18:46 UTC] The process was completed by flushing the cache systems
associated with this configuration.

[2021-07-02 18:47 UTC] Normal operations were reestablished.

Mitigation Strategy
At 17h58 UTC, our team started to work on reverting the incorrect configuration
made on our Payments Gateway, and at 18h46 UTC, we finally finished this rollback by
refreshing our caching systems. At 18h47 UTC, our services were back to normal behavior.

Follow up Actions: preventing future failures


As follow-up actions to this incident, we will work on improving the procedure of
updating our Payments Gateway configuration, with better validations before rolling out, and
better monitoring of the results after the rollout, so that we detect the degradation of our
service faster.

You might also like