Building Reliability: How to Create a Payment Gateway That Scales

Remember September 7–8, 2023?
Square (part of Block Inc.) went offline for almost 14 hours. A DNS misconfiguration — just a small mistake was enough to bring down Cash App and Square POS around the world. No payments. No checkouts. Just silence at the counters and growing queues.

It wasn’t a server crash. It wasn’t an outage caused by traffic spikes. It was something no one expected, and no one caught in time.

If you’re Googling how to create a payment gateway right now, don’t start with the API docs — start with this kind of failure because the problem isn’t in the code. It’s in architecture. In fault tolerance. In how your system reacts when DNS fails, or a Safari bug interrupts callbacks, or your fraud engine updates without a safety net.

Contents

1 Reliability Doesn’t Start With Code. It Starts With Crash Simulation
- 1.1 Here’s what the “before/after” looked like in a real case
2 One Failure And Everything Stops Working. Why Modular Architecture Is Needed
3 System Brains: Fallback and Behavior-Based Routes
4 The Conclusion

Reliability Doesn’t Start With Code. It Starts With Crash Simulation

If you’re a CTO, you’ve had a day like this. On Friday at 7:08 p.m., traffic peaked. Everything was going smoothly until Auth froze. Four minutes later, 112 transactions crashed. Eleven minutes later, a barrage of “Where’s my money?” tickets began pouring in.

If there had been a simulator for 100k RPS with unstable routing across regions, the gate would not have fallen. Now, you have to restore loyalty, pay compensation, and gather the team for a retrospective. As Shopify engineers like to point out, reliability begins with timeout configuration and failure simulation — before anything goes live.

Here’s what the “before/after” looked like in a real case

(based on actual rollout data from our team at Tranzzo — a payment infrastructure provider focused on high-risk markets — where they helped decouple fraud logic and set up fallback routes across six PSPs):

Indicator	Before the transition	After implementing fault tolerance
Downtime for August	7 hours	<40 minutes (including scheduled work)
SLA to LATAM	96%	99.2% (including Friday peak hours)
Average route time	5.2 sec	1.7 sec (including fallback logic)
Fallback Coverage	12%	83% (across 7 countries, 4 card types and 6 PSPs)
Auto-routing share	<20%	91% (without support)
Error Rate (code 406)	3.4%	0.6% (due to Safari 13 tracking)

Critical point? Where you least expect it. For example, not Boleto itself, but Safari 13, where session expiry breaks the callback (implementation-dependent, of course).

One Failure And Everything Stops Working. Why Modular Architecture Is Needed

Monolith is dead? Not quite. But if fraud lies within the same circuit as auth and payouts, any bug in one place brings everything down.

This is what needs to be done:

Move fraud detection to a separate microservice with its own SLA and versioning.
Routing: autonomous logic that doesn’t wait for the API to respond with “please.”
Auth gateway — isolated, with caches on IP and fingerprint.

As long as everything is in one place, you are hostages. You cannot update fraud without risk. You cannot restart payout without touching auth.

And here, you begin to unconsciously search for the answer to the question — how to create a payment gateway that does not turn into dominoes. Where one fall does not pull everything down. The answer? Separate, remove, simulate.

Simulated the load? Great. But if you didn’t simulate the failure of one module in isolation, it’s not architecture, it’s a visualization of a dream.

System Brains: Fallback and Behavior-Based Routes

Chile, Monday, 7:04 a.m. Decline. Another one a minute later. Then, a whole string of them. 37 declines in the morning. Operators are digging through the logs. The reason? The fallback went haywire after the fraud engine update.

It could have been different. If decline — take it to SPEI. If IP — from an unstable subnet, don’t wait 5 seconds for a response but duplicate it.

Here is a list of patterns that should be in a live payment architecture:

fallback by geography + time of day (for example, Brazil after 6 p.m. — unstable routing at Banco do Brasil)
device recognition: Safari 12? We don’t send 3DS there.

A system that doesn’t change the route after 2 consecutive failures is not a system, but a blind button.

Results of implementing behavioral fallback for one of our clients:

+7.4% conversion in the PER-COL region (compared YoY)
-58% support requests on the topic of “I can’t pay”
LTV of the VIP segment grew by 11% due to reduced frustration

The Conclusion

We are still asked: “Is it possible to avoid simulation and simply hope for the best?” It is possible. However, it is like constructing a tower without a foundation. As long as there is no wind, it will remain standing. Then Tuesday arrives.

You don’t have to build your payment architecture from scratch. But if you do decide to do so, don’t just Google “how to create a payment gateway,” but “how not to ruin it at the first failure of the fraud system.”

And if your gateway doesn’t know that Safari 13 + an Ecobank card in Nairobi at night is a potential failure, then it’s not a gateway. It’s just a button click. And a button won’t save you when your business scales into three new regions.

Do you know how much one “small” loss on Friday evening costs? According to Tranzzo’s internal Q1 2024 report, the median revenue dip during unplanned payment interruptions in LATAM was 6.3% per hour.

Maybe it’s time to ask: what would it cost you — and would your team even notice it in time?

Open up the architecture. Not for yourself. For your team. Otherwise, it will open up on its own one day — in production.

Reliability Doesn’t Start With Code. It Starts With Crash Simulation

Here’s what the “before/after” looked like in a real case

One Failure And Everything Stops Working. Why Modular Architecture Is Needed

System Brains: Fallback and Behavior-Based Routes

The Conclusion

Despeisekartes