Ravi Gupta

Posted on May 13

The Silent Failure I Never Saw Coming: What VaultPay Taught Me About Consistency Under Failure

#python #softwareengineering #backend #systemdesign

VaultPay is a wallet microservice I built on top of AuthShield. This is the first technical post in the series if you want the origin story, that's in the previous post.
Previous parts:
Part 1 is here: I Built AuthShield and Immediately Knew It Wasn't Enough

How I learned that good error handling and actual data consistency are two completely different things

When I started designing the transfer flow for VaultPay, my first instinct was simple. Validate the request, check the balance, move the money, log it. Linear steps, each one after the other.

That instinct was wrong. Not in a subtle way. In a "this would silently destroy data in production" kind of way.

Here's the scenario that changed how I thought about it.

Imagine a transfer of ₹500. The sender's balance gets deducted. Then network hiccup, database timeout, process crash, anything - the credit to the receiver never happens. The money is gone from one wallet and hasn't appeared in the other. No error the user can act on. No audit trail that captures what actually happened. Just a wrong number sitting in a database, quietly.

Auth had nothing to say about this. AuthShield could verify the user perfectly - correct JWT, correct role, correct permissions - and you'd still end up with corrupted financial state. Identity is fully solved. Consistency under failure is a completely different problem.

That's what I wanted to understand. Not invent a solution. Just understand how real systems actually handle this.

The Full Send Money Flow

Before getting into the transfer engine itself, it helps to understand everything that runs before it.

The send money flow in VaultPay has two distinct phases. The first is a series of guards - each one checking a condition and failing fast before any money moves. The second is the atomic transfer itself.

The guards run in this order:

The client sends POST /transactions/send with a JWT, PIN, amount, receiver wallet ID, and an idempotency key. VaultPay validates the JWT locally using the shared secret - fast path, no round-trip to AuthShield. Then the idempotency key gets checked against Redis. If this exact request has already been processed, the cached result comes back immediately - no re-processing, no double transfer. Then the sender's IP gets checked against the known IP trust cache. Then the PIN gets verified with bcrypt. On failure, a Redis counter increments. After 5 consecutive failures the wallet locks for 24 hours. Then wallet status, KYC verification, daily limits, monthly limits, and per-transaction limits all get checked.

Only after every single guard passes does the system attempt to move money.

This ordering is intentional. Every check that can fail without touching balances runs first. You never want to deduct from a sender and then discover the receiver's wallet is frozen two operations later. By the time the transfer engine runs, every condition that can be validated without starting a database transaction has already been validated.

But guards aren't the hard part. The hard part is what happens next.

The 6-Step Atomic Core

Moving money between two wallets isn't one database operation. It's six, and they all have to succeed or fail together:

Lock the sender's wallet row. Lock the receiver's wallet row. Check the sender's balance one final time inside the transaction. Deduct from the sender. Credit the receiver. Write both transaction records - a debit row for the sender's history and a credit row for the receiver's.

Six writes. If any one of them fails mid-way - or if the process dies between writes - you're left with partial state. Some writes happened. Some didn't. And now you have to manually figure out which.

The approach I landed on, which is how real financial systems handle it, is wrapping all six operations in a single atomic database transaction. Either all six complete and commit together, or none of them do.

async with db.begin():  # Atomic transaction - all or nothing
    # Lock both wallet rows before touching anything
    sender = await db.execute(
        select(Wallet)
        .where(Wallet.id == sender_id)
        .with_for_update()  # Row-level lock
    )
    recipient = await db.execute(
        select(Wallet)
        .where(Wallet.id == recipient_id)
        .with_for_update()  # Row-level lock
    )

    # Final balance check inside the lock - no race condition possible
    if sender.balance < amount:
        raise InsufficientBalanceError()

    # All four writes happen atomically
    sender.balance -= amount
    recipient.balance += amount

    db.add(Transaction(wallet_id=sender_id, type="debit", amount=amount))
    db.add(Transaction(wallet_id=recipient_id, type="credit", amount=amount))

# COMMIT - everything lands together, or PostgreSQL rolls it all back

If the process crashes after the deduct but before the credit, PostgreSQL rolls back the entire transaction automatically. The sender's balance is restored. Neither transaction record exists. The database is in exactly the state it was before the transfer started.

Clean failure. Every time. Not because of clever error handling - because the database guarantees it structurally.

The full sequence diagram for this flow - including every guard step and the post-transaction Redis operations — is in the VaultPay engineering repo if you want to trace the exact path end to end.

The Mid-Freeze Problem Nobody Mentions

There's an edge case that took me longer than I'd like to admit to think through.

The guard phase checks that both wallets are active before the atomic transaction starts. That check passes, so you enter the transaction. But what if an admin freezes the receiver's wallet in the milliseconds between the status check and the first write? The guard passed. The wallet is now frozen. The transfer shouldn't complete.

Without row locks, this is a real race condition. The status check and the wallet mutation are two separate operations, and something else can happen between them.

SELECT FOR UPDATE closes that gap. Before VaultPay touches any balance or writes any record, it acquires a row-level lock on both wallet rows. Any concurrent operation trying to modify those rows - including a freeze — has to wait until the current transaction either commits or rolls back.

Now the mid-freeze scenario has only two possible outcomes. Either the freeze completes first - the transfer sees a frozen wallet when it acquires its lock and fails cleanly. Or the transfer completes first - the freeze runs after commit and succeeds against the updated wallet. No overlap. No silent partial state. The lock forces operations to be sequential even when they arrive simultaneously.

Guard checks and transactional writes aren't automatically connected. The window between them is real, and in a concurrent system it gets exploited. Row-level locking is how you close it.

When the Transaction Rolls Back, You Still Need a Record

A clean rollback is exactly what you want when something goes wrong. But it creates a secondary problem - the rollback takes everything with it, including any audit trail you tried to write inside the transaction.

For a financial system that's unacceptable. An operation that failed because of insufficient balance is useful information. An operation that failed because of a database timeout mid-transfer is even more useful - the kind of event you want to query, investigate, and trace.

VaultPay handles this by separating failure logging from the transfer transaction entirely.

try:
    async with db.begin():
        # ... all 6 atomic steps ...
        # COMMIT

    # Post-commit: update Redis only after DB is confirmed
    await redis.incrbyfloat(f"vp:txn:daily:{wallet_id}", float(amount))
    await redis.delete(f"vp:pin:attempts:{wallet_id}")
    await redis.set(f"vp:idempotency:{idempotency_key}", result, ex=86400)

except Exception as e:
    # This write happens OUTSIDE the rolled-back transaction
    await db.execute(
        insert(FailedTransaction).values(
            sender_id=sender_id,
            recipient_id=recipient_id,
            amount=amount,
            failure_reason=str(e),
            failed_at_step="atomic_transfer",
            requested_at=timestamp,
        )
    )
    raise

The failed transaction record is a completely separate write. It runs after the rollback, not inside the original transaction. It will always exist regardless of what happened to the transfer itself.

This didn't occur to me until I thought carefully about what a rollback actually means. When a transaction rolls back, everything inside it disappears - including INSERTs to audit tables you added trying to be thorough. The audit record has to live outside the thing it's auditing.

Redis Ordering: The Part That's Easy to Get Wrong

The post-transaction Redis writes - daily spend tracking, PIN attempt counter reset, idempotency key caching - all happen after the database commit. Not before.

This matters because Redis has no rollback.

If you update the daily spend counter before the database commits and then the commit fails, your Redis state now says the user spent ₹500 today when the database has no record of it. The next time they try to transfer, they might hit a daily limit that was never actually reached.

The rule I kept coming back to: Redis is truth for fast lookups. PostgreSQL is truth for what actually happened. Any time Redis gets updated, it should be reflecting something already committed in the database - not something that's about to commit.

The PIN counter reset follows the same logic. PIN verification happens in the guard phase, before the atomic transaction. On a successful transfer, VaultPay deletes vp:pin:attempts:{wallet_id} after the commit. If you delete it before commit and the commit fails, you've cleared failure history for an operation that never completed.

Small ordering decisions. Each one encodes a specific guarantee about what the system's state means.

What the Notification Step Actually Is

After everything commits and Redis is updated, VaultPay creates notification records for both the sender and receiver. This is the final step in the full flow - outside the atomic core, but inside the overall request handler.

Notifications are informational. If a notification write fails, the transfer has already committed - the money moved, the records exist. VaultPay treats notification failures as non-critical and logs them separately rather than propagating the error back to the client.

This is another deliberate separation: operations that must succeed atomically versus operations that should succeed but don't affect financial state if they don't. Keeping those categories clearly separated prevents a notification failure from rolling back a completed transfer — which would be the wrong outcome in every scenario.

What I Actually Learned

I went into this thinking consistency mostly meant writing good error handling. Catch exceptions cleanly, return useful error messages, add some retry logic.

That's not what consistency is.

Consistency is designing the system so partial states are structurally impossible — not handling them after the fact, but making them impossible to produce in the first place. The atomic transaction doesn't prevent failures. It guarantees that every failure leaves the database in a state that's either fully committed or fully unchanged.

The ordering of everything else - guards before writes, locks before mutations, Redis writes after commit, failure logs outside transactions — isn't convention or style. Each decision encodes a specific guarantee about what can and can't happen when multiple operations run concurrently or when something fails in the middle.

Financial systems made this obvious because money is the thing you can get wrong most silently with the highest consequences. But the same thinking applies to any system where partial state is worse than a clean failure.

Next up: how VaultPay handles a completely different kind of problem : a request arriving from an IP address the system has never seen before, and why simply blocking it creates as many problems as it solves.

Engineering docs + code samples: Vaultpay-Engineering