What a Backend Architecture Review Actually Looks Like
A startup reached out last week. They wanted a backend architecture review — two hours, video call, someone to look at what they had built and tell them if it was going to fall over.
I do these occasionally. They are useful in a specific way: you get to see how a team thinks about their system, not just what the system does. The gap between those two things is usually where the problems live.
The data model first, always
Before I look at any code, I want to see the data model. Not the ORM layer, not the API — the actual tables or documents or whatever they are using to store state. The data model is the most honest thing in a system. It reflects every decision that was made, including the ones nobody remembers making.
In this case: a PostgreSQL schema that had grown organically over eighteen months. You could see the history in it. Early tables were clean and simple. Later tables had columns like is_new_flow and legacy_user_id — the archaeological record of features added without refactoring what came before. That is not a criticism. That is what systems look like when they are actually being used.
Where does the complexity live?
Every system has a centre of gravity — a place where most of the business logic accumulates. In well-designed systems this is intentional. In most systems it is wherever the first senior engineer felt comfortable writing code.
This one had its complexity in the API layer, which is common and not ideal. Business rules were scattered across endpoint handlers rather than in a domain layer. This means the same rule gets implemented in three slightly different ways across three endpoints, and eventually the three implementations diverge. The fix is not complicated. It is just work: extract the logic, test it independently, call it from the handlers. The demo always works. The divergence shows up six months later when someone adds a fourth endpoint and copies the wrong version.
What are the actual failure modes?
I ask teams to walk me through what happens when things go wrong. Not the happy path — the failure path. What happens when the database is slow? When an external API times out? When a job queue backs up?
Most teams have thought about one of these. Few have thought about all three simultaneously. The interesting failures are usually combinations: the database gets slow because the job queue backed up because the external API timed out. Each individual failure is handled. The cascade is not.
This team had good database timeout handling. They had no circuit breaker on their payment provider integration. That is the one I would fix first.
The honest part
At the end of these calls I try to be direct about what I found. Not brutal — direct. There is a difference. Brutal is listing everything wrong. Direct is saying: here are the two things that will actually cause you problems, here is why, here is roughly what fixing them involves.
In this case: the logic dispersion across endpoints is a maintenance problem that will get worse slowly. The missing circuit breaker on the payment integration is a reliability problem that will cause an incident suddenly. Fix the second one this week. Plan the first one for next quarter.
They seemed relieved to have a priority order. Most teams already know something is wrong. They just want someone to tell them which thing to fix first. That is most of what architecture review is. Reading the system carefully, asking why decisions were made, and helping the team see what they are too close to see themselves.
Abstraction is a loan you pay back in debugging time. The interest rate on this one was manageable. I have seen worse.
