Trust is Not An Opinion

There’s a fundamental gap in engineering circles, something that’s affected the entire industry for years, but one that’s widening lately thanks to the acceleration provided by better tooling. It’s a gap in trust.

Trust is not based on simple observation. You don’t have trust because you saw a system work once; you have trust when you can establish conditions and repeatedly produce the same result. That’s what proof looks like in software: not a successful run, but a reproducible one.

In practice, teams trust what they produce based on momentum: it's trust-shaped, rather than actual knowledge.

The code runs; it hasn't fallen over. When people create new code, someone started the application and went clicky-clicky here, typey typey there, and they looked for smoke and saw only mirrors: time to merge.

In teams, time builds a certain social momentum, a certain social capital. You know Christian and Sam and Deanna - Christian's code is elegant but his documentation is utter fiction if it exists at all, Sam's code is intimately familiar with scope creep and may be on the Wikipedia page for it, Deanna's code is fairly direct and tends to miss the edge cases. You know who to give what, and when, and what to look for, and so do they.

But then Sam blows it: he was certain, which meant you were certain, and suddenly you're facing the idea that everything he's done for the past six months might have had a problem that got past your defenses in a big way.

And that's with accountability. Sam's motivated to fix stuff, because he's on the team... and if he isn't motivated, well, that's ... information, too. Sam has trust - or did - and can rebuild on reciprocity and consequences.

Code has none of that. It has no reputation to uphold, no skin in the game, no shame, no memory. It can't be argued with or negotiated with; it can't explain assumptions. Trust in code cannot be the same as social trust; you can't say "It's been running for years, it's solid," because none of that matters.

You're trusting the people who wrote it, or the streak of days without incident. That's anthropomorphizing code, and the object of your trust does not and cannot care.

So what can a system do to deserve trust?

Trust in a system is deserved and earned when the system itself repeatedly and verifiably demonstrates its function.

A good system is one whose users and authors know what it's supposed to do, and can prove it. Maybe it processes 100,000 orders with 3,000 simultaneous users in less than an hour, using no more than 35% CPU and 400MB of RAM - who knows? The actual requirements depend on the nature of the system being implemented.

But those goals, those requirements, establish a baseline. They're verifiable: you can theoretically build a test harness to crank in 100,000 orders, with 3,000 simultaneous emulated users, and measure CPU and RAM usage along the way. You can say "this passed the requirements" or it didn't.

A lot of engagements don't have those sorts of requirements: they have intentions and hopes and an informal "this is what we do, and it works, from what we can tell." That's a system whose users think it works. That's not a demonstration of working. It's a demonstration of getting by.

That's better than not getting by - but it's still not good.

When a system is "getting by," engineers do what they can to keep it getting by. They run the app, try to see if they spot any flaws. They scan the code manually, as carefully as they can. They look for as much information at the code level as they can, because inference is risk, they argue about what they're used to doing, they calcify just like the project calcifies.

None of this is irrational. It is exactly what you would expect.

Without a clear definition of correctness, this kind of visual check is all that's left: engineers end up arguing against themselves. They have to convince themselves that as they read the code, it's right.

That works to a point, but it doesn’t scale or transfer. What seems obvious in one context becomes opaque in another. And most importantly, it ignores who - or what - the code is actually for. It's not written for people as its primary audience; it's written for a CPU.

And thus debates about syntax and grammar start to matter more than they should. Those aspects of the code become a proxy for confidence.

You want real confidence, not hope. You want to know, not presume.

Where Trust Actually Lives

Consider a system with very few unit tests, but a large, presumably comprehensive integration suite. That's what happens when you don't bake testability into the early design - maybe it's a project that uses waterfall as a methodology, maybe it's pressure to deliver early. It's hard to lay blame without knowing specific circumstances, and it's rare that competent programmers literally choose to ignore testability; sometimes it's just difficult and it's unpleasant to explain to management why you're spending most of your lines of code on nondeliverables.

That's how projects go sometimes. Someone ends up putting data in and checking the output, and if they're lucky they can put in bad data and know what to get out as a result, and that set of data - good and bad - turns into a giant set of integration tests that hopefully some deployment engineer verifies in the end. And that can go on for years, and it's "just what the project does" because fixing it means going back and baking in testability to a codebase that never anticipated it. There's always something more urgent than fixing the process.

And the delivery is "when the suite passes." That's the new metric of "good." That's "we made everything work." It's definitely one form of validation... but it comes with costs:

Velocity, flexibility, comprehension.

The integration tests become this black box and you're relying on it. Maybe it's fast, if you're lucky, and if you're not, you're looking at an hours-long validation cycle, and you can't test the change you made because it's buried in a giant validation mechanism. The scope of your change doesn't matter - if you're trying to validate that you are disallowing bad values in your validation cycle, you still have to run everything. You'd have to do that anyway, of course, but this is now baked into everything you do, every step of the way.

You end up building yourself into an untenable position, because every change has a massive cost, and every failure means you wasted the entire integration test cycle.

What happens then?

Humans are made for cost-benefit analysis. In this kind of situation - which is distressingly common - we fall back to what they can do to limit the damage: we code extremely carefully, we code very conservatively, we rely on whatever we can do to limit exposure, we judge, and judge, and judge - ourselves and everyone else.

Our velocity slows down because nobody can trust themselves or others. Anyone who pushes the envelope presents risk.

And that becomes culture.

The Substitution Effect

Correctness is difficult to replace. Well, wait, that's not true: it's just difficult to replace well. But you can tell when it's starting to happen: you start focusing on readability over validation, you start focusing on style instead of safety, and consistency becomes king: it's the way we do it, so that's what we shall do, prevailing wisdom be damned. What we did worked for us in the past somehow, so it'll continue to work for us in the future.

These things aren't... bad, necessarily. It's good to be readable (Knuth agrees, surely!) - it's good to have a consistent style, it's good to keep the momentum a codebase has. But these are still adaptive behaviors, and they're attempts to replace correctness with consistency.

The consistency makes it easy for us humans to form opinions: how similar is the new to the old? Very? Then it must be okay!

But that's not evidence. That's not proof. That's just treading paths that have been well-trodden, without asking whether they should have been explored in the first place.

That's Kodak saying that people want cheaper ways to create physical photographs while the digital revolution passes them by, to the point where even saying "Kodak" feels quaint.

AI and the End of Implicit Trust

AI introduces a new kind of pressure on these adaptive systems: not because it produces worse code, but because it removes the behaviors that humans use to apply the adaptations.

Humans implicitly trust things like:

familiarity with patterns, or
knowledge of the author, or
consistency with past decisions based on past reasoning, or
intuition about how the code "feels."

AI provides none of that. It produces code that is ... well, it can be hard to describe. It's tempting to say "It's fluent, well-structured, well-informed!¹" and because it's written in a form that repeats what humans use to describe code to other humans, it often looks quite plausible.

That plausibility is the problem.

You can read AI-generated code and feel confident - and be completely, amazingly, bafflingly wrong.

AI doesn’t make code less trustworthy. It means the trust has to be earned, by a system that has no reason to desire or understand trust. AI can break everything in amazing ways, and that means that you'd better understand trust, unless you're deliberately choosing to wear blinders².

Shifting the Source of Trust

Trust is an interesting thing: it's incredibly powerful, but it's brittle. You get to say "trust me" one time - as soon as you've done that, the strength is gone. That's code, too, except we'd do well to remember that code has no reason to say "trust me."

At least between humans, it's a request - it's an expenditure of social capital, capital that's not easy to retrieve, but at least we can recognize the interaction and the exploitation of a relationship. Sam can ask you to work within his limitations; Rust can't.

Thus, we should demand more of our code than we do. We shouldn't be hoping the code is right: we should have the expectation of proof, of evidence, as often as we can muster.

Tests as Evidence

Tests are not valuable because they exist. They are valuable if they can fail when the system is wrong.

A test suite that does not meaningfully detect failure is not evidence. It is ceremony. A smoke test is useful, sure, but only for about ten seconds - it proves the system can spin up, and not only is that a low bar, but it's also typically a useless bar once it passes.

Personally, I'm maniacal nearly to the point of absurdity with tests; it gets some eyerolls, but here's the thing: they work. In one example, we had a known failure in production, where filenames' character sets were reported as a bug by our customers; someone mentioned that issue to me offhanded, and ...well, I had a test battery loaded, why not add that to my tests? And our test suite broke. Not my code - internationalization isn't magic in Java - but in the CI/CD pipeline, which wasn't constructed well enough for our inputs.

That was 45 seconds of looking up some text in a foreign language, and slapping it into a few JSON files ("construct lorem ipsum with this foreign-language filename" and "construct lørém îpßuµ with this English-language filename" and, of course, a file with "foreign content" and a foreign name.) But the result was that not only did we know my code could handle the foreign content, but we found out none of the rest of our system could prove it.

That gave us an opportunity to fix it. There's more proving to go, of course, but at least we forged a path.

That’s the difference between presumed and demonstrated trust: if you’ve seen a problem and it isn’t a failing test, it’s not part of your definition of correctness.

Red/Green as Proof

In red/green testing, you first show a failure. You write the test with an expected outcome first: maybe it's before the actual code is written, or maybe it's to show what the expected output is supposed to be. The expectation of that initial code is to find the problem, to identify it, to apply the whole behavioral design mindset going in.

You might identify the input (the filename), the process (the translation engine), and what you expect out of it (the same filename, in this case). If you run a test that constructs that scenario and you don't get an error, well, the issue's missing something: why is that reported as a bug, if the code doesn't show the error? That's a suggestion that you're looking in the wrong place for the error.

If you do find that test to fail, well, there you are! You now have a concrete thing to fix, that you can observe and verify - when that test runs successfully ("goes green") you can say, definitively, that you've not only found the bug (that's the red test) but you've fixed the bug (that's the green test), and if anyone breaks that code, you'll know because the test has gone red again.

Otherwise you're running on hope.

Here, we see the definition - doneness for that aspect of the issue is when we know the output filename has the right characters in it - and the fulfillment of the issue. The actual issue was much more complex than this, but the example holds: the definition of that aspect of the issue specified what doneness was, and we can use that to measure progress, absolutely and definitively.

And because we can measure the progress absolutely and definitively, we can actually trust the solution for as long as that test exists, and if we've built the test properly, it should exist for as long as the production code exists.

Reviews as Verification

This actually aids in review, too: the production code is not as important as it used to be. Without tests, you have to check changelogs and chase method calls to make sure everything is being called correctly; you're doing whitebox inspection.

With good tests, you're still able to do whitebox inspection, but you can literally look at what the code is supposed to do and measure it, as a precursor to how it's supposed to do it - if the issue is about language translation, and the tests validate the language translation, most of your work is done.

It doesn't matter as much if the code "looks right" because working code wins over pretty code. This is the suffering-oriented programmer paradigm, also highly influential for me:

Make it work, then make it pretty, then make it fast.

An effective review, then, centers on different questions than code-first reviews.

Code-first reviews ask if the code is pretty, and hope the code is fast; it's the kind of review that suggests using a LinkedList for this structure because it's indexed and adding things to it is quick as long as you're adding to the beginning or the end of the list. It tends to regurgitate conventional wisdom, rather than apply wisdom.

An effective review with extensive tests looks at the tests. Are we covering the cases we need to cover? Do they pass? Does the code work? If not, well, there we are: we need to fix that. If it does work, then everything else about the code is gravy, because those tests will help us make sure the code still works if we "make it pretty" and will work even when we make it fast.

We can trust the code if we can validate the tests.

Code review without verification is just another form of visual trust. It's better than nothing, but it's expensive and easy to mess up.

If you have to read the code to know it works, you don't know it works. This isn't an accusation; it's observable throughout the industry: we find bugs even now that have been in deployed code for decades, despite thousands of interested eyes.

We’re still guessing. We're still hoping the code runs the way we think it does, and that the way we run the system is the way everyone else runs it with similar inputs.

We can work that way. Many systems do. Many systems have. It's a facet of our industry from the very beginning.

But it’s worth recognizing what it is: it's a consequence of where trust lives. We trust our code because we want to, not because we can... and that's a fixable problem.

Go fix it.

Worth noting: an AI suggested that description of its own code to me. I added the qualification to the statement. The AI was rather confident.
↩
Don't wear blinders. They're a fashion abomination; trust me, they look really geeky, really awful. Even mine, with the peonies on them. My kids make fun of me endlessly when I wear them. I mean, uh, don't wear blinders.
↩