Jason Kelly | Are your tech shortcuts killing your startup?

June 29th, 2024 - Agile & Software Development

“Do things that don’t scale”. It’s the top-tier “how to do a startup” advice from the great Paul Graham himself. And, the thing is, I don’t disagree with it per se - it’s pretty decent try-and-build-a-product-company advice. But is it the right way to approach how you’re solving your startup’s technical dilemma’s too?

(Spoiler: no I don’t think that on its own it is, but let me explain why)

But what if…?

There’s a common phrase that I’ve come across a bunch in my 13+ years of spewing software: “But, what if…”. It rears it’s head in a few different places, but I tend to encounter it in 2 obvious ones:

Planning sessions: when you’re scoping and figuring out what to build.
Kick-offs and code reviews: when you’re debating how it’s going to be built or how it has been built.

I think these “what if”s come from roughly the same place, just at different levels of granularity: a worry that we’re not dealing with all the potential eventualities of what might happen in the code/product. They emerge from our brains’ anxiety about all the ways something might go wrong, and in all honestly they’re incredibly useful - they’re essentially our core-gameplay-loop in the game that is software development.

OK, so what’s the problem then?

Well, in my experience the problem isn’t the “What if” itself, it’s the way in which we try and solve these questions. This is definitely an oversimplification, and there’s totally some nuance here and exceptions to the rule, but I’ve spotted 2 common ways that this question gets answered and they can both lead to problematic outcomes:

“Let’s worry about that later, it’s not part of scope for now so we can come back to it”
“Ooooh, ok yeah we should totally cater for that case, let’s implement it like this…”

So let’s dig into the potential issues here:

Worry about it later:

Okaaaay, when is later? I’ve often heard “next sprint” as a potential later, but then we get to planning and oh that super urgent work needs to get done… OK, let’s make it the sprint after. Soon after this later slowly slips back towards “never” and we now have a permanent hole in our system. Worse, at some point people might start thinking it was intentional!

But, let’s be more positive. Later could end up being “in 23 days” in which case it’s not that bad cause you actually sorted it out, right? OK maybe, and maybe sometimes that’s a good idea. But often, it just means that we’re finally plugging the gap with a lot less context. We might’ve remembered everything that was the issue 23 days ago, or we might have captured everything perfectly in our acceptance criteria. But we might not have.

And, a bit more meta than that: it’s been “broken” all the while. So yes obviously this might mean that things have gone wrong in the meantime, or gotten into an inconsistent state. That’s bad, but you can probably unpick it with a bunch of manual database tinkering 😬. Potentially worse though, is that in the meantime all of the work that has been ongoing has been done on top of something that’s “wrong” or “broken”. Does this mean that it is also broken? Will it now break when we roll our fix out? Or worse, will it keep working, but in some slightly disjointed and inconsistent manner.

Ugh icky. All in all it’s a bit scary to me - much nicer to just solve it at the time.

Fix it now:

OK great, but the exact reason that the above solution (let’s do it later) comes up so often is not invalid: this is scope creep. Sometimes it’s unavoidable, but it’s always to some extent bad for the general working of the team and the delivery of the work you’re doing.

More than this though, you’re “off the cuff” working out a solution and implementing it. This makes it super easy to not follow all the good practices of refinement and planning that lead to good implementation of the functionality you’re trying to achieve. This is likely to lead to a solution that might not really match what is needed: have we gotten all our stakeholder input? Have we given the devs some time to stew on the implementation approach to realise most of the potential flaws? Or to come up with a better way?

And further, whilst the devs are stopping what they’re doing to scope and plan this, what impact is that having on the rest of the work that needs to get out by the end of this sprint?

The correct solution(s)

In all honesty, there isn’t one. This kind of a scenario is something that teams often face, and it nearly always leads to some version of the above solution(s) that is rarely satisfactory for anyone involved. We can’t magic up a perfect solution here, instead we need to find a way to reach the right type of compromise for our situation. And, hopefully, learn enough from this event to reduce the chances of it happening with something else next sprint, or at least reduce the effect when it does. Here are some ideas that have worked for me:

Can I not do it, but have a manual fallback/gateway available until we implement it?

This is a nice way out if we can find it. We still need to explicitly write down the manual process now, but often this has little impact on the in-progress work (beyond making sure it’s explicitly error-ing/starting a manual process), and much of the manual-planning can be covered by the PM/business team. Even better, this manual process can become a set of acceptance criteria for how we might automate it later. And depending on frequency, we might chose not to bother implementing it anyways.

Can I just not do it? Like, at all?

My favourite kind of solution to a technical problem is working out how to make the problem go away from a product perspective rather than a technical one. In this instance, it’s important for the team to take a step back and consider if our assumptions are correct: do we actually need to do it like this? Is there a smarter way that’s “just/almost as good” but is a bunch easier? It’s not always possible, but if we can work with our product-buddies to get to a solution that reduces (or ideally completely mitigates) the extra “fix it now” effort required, then we’ve managed to find a best of both worlds.

An added bonus of this approach: we’re not only reducing the total amount of work needed to achieve our goals, we’re also (probably) reducing the overall complexity of our codebase. This can be a real big win in the long-term.

Can I not release? Or not release for the edge case? Can we feature-flag instead until it’s all done?

This is the nicest way really. It’s 2024, if you’ve not heard of feature flags yet probably time for a google. In this scenario is there a way to move this specific piece of functionality behind a feature flag (new or old)? This way the team can crack on, merge and deploy without interrupting the sprint, but this part of the product functionality won’t be available until all of the needed features are there.

In this scenario we’re essentially re-drawing the lines of where the release is, rather than re-drawing the lines of what scope we need to get done this sprint.

(An example of where this fits well: if our edge case only applies to a certain “type” of customer - we could use a feature flag to still deliver to all our other segments, but not to the “edge case segment” until we’ve catered to the potential issues)

Important things to consider

What metrics do I have so i know if/when this happens, and I can prioritise accordingly?

If we are going to have customers hitting this “not done yet” flow, how will we know how many? How bad is the experience for them? A mild annoyance for lots of people might actually be more impactful to the business than completely broken for only a couple.

The kind of things these metrics can be:

in-production UI recordings of the broken experience
a count of endpoint failures
a manual tally of customer reach-outs by the support team
a daily/weekly SQL query someone runs to see how many “bad things” happened in the last few days

^ the point being with these is that it isn’t hard to get some sort of metric in place to move the “how important is this” decision from opinion to data-driven fact

What this should be teaching you if it’s happening more than once in a blue moon

“Reactivity: Evolutionary architecture and product design”

If we’ve gotten into this position it’s most likely a failure of planning, scoping or both. Don’t get me wrong, it’s not always 100% possible to predict, but I am confident that it is possible to scope out, plan, and approach the work in a way that is always building on what came before. This means that when these kinds of issues do come up, their impact is way less significant. This probably has something to do with aiming to cater for an evolutionary architecture.

This, more specifically means aiming to optimise your development process to be good (and quick) at changing software all of the time, rather than just it’s final form. The idea being that software will inevitably change, so trying to make it perfect for an unpredictable end goal is just silly. Instead, aim to optimise for making it better: Acceleration vs speed. You’ll always win this way when the direction is (again, inevitably) changing.

Summary

At the end of the day though, this is just my experience. You might have solved this better or have a completely different take on it. If you do, feel free to let me know.