Make Invalid States Unrepresentable

I just read a fantastic article about the "Make Invalid States Unrepresentable" principle. I have never heard of this principle by name, but I have seen code several times which violates this principle and it makes reading and understanding it incredibly more difficult.

To quickly summarise this article, the core principle states that data in a program should be structured in such a way that it should be impossible to have invalid states in memory. As an example, if a series of time periods should be expressed in a manner that prevents gaps and overlaps, one might be tempted to use a representation of the form List[(start: date, end: date)]. However, if the end of the current period does not match the start of the subsequent period, then it is possible for gaps or overlaps to occur. If the data are instead represented as Set[(start: date)] instead, with the implication that the end date of that period is simply the next timestamp in order, then it is impossible to represent a state with non-contiguous periods.

Why is this principle important? Because having redundant information in memory makes it possible to surreptitiously introduce bugs. Simply looking at the data representation of the List demonstrates that it is possible for gaps or overlaps to occur, while looking at the Set representation guarantees that it cannot. The Set representation precludes an entire class of bugs, while the List representation invites them with warm, welcome arms. The only way to ensure the memory is in a consistent state is to either audit the code thoroughly to ensure that it never becomes inconsistent. It is the onus of the code which encapsulates the data to ensure that the List provides the same guarantee that the Set does.

Or, you could just trust that the code works, but please don't ever do that. It will bite you, or whoever is responsible for maintaining that code, at the worst possible time.

Of course, it would be remiss to write off the List representation as having no advantages over the Set representation. Consider, for example, a function to determine the length of a given period. In the case of the List, simply locate the element and compute the difference between the start and end dates. But the Set representation is more complicated. The logic for finding the next element may be non-trivial, and could be more complex computationally as well - O(log(n)) if backed by a binary tree, or O(n) if backed by a hashmap.

This is certainly bad, but not the end of the world. Often, developers choose to structure their memory representations in a particular way because it makes writing the code easier, even if it allows such invalid states where the memory could violate the constraints of the system. But I still think that it is better to write code such that the memory representation is simple, even if the code representation is more complex.

This is because of another drawback of invalid states: writing tests to handle and/or prevent these invalid states. The number of tests needed to cover all states increases multiplicatively with the number of states (or state categories), so many, many more tests must be written to ensure that the memory stays consistent.

So in short, I think this is an important principle to follow, but it certainly introduces trade-offs. This is why software is an engineering endeavour. There are no perfect answers, and sometimes there is no solution that is optimal in all respects. But I do think that following this principle goes a long way to making a codebase easier to understand and validate.

links

social