Software duplication on the dryness scale, is it all bad?


Within our software industry we have multiple principles and acronyms that explain that introducing duplicate code is a bad practice and should be avoided. An example is: DRY (Don’t Repeat Yourself). Or a more lenient version sometimes used for test code: DAMP (Descriptive and Meaningful Phrases). But then there seem to be some counter weights as well: AHA (Avoid Hasty Abstractions) and “duplication is cheaper than the wrong abstraction”. And we have: YAGNI (You aren’t going to need it), so only build something when it’s needed, not when you just foresee that you might need them.
To make this topic even more confusing, Robert C. Martin mentions that there are different kinds of duplication:

“But there are different kinds of duplication. There is true duplication, in which every change to one instance necessitates the same change to every duplicate of that instance. Then there is false or accidental duplication. If two apparently duplicated sections of code evolve along different paths—if they change at different rates, and for different reasons—then they are not true duplicates. Return to them in a few years, and you’ll find that they are very different from each other.”

― Robert C. Martin, Clean Architecture

What is duplication?

Code duplication I define as follows: “Code that’s defined more than once, with the same semantic meaning, that changes at the same rate and for the same reasons.”

That means that when code is repeated with the exact same syntax, but that represents something else, it will not be considered duplication, but as divergent code.

Types of Duplication

To make it more clear what the different forms of duplication are, we will start with an overview followed by code examples.

Syntactic similarities

Let’s show an example of syntax duplication:

We have mentioned the keyword public and class already 4 times. This syntax is clearly duplicated, but we don’t consider this a problem, nor would we even call it duplication under normal conditions.

The same applies for annotation usages:

Surely, these annotations can be found by the dozen in a Spring / java project. Nor would we consider jUnit test definitions @Test to be considered duplication. Better have multiple small tests than one huge test per application, right?

Semantic duplication

Now that we have the obviously non problematic syntax duplication out of the way, let’s take a look at real semantic code duplication.

The above code snippet implemented the same logic three times. So what if we want to change the implementation because it doesn’t work well or we want to add timezones to it? That means the same change would have to be applied at three different locations; real (semantic) duplication.

Divergent code

Let’s take a look at a yaml configuration file. It defined three external clients where we have to login with a password:

Currently, all clients use the same password, accidentally. But the three password entries do look like they are duplicated. But are they duplication? Are they representing the same concept and do they change for the same reasons at the same time? No they do not. If we would reuse the same property here for the three clients, and one has to change, it might lead to inadvertently breaking the two other clients.

What about these two possibilities, would you pick the reused approach? Or go for the divergent one?

With the first example that reuses the properties, if they would ever change, they will change the behavior of all clients using it. That might cause negative and unexpected impact. So, why not use the reused example, and only split the definition when its actually needed? The challenge with that is: how do you communicate this to the developer who will make this change in the future. What if they forgot to check all places where it’s used? By initially choosing the divergent approach, these side effects simply disappear.

A (divergent) use case with enums

Consider a domain-centric architecture that handles Users of certain Types. It has multiple input channels, an external API and an event listener. It’s also able to persist users to the database via a repository.

The only supported user type so far is: Regular.

This could be defined by a single Enum in the domain layer:

That’s then reused in the Api, Eventlistener and Repository. This code I consider dry, very dry. Dry and dusty. Everytime we touch it, dust falls off unexpectedly making dirty other parts of the code base. Why? Let’s consider the use case where the EventListener will also accept ‘Administrators’. What if we now update our domain model?

This would inadvertently allow administrator to be provided by the external API. Which we do not want at all, they should only be allowed via our internal event bus. So we would split that definition, the Api will have it’s own definition to only allow Regular users. And the EventListener will also have its own definition to prevent the problem that we might want to support more different user types only via the Api and not via the EventListener.

So what about our Repository, it’s using the UserType defined in our domain to store values in the database, so “REGULAR” and “ADMINISTRATOR”. What if we do a simple refactor of that enum:

So, Regular just became Normal. After deploying this change to production we might find out that there are a lot of errors are logged. It still has the old values in the database and we didn’t mean to change the values stored and read from the database. To prevent those accidental problems, it’s wise to not share these enum values over architectural boundaries, but define explicit mappings per boundery and map them, with compile time safety.

Once an enum is refactored, this switch will automatically make sure all mappings are still correct, and the sames values written to the database. If a new value is added to the domain enum, then this mapper will not compile any more, until a branch was added to map that new enum value. Please note that this only works when there is no default case. In other words, by diverging the enum representations to one per boundary, with a compile time safe mapper, we have made sure we spot bugs at compile time instead of at run time!

So somebody might ask the question again as stated before: why shouldn’t we start with a shared domain model used everywhere, and only create a specific implementation when it’s actually deviating? I believe the same arguments still stand: how do you communicate this dependency to other developers that make the change in the future, what if they forget to check these edge cases? It’s so easy to rename a field in an enum while it can cause runtime errors. This moves the problem from compile time to runtime, which makes fixing the problem way more expensive.

Code with syntactic similarities but semantic differences are not considered a form duplication. It might be DRY, but it still smells, and no amount of deodorant will help.


Leave a Reply

Your email address will not be published. Required fields are marked *