Straightforward vs. Structured, Non-repetitive Code: Which Would You Choose? (DB-Backed Set)

It is not always clear which code is better or worse as it might depend on the needs and the team in question. Let’s have a look at two different implementations of a database-backed Set, one that is straightforward and easy to understand and another one that has more structure and less duplication at the expense of understandability. Which one would you choose?

Tip: When designing a readable solution, start by writing implementations of the public methods in terms of hypothetical lower-level methods that you would like to have, without being concerned about their implementation details. Only then try to find out if there is a way to implement these lower-abstraction-level methods that would fit your needs. Thus your implementation will be driven by the public interface and by “what” the class does rather then “how,” resulting in better abstractions and readability.  (Kent Beck’s & others’ “thinking from outside in.”)

General solutions are easier to reason about than specific ones.

Background

The goal of the PersistentDatafileSet is to keep information about processed data files in a database so that none will be processed repeatedly with respect to a particular target data storage, identified by a table name and storage type. We will have a dedicated instance of PersistentDatafileSet for each storage. The data files themselves are identified by a filename and timestamp.

Side note: I really like the idea of using a Set backed by a database table for this purpose. The Set contract is familiar to all Java developers and the fact that it is backed by a database is a well-encapsulated “implementation detail.” Reusing known abstractions is a good practice, decreasing the cognitive load of our programs.

1. A straightforward though repetitive implementation

Pros:

  • Straightforward, explicit, easy to understand

Cons:

  • Connection and statement creation is repeated four times together with error handling, in each of the Set methods. The knowledge of the structure of the underlying table is explicitly and implicitly required at multiple places
  • The class has too many responsibilities – implementing Set, creating SQL strings, managing DB connections, performing DB operations

It could be argued that the repetition and mixing of concerns are on such a small scale that they are acceptable and are outweighted by the easiness of understanding.

2. More involved implementation with separated concerns and minimized duplication

Pros:

  • Simplification/Abstraction All the public (Set) methods are now one- or two-liners (aside of iterator) and it’s clear how they are implemented.
  • Concern separation, de-duplication The details of statement creation and execution have been moved to a generic StatementExecutor independent on the “business logic” and most of the repetition has been removed, the remaining duplication is at least isolated and co-located
  • More context in error messages (SQL, parameters) simplifies troubleshooting
  • A positive side-effect is that the code could be made unit-testable by hiding StatementExecutor behind an interface and using a fake implementation during tests, though the benefit of a non-integration test for a class highly dependent on a database is questionable.

Cons:

  • Considerably harder to read and understand (?) – The creation of ResultSet processing code looks complex and is unnecessary verbose due to Java’s lack of closures (lambdas); we have three classes instead of one
  • The knowledge about the target table’s structure is still repeated at several places, mainly  the SQL strings and the insert/select methods are implicitly coupled to the SQL strings defined elsewhere by the need to pass the right parameters in the right order (=> bad locality of change)
  • There is still some duplication between executeUpdateStatementOrFail and executeSelectStatement; the cost of its removal (in readability and code complexity) would be higher than the benefit (though I could have extracted & reused the setting of statement parameters)

Other comments:

  • The idea of a ResultExtractor isn’t mine, it is a copy of SpringJDBC‘s ResultSetExtractor. Using SpringJDBC would actually really simplify the code and if the choice was mine, I would nearly always prefer SpringJDBC over plain JDBC (it can be well used on its own, without anything else of Spring)
  • You might have noticed that I have changed the detection of duplicates in add to rely on the database’s integration constraints and SQLIntegrityConstraintViolationException rather then performing a costly select prior to an insert (though it might be regarded as less straitghforward than an explicit check and if the performance doesn’t matter then it might be prefered)
  • I have also switched to PreparedStatement since it made the implementation simpler and also protects again (even inadvertent) SQL injection
  • I would normally comment the classes and methods but have left that out for the sake of space; do not take that as a good example!

Summary

We have seen two implementations: The former one has straightforward, easy to understand code but is repetitive and has multiple mixed responsibilities. The latter one has factored out low-level database access methods (execute*Statement, createConnection) and implements the high-level Set methods in terms of these primitives, while also mostly removing the repetitive connection creation and exception management – at the expense of understandability due to the use of higher-order functions and Java’s cumbersome syntax for that.

I have discussed the code with a couple of people and the opinions differ, as you might have guessed. Some, especially those with background in Smalltalk, prefer the more structured code. Some would only agree to the introduction of the abstraction if it (the StatementExecutor) actually was used at least 2 or 3 times somewhere, since abstractions make it harder to read – you need to invest time to learn them. People don’t like to navigate too much, going through multiple levels, to find out what the code really does (while someone prefers details abstracted away). It is a lot a matter of personal experiences (have you been bitten more by duplication or by debugging onion-like code?), expectations (Smalltalkers are used to many small, focused objects and find it easier to work with them thanks to the widespread conventions while PHPers are driven crazy by that approach), and trust (do you trust the author that StatementExecutor.executeUpdateStatement(sql, params…) does what its signature indicates or do you need to drill into it to find it out for yourself?).

The main principles guiding the refactoring were Don’t Repeat Yourself (DRY) and Single Responsibility Principle. Other concerns such as performance were not considered (we might want to use connection pooling, caching etc., based on our actual performance needs and tests).

Quiz time!

Advertisements

4 thoughts on “Straightforward vs. Structured, Non-repetitive Code: Which Would You Choose? (DB-Backed Set)

  1. Michael

    Interesting comparison. I would much rather maintain the second implementation, as it is much easier to know where to make a change if you need to. Additionally, if there is some in for example the connection logic, you would need to change it in several places.

    Only other thing, the structured code example is still vulnerable to sql injection, through the table name and storage type parameters, which may or may not be an issue depending upon where these are coming from.

    Reply
  2. Anonymous Coward

    As long as you don’t get to reuse the result extractor and the statement executor, you have just added cost with no benefit. It may look better, but it actually is just gold plating.

    From another point of view, refactoring the first solution into the second one when there’s no obvious need for it is a lot like premature optimization – which, according to Djikstra, is the root of all evil.

    As programmers, we should learn to look at cost. Switching from the first solution to the second, we have not only invested additional effort without any visible gain, we have also forced future maintainers to do the same thing – maintain two classes instead of one, with one of them being most likely quite expensive to maintain, due to its generality (the statement executor).

    There are definitely some things to improve in the first version. I’d extract the SQL templates into private static final members, and not inline parameters – you open your code up for SQL injection that way – and I’d gather the knowledge about running an SQL statement in a single place – this would somewhat shorten the code, which means less code for bugs to hide in. Both of these refactorings would provide immediate gains. But I would definitely not add another interface and a class to the design, as long as there is no other place where to use them – these would just provide more places for bugs to hide.

    Reply
    1. Jakub Holý Post author

      Dear Coward, thank you for your comment. I am very happy to have readers with varied opinions 🙂 I absolutely agree that we should always consider costs of what we do (though we might all end up with a different result). And I’d love to see your solution of ” I’d gather the knowledge about running an SQL statement in a single place”.
      I have already experienced that different people have different opinion on this topic so I do not expect to persuade you but I would still like to address some of the points.

      – “… added cost with no benefit” – less duplication is a clear benefit for me (though the gain/cost calculation may differ for you), code focused on one thing is easier to read for me than one doing everything (though other people prefer the opposit); YMMV but for me this isn’t a “no benefit” change
      – “.. refactoring … when there’s no obvious need for it” – well, it was mostly a learning exercise for me; I ddidn’t like the code and wanted to understand why and what would be a code that I would like more so for me the refactoring certainly had a value
      – “maintain two classes instead of one, with one of them being most likely quite expensive to maintain, due to its generality” – it is disputable how much more difficult it is to maintain two classes focused on one thing each than one do-it-all class, as you can see, f.ex. Michael has a different estimation of the cost than you;

      Some friends of mine have argued in accord with you that S.E. is unnecessarily generic. But there is another way to look at it. If you look at a PreparedStatement, what does it do? It takes a SQL string and a number of parameters (depending on the string) in a particular order and turns them into a result. This is exactly what S.E. does, its interface reflects this directly.
      Making S.E. more specific (e.g. taking a “filename” and “timestamp” parameters) would hide the fact that is only wraps a Prep.St., would push some logic from the set implementation into it, making the two more coupled. *For me* it is easier to understand a generic but otherwise simple class by just looking at its interface than if it contains some business-specific logic.
      But that is just me. YMMV.
      Thank you!

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s