Today we will discuss a conflict between the design values of keeping things simple, stupid (KISS) and robustness, between underdesign and overdesign.
We were writing a batch Java application and needed to ensure that at maximum one instance is running at a time on the server. A team member had the good idea of using lock files, which indeed worked and helped us a lot. However the original implementation wasn’t very robust, which has cost us valuable people time and expensive context switches due to troubleshooting the damn application rejecting to run and locating the lock file.
As Øyvind Bakksjø of Comoyo has recently explained, a software engineer is distinguished from a mere coder by thinking and caring not only the happy path through the code but also about the unhappy cases. Good engineers think about possible problems and try to handle them gracefuly so that code that depends on them and their users have easier time dealing with problematic situation. Robustness includes catching errors early, handling them in a good way, and providing useful and helpful error messages. On the other hand, simplicity [TBD: Hickey] is a crucial characteristic of systems. It is always too easy to spend too much time on making code bullet-proof instead of focusing the effort somewhere where it would be more valuable to the business.
The overly simple implementation
The original implementation was rather simple:
The main problem is that if the app fails or is killed, it will leave the lock file behind and the next time it will reject to start with an unhelpful error message. You will need to know/read the code to understand how to fix the problem.
It has been argued that such failures and intentional kills would only happen so rarely that the effort to make the code more robust wasn’t justified. However we need to invest very little effort to make the code more friendly and robust, f.ex. by including the lock file path in the error message and explaining why it might be there and how to fix it (such as “if the app isn’t running, the lock is a leftover from a failed run and may be deleted”). Making sure that the file is removed upon failure is a few trivial lines of code that can save some confusion and time. Also, it is worthwile to make it little more robust so that it doesn’t need that many manual interventions – be nice to your ops people. (Which, I hope, is you.)
The more robust implementation
This is the improved version, with a helpful error message and lock removal upon failure:
- Create the directory storing the locks if it doesn’t exist (its non-existence and resulting confusing error message of “Already running”) has bitten us
- Helpful error message “Lock file <absolute path of the file> already exists.” => easy to copy and paste int rm.
- Helpful error message with the file path and error when we fail to create the lock (lack of space, insufficient directory permissions, …)
- Wrapping the whole main into try – finally and making sure to always delete the lock file
The code still isn’t perfect – if you kill the app, the lock file will be still left behind. There are ways to handle that (f.ex. include the app’s pid in the file, upon start check not only its existence but also that the pid actually exists / is the app) but the cost of implementing them in terms of time and increased complexity is indeed higher than the benefits.
Both KISS and robustness are important objectives and will often be in conflict. Making your code more robust than necessary makes it too complex and costs time and has an (missed) opportunity cost. Making the code too simple costs you or its users time due to troubleshooting. Achieving the right balance requires experience and iterating towards it. If your team can’t find a consensus, it is perhaps better to start with a simpler code and collect hard data on its actual robustness needs rather then overdesigning it upfront. Don’t be perfectionist like me – but also be good to your users and fellow developers. If you can make your app more robust with little effort, do it. If it requires more work, go and collect data to justify (or not) the work.