Working with Legacy Code

20 July 2015
634 Views

The phrase “legacy code” can have several different interpretations; for the sake of discussion, let’s define it as “code written by someone else, and which you now have to use”. These past couple of weeks I have had the opportunity to work legacy code written by an employee of the company. This opportunity has proven both challenging and rewarding, especially since it forces me to think harder about good programming and software development.

Of particular interest to me is the fact this code is written in SAS, a language I was not familiar with prior to joining Prescio. Since I did not know the language, I did not have any preconceptions about what the language could (or should!) do. Instead of looking at the code and saying “This is how I would have done that”, I had to say “Hmm…what does this piece of code do, and how can I work with it?”. This shift in attitude made me appreciate how hard it is to write code which is useful to, and useable by, others.

After a couple of weeks, some frustrations and joys, I have come to realize there is a good way to go about working with legacy code. Mostly it involves making incremental changes, verifying those changes do not break anything, and adding in the features I wanted.

Some tips on working with legacy code:

  • Figure out the structure of the codebase. Look at the organization of the directory structure. Is code clearly and cleanly separated from outputs or inputs? Where does the documentation go? If there are multiple contributors to the codebase, how are those contributions managed? The structure of the codebase tells you something about what the author thought the purpose of the code was, and may yield insights into how the code works. (For instance, if there is an “Inputs” folder, you had better look for variables or pathnames within the code which reference those inputs, probably towards the top of the code. Similarly, are all results of the code written to an “Outputs” folder? What if there are multiple kinds of output? How much information is encoded in filenames or directories?
  • Put the code under some kind of version control, if applicable. Version control lets you undo changes you make, and, if the code has “worked” previously, you’ll want to be able to roll back to the original code if necessary. Personally, I like using git for version control, but any version control software is better than none.
  • Just run the code and see what logging output it generates, if any. Look for error messages and write those down for yourself. Sometimes the error messages are actually anticipated by the previous author (“Don’t worry about this, it’s actually caused by X, not super-critical thing Y, but we haven’t had time to fix it.”). Sometimes the error messages are telling you something you’ll need to fix before making your own changes. For instance, if pathnames are absolute, and reference a different user of the computer…well, you’ll need to change those before you can hope to get any useful output! (Also, this is an argument in favor of using relative pathnames.)
  • After fixing all the major errors, now you can start adding the features you wanted. Make small changes, and commit those changes frequently. If possible, re-running the code and making sure no unexpected errors occur is helpful.
  • Help make the code easier to read and use. Refactor as needed. If you do refactor, make sure the code runs without errors; preferably, make sure the outputs are unchanged. If there are tests, make sure the refactored code passes them. Write documentation for functions or procedures invoked in the code. Even an unhelpful “I have no idea what this does, but apparently it solves issue Z” lets people down the line know that someone took a look at this, instead of saying “It must be me who does not understand”.

Conclusion

As you may surmise, the best rule of thumb for working with legacy code is “First, do no harm.” The Hippocratic Oath works in medicine, and it can also work in coding. Even though the code may be written in a bewildering way, may be poorly documented, and may have mistakes, it worked well enough to be used. By not breaking the code, unless absolutely necessary, you ensure the outputs of the code can continue to be used by other people in the company.

While adhering to that rule, we should always strive to improve the code, to leave behind a codebase which is slightly easier to understand and use. Of course, it is not possible to fix all the mistakes, to refactor all the code, or to write all the documentation. New features need to be implemented, and deadlines made. If every coder down the line makes small improvements, then, in time, a much more developer-friendly codebase can be produced.

Travis Scholten

Travis Scholten

Quantitative Analyst Intern at Prescio Consulting
Travis is a Physics PhD student and summer intern at Prescio Consulting. He's interested in solving problems related to statistical inference and writing code for data mining and analysis. He also likes to work on writing new algorithms related to his studies and work at Prescio.
Travis Scholten