Given that a series of highly unusual winter weather conditions have resulted in Chicago having colder same-day temperatures than the South Pole (seriously), many people in the city opted to work from home on the most inclement days. That includes myself, and working disconnected from my normal environment caused me to consider how important it is to make your solutions repeatable.
What does repeatability mean?
In the “pure” sciences, which the Sheldon Coopers of the world are all too quick to point out do not include Computer Science, the concept of repeatability roughly means that an independent third party can reproduce the results of an experiment given only the documentation of the experiment itself. Ideally, when published research is done well, a team in Thailand should be able to reproduce the results of a team in France without contacting the French team for additional information. Research that cannot be reproduced is quickly dismissed.
In my opinion, the same discipline can be applied to software projects, but unfortunately this is rarely the case. Anyone who has been in the software development profession knows how satisfying it is when new source code pulled from a repository compiles and runs on the first try. We also know the frustration when the project doesn’t compile, and the time wasted hunting for the original author who will hopefully provide the undocumented steps or libraries necessary to run. In my experience, the latter scenario is far more common. All too often I’ve had to track down the developer who initially wrote the code six years ago, to have him tell me “Oh, you need to have [obscure library] in [random, unrelated directory]. I’ve got a copy of the library somewhere. I’ll email it to you.” This kind of tribal knowledge is great for job security, but is clearly not repeatable. What happens if the original author leaves the company, or gets a new machine, losing the essential files?
As professionals, we can increase our productivity and bring respect to our craft by ensuring that our solutions are repeatable. Once a project is beyond the earliest phases, anyone who pulls the source should be able to build and run taking minimal, intuitive steps. Ten years ago, this was not an easy task, usually requiring complicated batch scripts and company-standard machine configurations. Thankfully, technology has reached a point where we can easily configure any project for repeatability.
How can repeatability be achieved?
Be directory agnostic
It’s unlikely any two developers will set their machines up the exact same way. Everyone has a personal environment that works best for them. The structure of the project itself can usually be safely assumed, because that structure is usually committed to source control. File structure above the root directory of the project should not be assumed, however. Wherever possible, use relative paths. Whenever an absolute file path is found in code, consider any possible way to make it relative. If it can’t be made relative, or making it relative isn’t reliable enough, consider making the path part of the configuration. In Visual Studio projects, the avoidance of concrete paths should extend beyond source code to the project files. When adding a reference to another library, if that library is another project in the solution, add a “Project Reference” instead of a “File Reference”. If referencing an external library (and NuGet isn’t an option, covered next), create a directory in the solution root for external libraries, and reference the library in that location, so that the path is still relative to the solution. The main idea is that the code should be able to run from any location, given standard privileges. Whenever a file path is encountered, consider “would this work on any machine?” If not, make it so.
Use a package manager
Using a package management framework is probably one of the biggest steps toward repeatability. In the Microsoft world, the most common package manager is NuGet; Java has Maven, and Object C has CocoaPods for the same purpose. There is a main online NuGet repository, but private repositories can also be configured. NuGet allows adding references to third party libraries, without directly including those libraries in the project. When NuGet Package Restore is enabled for a solution, building the solution will cause Visual Studio to download any missing NuGet references. (This can also be configured to work with automated builds, so build servers can continue to function.) For example, if project has a reference to Entity Framework added through the NuGet package manager, any developer who builds that project with Package Restore turned on will automatically pull the correct version of Entity Framework if they don’t already have it. While it is often still necessary to have an “External Libraries” folder in the solution for very specific libraries, using a package manager like NuGet cuts down the number of libraries committed to source control exponentially.
Keep everything required by the project in the project
In combination with relative paths and a package manager, it is imperative to keep files required by the project natively stored in the project. A package manager is the preferred method for storing external libraries, but if a library can’t be managed with NuGet, create an “External Libraries” directory in the solution and store the library there. Then reference the library in that relative location. Any developer (or automated build process) pulling the solution will implicitly receive the required library as well. If a particular project requires some other software installed on the machine, include the installer (if it is legal to distribute) or a clearly labelled “Read Me” file with instructions on where to download the software and how to install it. If the unit tests (and there should be unit tests) require files full of test data, store that data within the test project as well. Disk space is cheap – store everything needed to run the solution until the admins push back.
Use configuration transformations
There are multiple ways to achieve configuration transformation. The built-in transformation included with Visual Studio is usually sufficient, but other tools do exist. Scott Hanselman has a great explanation here. The benefit of configuration transformations is that environment-specific configurations can be stored independently. This further reduces the dependence on standardized machine configuration. When I began my career over a decade ago, everyone on the team was required to set their machine up a certain way because “that’s how the server is set up”. It didn’t work very well, and when developer machines and the server diverged, problems were often difficult to track. Using configuration transformations, production, staging and test environment-dependent settings can be stored separately from the configuration used by the team. Moreover, developers can make changes to the local configuration to support their work without the danger of accidentally impacting other environments. For example, if the local configuration is changed to log all database queries, this should not have any effect on the production environment because it uses an separate configuration.
Create a database deployment plan
This is quite open-ended and can be implemented to varying degrees. At the most basic level, developers should be able to deploy an instance of the database locally. Most modern applications of significant size require some sort of database. It is still frequently common practice for all developers on a team to develop against one common database instance. While necessary in some cases, more frequently, this results in one developer making a change to the database that breaks the application for the others. I am an advocate for individual developer database instances instead. Each developer has a local database instance with basic test data that they can modify as needed. The database is committed to source control as part of the project. When other developers download the solution, they can deploy the latest database configuration as well. This guarantees that the database and code stay up-to-date with one another. The most primitive way to achieve this is to commit a backup of the database to source control, which each team member can restore. However, this is time consuming, does not allow for easy merging of changes, and takes up much more disk space than necessary. A more flexible approach is to include SQL scripts to generate the database, including test data. Most database management systems provide a method to do this, or Visual Studio can create a database project which includes all the required scripts. By including database scripts in the solution, anyone who downloads the solution can deploy a test instance of the database, and because the scripts are text based, individual developers can merge their database changes. Not only does this increase the repeatability of the solution, but it is also incredibly valuable when developing disconnected from the main database environment.
There are certainly more steps that can be taken in the name of repeatable project development, which I hope to outline in following posts. Just as the sciences consider repeatability the sign of good research, software professionals should consider it a mark of a well-designed solution. In general, work toward the goal that anyone who downloads your project source should be able to build and run it without ever talking to you. Whenever manual steps are taken as part of the build or deployment process, consider if there is some way to automate these steps so that they are transparent to the next developer, but not so complicated as to become confusing.