This post is the fourth in my “5 Keys” to EIM series, covering the core principles of enterprise information management (EIM). Please check out the others:
This time we are looking at data development, which is where many of today’s distinguished data management professionals began their careers.
Figure 1: 10 Data Management Disciplines (adapted from the Data Management Association)
Data development is about producing a technology-driven solution to move data from place to place. It may be moving things from a data generation system to a data reporting system, or writing queries for reports, or a variety of systems development tasks that enable one of the many other DAMA domains to function correctly – but in the end, it all comes back to moving data from place to place. Let’s keep it simple and look at the keys:
KEY 1: Data development doesn’t develop data directly
Opportunities are rare, but I like to incorporate tongue-twisters in this “5 Keys” series whenever possible. This key says that data development doesn’t create new data – data development is focused on the movement or changing of data. Think of your data systems as cities, and data development as the highways connecting these cities. And remember, nobody takes a trip to visit a highway. Data development is similarly a means to an end, and not much of a destination unto itself. Keep this highway analogy in mind, as we’ll keep referring back to it.
KEY 2: Data development changes data very little…or entirely
Data development should compartmentalize its activities into “move” activities and “change” activities, and do only one of them in any process. This wasn’t always considered the best practice: it used to be that we would move data and change it as part of the move so we wouldn’t waste space storing something in the same way twice.
The downside of the old way is that when move and change activities become intertwined, solution complexity increases exponentially. Developers end up spending a lot of time unwinding the embedded business rules any time a change is necessary, and mistakes are common in that process (see Key 4 below). Fortunately, space is now cheaper than it used to be, so if you must waste something to promote clarity, space tends to be less costly than developers’ time.
KEY 3: “ETL” is usually inferior to “ELT”
ETL is a relatively technical term that stands for “Extract, Transform, and Load.” This means you pull data out of somewhere, change it somehow, and then put it somewhere else. What did we just say in the last key? This is exactly what we said not to do most of the time!
ELT on the other hand means “Extract, Load, and Transform.” First you pull data out of somewhere, then move it somewhere else, and THEN make changes to it! This process has a clear separation between moving the data and changing the data. Now business rules, documentation, and future changes all become easier to understand and adapt.
Going back to our city/highway analogy, this method keeps things moving on the highways, and puts all the exciting business stuff in the cities where it belongs. Let’s keep our data highways flowing fast and free by not littering business rules all over!
KEY 4: Your data developers need more understanding of the business
Too few developers ask “why?” Most of the time when developers are told to make one system talk to another, they jump into their tool of choice and start building, based on file formats or whatever they are given as parameters. The problem is that too often they receive some input and output parameters, but a lot of the underlying context is lost.
Back to the highway analogy: developers are told to expect a certain number of cars going from city A to city B, and that they should build a 4-lane highway. That may seem sufficient, but what kind of average speed is adequate? Are there data rush hours? What happens if some data doesn’t make it to the destination in a certain amount of time? Is it more important to be fast or redundant?
If a developer asks a business person which RAID (redundant array of independent disks) setting to use, the conversation is already over. But if the developer understands the business priorities, they can determine which RAID setting is most appropriate without confusing the heck out of anybody. If by saying that I have just confused the heck out of you, just know that RAID settings determine the relative speed versus safety of the disk drives that store the data of a database. These settings are full of tradeoffs, and not knowing the business priorities well enough will lead to choosing the wrong one.
KEY 5: Extensibility is an Art
I preach proactive laziness. If I can do a little work now to avoid a lot of work later, I inherently want to do that. It’s not a precise science, but all good data developers learn to make similar tradeoffs.
This is why we will often build an ELT process to load all of the data from a source into a staging area even if we only need 25% of it right now – it only takes a small amount of extra time to map the additional fields up-front, and it is painful to go back and do it later. It’s also likely that if some of the data from a source were useful, other data from that source will prove useful later.
Though these are important concepts, don’t let your developers get too carried away. My first draft of this key went an entire page by itself, and I’m going to put it into a separate post later. Bottom line, if it is overwhelmingly better to do the work now, let the developers do it. If they can’t tell you why without resorting to terms like “RAID” then tell them to skip it for now and spend the next hour learning more about the business.
Anthony J. Algmin is a Manager in the Business Intelligence Practice at West Monroe Partners.