August 17, 2016
Big Data Insights – Dealing with Dramatic Changes to your Source Data
We were feeling good. Our data processes were running like a well-oiled machine. Information came in and valuable insights flowed out. In the middle of it all sat a series of processes that dissected information from a very complicated source, smoothed it out, and made it useful. Then the Department of Labor released a statement saying they were going to make major changes to our data source, the Form 5500. Along with the statement came one thousand pages of supporting documentation (okay, technically it was 918… but who’s counting!).
When working with a massive data set like the 5500 (about a million records a year, each with approximately 400 data elements), any disruption to the form and format of the data can have a serious impact on the business you’ve built on top of that data. We at ALM Benefits Intelligence are exactly one month into dealing with these newly announced changes, and I thought it might be interesting to document the process of coping with such a big issue.
Step 1 – Don’t panic
If your data source changes, it’s important to understand that the sun has not fallen from the sky. Recognize that you’re in for a turbulent couple of months or years, but understand that you’re not exactly back to square one. Even if the source data changes drastically, there are likely common data elements from the old to the new, or groups of elements that are the same. Yes you might have to change your own table structures to accommodate the altered information but if you let an “it’s like we’re starting from scratch!” attitude infect your team you are setting yourself up for failure.
Step 2 – Get out those highlighters
This step probably should be called “understanding the changes,” but it amounts to the same thing. Before you can do anything constructive, you need to understand what you’ve got to play with. In our case, we ran through all 918 pages and flagged every data element that was new. Stuff that got moved from one place to another, or is now called something different… all of that can be handled by creating a good map from the old dataset to the new. It’s a pain, sure, but it’s not exactly difficult.
Understanding what’s new allows you to figure out what additional insights you may be able to bring to the product.
Step 3 – Determine what you want the end-tool to look like
Not all changes have to flow into your finished product. In my case that product is an online b2b database but your finished product may be an internal data warehouse, a flashy report, or something in between.
A change in the data can mean a change in the finished product, but it doesn’t have to. If you need to deal with an unexpected change very quickly, you can simply map the new data layout into the same form and format that your old processes can handle. In this case of the 5500, we have a few years to get ready so we’ll be looking to alter our front-end to display the new information and the insights we develop from it. We’ll be revamping the whole thing.
Step 4 – Get your hands on the new data as soon as possible
Even if you can’t get the full dataset early (and with a government source you certainly won’t), it may be possible to get your hands on a few sample records. Having just a few well populated sample records can allow you to test out any new processes and find any holes in your product well before you’d need to go to market. Treat your development team like a bunch of young doctors. Get them a cadaver to work on before they have to go hit the big-time with a million new records.
Step 5 – Check EVERYTHING
Once your development team tells you they’ve got the new system up and running, you’ve got to go check their work. Start with a single record exactly as it came to you from the data source. Then go look at that same record in the finished product, and make sure that all of the fields populated correctly. Don’t just check to see if they are populated, but make sure the correct values are in place as well. If the source data says a field should be $1,000,000 and your product shows $1,000, there’s an issue that you need to run down.
Once you’ve done that for a single record, do it 9 more times.
When the very data on which you’ve built your business is scheduled to undergo a big change, you need to be ready to change with it. It can seem intimidating, but if you take a deep breath and roll up your sleeves, you may find that there is much more opportunity than crisis.