David Dietrich has a useful post, “The Dirty Little Secret of Big Data Projects” on EMC’s InFocus Blog. In the post, David importantly notes that the biggest challenges with big data often don’t get the most attention:
[I]t occurs to me that many times people look to improve and push boundaries for things that we are already pretty good at, while spending less time on improving areas that are important, but may be difficult or less sexy.[…]
From my experience, and from input I’ve received from others who are experienced Data Scientists, Data Prep can easily absorb 80% of the time of a project. But there has been a real lag in the development of tools for data prep. Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.
Dealing with the data cleansing and conditioning can be a very unsexy part of a project. It can be painful, tedious, time consuming, and sometimes thankless to clean, integrate and normalize data sets so that you can later get it into a shape and structure to analyze later on. Rarely do people pound their chests at the end of a project and talk about all of the fabulous data transformations they performed in order to get the data into the right structure and format to analyze. This is not where the sizzle is, but, like many things, it’s what separates the novices from the masters. In fact, because of the amount of thought and decision-making related to how data is merged, integrated, and filtered, I believe more and more that the data prep cannot be separated from the analytics, and is intrinsically part of the Data Analytics Lifecycle and process.
I heartily agree with the difficulties outlined in the post. UF’s Data Management/Curation Task Force is dealing with these and other difficulties in both working to establish new supports for data management/curation and to further develop the culture and overall socio-technical supports for data management/curation to ensure that data collected is cleaner and more robustly defined from the start. There’s a lot of work ahead, and it’s important to note that some of the work that may not seem as glamorous is the critical work to making the most exciting things possible.