Steven,
I got a kick out of your two examples of data errors. I'll chip in a
couple of my own (from the same corporate data source, back in the punch
card era). One of the fields was mileage. For whatever reason, some of
the people doing data entry would leave that blank. To the program
reading the punch cards, a field with no punches was a zero, not an NA.
This led to rather curious "shortest path" in an application using the data.
Another field was an alpha code (one to four characters) for a location.
We discovered (the hard way) that "D" meant Dallas ... or Denver ... or
Detroit ... or "delivery point".
I've also encountered time series where something about the units of
measurement or method of calculation (such as using calendar year v.
fiscal year, or first of the month v. fifteenth of the month) changed
mid-series, with no notation in the database.
I'm starting to think people do this intentionally, to confuse machine
learning algorithms and delay "the Singularity" (when SkyNet takes over).
Paul
Original Message------
The "janitor work" question is an interesting one. Our firm has delivered hundreds of analysis engagements and depending on how you parse that experience set, there are at least three kinds of "janitor work" related to data.
First, there is the classic data cleaning problem. The range of errors in data is bewildering and constantly amazing. Here are two favorites;
- the maintenance people who were confused about which information when to which field; they logged the tail number of the aircraft in the field reserved for minutes to perform a task - result the "average time" to perform any task was dominated by these errors, since the tail numbers were 6 digit numbers.
- the students who figured out no one audited their practice logs; one enterprising young fellow logged over 170 separate learning objectives in a 70 minute training period. Even more impressive, some objectives could only be done at night, while others were daylight only.
I could go on. Some errors come from automatic logging too, but humans are more inventive in error creation than machines.
Second; data collection. In many cases a client pays for research/data collection as part of the analysis. In some cases data which seemed available is not easy to obtain. The budget and schedule can be a real problem when this happens. Inventive polling methods, subject matter interviews, obscure data bases in government and academia, and purchased analysis reports are all alternatives. In these cases data economics is critical - what is the real value of obtaining more data? In the era of big data, it is easy to be confused about the power of small data sets. If we weighed only two mice and two elephants, we'd know a lot about the differences between the two species.
A different version of this problem is finding data readily available from multiple sources which seems to be contradictory. This is true in particular when data has been collected to "prove" an organization is good. Another problem is data collected to arbitrary standards because they are easy to obtain (we measure what we can, rather than what we need to measure). A closely related problem is the arbitrary standard set by convention (we've always measured it this way). These lead to multiple data sets using the same semantics, but with clearly can't be measuring the same thing... or ever worse, the differences are subtle and the analyst fails to notice they are not measuring the same thing. Y
Third; what we call the "fourth great lie" - clients who say, "don't worry we have that data... but they never do.
------------------------------
Steven Roemerman
Chairman
Lone Star
Addison TX
------------------------------