INFORMS Open Forum

Expand all | Collapse all

Resources for data preparation

  • 1.  Resources for data preparation

    Posted 04-27-2016 07:15

    When discussing analytics projects, the typical statement is that the data preparation phase is 80 to 90% of the effort.  At the same time, data preparation seems to be the area that is "glossed over" in terms of how to perform this step.  I realize that the data preparation process is obviously project specific but I have to believe that there are some general resources on this topic that provide examples and starting recommendations on the data cleaning process.

     

    As I work on our analytics curriculum, I am therefore looking for any material (web sites, books, software, etc.) that I can use as resources for teaching both undergraduate and graduate classes.  In addition to these type of resources, any recommendations on where else I can post this question is also appreciated.

     

    I will share back to the list any responses I receive.

     

    Thanks.

     

    Jerry

     

    "No trees were harmed in the sending of this message; however, a large number
    of electrons were slightly inconvenienced..."


    Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

    University of Indianapolis, Indianapolis, Indiana, USA mailto:jflatto@uindy.edu

     

    Confidentiality Notice: This communication and/or its content are for the sole use of the intended recipient, and may be privileged, confidential, or otherwise protected from disclosure by law.  If you are not the intended recipient, please notify the sender and then delete all copies of it.  Unless you are the intended recipient, your use or dissemination of the information contained in this communication may be illegal.

     



  • 2.  RE: Resources for data preparation

    Posted 04-28-2016 11:34

    The 80% figure for "janitor work" breaks down into several distinct parts (that likely will require distinct references): acquiring the data; finding and fixing errors (distinguishing genuine outliers from recording errors); and reorganizing/reformatting the data (what Hadley Wickham calls "tidying" it). For the third part, a very good starting place is one of Wickham's papers: Hadley Wickham, "Tidy Data", Journal of Statistical Software 59, 10 (2014).

    ------------------------------
    Paul Rubin
    Professor Emeritus
    Michigan State University
    East Lansing MI



  • 3.  RE: Resources for data preparation

    Posted 04-29-2016 07:31

    Paul - the referenced paper is really excellent - thank you.

    ------------------------------
    Steven Roemerman
    Chairman
    Lone Star
    Addison TX



  • 4.  RE: Resources for data preparation

    Posted 04-29-2016 17:06

    I am interested in references to #2: finding and fixing errors (distinguishing genuine outliers from recording errors);

    I teach a Data Analysis class that begins with cleaning data. I have found it impossible to locate a good text for undergraduate students in MIS. References appreciated.

    Thanks,

    Kaye

     
    ------------------------------
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    President, UCA Faculty Senate, AY 2016-2017
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Kaye McKinzie, Ph.D.
    NIS, College of Business, 305 C
    University of Central Arkansas
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    201 Donaghey Ave.
    Conway, AR 72035



  • 5.  RE: Resources for data preparation

    Posted 04-30-2016 12:05
    Kaye,

    Sadly, I never found any useful references to cleaning data. I usually think of it in three stages. First, identify outliers. Second, figure out (guess?) which outliers are errors in the data, which are observations from subpopulations that should be analyzed separately or discarded (the somewhat dated example I used was looking at income in the Seattle area and including Bill Gates in the sample), and which are just legitimate observations in the tail of some distribution. Third, for the first group (errors) try to get corrected values or drop the observations. (For the second group, you can rejigger the model to explain the subpopulation, split it off into a separate analysis, or just drop it.)

    Fixing the erroneous observations is probably too case-specific to show up in a text, other than perhaps as a few anecdotes like the ones Steven Roemerman posted. The closest I've come to the identification part are a few scattered references to outliers in books or online courses. For instance, "Applied Linear Regression Models" by Kutner, Nachtsheim and Neter contains a few pages devoted to detecting outliers in linear regression residuals. I've seen references to using box plots (single variables) and scatter plots (paired variables) for spotting outliers, but again those references are widely scattered (and often just tossed out in passing).

    This seems like a topic that the CAP folks out to cover in detail in one of their continuing ed courses, but I'm not familiar with the content of those (not being a CAP candidate).

    Sorry I can't be of any real help with this. Having taught data analysis to both MBAs and doctoral students, I feel your pain.

    Paul






  • 6.  RE: Resources for data preparation

    Posted 04-29-2016 07:12

    The "janitor work" question is an interesting one.  Our firm has delivered hundreds of analysis engagements and depending on how you parse that experience set, there are at least three kinds of "janitor work" related to data. 

    First, there is the classic data cleaning problem.  The range of errors in data is bewildering and constantly amazing.  Here are two favorites;

    - the maintenance people who were confused about which information when to which field; they logged the tail number of the aircraft in the field reserved for minutes to perform a task - result the "average time" to perform any task was dominated by these errors, since the tail numbers were 6 digit numbers.

    - the students who figured out no one audited their practice logs; one enterprising young fellow logged over 170 separate learning objectives in a 70 minute training period.  Even more impressive, some objectives could only be done at night, while others were daylight only.  

    I could go on.  Some errors come from automatic logging too, but humans are more inventive in error creation than machines.  

    Second; data collection.  In many cases a client pays for research/data collection as part of the analysis.  In some cases data which seemed available is not easy to obtain.  The budget and schedule can be a real problem when this happens.   Inventive polling methods, subject matter interviews, obscure data bases in government and academia, and purchased analysis reports are all alternatives.  In these cases data economics is critical - what is the real value of obtaining more data?  In the era of big data, it is easy to be confused about the power of small data sets.  If we weighed only two mice and two elephants, we'd know a lot about the differences between the two species.  

    A different version of this problem is finding data readily available from multiple sources which seems to be contradictory.   This is true in particular when data has been collected to "prove" an organization is good.  Another problem is data collected to arbitrary standards because they are easy to obtain (we measure what we can, rather than what we need to measure).  A closely related problem is the arbitrary standard set by convention (we've always measured it this way).   These lead to multiple data sets using the same semantics, but with clearly can't be measuring the same thing... or ever worse, the differences are subtle and the analyst fails to notice they are not measuring the same thing.  Y

    Third; what we call the "fourth great lie" - clients who say, "don't worry we have that data... but they never do. 

    ------------------------------
    Steven Roemerman
    Chairman
    Lone Star
    Addison TX



  • 7.  RE: Resources for data preparation

    Posted 04-30-2016 11:38
    Steven,

    I got a kick out of your two examples of data errors. I'll chip in a
    couple of my own (from the same corporate data source, back in the punch
    card era). One of the fields was mileage. For whatever reason, some of
    the people doing data entry would leave that blank. To the program
    reading the punch cards, a field with no punches was a zero, not an NA.
    This led to rather curious "shortest path" in an application using the data.

    Another field was an alpha code (one to four characters) for a location.
    We discovered (the hard way) that "D" meant Dallas ... or Denver ... or
    Detroit ... or "delivery point".

    I've also encountered time series where something about the units of
    measurement or method of calculation (such as using calendar year v.
    fiscal year, or first of the month v. fifteenth of the month) changed
    mid-series, with no notation in the database.

    I'm starting to think people do this intentionally, to confuse machine
    learning algorithms and delay "the Singularity" (when SkyNet takes over).

    Paul




  • 8.  RE: Resources for data preparation

    Posted 04-29-2016 09:35
      |   view attached

    I don't have any particular resource but the first lab assignment in my business statics class is to download and "clean" a demographic data set from IPUMS USA (it's free).  Each record in the data extract is a person but the variables include household variables. That alone makes it impossible to use the data to address household questions. Therefore the first step for household questions is to remove duplicate records. Then students need to look at the data dictionary to find the numeric codes for missing data and filter out those records before doing descriptive statistics. I also talk about retaining the original data in case the question changes from household to individual or to a variable where the missing data is in a different record.

    It's far from comprehensive, but doing this early in the course helps the students understand that usable data in the real world doesn't magically appear. I've attached the directions for getting the data set. This doesn't include the filtering steps since I do that in class.

    I recorded part of one class and put it on YouTube. It's not great but it gives you an idea of what I do.

    Filtering IPUMS Data in Excel

    YouTube remove preview
    Filtering IPUMS Data in Excel
    This was recorded during one of my classes on February 12, 2016. From about 3:50 through 4:01 there's a lag due to a student's question that cannot be heard in the video. The data is a sample from the 2010 US Census that was obtained through IPUMS USA, University of Minnesota, www.ipums.org.
    View this on YouTube >
    ------------------------------
    Thomas Groleau
    Carthage College
    Kenosha WI

    Attachment(s)



  • 9.  RE: Resources for data preparation

    Posted 04-30-2016 07:07

    Thomas Groleau,

    Thank you for the ipumus reference and instructions. I have a data set I have developed over the years, but nice to have another source.

    Kaye

    ------------------------------
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    President, UCA Faculty Senate, AY 2016-2017
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Kaye McKinzie, Ph.D.
    NIS, College of Business, 305 C
    University of Central Arkansas
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    201 Donaghey Ave.
    Conway, AR 72035



  • 10.  RE: Resources for data preparation

    Posted 05-01-2016 17:56

    Here's a common problem with data cleansing.  You receive a spreadsheet containing the data you requested from a client and begin editing that data in Excel, fixing spellings, eliminating useless records, etc.  The next day, you receive word from the client that the data you received was not the right data along with an updated version.  You start the time consuming process of cleaning up the new data and realize that you are not sure if you recall all of the checks and edits you made to the prior version.  Instead of operating directly on the data, you might use ETL-like tools (e.g., those included with popular data mining packages) to create a stream, starting with importing the data and applying various transformations/fixes, which can be saved and applied to new versions of the data set with ease, achieving consistent and rapid cleansing.  Moreover, the original data set is not is preserved in case questions regarding quality, conventions, etc. arise over time.

    ------------------------------
    Erick Wikum
    Principal Scientist
    TCS
    Maineville OH



  • 11.  RE: Resources for data preparation

    Posted 05-02-2016 09:07

    Thank you for all the comments related to Data Preparation.  Please keep them coming. 

     

    Any recommendations for specific tools to use for data cleaning?  In our curriculum, we have access to some commercial data mining software that also has some data preparation capabilities.  I am trying to move away from this software to open source (Python, for example) simply because the software is very expensive.  As such, I don't see many of my students having access to this commercial software once they graduate as many of them go to work for medium to smaller sized organizations that are not focused on analytics.

     

    Jerry

     

    "No trees were harmed in the sending of this message; however, a large number

    of electrons were slightly inconvenienced..."

     

    Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

    University of Indianapolis, Indianapolis, Indiana, USA mailto:jflatto@uindy.edu

     

    Confidentiality Notice: This communication and/or its content are for the sole use of the intended recipient, and may be privileged, confidential, or otherwise protected from disclosure by law.  If you are not the intended recipient, please notify the sender and then delete all copies of it.  Unless you are the intended recipient, your use or dissemination of the information contained in this communication may be illegal.

     

     






  • 12.  RE: Resources for data preparation

    Posted 05-03-2016 11:29

    Jerry,

    I've lost track of what all has been mentioned, but in case this wasn't, there's a Coursera course out of Johns Hopkins ("Getting and Cleaning Data") that you might want to look at. It uses R.

    There are a number of R packages for getting data, tailored to different tasks (inhaling Excel spreadsheets, scraping data from web pages, using web APIs, accessing specific open databases, ...). One just popped up recently that apparently will extract tables from PDF files and convert them into what R calls data frames.

    For cleaning data in R, I think the two key packages are dplyr and tidyr, both by Hadley Wickham. They simplify the most common manipulations for massaging someone else's data into a format you are comfortable working with.

    Cheers,

    Paul

    ------------------------------
    Paul Rubin
    Professor Emeritus
    Michigan State University
    East Lansing MI



  • 13.  RE: Resources for data preparation

    Posted 05-11-2016 11:48

    A colleague just pointed me to a set of free online courses called the "School of Data". In particular, they have one titled "A gentle introduction to cleaning data". The focus seems to be on data sets housed in spreadsheets. (They have other courses that include getting data from online repositories and so forth.) Course content appears to be covered by a Creative Commons license, so there should be no problem adopting/adapting it for use in your class(es).

    Paul

    ------------------------------
    Paul Rubin
    Professor Emeritus
    Michigan State University
    East Lansing MI