INFORMS Open Forum

  • 1.  Looking for dirty data

    Posted 12-28-2016 09:58

    I have an unusual question for folks.  Does anyone have any good sources for semi-dirty or dirty data for class projects? 

     

    I am teaching a class in unstructured data analysis for undergraduates.  The first part of the class will look at cleaning data using OpenRefine.  Most of the data sets that I have are relatively clean and the data sets available online also seem to be relatively clean.   Thus counter to normal requests, I am looking for data sets that need to be cleaned that I can give as class projects for students to work on.  While many organizations have dirty data, I found it difficult to get them to share any of the data for proprietary reasons.

     

    Any ideas are appreciated.  You can reply to this posting or email me directly at jflatto@uindy.edu.

     

    Jerry Flatto

     

     

    "No trees were harmed in the sending of this message; however, a large number
    of electrons were slightly inconvenienced..."


    Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

    University of Indianapolis, Indianapolis, Indiana, USA mailto:jflatto@uindy.edu

     

    Confidentiality Notice: This communication and/or its content are for the sole use of the intended recipient, and may be privileged, confidential, or otherwise protected from disclosure by law.  If you are not the intended recipient, please notify the sender and then delete all copies of it.  Unless you are the intended recipient, your use or dissemination of the information contained in this communication may be illegal.

     



  • 2.  RE: Looking for dirty data

    Posted 12-29-2016 10:39

    In my work, we come across lots of dirty data (alove proprietary and unshareable) and the challenge is always determining truth. The solution often involves generating some simulated data and trying to accurately clean it up.

    So have you considered taking a clean dataset and "dirtying" it? If you dirty up the data on your own, you can test the accuracy of different techniques against various forms of dirt (e.g., missing data, misspelled entries, swapped entries, erroneous entries--intentional and unintentional).

    That being said, if you find some, do share with the group.

    ------------------------------
    William Christian
    Severn MD



  • 3.  RE: Looking for dirty data

    Posted 12-30-2016 12:31

    The databases available online from the Federal Railroad Administration are dirty. There is the Railroad Equipment Accident database and the Grade Crossing accident database. The database contains accident reports and they have missing, incorrectly coded, misplaced entries, and incorrect latitude and longitude.

    The website is at FRA-Homepage

    I hope this is useful/

    ------------------------------
    Trefor Williams
    Professor of Civil Engineering
    Rutgers University
    Princeton NJ



  • 4.  RE: Looking for dirty data

    Posted 12-30-2016 14:23

    I've come across a couple of messy data sets.

    One is traffic violations in Montgomery County Maryland. Traffic Violations | Open Data Portal.  The vehicle information is very inconsistent. For example, Ford Mustangs could be Mstg, 2D Mustng, 2DT Mstng, etc. I think there are even some mis-matches in brand and model (like a Dodge Mustang).

    The Chicago data portal (City of Chicago | Data Portal) has many data sets. I haven't looked at it in awhile but there were some oddities in the building permits. There are many permits issued for jobs estimated to cost $0 and some of the contractor fields seem inconsistently used.

    The Chicago data also has Red Light Camera violations. I don't recall any obvious errors in the data but there are several records with a missing Camera ID. Those records have addresses which can help clean up the data but many addresses have two cameras (east-west & north-south) so it might be impossible to completely clean/fix this data.

    ------------------------------
    Thomas Groleau
    Business Division Chair
    Carthage College
    Kenosha WI