INFORMS Open Forum

  • 1.  Looking for dirty data

    Posted 01-02-2017 17:18

    Hi all.  I recently posted a question about finding dirty data for a class project using OpenRefine, a very nice open source (i.e., free) software tool.  I posted my question on a couple of different sites that I am on, including the American Statistical Association mailing list and the Informs mailing list.  I received a number of useful responses from folk providing some ideas as well as some postings from folks who are in a similar situation.  I want to both thank everyone who responded as well as share the responses I received.  Depending on what mailing list you are on, you may have seen some of the responses but not all. 

     

    In addition to the responses on the mailing lists, I also performed some Google searches and wanted to share that information also.

     

    I have included the posts below.  I tried to include 5 or 6 blank lines between the postings to separate them.  I also removed names from the postings to protect the guilty.  J

     

    I also realize that by posting this message back to the mailing lists, that it will make for a very long posting that may repeat some postings that already appeared.  However, since a number of folks individually sent me messages and I do not know where they saw the original posting, ASA or Informs, I prefer to err on making sure that everyone sees all the responses.

     

    Jerry

     

    "No trees were harmed in the sending of this message; however, a large number
    of electrons were slightly inconvenienced..."


    Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

    University of Indianapolis, Indianapolis, Indiana, USA mailto:jflatto@uindy.edu

     

    Confidentiality Notice: This communication and/or its content are for the sole use of the intended recipient, and may be privileged, confidential, or otherwise protected from disclosure by law.  If you are not the intended recipient, please notify the sender and then delete all copies of it.  Unless you are the intended recipient, your use or dissemination of the information contained in this communication may be illegal.

     

     

    I came across the following webpage related to dirty data.  http://opendata.stackexchange.com/questions/1850/collection-of-messy-data  About halfway down the page, I found a discussion related to GEDIS, a free online test generator. The individual provided details on how to start with clean data and "messy it up" using GEDIS.   This might be well suited for class projects since the instructor can generate multiple sets of data and have copies of both the original clean data as well as the messy data.  http://www.gedis-studio.com/   The webpage in the first line of this paragraph also included other suggestions for dirty data.

     

     

     

     

    Hello,

    I saw your request for uncleaned data sources. The NYPD has annual stop-question-frisk data available publicly at  http://www.nyc.gov/html/nypd/html/analysis_and_planning/stop_question_and_frisk_report.shtml  We used it in our freshman-level statistics course. We did some initial cleaning to the data (e.g. Removing fields, converting yes/no to binary) just to simplify the task for our students, but there were still some good data cleaning opportunities for the students to do. For instance many empty fields were coded as 999 - students had to notice the fact that there were 999 year old people in NYC and do some sanity-checking to determine what a reasonable filter would be.

     

    Cheers,

     

     

     

     

     

    I noticed your post on the Informs email list looking for dirty data.

    I used some loan data from Lending Club for a class project on logistic regression. It's more sparce and slightly dirty, depending on your definition of dirty. It did require some data cleaning, which is an excellent exercise by itself.

    This is the link:
    https://www.lendingclub.com/info/download-data.action

     

     

     

    I like your junk data idea. Here are some suggestions.

     

    A standard method is to start with a nice clean data set and then introduce problems: outliers, missing at random data, missing not at random data.

     

    There is lots of air quality data on EPA web site. Mostly people just accept at face value. I don't think data quality has been carefully evaluated. At least I've not run across careful treatment. Of interest would be the South Coast air basin, basically Los Angeles. There are multiple air quality sites. People often use some sort of funky average over the basin. There should be lots of air components beyond the usual small particle (PM2.5) and ozone. Ozone is a trip to examine. Ozone levels vary with the time of day and likely the monitoring site. There is a lot of ad hoc treatment. But there are lots of other components, minerals, etc. (Look around for a paper by Zanobetti. She is at Harvard.

    There are some 18 minor components. I'm not able to quickly find the paper. I have it somewhere.) Some of those components are collected not every day. What to do with the missing days? Students might be able to infer missing data from relationships among the components. For LA it is also easy to get weather data. Then there are forest fires that increase PM2.5. You can chase down satellite images of LA.

     

    Don't get me started on people not sharing data. Your students should ask for data just to learn that sharing is not common practice.

    Hopefully they will get angry about the situation. Write their congressman. Etc. We paid for the data and researchers mostly will not share.

     

    Let me know how your project progresses.

     

     

     

     

    For several years I've had my class collect data like this:

     

    choose a book you like and find it on Amazon.

    scroll down to find their data about the book (dimensions, weight, prices, pub date, etc.)

    Choose a random number between 1 and 10  (random.org, for example)

    count out that many books along the "people who purchased this also..." list

    That's your next book

     

    rinse and repeat.

     

    Merge the class data.

     

    Amazon is inconsistent about the order of dimensions, for example,

    Textbooks are much more expensive than novels

    coffee table books are just different.

    kids screw up collecting and recording data

     

    In about 100 books you are almost guaranteed some errors and some legitimate outliers

     

    Pass out a spreadsheet for recording them, so you can combine them easily.

    Have them include the URL of each book, so the outliers can easily be checked and corrected when they do their data cleaning.

     

     

     

     

     

    I have worked a bit with data available at http://data.cambridgema.gov and found that it usually needs some work to be usable. In particular the tree data has many missing values and inconsistent coding. The assessor's data might also be a good source.

    https://data.cambridgema.gov/Public-Works/Street-Trees/ni4i-5bnn

    https://data.cambridgema.gov/Assessing/Cambridge-Property-Database-FY16-FY17/eey2-rv59

     

    I imagine most city's open data portals are similar. I know Boston and San Francisco have extensive open data portals.

     

     

     

     

    Data.gov may be s good source.  Also look at the hazardous liquids data reference I posted to LinkedIn (it's dirty in more than one way!).  Goid to stress the importance of understanding why the data are dirty/messy (and fix it at the source when possible) otherwise the seemingly 'good' data may not be credible either.  Good topic - good luck.

     

     

     

    You might want to look at the old IRI Marketing Data Set:

     

    http://pubsonline.informs.org/doi/suppl/10.1287/mksc.1080.0450

     

    Some cleaning has been done, but there are known issues:

    • Panelist who don't report regularly (there are files to address this in the dataset, but you have the students create their own)
    • Hi-cone (see notes about 0 unit observations in the dataset)

     

    Good luck.

     

     

     

     

    Hi , try to look for the mexican climate database, that one is really messy.

     

     

     

     

    In my work, we come across lots of dirty data (alove proprietary and unshareable) and the challenge is always determining truth. The solution often involves generating some simulated data and trying to accurately clean it up.

    So have you considered taking a clean dataset and "dirtying" it? If you dirty up the data on your own, you can test the accuracy of different techniques against various forms of dirt (e.g., missing data, misspelled entries, swapped entries, erroneous entries--intentional and unintentional).

    That being said, if you find some, do share with the group.

     

     

     

    One standard demonstration set is federal records of contracting.

     

    These data sets are always bad. Mixed measures, missing entries, spelling errors...

     

     

     

    The databases available online from the Federal Railroad Administration are dirty. There is the Railroad Equipment Accident database and the Grade Crossing accident database. The database contains accident reports and they have missing, incorrectly coded, misplaced entries, and incorrect latitude and longitude.

    The website is at FRA-Homepage

    I hope this is useful/

     

     

     

    Hello,

    Please try the USDA's ERS data sets.  They have some data tables that have suppressions in it. Those can be difficult to correct.

     

     

     

     

    @JerryFlatto you could use the data set linked in this lesson http://data-lessons.github.io/library-openrefine/01-introduction/ @OpenRefine

     

     

     

     

     

    Try the height and weight data in the NYC Stop & Frisk database. See: www.nyc.gov/html/nypd/html/analysis_and_planning/... and journals.plos.org/plosone/article?id=10.1371/....

     

    Best Wishes,

     

     

     

     

    How about data from open data portals, universities and government? More and more data is being shared these days. Open portals are especially appealing. Not only it is real data, but you have at least one portal close to you, making the project more interesting to your students (http://data.indy.gov/)

    Just to give you an idea, here are some examples based on the Puerto Rico open data portal:

    university admission data - almost 70,000 admissions including campus admitted to, program, IGS (a function of SAT and GPA), GPA, high school they went to and gender. Among projects students can do is compare mean IGS among programs, high schools, universities or genders. Another possibility is estimating the minimum IGS a school program has. Data includes misspelling of schools and programs among other challenges.

    annual average traffic volume - we noticed that this data has duplicates which need to be removed before any statistical procedure. Also, data has type of street (primary, etc.) in Spanish and English.

    crime data - murder rates from different police regions can be compared (we did a great case study comparing murder rates between regions on the island, comparing the island murder rate with US and some states, and also comparing the US murder rate to other developed countries). Many other crimes available.

    power consumption - evaluate power consumption from different sectors throughout time.

    high school graduation rates - evaluate the rates based on school, gender, poverty level, etc. Technically, these rates are parameters, not statistics. A possibility is to use the rates to infer on next year rates (but probably not too far into the future).

    Here's an example case study: data.pr.gov/Seguridad-P-blica/...

    We have developed R code for simple case studies for those interested. It completely reproduces the case study, even installing or uploading whatever libraries are necessary. You can find R code for this and other examples here: github.com/EstadisticasPR/Examples-for-Data.PR.gov

     

     

     

     

    The Uniform Crime Report (UCR) data has many missing values, and some values may be incorrect. Take murder or rape as an example. Another example is data from the CDC on pediatric cancers. Many values are suppressed.

     

     

     

     

    For the past few years I have been working with students on analyzing the police records from the NYPD.  These stop and frisk datasets (www.nyc.gov/html/nypd/html/analysis_and_planning/...)  contain individual police reports for every stop made.

     

    I also have a list of sites to find data on my website. Most of the "Large Data Sets" are somewhat messy.

    web.grinnell.edu/individuals/kuipers/stat2labs/...

     

    Robin Lock gave a very nice presentation at JSM 2016 about finding datasets online, ww2.amstat.org/meetings/jsm/2016/onlineprogram/....  

     

     

     

     

    John Holcomb had a Dataset and Stories article in JSE a few years back on this topic (with a dataset  for students to clean).

    ww2.amstat.org/publications/jse/v13n3/...

     

     

     

       I would reccommend data from this very neat article about the proposed Dakota access pipeline

    Story: Contextualizing the Dakota Access Pipeline: A roundup of visualizations - Storybench

    Storybench

    remove preview

    Contextualizing the Dakota Access Pipeline: A roundup of visualizations - Storybench

    Since August, thousands of Americans have flowed into North Dakota to help the Standing Rock Sioux and dozens of other Native American groups protest the construction of a 1,172-mile crude oil pipeline that will cross the Missouri River and encroach upon the tribe's drinking water and sacred grounds.

    View this on Storybench >

    Data (csv format): www.nytimes.com/newsgraphics/2014/09/30/spills-database/...

    Best,