INFORMS Open Forum

Responses to "Resources for R and Python"

  • 1.  Responses to "Resources for R and Python"

    Posted 03-29-2016 12:46

    I recently posted a request for resources related to R and Python.  Thank you to everyone who responded.  All the responses are provided below.

     

    After some research and thinking, I am planning to go with Python in my classes.  I am including my thoughts on how I arrived at this decision in case this might be helpful to others in a similar situation.  Feel free to email me at jflatto@uindy.edu to kick this around or to tell me why I should rethink my plan.  While leaning towards Python, I can still be swayed.  J

     

    My business students generally do not know programming and are not generally going to be statistical experts.  I do not see them pushing the boundaries of data science but rather working for organizations who want to improve their decision making process but will not be "bleeding edge" in most cases.

     

    Rather, I see them spending time capturing data from various sources and having to clean the data before the analysis.  As such, Python seems to be a better fit.  I also see more natural language processing in the curriculum which Python seems to handle better.  I incorporate Tableau in the curriculum which helps with visualization. I do not have a philosophical issue with open source versus commercial software; rather I do not want to use commercial software so expensive that it will be very unlikely for my students to have after they graduate.  Tableau is popular enough so that I can easily see my students having it available.  Some of my other commercial software is just "too expensive" for many companies to have.

     

    As for the option of teaching them R and Python, I am concerned that if I go this route, the students will not get enough depth in either one to be "dangerous".

     

    Some of the online discussion I have looked at for R versus Python include:

     

    http://www.dataschool.io/python-or-r-for-data-science/

     

    https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

     

    https://www.dataquest.io/blog/python-vs-r/

     

    Jerry

     

     

    "No trees were harmed in the sending of this message; however, a large number
    of electrons were slightly inconvenienced..."


    Dr. Jerry Flatto, Professor, Information Systems Department - School of Business

    University of Indianapolis, Indianapolis, Indiana, USA mailto:jflatto@uindy.edu

     

    Confidentiality Notice: This communication and/or its content are for the sole use of the intended recipient, and may be privileged, confidential, or otherwise protected from disclosure by law.  If you are not the intended recipient, please notify the sender and then delete all copies of it.  Unless you are the intended recipient, your use or dissemination of the information contained in this communication may be illegal.

     

     

    This is probably the best resource I found insofar: https://cran.r-project.org/doc/contrib/Zhao_R_and_data_mining.pdf. And it is available for free. 

     

     

     

    You may find this useful:

     

    > https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

     

     

     

     

    RStudio is a good IDE and the server version is free for universities.

     

    Feel free to use my slides.

     

    http://richardtwatson.com/dm6e/Reader/slides.html

    Chapters 14-18

     

     

     

    Hi Jerry, in 2013 a graduate student and I developed a set of five R tutorials that we submitted to some competition but never heard back about.  Your request reminded me of them, and I just uploaded them to the Teradata University Network.  Have you been through there, yet, by the way?  It's teradatauniversitynetwork.com and a lot of faculty upload their teaching materials to share.

    Here's the link to the material on TUN:

    http://www.teradatauniversitynetwork.com/Library/Items/Five-Tutorials-for-Data-Visualization-and-Analysis-with-R/

     

     

     

     

     

    I would highly recommend DataCamp (https://www.datacamp.com/home), a site with several online courses that specialize on R, statistics and analytics (and to a lesser degree, Python). The format is short videos followed by hands-on exercises hosted on their cloud R service. I haven't used it for teaching, but this is what I've been using to learn R myself, and I find the quality of the content and pedagogy to be excellent (with the sole exception of their data.table course). The academic price is USD 9 per month, but a couple of introductory courses are free, and the first chapter of every course is free, so you can easily try it out.

     

     

     

    Your choice to provide instruction in R is wise. I wish I had learned it during my Ph.D.  I am learning it now.  The learning curve is steep at first but R is much more powerful and flexible than SPSS.  

     

    There are quite a few free books available in pdf format online. 

     

    R for Beginners https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

    The R Inferno.  http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

    Statistics with R is a webpage (http://zoonek2.free.fr/UNIX/48_R/all.html) but they provide a pdf version of their site.  http://zoonek2.free.fr/UNIX/48_R/all.pdf.bz2

    R tips. http://pj.freefaculty.org/R/Rtips.pdf

    http://cran.r-project.org/doc/manuals/

    http://heather.cs.ucdavis.edu/~matloff/132/NSPpart.pdf

    https://media.readthedocs.org/pdf/little-book-of-r-for-multivariate-analysis/latest/little-book-of-r-for-multivariate-analysis.pdf

     

    There are free books and resources available on specific topics as well. 

     

    An Introduction to Statistical Learning with Applications in R. http://www-bcf.usc.edu/~gareth/ISL/

    http://ggplot2.org/book/qplot.pdf

     

    There are some really good channels on youtube that provide instruction on R.

     

    https://www.youtube.com/channel/UC0MxOB6BCL976Dm2kPK-HgA

    https://www.youtube.com/user/TheLearnR

    https://www.youtube.com/user/marinstatlectures

    https://www.youtube.com/user/Tutorlol

    https://www.youtube.com/channel/UClYj39vwP_hdlG8mBXz2y4w

     

    I have watched videos on youtube. I have taken courses on udemy. The best instruction I have taken so far has been on www.datacamp.com. Learning how to use R by watching videos is a bit like learning mathematics by watching someone else do it.  The only real way to learn is by doing it.  It is important to do a lot of exercises. Datacamp allows me to do that. However, it is not free.  

     

     

     

     

    This might not be aimed at the right audience for you, Jerry, but there are some great resources for the beginner here:

     

    http://inventwithpython.com/

     

    With a free online book:

     

    http://inventwithpython.com/chapter1.html

     

     

     

     

    For Python, I recommend "Python for Data Analysis" by McKinney (O'Reilly Media).  The author is the creator of the 'pandas' library, very useful for data preparation, and he covers a bit of visualization as well.  If your students need to start from scratch in the language, I've heard great reviews of Learn Python the Hard Way (learnpythonthehardway.org), it's a free text but they can pay a small fee for video lessons.

     

    Definitely use Anaconda Scientific Python Distribution from https://continuum.io.  It's free, and bundles the latest versions of Python with all the commonly-used packages for data analysis and visualization.  Also, if students have Python already installed for work, Anaconda installs a separate copy so it doesn't disrupt their current installation.  Best of all, it has an "install for self" mode which means that students can install it on computer lab computers without having Administrator access... so I can bypass going to the university's IT department.

     

     

     

     

    I think considering R/Python instead of SAS/ SPSS is a very good idea for analytics programs. If  you are looking for books with  R, you may want to consider the following books:

    1. An Introduction to Statistical Learning with Applications in R by Gareth James et.al.

    2. R and Data Mining: Examples and Case Studies by Yanchang Zhao

    3. Data mining and business analytics with R  by Johannes Ledolter

     

     

    For the decision as to which one to use, that is really dependent on how much analysis and mathematics will need to be used.

     

    For heavy data analysis and mathematics, here are the recommended open source options:

     

    1. R 

    2. Octave

     

    When you are ready to take an algorithm to a production state or drop into another application's workflow, python with either pylab or pandas packages is the way to go.

     

    For machine learning capabilities R, Octave are the best again for creating the math, Python is the preferred application code to implement.

     

    For resources, here are some courses/tutorials that my team has found useful:

     

    https://www.udemy.com/r-programming/

    https://www.udemy.com/applied-data-science-with-r/

    https://www.udemy.com/applied-data-science-with-python/

    https://www.udemy.com/data-analysis-in-python-with-pandas/

     

    Although it is a bit dated at this point, I still really love Stanford's course on machine learning.  This course does require some pretty heavy mathematics/stats, so might want to brush up on those things before taking:

     

    https://www.coursera.org/learn/machine-learning

     

    One thing you didn't request is how to visualize the isights or outputs.  For this you can certainly leverage the packages of R or Python to provide some nice visualization capabilities; however, for more advanced and explorable options, there is a java script library that has plugins for both R & Python, D3.JS -- you might want to research this as well.

     

     

    One very useful resource you may consider is the integrated development environment for python by Jet Brains and it is free for Professors and Students. You can find it here https://www.jetbrains.com/pycharm/

     

     

    First, I think its great that you are moving towards open source, flexible data analysis tools. This will really help your student's think about what they are doing and let them be more creative. However, with that comes a price: your student's need a modicum of comfort or ability to program or think like a programmer to use these tools...there are no buttons to just click on and pretty tables to view data. Its all through programming commands.

    Here are the books that I've found must useful.

    Note: Unless noted otherwise, all the resources below have been made freely available by their authors, but they are also available for purchase from places like Amazon.com

    R Programming Language Resources 

    • Books by Hadley Wickham (a Core R Team member who has developed a lot of very useful utilities for R)
      • R for Data Science this is focused on using R for statistics
      • Advanced R this is focused on R as a programming language, not on how to do statistics. 
    • CRAN Task Views this is a page maintained by the R Project Team that thematically organizes the myriad of packages in R. 
      • Pros: Well organized and has decent descriptions and links to many packages.
      • Cons: Not exhaustive...more experimental or relatively new packages are not always there (however, this may not be a bad thing) 
    • Cookbook for R takes a "just tell me what to do" approach to many common tasks in R. 

    Python Programming Resources
    Note
    : There are currently two versions of python out there: Python 2 and Python 3. Normally, the developers try to maintain backwards compatibility, but they deviated from that principle for Python 3. The vast majority of Python 2 code will run with Python 3, but there are a few gotchas. I've included a reference that I think does a good job describing both languages. I'd recommend having your students use Python 3, as it's where the language is going.

    • Official Python 3 Documentation --  Decently written, comprehensive overview of Python's standard library.
    • Core External Packages for Data Analysis: Unlike R, Python's data science toolkit is comprised of a few "mega packages" as opposed to many small, focused packages. Also, these packages almost have a life of their own, with their own conferences and generally well-documented, decent looking web pages (unlike R's sparse help files).
      • Scipy.org: Not a package, but the SciPy organization makes most of the packages below.
      • Numpy: Convient array-like objects that are more user-friendly than Python base arrays for numerical computations.
      • ScipyThe  package for scientific computing...has tons of stuff from calculus to statistics to image processing and linear algebra and optimization and....
        • Scikit-learn: Scippy has a nubmer of "kits" that add additional functionality. This one has a bunch of cool machine learning algorithms with generally user-friendly APIs (so they are more accessible to non-ML experts). Since machine learning is pretty hot right now, and the idea of AI and computers learning though statistics is just plain cool, even a brief foray into this area would be well received by students (e.g., lots of classification algorithms boil down to a linear model, albeit in a transformed space)
      • Pandas: Major contribution is the DataFrame, which is meant to have similar functionality to R's popular DataFrame. Has lots of nice data import/export features too (e.g., Pandas.DataFrame.from_csv("filename.csv" creates a nice data from right from a local csv)

    ·         

      • Matplotlib: Emulates a lot of MATLAB's plotting functionality. again, with a generally user-friendly API.
        • Seaborn: This is a package that uses matplotlib behind the scenes, but it makes a lot of the choices for you regarding formatting and display...generally good choices ;-) I use it a lot because I don't like fiddling with tons of parameters.
    • (NOT FREE) Python Essential Reference by David Beasley. This is a very concise (but well written) reference manual on Python programming (note, does not have a statistics focus). However, it does a good job pointing out the quirks in the language and how it's internals work, so Python will seem less mysterious.

    New(er) Data Formats

    It may also be helpful for you to briefly describe how to use JSON and YAML data formats. They aren't super difficult to learn, but both R and Python can parse these files into useful data structures and they allow for expressing more complex data (like nested lists). It also helps if your students aren't tied to CSV files, useful as they may be for basic statistics.

    • JSON: Less "human readable" but widely used.
    • YAML: More readable and a person favorite of mine for developing configuration files and expressing complex data.

    Finally: Done underestimate YouTube....lots of great stuff related to above, and its generally easier to digest a 15 minute example.

    As a practicing data scientist, I regularly use all the above items, and they have helped me learn a lot of techniques.

    Hope it helps you and your students.

     

     

    • If your students are going to work with R, they most definitely should install RStudio, which is a great IDE (dare I say "industry standard"?) for R.
    • Johns Hopkins offers a data science specialization on Coursera. The specialization itself has a fee, but the courses are free, they are based on R, and their done well. In particular, the second course is an introduction to R programming.
    • The swirl package can be installed from CRAN. It is a learn-by-doing approach to R and related topics. Once installed, it lets you choose from a list of courses and then walks you through entering and executing code. Someone shifting from, say, Python to R might find it a tad basic, but for a beginner it's a fairly painless introduction to R coding.
    • There's an active Statistics and R Google+ community where people can seek help.

     

     

    I could recommend some text books, but you have enough by now. Besides, it would be useful to visit some interesting sites showing R aplplications. here is a suggetsion:  R Stats + Digital Analytics: 8 Blogs you should Follow

     

     

    Learning Base R,

    by Lawrence M. Leemis,

    2016, Lightning Source, ISBN: 978-0-9829174-8-0.

    Available on Amazon.

     

    *Learning Base R* provides an introduction to the R language for those with limited or no prior programming experience.  It introduces the key topics, listed below, that are needed to begin analyzing data and programming in R.

    The focus is on the R language rather than a particular application.  Nearly 200 exercises make the book appropriate for classroom use.

     

     

    You might want to take a look at R for Marketing Research and Analtyics. The first half of the book focuses on basic statistical operations that ought to be fairly universal (plotting, crosstabulating, ANOVA and linear regression).  The second half covers a variety of more specific methods that are useful in marketing including factor analysis, choice modeling and hierarchical modeling. It wasn't intended as a textbook, but a few marketing faculty have adopted it. They are creating slides and exercises to go with the book and should be posting them in the next week or so.  You can read a review of the book in the Journal of Statistical Software.