data science  

Aug 13, 2015 • Michael Chen

Although Unix command-line utilities revise and update yearly, the overall interface and logic of Unix remain the same. On the contrary, data science is still young and in its infancy. What kinds of sparkles will occur when infancy data science meets 40-year-old Unix? Janssens showed the interaction between the two diverse computing skills in a cookbook style, demostrating many interesting user cases of data skills in the command line of Unix.

About the book:

The author told us that data science is OSEMN (obtaining, scrubbing, exploring, modeling, and interpreting data) in the first chapter. Later, the basic concepts of Unix command line were elucidated in the second chapter. It’s good for the newcomers of Unix or GNU/Linux. Then, he explained each steps, except interpreting data, in separated chapters. Besides, there are three intermezzo chapters augementing the command-line skills. Unix is not a new stuff, but its application in data science is innovative and creative.

In this book, CSV (comma-separated values) is of the most importance. Unlike XML, JSON, HTML, or other special text formats, CSV is line-oriented, suitable for the text processing utilities of Unix. In addition, we can feed CSV files into R or Python as spreadsheets for further statistics and modeling. Therefore, csvkit and other related tools are heavily used. The author also suggests converting other file formats into CSV for further processing. Moreover, the author created several small utilities to process CSV files and do other tasks. These scripts were not complex; however, they were handy and worked well in his workflow.

The author introduced several utilities for data wrangling. However, sed, AWK, and Perl were not well illustrated. Furthermore, Perl one-liners didn’t appear at all in this book. They are all important and common text processing utilities of Unix. Not each data was in well-formated CSV, JSON, XML, etc. Sometimes data scientists still need to munge data by themselves. The programming languages for text processing like AWK or Perl give us swift and potent tools to deal with data that cannot be done by simple command-line utilities. AWK, Perl, and Ruby are few Unix utilities have native ability to deal with columns in text files, important in data munging. Besides, they can run as both one-liners and full scripts, providing flexible and capable means for data scientists.

Unix itself is good at data grabbing and munging, but does not provide utilities for statistics and machine learning. We still need some real data science software for further exploration, summarization, visualization, and modeling. Python and R are both interactive data science workbenches and scientific programming languages, suitable for both interactive tasks and batch processing. In this book, the author wraps R in a Bash script and runs it from the command line. Sometimes this approach works well with other utilities, but sometimes it is suboptimal. Many computers have GUIs and there are many IDEs and software for data science, e.g. RStudio, Canopy, and Spyder. For example, feeding interactive R commands is easier in RStudio than from Unix command line.

Overall, this book provides interesting and useful examples and applications about data science from the Unix environment. No matter your are the newcomers of Unix (or GNU/Linux) or sophisticated Unix users, you can pick up some tricks and skills from this book and merge these tips into your own workflow. GUI and CLI are just different interfaces to operate computers. Learn and choose when either way is proper for your tasks.