My Data Science Tool Box

This post describes the tools I currently use for working with data. People often ask me to recommend specific tools, and I always hesitate, because so much boils down to personal preference. I recently added a workshop to the DSS lineup providing an overview of popular tools for working with data. The core idea is that researchers have a lot of choices available when it comes to choosing tools to implement a reproducible workflow. For example, it doesn't really matter whether you choose to learn R or Python; the important thing is that you write and document code of some kind so that your analysis can be reproduced. Similarly, it doesn't matter much whether you choose to use RStudio or Jupyter notebooks; the important thing is that you have a development and authoring environment that encourages good research practices. Still, inquiring minds want to know, what do you use?

The short answer is as follows:

Operating system
Arch Linux
Programming language
R and Python
Editor / IDE
Emacs
Markup language
Org Mode and LaTeX
Revision control system
git
Shell
fish

Those curious to know why I prefer these tools and how I've customized them to suit my needs and preferences can read on.

I use the Arch Linux operating system

Operating systems are often excluded from "toolkit" discussions, presumably because most popular tools are cross-platform and abstract away a lot of the differences across operating systems. Nevertheless, operating systems are not all created equal, and in my opinion Linux-based operating systems are currently the best option for working with data.

Linux is similar in many ways to OS X (they are both UNIX-like), but without the annoying restrictions and limitations. Linux gives you a package manager and the freedom to install whatever tools you need, and to configure them however you like. All this freedom can make it easier to shoot yourself in the foot, but I find that preferable to being restricted by technical limitations (Windows) or greedy corporate policy (Apple OS X).

It is worth noting that the world of Linux is diverse and varied, and that there can be significant differences among different Linux distributions. One frequent issue is that many Linux distributions are very conservative about releasing software updates. This might make sense for embedded devices or servers (though I have my doubts) but it makes absolutely no sense on a laptop or personal machine. Generally speaking, I want to run the latest stable release of all the programs installed on my computers. Arch Linux is one of the few Linux distributions that makes it easy to keep your applications up-to-date. If you are new to Linux you may wish to start with Manjaro Linux, a pre-configured Arch Linux derivative.

I use the Emacs text editor

I use the Emacs text editor because I haven't yet found anything that I like better. It has a lot of legacy baggage that makes it feel alien and intimidating at first, but I got used the these quirks after a using it for a week or two. The defining feature of Emacs is that it can be customized and extended without limit. It is both a text editor and a toolkit to building your own editing environment. A large and robust ecosystem of community developed packages makes it easy to use Emacs not only as a text editor, but also as a Git and GitHub front-end, an email client, and StackOverflow browser (to give but a few examples).

Emacs customization

I customize Emacs extensively to improve the user interface, provide better support for specific programming and markup languages, and to provide front-ends for reading email and managing Git repositories.

Highlights of my Emacs configuration include

  • improved support for running R, python, or other programming languages inside,
  • support for LaTeX and other markup languages,
  • support for literate programming using org-mode or R markdown,
  • consistent and familiar code evaluation using CTRL-RETURN,
  • consistent and familiar indentation and code completion using the TAB key,
  • powerful and simple search-based tools for finding commands, files and buffers, inserting citations etc.
  • more standard select/copy/paste keys and right-click behavior makes it more familiar to those new to Emacs,
  • more powerful and convenient window management.

If you are interested in giving Emacs a try take a look at the instructions and report any problems you may encounter.

I write documents using the Org Mode, Markdown, and LaTeX markup languages

Most of the documents I produces these days are technical or training materials that include lots of example code. I often use literate programming techniques to keep the examples and the output produced by those examples together in a single document. The markup language I use for this purpose depends on the complexity of the document.

Markdown is the simplest and most ubiquitous of the markup languages I use. I often use it for simple documents, or those on which non-emacs users are collaborating. Org mode is a more powerful markup language for which adequate support is available only in Emacs. I use it for many things, including most of my workshop notes. Finally LaTeX is the most powerful, complex, and verbose of the markup languages I use. It is useful when you need more control over the appearance of the resulting document.

Markdown customization

I write markdown using markdown-mode in Emacs. To "typeset" the documents for printing or posting on line I use pandoc to convert markdown to .pdf files (via LaTeX), .html, or .ipynb (jupyter notebook) format. I have written a couple of scripts to make this process easier in specific cases, e.g. these scripts convert markdown documents to jupyter notebooks and this one converts markdown to .html using a custom template.

Org mode customization

Since many of the documents I prepare in org-mode include code examples, I've configured org-mode support for bash, R, Python and Matlab. I also use a custom template to export R workshop notes for publication on https://tutorials.iq.harvard.edu.

LaTeX customization

Although I rarely write directly in LaTeX these days (Markdown and Org mode and much simpler and I strongly prefer them) LaTeX remains important as a backend. For example, I may prepare notes for a presentation in Markdown and then export to LaTeX in order to typeset the slides using beamer.

I have developed a custom beamer theme using IQSS colors that might be useful to you, especially if you are an IQSS affiliate. Highlights include

  • a modern font that includes math symbols,
  • simple and clean layout (e.g., only section same and page numbers in the footer),
  • plenty of IQSS orange!

If you have any questions or difficulties using this theme please open an issue.

I use the R and Python programming languages

I use R and Python for working with data because they provide a good balance of flexibility and convenience. I prefer them to statistics packages like SPSS or SAS because both R and Python are full-fledged programming languages that give me the power and flexibility I need to address unusual or complicated tasks. At the same time, their substantial standard libraries and huge package repositories make it easy to accomplish standard or common data management and analysis tasks.

Notably, I do not choose these languages because the core language design – in both cases I don't particularly like the languages. It really is the ecosystem of packages that keeps me using these tools instead of running off to a shiny new thing like Julia or Go, or even from wandering off to an old but more interesting environment like Haskell.

R customization

My customization of R is mostly limited to the installation of packages. My .Rprofile just sets a default CRAN repository and prints an amusing quote.

I find the following R packages especially useful:

ggplot2
robust graphics package
lvaan
structural equation models
lme4
mixed effects modes in R
Amelia, mitools, mice
multiple imputation
purrr
consistent and clean functional programming tools
stringi
powerful text manipulation tools
xml2, jsonlite
powerful tools for manipulating and converting XML and JSON data
httr
a web client written in R

Other than installing these packages I mostly use the default R configuration.

Python customization

As with R I don't really customize Python much. I mostly use python from Emacs, sometimes using Org mode with python code blocks for literate programming.

I use the fish shell in the Terminology emulator

It has been my experience that kids these days don't really like shells. Honestly I don't blame theme. Shell technology has been stuck in the 80's for far too long. That situation is starting to change, but bash (a shell from the 80's!) is still by far the most commonly used. I think it is time for a change, and fish is the most well-developed at the moment. It doesn't make using the command line fun exactly, but it feels a lot less like being forced to time-travel 40 years into the past.

I use the Terminology emulator because unlike other terminal emulators it can display images and video directly in your terminal. This also helps avoid the forced-time-travel feeling commonly induced by using bash in a typical terminal emulator.

Fish customization

One of the nice things about the fish sell is that it has all the bells and whistles turned on by default. Very little configuration is needed to have a pleasant environment. My fish configuration is limited to a handful of convenience functions (AKA aliases) e.g., to update my system or ssh to a particular computer.

I use Git and GitHub for revision control

I don't particularly like git, but since everyone uses it I don't feel like I have much of a choice. If it were up to me I would use something simpler like mercurial but it's not so I use git. It is much more complicated and frustrating than it needs to be, but it doesn't suck once you get the hang of it.

Git customization

I mostly use git from a terminal, but I often launch graphical tools from the command line, e.g., gitg for viewing history and meld for viewing and merging diffs.

Commonalities and alternatives

If you've read this far you must be really interested in tools for working with data! While I hope it was interesting to read about my choices, I encourage you to try out some alternatives and pick a set of tools that works well for you.

In reflecting on my own tool choices I notice that customization and community activity are key values for me. For example, I like R not so much because of the design of the language, but because it is flexible and has an active community building and sharing tools in the form of R packages. Similarly, I value Emacs because it is easy to configure and because there is an active community developing Emacs packages. I value these tools not because of their design per se., but because they are actually platforms that their user and developer communities have built tools on top of. The downside of this preference for power and flexibility is that these tools are often complex. Some people prefer simpler tools, e.g., Stata instead of R or Jupyter Notebooks instead of Emacs with Org mode, and that is perfectly reasonable.

My data science tools workshop notes describe some alternative tools and is a good place to start if you're not sure what you should use for a particular task.