A Data Analysis Workflow Based on Git, Luigi and Jupyter Notebooks
Structuring a data analysis project is hard. One could think that “it is just code, and there are great tools to manage code projects, so that should be easy”, but that is wrong, for the following reasons:
- In software development, the code itself is the output. In data analysis, the insights are the output.
- In particular, data pretty much always comes in messy, and needs cleaning. Cleaning might never be “done” completely: every time you move forward in the analysis, you might encounter issues you will need to clean.
- However, versioning data is not a good idea: it takes a lot of space, and diffs tend not to be useful
Over the years, I iteratively improved my approach, and I now have a process I am quite happy with. The directing principles are:
- the only versioned data is the initial, raw data
- the cleaning/munging code is version controlled, such that each commit does correspond to a particular state of that process
- the cleaned files are not stored in a place from where they should be read by hand; rather, every piece of analysis code calls a method that might regenerate that data, or read some cached version of it.
- every step in the process is parameterized. This means, for instance, that repeating a multi-step process on different sub-samples of the data only requires changing a few characters.
Workflow Management with Luigi
Why Workflow Management
In any non-trivial project, you will have code that transforms raw data to cleaner datasets, combine those datasets in new ways, and code that consume those datasets.
Over the years, I saw quite a few ways one should not do that (some of them by me, of course), the most common being:
- Have all transformation and analysis in one script, that you copy and modify for new analyses. This makes it very difficult to retrofit corrections to your cleaning code to earlier analyses.
- Have separate scripts that transform the data and writes them to files. This goes in the good direction, but when things get complex and deadlines come close, this tends to lead to inconsistent states (where say B depends on A, but was not corrected after a transformation on A)
- Just look at the data in an interactive R/SPSS/Python session or Excel, and “clean” by hand (don’t. ever.)
- Get a “clean” dataset from someone else by e-mail, do transformations to it, and send that new version to yet someone else via file-sharing, sometimes (if you are lucky) “versioned” by relying on poetic names such as “awesome_data_cleaned_v0.7_final_modified_2.csv”. I would also laugh, if I had not seen this kind of things happen so often in real life. Really. And some of those people had PhDs, and all of them were smart.
All those approaches share a few issues:
- It can be difficult to know where a particular transformed dataset comes from
- It can be difficult to propagate corrections in a dataset to all downstream analyses.
Workflow management tools help preventing those issues by encoding the flow of data from the raw data to the end analyses, allowing to re-run the full pipeline if needed. The following options exits:
- Hand-written scripts. Fine for simple use-cases or very linear processes.
- GNU Make. Every developer knows it, it is older than I am and will probably still be in use after my death. However, it is designed to build software projects, and adapting it to data analysis projects can become cumbersome. In particular, Make becomes very difficult to adapt if you have some rules that create more than one target (file), which is often useful when working with data.
- Snakemake: pretty much a GNU Make on steroids for data analysis.
- Renku: git extension that allows to keep track of data transformations in the repository, replay them, etc. I did not find it useful as such, because the workflow gets hidden and it is difficult to parameterize. It is part of a bigger project, though, which aims at making sharing and traceability of results much easier. I highly recommend to check it out!
- Luigi: python library developed by Spotify to describe workflows in python itself. It is flexible, and when using python for analysis, allows for very interesting interactive workflows, where one can just ask for something, and get it built if needed.
Luigi allows to describe workflows with two basic elements:
Targets describe any kind of input or output
Tasks are processes that compute
Targets. They might depend on other
This is very similar to GNU Make’s targets and rules, but instead of going for a Domain Specific Language, Luigi is simply a “normal” Python library. This makes it a bit more verbose than DSL-based tools, but incredibly flexible!
The documentation has a nice example, which I will not reproduce here.
I guess that if you landed here, you know what a Jupyter Notebook is. In a few years, they became the go-to tool for people using scientific python. Their ability to mix code, markdown-formatted text and outputs give the interactivity of a REPL, the persistence of a script, and the expressiveness of a document. They are great to explore data, and make it possible to come back to that analysis a few months later and still understand the analysis!
A pain point with Jupyter notebooks it how to share data transformations between notebooks:
they are great for data cleaning, but using that cleaned data in another notebook can become a nightmare.
This is where Luigi comes in: being Python, you can actually execute Luigi
Tasks directly in your notebooks!
This is as simple as writing:
import luigi task = ServeCustomer(ingredients=['egg', 'spam', 'bacon', 'spam', 'spam']) luigi.build([task]) task_output = task.output()
which will run the
ServeCustomer task and any of its dependencies,
if their outputs do not exist yet,
using the specified parameters.
The workflow becomes the following:
- do your exploration in a Jupyter notebook
- once you are satisfied, extract the relevant bits in a python module
- create a Luigi
Taskthat performs those transformations
- you are now able to call those tasks directly from other notebooks, potentially with other parameters (for instance to use a different sub-sample of the dataset)
Doing so, rather than manually writing results to files makes the whole process much less error-prone. The alternative would be to always execute the whole pipeline, which is often prohibitively expensive. Plus, Luigi gives you (some) parallelism for free, as it can execute independent tasks in separate workers!
For more, look at the relevant part of the documentation.
We are now to the last point, which is also one of the most important: how to version control that thing?
Version Control of Data Science Project
Being involved in as much software engineering as data science activities, version control is part of my daily hygiene. And since git made version control so cheap (and local), I think did not write a single line of code without creating a repository first. The benefits are countless:
- Programming is iterative in nature: start with an initial solution, and improve upon it as you discover edge cases or new applications. Sometimes, one of those iterative improvements end up not being an improvement at all, in addition of being buggy. Version controlled? Just revert and enjoy life. No version control? Congrats, you just gave yourself a week of free work!
- Commits act as a documentation of the process that lead to the current state. Wonder why things are like they are? Sometimes, looking at the commit history can save you quite a few headaches.
- Version control make it easy to collaborate: have a central repository somewhere, and have everyone push and pull from it. No files exchanged by e-mail with random naming conventions, no shared folder with 10 “versions” of the same script, no “who broke that crap, it was working last week!” nervous breakdowns, no scripts with more lines commented out than not as a way to “document” non-working solutions.
- Version control makes you sexy and improve your overall health, by reducing your overall stress level. It is said that getting a “git reset-branch” command right from the first try without looking at StackOverflow releases more endorphins in your brain than all other human activities combined, making you the happiest person on Earth!
However, versioning data analysis projects is hard:
- contrary to a software engineering project, where code is the output, in data analysis, insights are the output, and code is just a means to that end. Jupyter notebooks (or other literate programming methods) and meaningful commit messages help, but this is an additional difficulty.
- versioning the data itself is often a bad idea: the diffs are unreadable, and anything else than the raw data can be re-generated, and version-controlling it just introduces the risk of incoherent states (where the transformed data was generated with another version of the code than the one in the same commit).
It gets worse: versioning Jupyter notebooks sucks.
- As much as I like them, I cannot understand why on Mother Earth they chose json as the underlying format. Try opening a notebook in your favorite text editor to see what I mean. Un. Read. A. Ble. It is not like it is impossible to design a literate programming format that is readable, look at RMarkdown. This makes diffs rather difficult to read.
The second design decision that puzzles me is how Jupyter notebooks contain both the code and the output. I just consider this plain wrong. Again, it would not have been very difficult to just generate the output in a separate file, with exactly the same user interface. RMarkdown again.
The main problem with this is that, as you progress through your exploration, you might update some steps (for instance data cleaning) making some existing outputs inconsistent with the current state of code. So you do not want to version control outputs: anybody checking out your code can re-generate them, with the guarantee that they are up to date.
Except, of course, that it sometimes makes sense to version control outputs. This is the case for notebooks related to a particular talk or paper: you might want those to be available without the need for re-computation.
At least the second point can be solved with a tool called nbstripout.
See the documentation to see how to activate it as a git hook.
It will remove the output in the version that is version controlled,
but not from the working directory,
and using glob patterns in the
.gitattributes file you can select for which notebooks you want it to run (more on this later).
Given those remarks, my repository structure typically looks like this:
project_root/ |_ data/ | |_ raw/ | |_ cache/ |_ python_lib/ | |_ domain_one.py | |_ domain_two.py | |_ luigi/ | |_ pipeline.py |_ 001_SomeAnalysis.ipynb |_ 002_SomeOtherAnalysis.ipynb |_ x_001_SomeSummaryPresentationOrReport.ipynb |_ .gitignore |_ .gitattributes
It is very similar to what lots of others advise, but specific to the use of Luigi:
- I have two data directories:
raw, where the (surprise!) raw data lies, and
cache, where all files created by Luigi land (more on that in another post). One could add a
finaldirectory if one wants to share a cleaned dataset, but I almost never need to.
- I keep all python modules specific to the project in a
python_libpackage. I have a specific package for Luigi files.
- I number my exploration notebooks, and seldom correct them after the fact. They act as a documentation of the exploration process, but can be re-run when the Luigi tasks get updated/debugged.
- I prefix presentations or reports with an
x_, and do not run
nbstripouton them, by having the following in
???_*.ipynb filter=nbstripout ???_*.ipynb diff=ipynb
That is, I only run it on notebooks which names start with 3 characters and an underscore.
Wrapping it Up
That blog post presented the simple workflow I settled on for data science/data analysis projects. The introduction of Luigi in particular was a game changer for me, and I highly recommend it to anyone fighting with managing even moderately complex data science pipelines in Python.
If you want a bit more tips and tricks for luigi, make sure to check the companion blog post.
How do your workflow differ? Be sure to mention it in the comments!
No webmentions were found.
Want to react? Send me an e-mail or use Webmentions