Version control for datasets

Version control for data is equally important for being able to reproduce results as it is for software/code and other documentation. Git, however, is very bad at handling (a) large (amount of) data files, because every version of every file is stored in the repository. So how can we formally version our data?

git-annex

git-annex is a command line-based version control system that can manage all file content is a separate directory in the repository called the annex (.git/annex/objects). Only the files names and some metadata are placed into git version control. When you push a git repository with an annex to Github, the annex is not uploaded, but can be stored in a web-hosting service. Thus, a copy (clone) of the github repository only contains the version histories and not the data files themselves. Any file content can be downloaded from the external storage with git-annex get.

DataLad

DataLad is a great version control system for datasets, independently of its size. It is based on git and git-annex and is relatively simple to use. It also has many more functionalities and a great and comprehensive handbook. Note that DataLad is a command line tool, so some previous experience with command line git is advantageous.