From 44a33f2c72cabd110b4254e247f014bae3680eaf Mon Sep 17 00:00:00 2001 From: Fabien Freling Date: Wed, 8 Jan 2020 18:55:18 +0100 Subject: [PATCH] add git large files article --- articles/git_large_files.md | 110 ++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 articles/git_large_files.md diff --git a/articles/git_large_files.md b/articles/git_large_files.md new file mode 100644 index 0000000..f64a522 --- /dev/null +++ b/articles/git_large_files.md @@ -0,0 +1,110 @@ +--- +title: Git and large files +date: 2021-12-29 +--- + +Git is a cornerstone of software development nowadays, it has become the +de-facto version control system. + +Its interface is a bit complex to work with but a lot of tooling has been +developed over the years that lessen the pain to deal with it. + +One shortcoming of git though (and version control system in general), is +dealing with huge binary files. These files are usually media assets that are +not meant to be diffable, but they do belong to the project nonetheless. + +They are different ways to deal with these files: + +### 1. treat them like regular text files + +This is the easiest solution: do nothing special. It works perfectly and you +keep a clean history. However, as you modify your assets, the repository size +will grow and it will become slower to clone on your CI pipeline. It will also +put more charge on your git server. + +### 2. keep them out of your repository + +Out of the repository, out of trouble! If you keep your large assets in a +separate directory (Dropbox for instance), your repository will stay light. But +now you need to synchronize your external storage with your repository for your +project. Most of the time, only the latest version is kept around, making it +impossible to inspect an older revision with the appropriate assets. + +### 3. store a pointer to external storage + +As a compromise, you can store a pointer to external storage in your repository. +Everytime you checkout a specific revision, you will fetch the according data to +external storage and inject it into the project. + + +The solution 3. is the more convenient solution: we keep regular git workflow, +and put the burden of hosting large files out of git itself. + +## Git Large File Storage + +[Git Large File Storage](https://git-lfs.github.com/) (Git LFS) is the more +widespread implementation of this mecanism. It is developed by GitHub and is +available on all repositories on their platform. It works out of the box: you +set it once and you can forget about it. + +However, there are some shortcomings with Git LFS. + +### 1. your project is now longer self-contained in git + +If you decide to use Git LFS, you will tie your project with the LFS storage +server. You won't be able to walk through your history without having a storage +server. GitHub LFS server implementation is currently closed-source and only a +"non production ready" reference server is available. + +Major hosting platforms have implemented their own implementation and it is +possible to migrate your data among compatibles hosting platforms. But your +local copy of the repository will never hold all the data needed for your +project. In a way, the storage server becomes a centralized piece. You can fetch +all data locally to have it available but it won't be considered a source, it is +more like a cache. + +### 2. you can't easily manage storage in LFS + +If you commit a bunch of files, then push your changes, all the files will be +stored on the LFS server. If you want to remove them (eg. you uploaded unwanted +files), you can do it locally by doing a rebase, then call `git lfs prune`. +However, that will only clean up your local copies of files. What has been +pushed will stay on the server. + +If you wish to remove files from the server, your options depend on the server +implementation: +- on GitHub, your only option to reclaim LFS quota and truly delete files from + LFS is to [delete and recreate your repository](https://docs.github.com/en/repositories/working-with-files/managing-large-files/removing-files-from-git-large-file-storage#git-lfs-objects-in-your-repository) +- on BitBucket Cloud, you can browse LFS files in the web UI and delete specific + files + +## Git Annex + +[git-annex](https://git-annex.branchable.com/) is a complete solution to deal +with external files in your git repository. It is also more complex than Git +LFS. As you can see in their +[walkthrough](https://git-annex.branchable.com/walkthrough/), you need to +explicitly set remotes for your files, and sync content between remotes. + +Data is shared among local repositories in `.git/annex`, but it won't be +available in common source forges such as GitHub. To make this data available to +all people in the project, you can use [special +remotes](https://git-annex.branchable.com/special_remotes/) which are used as +data storage stores, akin to Git lFS (which can be used as a special remote). + +Contrary to Git LFS, you can see what content is currently +[unused](https://git-annex.branchable.com/walkthrough/unused_data/), [delete +unwanted files](https://git-annex.branchable.com/tips/deleting_unwanted_files/). +It is a more complex solution but it is more flexible. + +## What I recommend + +I think git-annex gives the user more control over its data: it can be fully +decentralized and offers tools to manage its content. + +Git LFS is simpler and more widely used, but once you hit one of its limitation, +it can be costly to break free. + +## Links + +- [Large files with Git: LFS and git-annex](https://lwn.net/Articles/774125/)