add git large files article

This commit is contained in:
Fabien Freling 2020-01-08 18:55:18 +01:00
parent b7325dfe07
commit 44a33f2c72

110
articles/git_large_files.md Normal file
View file

@ -0,0 +1,110 @@
---
title: Git and large files
date: 2021-12-29
---
Git is a cornerstone of software development nowadays, it has become the
de-facto version control system.
Its interface is a bit complex to work with but a lot of tooling has been
developed over the years that lessen the pain to deal with it.
One shortcoming of git though (and version control system in general), is
dealing with huge binary files. These files are usually media assets that are
not meant to be diffable, but they do belong to the project nonetheless.
They are different ways to deal with these files:
### 1. treat them like regular text files
This is the easiest solution: do nothing special. It works perfectly and you
keep a clean history. However, as you modify your assets, the repository size
will grow and it will become slower to clone on your CI pipeline. It will also
put more charge on your git server.
### 2. keep them out of your repository
Out of the repository, out of trouble! If you keep your large assets in a
separate directory (Dropbox for instance), your repository will stay light. But
now you need to synchronize your external storage with your repository for your
project. Most of the time, only the latest version is kept around, making it
impossible to inspect an older revision with the appropriate assets.
### 3. store a pointer to external storage
As a compromise, you can store a pointer to external storage in your repository.
Everytime you checkout a specific revision, you will fetch the according data to
external storage and inject it into the project.
The solution 3. is the more convenient solution: we keep regular git workflow,
and put the burden of hosting large files out of git itself.
## Git Large File Storage
[Git Large File Storage](https://git-lfs.github.com/) (Git LFS) is the more
widespread implementation of this mecanism. It is developed by GitHub and is
available on all repositories on their platform. It works out of the box: you
set it once and you can forget about it.
However, there are some shortcomings with Git LFS.
### 1. your project is now longer self-contained in git
If you decide to use Git LFS, you will tie your project with the LFS storage
server. You won't be able to walk through your history without having a storage
server. GitHub LFS server implementation is currently closed-source and only a
"non production ready" reference server is available.
Major hosting platforms have implemented their own implementation and it is
possible to migrate your data among compatibles hosting platforms. But your
local copy of the repository will never hold all the data needed for your
project. In a way, the storage server becomes a centralized piece. You can fetch
all data locally to have it available but it won't be considered a source, it is
more like a cache.
### 2. you can't easily manage storage in LFS
If you commit a bunch of files, then push your changes, all the files will be
stored on the LFS server. If you want to remove them (eg. you uploaded unwanted
files), you can do it locally by doing a rebase, then call `git lfs prune`.
However, that will only clean up your local copies of files. What has been
pushed will stay on the server.
If you wish to remove files from the server, your options depend on the server
implementation:
- on GitHub, your only option to reclaim LFS quota and truly delete files from
LFS is to [delete and recreate your repository](https://docs.github.com/en/repositories/working-with-files/managing-large-files/removing-files-from-git-large-file-storage#git-lfs-objects-in-your-repository)
- on BitBucket Cloud, you can browse LFS files in the web UI and delete specific
files
## Git Annex
[git-annex](https://git-annex.branchable.com/) is a complete solution to deal
with external files in your git repository. It is also more complex than Git
LFS. As you can see in their
[walkthrough](https://git-annex.branchable.com/walkthrough/), you need to
explicitly set remotes for your files, and sync content between remotes.
Data is shared among local repositories in `.git/annex`, but it won't be
available in common source forges such as GitHub. To make this data available to
all people in the project, you can use [special
remotes](https://git-annex.branchable.com/special_remotes/) which are used as
data storage stores, akin to Git lFS (which can be used as a special remote).
Contrary to Git LFS, you can see what content is currently
[unused](https://git-annex.branchable.com/walkthrough/unused_data/), [delete
unwanted files](https://git-annex.branchable.com/tips/deleting_unwanted_files/).
It is a more complex solution but it is more flexible.
## What I recommend
I think git-annex gives the user more control over its data: it can be fully
decentralized and offers tools to manage its content.
Git LFS is simpler and more widely used, but once you hit one of its limitation,
it can be costly to break free.
## Links
- [Large files with Git: LFS and git-annex](https://lwn.net/Articles/774125/)