111 lines
4.9 KiB
Markdown
111 lines
4.9 KiB
Markdown
---
|
|
title: Git and large files
|
|
date: 2021-12-29
|
|
---
|
|
|
|
Git is a cornerstone of software development nowadays, it has become the
|
|
de-facto version control system.
|
|
|
|
Its interface is a bit complex to work with but a lot of tooling has been
|
|
developed over the years that lessen the pain to deal with it.
|
|
|
|
One shortcoming of git though (and version control system in general), is
|
|
dealing with huge binary files. These files are usually media assets that are
|
|
not meant to be diffable, but they do belong to the project nonetheless.
|
|
|
|
They are different ways to deal with these files:
|
|
|
|
### 1. treat them like regular text files
|
|
|
|
This is the easiest solution: do nothing special. It works perfectly and you
|
|
keep a clean history. However, as you modify your assets, the repository size
|
|
will grow and it will become slower to clone on your CI pipeline. It will also
|
|
put more charge on your git server.
|
|
|
|
### 2. keep them out of your repository
|
|
|
|
Out of the repository, out of trouble! If you keep your large assets in a
|
|
separate directory (Dropbox for instance), your repository will stay light. But
|
|
now you need to synchronize your external storage with your repository for your
|
|
project. Most of the time, only the latest version is kept around, making it
|
|
impossible to inspect an older revision with the appropriate assets.
|
|
|
|
### 3. store a pointer to external storage
|
|
|
|
As a compromise, you can store a pointer to external storage in your repository.
|
|
Everytime you checkout a specific revision, you will fetch the according data to
|
|
external storage and inject it into the project.
|
|
|
|
|
|
The solution 3. is the more convenient solution: we keep regular git workflow,
|
|
and put the burden of hosting large files out of git itself.
|
|
|
|
## Git Large File Storage
|
|
|
|
[Git Large File Storage](https://git-lfs.github.com/) (Git LFS) is the more
|
|
widespread implementation of this mecanism. It is developed by GitHub and is
|
|
available on all repositories on their platform. It works out of the box: you
|
|
set it once and you can forget about it.
|
|
|
|
However, there are some shortcomings with Git LFS.
|
|
|
|
### 1. your project is now longer self-contained in git
|
|
|
|
If you decide to use Git LFS, you will tie your project with the LFS storage
|
|
server. You won't be able to walk through your history without having a storage
|
|
server. GitHub LFS server implementation is currently closed-source and only a
|
|
"non production ready" reference server is available.
|
|
|
|
Major hosting platforms have implemented their own implementation and it is
|
|
possible to migrate your data among compatibles hosting platforms. But your
|
|
local copy of the repository will never hold all the data needed for your
|
|
project. In a way, the storage server becomes a centralized piece. You can fetch
|
|
all data locally to have it available but it won't be considered a source, it is
|
|
more like a cache.
|
|
|
|
### 2. you can't easily manage storage in LFS
|
|
|
|
If you commit a bunch of files, then push your changes, all the files will be
|
|
stored on the LFS server. If you want to remove them (eg. you uploaded unwanted
|
|
files), you can do it locally by doing a rebase, then call `git lfs prune`.
|
|
However, that will only clean up your local copies of files. What has been
|
|
pushed will stay on the server.
|
|
|
|
If you wish to remove files from the server, your options depend on the server
|
|
implementation:
|
|
- on GitHub, your only option to reclaim LFS quota and truly delete files from
|
|
LFS is to [delete and recreate your repository](https://docs.github.com/en/repositories/working-with-files/managing-large-files/removing-files-from-git-large-file-storage#git-lfs-objects-in-your-repository)
|
|
- on BitBucket Cloud, you can browse LFS files in the web UI and delete specific
|
|
files
|
|
|
|
## Git Annex
|
|
|
|
[git-annex](https://git-annex.branchable.com/) is a complete solution to deal
|
|
with external files in your git repository. It is also more complex than Git
|
|
LFS. As you can see in their
|
|
[walkthrough](https://git-annex.branchable.com/walkthrough/), you need to
|
|
explicitly set remotes for your files, and sync content between remotes.
|
|
|
|
Data is shared among local repositories in `.git/annex`, but it won't be
|
|
available in common source forges such as GitHub. To make this data available to
|
|
all people in the project, you can use [special
|
|
remotes](https://git-annex.branchable.com/special_remotes/) which are used as
|
|
data storage stores, akin to Git lFS (which can be used as a special remote).
|
|
|
|
Contrary to Git LFS, you can see what content is currently
|
|
[unused](https://git-annex.branchable.com/walkthrough/unused_data/), [delete
|
|
unwanted files](https://git-annex.branchable.com/tips/deleting_unwanted_files/).
|
|
It is a more complex solution but it is more flexible.
|
|
|
|
## What I recommend
|
|
|
|
I think git-annex gives the user more control over its data: it can be fully
|
|
decentralized and offers tools to manage its content.
|
|
|
|
Git LFS is simpler and more widely used, but once you hit one of its limitation,
|
|
it can be costly to break free.
|
|
|
|
## Links
|
|
|
|
- [Large files with Git: LFS and git-annex](https://lwn.net/Articles/774125/)
|