add git large files article
This commit is contained in:
parent
b73a81a129
commit
6911b53616
110
articles/git_large_files.md
Normal file
110
articles/git_large_files.md
Normal file
|
@ -0,0 +1,110 @@
|
|||
---
|
||||
title: Git and large files
|
||||
date: 2021-12-29
|
||||
---
|
||||
|
||||
Git is a cornerstone of software development nowadays, it has become the
|
||||
de-facto version control system.
|
||||
|
||||
Its interface is a bit complex to work with but a lot of tooling has been
|
||||
developed over the years that lessen the pain to deal with it.
|
||||
|
||||
One shortcoming of git though (and version control system in general), is
|
||||
dealing with huge binary files. These files are usually media assets that are
|
||||
not meant to be diffable, but they do belong to the project nonetheless.
|
||||
|
||||
They are different ways to deal with these files:
|
||||
|
||||
### 1. treat them like regular text files
|
||||
|
||||
This is the easiest solution: do nothing special. It works perfectly and you
|
||||
keep a clean history. However, as you modify your assets, the repository size
|
||||
will grow and it will become slower to clone on your CI pipeline. It will also
|
||||
put more charge on your git server.
|
||||
|
||||
### 2. keep them out of your repository
|
||||
|
||||
Out of the repository, out of trouble! If you keep your large assets in a
|
||||
separate directory (Dropbox for instance), your repository will stay light. But
|
||||
now you need to synchronize your external storage with your repository for your
|
||||
project. Most of the time, only the latest version is kept around, making it
|
||||
impossible to inspect an older revision with the appropriate assets.
|
||||
|
||||
### 3. store a pointer to external storage
|
||||
|
||||
As a compromise, you can store a pointer to external storage in your repository.
|
||||
Everytime you checkout a specific revision, you will fetch the according data to
|
||||
external storage and inject it into the project.
|
||||
|
||||
|
||||
The solution 3. is the more convenient solution: we keep regular git workflow,
|
||||
and put the burden of hosting large files out of git itself.
|
||||
|
||||
## Git Large File Storage
|
||||
|
||||
[Git Large File Storage](https://git-lfs.github.com/) (Git LFS) is the more
|
||||
widespread implementation of this mecanism. It is developed by GitHub and is
|
||||
available on all repositories on their platform. It works out of the box: you
|
||||
set it once and you can forget about it.
|
||||
|
||||
However, there are some shortcomings with Git LFS.
|
||||
|
||||
### 1. your project is now longer self-contained in git
|
||||
|
||||
If you decide to use Git LFS, you will tie your project with the LFS storage
|
||||
server. You won't be able to walk through your history without having a storage
|
||||
server. GitHub LFS server implementation is currently closed-source and only a
|
||||
"non production ready" reference server is available.
|
||||
|
||||
Major hosting platforms have implemented their own implementation and it is
|
||||
possible to migrate your data among compatibles hosting platforms. But your
|
||||
local copy of the repository will never hold all the data needed for your
|
||||
project. In a way, the storage server becomes a centralized piece. You can fetch
|
||||
all data locally to have it available but it won't be considered a source, it is
|
||||
more like a cache.
|
||||
|
||||
### 2. you can't easily manage storage in LFS
|
||||
|
||||
If you commit a bunch of files, then push your changes, all the files will be
|
||||
stored on the LFS server. If you want to remove them (eg. you uploaded unwanted
|
||||
files), you can do it locally by doing a rebase, then call `git lfs prune`.
|
||||
However, that will only clean up your local copies of files. What has been
|
||||
pushed will stay on the server.
|
||||
|
||||
If you wish to remove files from the server, your options depend on the server
|
||||
implementation:
|
||||
- on GitHub, your only option to reclaim LFS quota and truly delete files from
|
||||
LFS is to [delete and recreate your repository](https://docs.github.com/en/repositories/working-with-files/managing-large-files/removing-files-from-git-large-file-storage#git-lfs-objects-in-your-repository)
|
||||
- on BitBucket Cloud, you can browse LFS files in the web UI and delete specific
|
||||
files
|
||||
|
||||
## Git Annex
|
||||
|
||||
[git-annex](https://git-annex.branchable.com/) is a complete solution to deal
|
||||
with external files in your git repository. It is also more complex than Git
|
||||
LFS. As you can see in their
|
||||
[walkthrough](https://git-annex.branchable.com/walkthrough/), you need to
|
||||
explicitly set remotes for your files, and sync content between remotes.
|
||||
|
||||
Data is shared among local repositories in `.git/annex`, but it won't be
|
||||
available in common source forges such as GitHub. To make this data available to
|
||||
all people in the project, you can use [special
|
||||
remotes](https://git-annex.branchable.com/special_remotes/) which are used as
|
||||
data storage stores, akin to Git lFS (which can be used as a special remote).
|
||||
|
||||
Contrary to Git LFS, you can see what content is currently
|
||||
[unused](https://git-annex.branchable.com/walkthrough/unused_data/), [delete
|
||||
unwanted files](https://git-annex.branchable.com/tips/deleting_unwanted_files/).
|
||||
It is a more complex solution but it is more flexible.
|
||||
|
||||
## What I recommend
|
||||
|
||||
I think git-annex gives the user more control over its data: it can be fully
|
||||
decentralized and offers tools to manage its content.
|
||||
|
||||
Git LFS is simpler and more widely used, but once you hit one of its limitation,
|
||||
it can be costly to break free.
|
||||
|
||||
## Links
|
||||
|
||||
- [Large files with Git: LFS and git-annex](https://lwn.net/Articles/774125/)
|
Loading…
Reference in a new issue