Git Repository Size Optimization: Techniques for Cleaning Large Files and History

Why Are Git Repositories Getting Larger and Larger?¶

Have you ever encountered a situation where, despite having a fast internet connection, cloning a Git repository gets stuck at the “Receiving objects” step? Or have you ever waited ages to commit code because the repository was hiding some “giant” files? In reality, Git repositories tend to grow for these reasons:

Accidentally committed large files: Such as log files, installation packages, video/audio files, or temporary large files generated by development tools.
“Old baggage” in history: Some large files might have been committed multiple times across history. Even if you delete them locally, these “ghost” files in previous versions still occupy space.
Unoptimized submodules: If your project uses Git submodules and the submodules themselves are large, the entire repository becomes “bulky.”

The Troubles of a Too-Large Repository¶

Slow cloning/downloads: When others clone the repo, they have to download tens of GB of historical data, reducing team collaboration efficiency.
Time-consuming backups and transfers: Backing up the repository involves transferring large files repeatedly, consuming storage space.
Local operation lag: Git operations (e.g., git log, git diff) become slow, and the editor also lags when opening the project.

1. Root Cause Removal: Clean Up Recently Committed Large Files¶

If you just committed a large file and haven’t pushed it to the remote repository yet, you can simply delete it immediately!

Scenario 1: Just committed a large file, not yet pushed to remote¶

Steps:
1. Identify the large file: Use git status to check the recently committed files, or ls -lh filename to check the file size (e.g., ls -lh mybigfile.zip).
2. Remove the file from the cache: Suppose the large file is named mybigfile.zip:

   git rm --cached mybigfile.zip  # Removes from Git cache only, keeps local file

Recommit the change:

   git commit -m "Removed large file mybigfile.zip"

Push the update: Since the local file has been deleted, directly push the change to the remote repository:

   git push

Scenario 2: Need to delete a large file from a specific commit in history¶

If the large file has been committed multiple times (e.g., it appears in the last 5 commits), you need to clean up the historical records.

2. Rewriting History: Delete Large Files from History¶

For this, we use git filter-repo (faster and safer than the older git filter-branch; recommended).

Step 1: Install `git filter-repo`¶

Mac: Install via Homebrew:

  brew install git-filter-repo

Linux (Debian/Ubuntu):

  sudo apt-get install git-filter-repo

Windows: Use WSL (Windows Subsystem for Linux) or Chocolatey:

  choco install git-filter-repo  # Requires Chocolatey first

Step 2: Delete large files from history¶

Assume the large file is mybigfile.zip and it appears in all historical commits:

# Navigate to the repository root and run the command (replace the path/filename)
git filter-repo --path mybigfile.zip

This command will traverse all commits and completely remove the file containing mybigfile.zip from the history.

Step 3: Verify the changes¶

Check the commit history to confirm the large file is gone:

git log --oneline

Check the repository size:

du -sh .git

The size should decrease significantly.

Step 4: Push to the remote repository¶

⚠️ WARNING: Rewriting history changes the remote repository’s history. Use force push with caution:

git push --force-with-lease origin main  # Push the updated history to the main branch

Critical Note: --force overwrites the remote history. Ensure no team members are working on the old history! For collaborative projects, communicate with the team first or perform this operation in your personal fork.

3. Ultimate Solution: Clean Up All Large Files (Including Submodules)¶

If the repository contains multiple large files or you need to bulk-clean:

git filter-repo --path-glob "*.log" --invert-paths  # --invert-paths deletes matching files

For Git submodules (which might be large themselves):
1. Clean the large submodule files first using the above method.
2. Reinitialize submodules:

   git submodule update --init

4. Pitfall Avoidance: Must-Knows Before Operation¶

Backup the repository: Create a copy of the repository before starting to prevent data loss.
Avoid --force on shared branches: In team collaborations, --force can cause conflicts in others’ local repositories. Coordinate with the team first.
Check for missed deletions: After cleanup, list the largest files to verify:

   git rev-list --objects --all | sort -k2n | tail -10

Never use git rm -rf on the workspace: git rm -rf deletes local files, while --cached only removes the cache without deleting local files, which is safer.

5. Long-Term Optimization: Manage Large Files with Git LFS¶

If your project must include large files (e.g., videos, model files), do NOT commit them directly to Git! Use Git LFS (Large File Storage) instead:

Install Git LFS:

   git lfs install

Track large files (e.g., all .zip files):

   git lfs track "*.zip"

Commit the .gitattributes file:

   git add .gitattributes

Future large file commits will automatically use LFS, reducing repository size.

Summary¶

The core approach to cleaning a Git repository is: Delete unnecessary large files + Rewrite history. The key tool is git filter-repo. Always back up and communicate before force-pushing. For large files, use Git LFS to manage them from the source, reducing repository size.

Regularly check repository sizes and adopt the habit of “committing small files and using LFS for large ones” to ensure smoother team collaboration!