Categories
git git-rebase git-rewrite-history version-control

How to remove/delete a large file from commit history in the Git repository?

960

I accidentally dropped a DVD-rip into a website project, then carelessly git commit -a -m ..., and, zap, the repo was bloated by 2.2 gigs. Next time I made some edits, deleted the video file, and committed everything, but the compressed file is still there in the repository, in history.

I know I can start branches from those commits and rebase one branch onto another. But what should I do to merge the 2 commits so that the big file doesn’t show in the history and is cleaned in the garbage collection procedure?

10

278

Why not use this simple but powerful command?

git filter-branch --tree-filter 'rm -f DVD-rip' HEAD

The --tree-filter option runs the specified command after each checkout of the project and then recommits the results. In this case, you remove a file called DVD-rip from every snapshot, whether it exists or not.

If you know which commit introduced the huge file (say 35dsa2), you can replace HEAD with 35dsa2..HEAD to avoid rewriting too much history, thus avoiding diverging commits if you haven’t pushed yet. This comment courtesy of @alpha_989 seems too important to leave out here.

See this link.

8

  • 5

    Much better than bfg. I was unable to clean file from a git with bfg, but this command helped

    – podarok

    Jul 1, 2016 at 11:56

  • 4

    This is great. Just a note for others that you’ll have to do this per branch if the large file is in multiple branches.

    – James

    Aug 19, 2016 at 1:38

  • 1

    This worked for me on a local commit that I couldn’t upload to GitHub. And it seemed simpler than the other solutions.

    – Richard G

    Feb 3, 2017 at 16:32

  • 5

    If you know the commit where you put the file in (say 35dsa2), you can replace HEAD with 35dsa2..HEAD. tree-filter is much slower than index-filter that way it wont try to checkout all the commits and rewrite them. if you use HEAD, it will try to do that.

    – alpha_989

    Jan 21, 2018 at 20:10

  • 6

    After running the above command, you then have to run git push --all --force to get remote’s history to match the amended version you have now created locally (@stevec)

    Jun 16, 2020 at 19:05


664

What you want to do is highly disruptive if you have published history to other developers. See “Recovering From Upstream Rebase” in the git rebase documentation for the necessary steps after repairing your history.

You have at least two options: git filter-branch and an interactive rebase, both explained below.

Using git filter-branch

I had a similar problem with bulky binary test data from a Subversion import and wrote about removing data from a git repository.

Say your git history is:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Note that git lola is a non-standard but highly useful alias. (See the addendum at the end of this answer for details.) The --name-status switch to git log shows tree modifications associated with each commit.

In the “Careless” commit (whose SHA1 object name is ce36c98) the file oops.iso is the DVD-rip added by accident and removed in the next commit, cb14efd. Using the technique described in the aforementioned blog post, the command to execute is:

git filter-branch --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch oops.iso" \
  --tag-name-filter cat -- --all

Options:

  • --prune-empty removes commits that become empty (i.e., do not change the tree) as a result of the filter operation. In the typical case, this option produces a cleaner history.
  • -d names a temporary directory that does not yet exist to use for building the filtered history. If you are running on a modern Linux distribution, specifying a tree in /dev/shm will result in faster execution.
  • --index-filter is the main event and runs against the index at each step in the history. You want to remove oops.iso wherever it is found, but it isn’t present in all commits. The command git rm --cached -f --ignore-unmatch oops.iso deletes the DVD-rip when it is present and does not fail otherwise.
  • --tag-name-filter describes how to rewrite tag names. A filter of cat is the identity operation. Your repository, like the sample above, may not have any tags, but I included this option for full generality.
  • -- specifies the end of options to git filter-branch
  • --all following -- is shorthand for all refs. Your repository, like the sample above, may have only one ref (master), but I included this option for full generality.

After some churning, the history is now:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
|
| * f772d66 (refs/original/refs/heads/master) Login page
| | A   login.html
| * cb14efd Remove DVD-rip
| | D   oops.iso
| * ce36c98 Careless
|/  A   oops.iso
|   A   other.html
|
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Notice that the new “Careless” commit adds only other.html and that the “Remove DVD-rip” commit is no longer on the master branch. The branch labeled refs/original/refs/heads/master contains your original commits in case you made a mistake. To remove it, follow the steps in “Checklist for Shrinking a Repository.”

$ git update-ref -d refs/original/refs/heads/master
$ git reflog expire --expire=now --all
$ git gc --prune=now

For a simpler alternative, clone the repository to discard the unwanted bits.

$ cd ~/src
$ mv repo repo.old
$ git clone file:///home/user/src/repo.old repo

Using a file:///... clone URL copies objects rather than creating hardlinks only.

Now your history is:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

The SHA1 object names for the first two commits (“Index” and “Admin page”) stayed the same because the filter operation did not modify those commits. “Careless” lost oops.iso and “Login page” got a new parent, so their SHA1s did change.

Interactive rebase

With a history of:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

you want to remove oops.iso from “Careless” as though you never added it, and then “Remove DVD-rip” is useless to you. Thus, our plan going into an interactive rebase is to keep “Admin page,” edit “Careless,” and discard “Remove DVD-rip.”

Running $ git rebase -i 5af4522 starts an editor with the following contents.

pick ce36c98 Careless
pick cb14efd Remove DVD-rip
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# If you remove a line here THAT COMMIT WILL BE LOST.
# However, if you remove everything, the rebase will be aborted.
#

Executing our plan, we modify it to

edit ce36c98 Careless
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
# ...

That is, we delete the line with “Remove DVD-rip” and change the operation on “Careless” to be edit rather than pick.

Save-quitting the editor drops us at a command prompt with the following message.

Stopped at ce36c98... Careless
You can amend the commit now, with

        git commit --amend

Once you are satisfied with your changes, run

        git rebase --continue

As the message tells us, we are on the “Careless” commit we want to edit, so we run two commands.

$ git rm --cached oops.iso
$ git commit --amend -C HEAD
$ git rebase --continue

The first removes the offending file from the index. The second modifies or amends “Careless” to be the updated index and -C HEAD instructs git to reuse the old commit message. Finally, git rebase --continue goes ahead with the rest of the rebase operation.

This gives a history of:

$ git lola --name-status
* 93174be (HEAD, master) Login page
| A     login.html
* a570198 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

which is what you want.

Addendum: Enable git lola via ~/.gitconfig

Quoting Conrad Parker:

The best tip I learned at Scott Chacon’s talk at linux.conf.au 2010, Git Wrangling – Advanced Tips and Tricks was this alias:

lol = log --graph --decorate --pretty=oneline --abbrev-commit

This provides a really nice graph of your tree, showing the branch structure of merges etc. Of course there are really nice GUI tools for showing such graphs, but the advantage of git lol is that it works on a console or over ssh, so it is useful for remote development, or native development on an embedded board …

So, just copy the following into ~/.gitconfig for your full color git lola action:

[alias]
        lol = log --graph --decorate --pretty=oneline --abbrev-commit
        lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
[color]
        branch = auto
        diff = auto
        interactive = auto
        status = auto

15

  • 5

    Why i can’t push when using git filter-branch, failed to push some refs to ‘[email protected]:product/myproject.git’ To prevent you from losing history, non-fast-forward updates were rejected Merge the remote changes before pushing again.

    Feb 4, 2013 at 10:49


  • 11

    Add the -f (or --force) option to your git push command: “Usually, the command refuses to update a remote ref that is not an ancestor of the local ref used to overwrite it. This flag disables the check. This can cause the remote repository to lose commits; use it with care.”

    Feb 4, 2013 at 23:47

  • 6

    This is a wonderfully thorough answer explaining the use of git-filter-branch to remove unwanted large files from history, but it’s worth noting that since Greg wrote his answer, The BFG Repo-Cleaner has been released, which is often faster and easier to use – see my answer for details.

    Jan 15, 2014 at 15:09

  • 2

    After I do either of the procedures above, the remote repository (on GitHub) does NOT delete the large file. Only the local does. I force push and nada. What am I missing?

    – 4Z4T4R

    May 13, 2014 at 21:11


  • 1

    this also works on dirs. ... "git rm --cached -rf --ignore-unmatch path/to/dir"...

    – rynop

    Aug 20, 2014 at 16:08

278

Why not use this simple but powerful command?

git filter-branch --tree-filter 'rm -f DVD-rip' HEAD

The --tree-filter option runs the specified command after each checkout of the project and then recommits the results. In this case, you remove a file called DVD-rip from every snapshot, whether it exists or not.

If you know which commit introduced the huge file (say 35dsa2), you can replace HEAD with 35dsa2..HEAD to avoid rewriting too much history, thus avoiding diverging commits if you haven’t pushed yet. This comment courtesy of @alpha_989 seems too important to leave out here.

See this link.

8

  • 5

    Much better than bfg. I was unable to clean file from a git with bfg, but this command helped

    – podarok

    Jul 1, 2016 at 11:56

  • 4

    This is great. Just a note for others that you’ll have to do this per branch if the large file is in multiple branches.

    – James

    Aug 19, 2016 at 1:38

  • 1

    This worked for me on a local commit that I couldn’t upload to GitHub. And it seemed simpler than the other solutions.

    – Richard G

    Feb 3, 2017 at 16:32

  • 5

    If you know the commit where you put the file in (say 35dsa2), you can replace HEAD with 35dsa2..HEAD. tree-filter is much slower than index-filter that way it wont try to checkout all the commits and rewrite them. if you use HEAD, it will try to do that.

    – alpha_989

    Jan 21, 2018 at 20:10

  • 6

    After running the above command, you then have to run git push --all --force to get remote’s history to match the amended version you have now created locally (@stevec)

    Jun 16, 2020 at 19:05