Fun with git - breaking apart a large PR into more manageable chunks

Sometimes a team moves so quickly on a project that they end up with a rather large feature branch in a short amount of time. The sheer size of the change set makes it difficult to merge the feature branch code into the broader code base.

On a recent project, our team faced just this situation. We were cranking along, building a new anti-fraud integration for Bigcommerce Enterprise, when we (seemingly) suddenly neared the end of the project. We put in a pull request, which showed us how much code we had added and changed- a lot. The PR added or changed around 7500 lines of code. Needless to say, it would be challenging for one of our colleagues to review all of this code at once, so we decided to submit these changes in a few change sets to make code review more manageable.

Starting Point: +7500 LOC in feature branch

Removing schema change code

At Bigcommerce, we make schema changes with some releases and not others, so the first code we split off from our anti-fraud-feature branch was a schema change we needed to make.

To do this, we checked out a new branch off of our main branch.

git checkout master
git checkout -b anti-fraud-schema-change

Then we checked out the schema change file from our anti-fraud-feature branch.

git checkout origin/anti-fraud-feature -- path/to/schema/change/anti-fraud.sql
git commit -a -m "Add schema changes we need for new anti-fraud feature to master"
git push origin anti-fraud-schema-change

We pushed the code to our remote anti-fraud-schema-change branch, creating a new branch, and put in a pull request from this branch to master.

100 LOC removed, +7400 in feature branch

We also manually removed the schema change files from our anti-fraud-feature branch and pushed the resulting code to our remote anti-fraud-feature branch.

This schema change introduced a UI change that we wanted to prevent merchants from seeing during rollout, so we also reverted the UI change until we had better controls around it.

git checkout master
git fetch origin 
git checkout -b anti-fraud-ui-update
git cherry-pick shaForUIChange

We pushed this branch up to a remote anti-fraud-ui-update branch, and then reverted the commit in our feature branch.

git checkout anti-fraud-feature
git revert shaForUIChange
git push origin anti-fraud-feature`

5 LOC removed, +7395 in feature branch

Moving libraries to their own git branches

Next, we decided to separate a library that moves around data for our integration into a new branch called anti-fraud-data-sender. We chose to remove this library because it is not integrated with the existing Bigcommerce codebase. We also pulled the unit tests for this library into the branch to validate our code during the process of separating it from our feature branch. To do this, we checked out the library and test directories from our anti-fraud-feature branch onto a new branch created from master.

git checkout master
git checkout -b anti-fraud-data-sender
git checkout origin/anti-fraud-feature -- path/to/data/sending/
git checkout origin/anti-fraud-feature -- path/to/unittest/data/sending/
git commit -a -m "Separate out code that sends anti-fraud data"`

Failing unit tests showed us where we did not have appropriate mocks for classes not present in this branch. We updated our unit test mocks to remove these dependencies and amended our previous commit to include our minor changes. We put in a pull request to master for this branch so we could get feedback from our colleagues.

2300 LOC copied from feature branch

While waiting for feedback on that pull request, we pulled a helper library with some tie-ins to the main Bigcommerce codebase into a new branch. Each of these data gathering classes has unit tests, so we pulled the library and its tests into an anti-fraud-data-gather branch.

git checkout master
git checkout -b anti-fraud-data-gather
git checkout origin/anti-fraud-feature -- path/to/data/gather/
git checkout origin/anti-fraud-feature -- path/to/unittest/data/gather/
git commit -a -m "Separate out code that gathers anti-fraud data"`

Naturally, the unit tests failed because we had made some small changes in other parts of the codebase to support our data collection. We cherry-picked those commits over to the current branch, pushed it to our remote anti-fraud-data-gather branch, and put in a pull request for review.

3495 LOC copied from feature branch

Removing code from git history

After both the anti-fraud-data-sender and anti-fraud-data-gather branches were merged into master, we looked at several paths forward for the remaining code in the feature branch. It was important to get this code in relatively quickly, as it is the code that ties our anti-fraud libraries in with the rest of the codebase and makes our feature viable and many of these files change often.

The first option, which is how we typically move forward with smaller change sets, was to git pull --rebase origin master into our anti-fraud-feature branch. This would pull the current master and then try to add our changes on top; we would resolve major conflicts manually. This turned out to be a terrible idea for many reasons, including the 140 commits and the many updates we had made to the libraries we merged into master during the project— git rebase --abort.

Next we looked at taking master, git mergeing our anti-fraud-feature branch into it, and merging that into master. That has the unfortunate effect of changing the git history relative to master and doesn't follow our typical development flow, so that path is non-ideal as well.

We asked ourselves, "Wouldn't it be great if we could scrub our data-sending and data-gathering directories from our git history like they were never there so we could merge our remaining code?"

That's when we discovered git filter-branch. git filter-branch allows you to remove files or even entire directories from your git history (if you use the tree-filter option). It's like those files never existed.

First, we found the last common commit SHA with our master branch by looking at our git history in github. This was important because tree-filtering is no small feat and takes a good while even for 140 commits. Then we backed up our feature branch to 3 locations, just in case.

Then we ran the filter-branch git command with the tree-filter option. We force removed the directory that housed our data gathering code, from the last common commit SHA to the current commit/ HEAD.

git filter-branch --tree-filter 'rm -rf path/to/data/gather/' ThisIsOurLastCommonSHA..HEAD

And waited. And waited. And waited. (It probably took about 3 minutes, which seems like a long time when waiting to see if you broke a feature your team worked on for several months.) When it was done, our path/to/data/gather/ directory was gone. We repeated this process with the other directories we wanted git to forget in our anti-fraud-feature branch.

git filter-branch --tree-filter 'rm -rf path/to/data/sending/ path/to/unittest/data/sending/ path/to/unittest/data/gather/' ThisIsOurLastCommonSHA..HEAD

When our filter-branch finished, we confirmed that the directories that were supposed to be gone were removed, and then updated our anti-fraud-feature branch with changes made to master, using our standard git pull --rebase origin master workflow. As expected, there were no conflicts pulling in our libraries and unit tests. We proceeded to resolve other conflicts manually; changes had been made to several of the files we modified for our integration and these needed resolution.

5795 LOC removed, +1600 in feature branch

Resolving conflicts and recovering lost commits

The most difficult conflict to resolve was a mass whitespace change affecting 2 files we modified for our integration. There was no good way to resolve this change, so we opted to
git checkout --ours /path/to/modified/file to ensure our team's changes stayed around. We pushed the rebased anti-fraud-feature branch to our remote and staged a pull request into master to compare those files. Then we manually updated a few lines that were changed by someone outside our project team, and committed those changes.

In the meantime, some changes had been made to configuration files, so we needed to rebase again. We did git pull --rebase origin master as usual and pushed the result up to our remote. When we went to stage the PR again, we noticed that the last changes we'd made were not applied at the end of the rebase, and we'd just removed that commit from our history- local and remote. Not a good moment.

Since we'd recently had a version of our anti-fraud-feature branch checked out that contained our change, we used git reflog to show the recent history:

niue57d HEAD@{0}: rebase finished: returning to refs/heads/anti-fraud-feature
4fab17d HEAD@{1}: pull --rebase origin master: Update configs
d00617c HEAD@{2}: pull --rebase upstream master: checkout LastCommitSHAOnMaster
b9a612a HEAD@{3}: commit: Manual update of files affected by whitespace change`

and used git reset --hard HEAD@{3} to move back in time to before the rebase. We tried it again. This time, the rebase worked as expected, and applied our changes on top of those from master. We pushed our anti-fraud-feature branch for the last time, pre-review. The whitespace changes to those files added nearly 3000 changed lines of code, so we asked people to view the Github PR with ?w=1 at the end of the url to hide the whitespace changes.

Over the course of a week or so, we were able to whittle down our feature branch pull request so that our final anti-fraud-feature PR contained 1600 LOC (4500 including whitespace changes). This change set, while not small, is considerably smaller than the 7500 lines we started with. Once that pull request was reviewed (and fixed), we were able to merge our last group of changes into our master branch.