-
Notifications
You must be signed in to change notification settings - Fork 2
History-Preserving Modularization #4
Comments
troy remarks:
|
I think I see what you mean, and in the back of my mind I anticipated this problem. I guess one possibility would be to use git-svn with the --ignore-paths option. I wonder if there's also a way to strip out all the irrelevant stuff from the git repo. I'm guessing not :-/ |
In older versions of boost the modules were not so cleanly separated, right? This means that their development histories were intertwined and only recent versions can be separated. For full preservation, each repository would have the full history of all boost right up until the modularization point, at which time the beginning of each module would be a commit that moves things into place and removes everything else. This commit would be separate for each module. After that you could merge updated changes from boost svn and Git may be able to follow the renames and resolve the merge (mostly) automatically. Just so I'm clear, Troy's concern is that the pre-modularization history would be duplicated in every module's repository? This is related to an alternative use case for Git submodules which has been brought up on Git's mailing list many times (in slight variations). Currently the "git submodule" command is meant for a super-repository that refers to sub-repositories that each have completely distinct content. Another valid use case that is not supported right now is when the super-project shares content with the subproject (we have this problem at Kitware because ParaView shares VTK content and does not build without it). Boost modules have this problem with a slight twist: the modules share content historically but not in their modern revisions. In the shared-content submodule use-case it makes more sense for the objects of the submodule to be stored in the same .git/objects database as the outer project. In the Boost case this would avoid duplicating all the old history in each module's repository while allowing it to appear so conceptually. I've spent some time investigating how to do this with Git, and it is possible. However, it requires deep understanding of Git and careful manual set up of the work tree (and requires symlinks in some cases). More work is needed in Git proper to provide a nice interface for it. When I get more time to spend on it I plan to propose a solution to Git upstream. |
git filter-branch is one way to do this. But perhaps even better, we could use the technique described in http://progit.org//2010/03/17/replace.html So cool! |
Specifically, see http://progit.org//2010/03/17/replace.html?dsq=41051056#comment-41051056 |
Yes, for stripping irrelevant stuff from history, filter-branch is the way to go. I'm very familiar with it because I've been using it extensively to do manual cleanup of automatic cvs->git conversion results for permanent one-time conversions. I'm happy to answer any questions about it. The "replace" approach looks like a porcelain around Git grafts. I've also used grafts extensively during cvs->git conversion cleanup. Basically you can just put a .git/info/grafts file in your repo. Each line is the hash of a commit followed by the hashes of other commits to pretend are the parents (and ignore the real parents). Note that grafts are meaningful only in the local repository. They can also break push/fetch if a transmitted commit is grafted. Grafts are certainly a reasonable approach for the modularization-with-history problem. Just be sure that the graft for the first commit in each module points its pretend parent at a commit in the non-modularized history that has the same content for each file. Otherwise Git may not track the renames correctly (it can tolerate some edits with renames but disables this feature when more paths change than some threshold). Unfortunately it is up to each user that wants full history to fetch it from the full repo and add the correct graft locally. Perhaps a script or at least some documentation for each module can specify the proper graft line. |
I was thinking of doing it this way; comments appreciated:
|
As long as the SVN repo is active, no one can really use the modularized git repositories for development anyway, so why is this important? When doing this as a one time thing for the final conversion, we can do it with grafts or possibly just use filter-branch to clean up the history and get rid of unrelated changes. But why do it now? What am I missing? Also, I'm not sure about (5). Sure, merges MIGHT work, but they will fail when new files are added or old files are deleted, what do we do when that happens? Also the history will be full of merge commits, one every time we sync, which I guess is for every commit if it's automatic. There doesn't seem to be any point in doing merges here rather than just rebuilding the branch anyway, since the actual modularized repo (which doesn't have any history) would have its branch head reset anyway. |
Depends what you mean by "development." We can work on the build system, work on the testing system, update CMakeLists.txt files, keep things in synch, etc. I'm not sure whether the modularized repo would have its branch head reset. But in any case, please make a specific recommendation if you don't like this plan. I can't easily evaluate the consequences of your objections. One of my goals is to not take all of Boost "offline" for a week to do a conversion. |
Right, I meant anything that merges back to the SVN repo.
It doesn't carry any history, so there's nothing else you can do. There's no commit to merge in, so you'd have to rebuild it from scratch from the modularized source and then graft it on top of the new branch head from the repo that DOES carry history.
OK. My point was that maybe we don't have to do anything fancy. Just:
When the final conversion happens we can use filter-branch and do a complicated conversion that takes a week, and then just have every branch rebased on top of that. |
I had never intended to do anything that goes back into SVN. Everything you wrote after 1) above lacks enough detail to be sure I understand it:
|
Small update. Dave and I talked off-ticket about this, and here's a small drawing to clarify what the original plan was:
(1) is some arbitrary old commit in the repo, perhaps the first commit ever. (2) is subversion HEAD as of today, the state where we start from. (A) is the modularized state for HEAD as of today. (B), (C), etc are the merges that sync with update subversion. (A1), (B1).. are the tree state of (A), (B).. in the new library repository. The dotted lines represent graft relationships. |
I have asked about our conversion to git on the git mailing list. The thread is here: http://thread.gmane.org/gmane.comp.version-control.git/150270 The suggestions so far include git-filter-branch, --tree-filter, and svn2git. The later looks like an interesting suggestion, and is how KDE migrated. Like ours, their migration also included a refactorization into separate repositories. Anybody have any experience with it? |
These are in three different categories:
It seems, whatever else happens, one has to start with #3 or some equivalent. The only reason I can see to use filter-branch would be to rewrite history so that each library's own repo gets a history that contains only the files owned by that library. However, doing that correctly seems difficult at best and even if we could do it, I don't think the result would be all that useful, because it wouldn't reflect the true nature of pre-modularized boost: there are some tangled dependencies, and occasionally sweeping changes are made by one person across several libraries. In this thread I've been suggesting that each library's Git history include the un-modularized state and the modularization changes (file moves and deletes that take us from un-modularized to modularized). One way we could do that is to start every library's Git repo from a clone of the SVN mirror, but that would be very inefficient for anyone with multiple boost library repositories on his machine. So instead I think we should have the first commit of every library repo look identical: a snapshot of the latest SVN state (no history). Then, if someone wants to see further back in a given library's history, he can graft on the changes from the final state of the SVN mirror. Make sense? |
I think so, but you and I should get together so I can be sure I'm understanding. IIUC, you'd like to leave ancient (pre-modularlized) history in a frozen repository cloned from boost svn. And when you pull down an individual library, you still pull down all of boost, but minus the history. That means pulling down all of boost means pulling down boost >100 times (without history), is that right? I think I'm still confused. |
You're very close. "You still pull down all of boost" is true in a sense; i.e. the repository would contain a copy of just the latest state of every boost file. However, Git is really efficient at storing things so that probably wouldn't take up much space. Those files wouldn't exist in a typical working copy. Well, now that you mention it, Boost is quite large, even without the history. I suppose the optimal solution would be to move the graft point forward in time by one commit. So:
Howzat? |
OK. So pre-modularized boost lives on a server in the sky (github?) and is never downloaded, ever? But users have the options of adding grafts in their local repro pointing to the server in the sky -- in fact, each library would point to its own branch on the server in the sky. Have I got that right? "Going up to the Server in the Sky. It where I'm gonna go when I die...." (with apologies to Norman Greenbaum. I didn't understand the first bit, though. In the scheme as I described it, you said git would only save the pre-modularlized boost locally once (because they have the same hash). But that's only if library X and library Y share the same object store on the local machine. If I checked out library X into directory A and library Y into directory B, I'd still get two full copies of boost. |
Whether or not the pre-modularized boost repo gets cloned just depends if anyone is interested. But grafts don't "point at servers," or we would be able to build the grafts into the original library repos. Developers would have the option of fetching from pre-modularized boost (i.e. pulling one or more branches into their local modularized repo) and grafting the initial commit in their modularized repo onto the tip of one of those branches. You'd only get two full copies of boost if you decided to graft on history in both repos, but nobody will do that. Grafting is just something you'd do for exploring history on a local repository. Nobody will be pushing boost's history into the master repository of an individual library, so nobody will get that history in their local clone automatically. |
".. graft on history in both repos..." I don't know what two repos you're referring to. I think we had better save this discussion until we're face to face. It's not getting any clearer for me. |
For everyone else and the sake of posterity: I mean the repos for X and Y that are in directories A and B. |
If I'm understanding how things should work, each library's git repository will contain a branch called 'history' (or something similar) which contains all the pre-modularized boost history, and the library repo's master's history will be rewritten to just have the first commit after the pre-modularized boost have a dummy parent. Now in case anyone wants to view the full history, they will then have to fetch the branch from Github, and then use git-remove to link the first commit after the pre-modularized boost to the history's HEAD, which eventually shows the linear history. I just read http://progit.org//2010/03/17/replace.html?dsq=41051056#comment-41051056 Dave, and unfortunately when you clone a repo, it's going to do just that, clone whatever is in the repo. Maybe compressing the repositories would be an option to help with the large histories, but the Linux kernel development team doesn't seem to mind. Besides, I'm not sure if Github supports aggressive compression of the repositories on their end anyway so any gains with repository compression would only be local. |
No, that's exactly what we want to avoid as noted above. 100 boost library clones means storing all of boost's history 100x.
This is not news to me, which is why I wrote that “It would be interesting if Git had a way...”
Local is the only important consideration unless we fear exceeding GitHub's storage limits for free repos |
IIRC, Git does not clone refs outside of refs/heads/ and refs/tags/ by default. You can push the 'history' branch to a non-standard ref in the main repo: git push origin history:refs/ancient/history Others that clone the main repo can do git fetch origin refs/ancient/history:refs/ancient/history to get the objects, and then add the graft. |
Oh, that is pretty cool. Thanks, Brad! |
set up a live clone of boost svn in a git repo (A) (done) develop a script that modularizes boost. Test locally against repo (A). The script will:
Create a boost ryppl project Add ryppl metadata pointing to the repositories of boost libraries Now:
|
Cool! What tool did you use to create the live clone and how are you keeping it in sync? |
Ha! You misunderstand. This is my TODO list. I was asking for feedback about whether these are the right things to do, and if they're in the right order. |
OK, but what does “done” mean in
|
troy did that already. |
Oh, yeah, but it's incomplete IIRC. Only tracks trunk and release, right? |
I suggest creating a history repository on github for the "history" branch as a normal head. Then fork that to create each individual library repository. After forking, then move the history branch to refs/ancient/history. Finally, leave only the modularized history in each repo's refs/heads/. This approach should help github re-use disk space for all the ancient history objects. It will also provide a first-class historical reference repository. However, I'm not sure off the top of my head what other effects on the apparent organization that might have. |
OK, this sounds good. Thanks Brad. |
That's a neat idea, thanks for sharing Brad! |
True, it's a neat idea, but after some consideration I'm not sure we get much of an advantage by having ancient history in each library's repo. The user is going to have to fetch those commits explicitly and make a graft either way, i.e. the average user will need instructions. I don't think those would be simplified much by not having to reference the ancient history remote. |
Agreed. It makes more sense to fetch directly from the ancient A graft is just a line in the local ".git/info/grafts" file with the # A -> B (this line is a comment) aaaaaaaa bbbbbbbb where "aaaaaaaa" is the 40-byte SHA-1 of commit A, "bbbbbbbb" is the In our use case, commit A is the root commit of one Boost module, and The set of modules that can be extracted from the monolithic source is $ git remote add history git://somewhere/boost-history.git $ git fetch history $ git show history/master:grafts > .git/info/grafts This assumes that the "master" branch of the history repository has a |
perfect. |
Boost has already been modularized in http://gitorious.org/boost. However, these modules don't bring along the Boost development history.
I think if we want to keep the history, we need to remake each of these repos a clone of http://gitorious.org/boost/svn, which is automatically tracking our SVN repository, then make the changes required to modularize the repo, so we'll be able to continue pulling changes in from our Git SVN mirror. I think it's important to make these changes fine-grained, so you don't move/rename and modify a file in a single commit, so that Git can succeed with later merges. It would be a good idea if each module contained a script (or other record) of the exact steps required to create and update it.
The text was updated successfully, but these errors were encountered: