Managing XML in Git or Mercurial? Watch out for your merges

Managing XML in Git or Mercurial? Watch out for your merges

It's increasingly common to manage XML-based tech docs in distributed version control systems such as Git and Mercurial. For limited requirements, this is quite feasible. Every contributor gets their own full version of the repository, available locally, offline. Different contributors can work on the same set of files concurrently. Changes can be co-ordinated via a central server (of which GitHub and BitBucket are examples, though it's easy to set up one's own), or even on a peer-to-peer basis. 

Of course this relies on decent branch and merge capabilities. This is particularly where distributed version control has made software development easier. When two sets of changes are merged, they're automatically compared with their common ancestor to see what's actually new (and presumably should be kept). With program code development, even concurrent changes to the same file are usually easy to merge. If one contributor edits one part of the file and another edits another part, comparison with the common ancestor shows that both these changes are new, and so the resulting automatically merged file incorporates them both.

However, this principle can fall down when working with XML documents and using the version control system's built-in diff/merge features, or using standard line-based merge tools such as KDiff3 or the diff tool in Eclipse. In program code, each line is often a meaningful unit. In XML documents, the smallest meaningful unit is generally an element, which often spans multiple lines. Line-based merge tools can be fiddly to work with and you can easily end up with invalid files or missing content.

Some time ago, I blogged about these potential problems and an XML-aware 3-way merge tool that alleviates them. I mentioned this on a thread on the DITA-Users mailing list, and Ron Wheeler replied:

I would be interested in some example files where Project:Merge would do something that Eclipse would not handle as well.

To answer this question I've been playing around with some examples. First, here's one where you would think that the diff algorithm should take the new content from each author, but it doesn't.

Base content:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files.</p>
</conbody>

Author A adds to the paragraph:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
    is necessary when a file is modified by two people on two different computers at the same
    time. When two branches are merged, the result is a single collection of files that contains
    both sets of changes.</p>
</conbody>

However, Author B, working from the same base, adds a definition list:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files.</p>
  <dl>
    <dlentry>
      <dt>Automatic merging</dt>
      <dd>Automatic merging is what revision control software does when it reconciles changes that
        have happened simultaneously (in a logical sense). </dd>
    </dlentry>
    <dlentry>
      <dt>Manual merging</dt>
      <dd>Manual merging is what people have to resort to (possibly assisted by merging tools)
        when they have to reconcile files that differ.</dd>
    </dlentry>
  </dl>
</conbody>

Using a standard 3-way merge algorithm, KDiff3 reports two automatically resolved conflicts and one that requires manual resolution:

this is really a caption

Here's the result:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
<Merge Conflict>
      <dd>Automatic merging is what revision control software does when it reconciles changes that
      have happened simultaneously (in a logical sense). </dd>
    </dlentry>
    <dlentry>
      <dt>Manual merging</dt>
      <dd>Manual merging is what people have to resort to (possibly assisted by merging tools)
        when they have to reconcile files that differ.</dd>
    </dlentry>
  </dl>
</conbody>

Here's how it looks in KDiff3:

If I use the tool to pick lines from A, I get this malformed fragment:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
    is necessary when a file is modified by two people on two different computers at the same
    time. When two branches are merged, the result is a single collection of files that contains
    both sets of changes.</p>
      <dd>Automatic merging is what revision control software does when it reconciles changes that
        have happened simultaneously (in a logical sense). </dd>
    </dlentry>
    <dlentry>
      <dt>Manual merging</dt>
      <dd>Manual merging is what people have to resort to (possibly assisted by merging tools)
        when they have to reconcile files that differ.</dd>
    </dlentry>
  </dl>
</conbody>

If I use the tool to pick lines from B, I get this malformed fragment:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
  <dl>
    <dlentry>
      <dt>Automatic merging</dt>
      <dd>Automatic merging is what revision control software does when it reconciles changes that
        have happened simultaneously (in a logical sense). </dd>
    </dlentry>
    <dlentry>
      <dt>Manual merging</dt>
      <dd>Manual merging is what people have to resort to (possibly assisted by merging tools)
        when they have to reconcile files that differ.</dd>
    </dlentry>
  </dl>
</conbody>

Using Project: Merge, an XML-aware 3-way merge tool, if I launch it manually I get to pick whole elements from A or B, and if I launch it via Git/Mercurial client SourceTree, the correct new elements are selected by default:

This is the result of that correct merge:

<conbody>
  <p>Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
    is necessary when a file is modified by two people on two different computers at the same
    time. When two branches are merged, the result is a single collection of files that contains
    both sets of changes.</p>
  <dl>
    <dlentry>
      <dt>Automatic merging</dt>
      <dd>Automatic merging is what revision control software does when it reconciles changes that
        have happened simultaneously (in a logical sense). </dd>
    </dlentry>
    <dlentry>
      <dt>Manual merging</dt>
      <dd>Manual merging is what people have to resort to (possibly assisted by merging tools)
        when they have to reconcile files that differ.</dd>
    </dlentry>
  </dl>
</conbody>

If the closing </p> tag is on its own line, the line-based diff tools are also able to merge the file automatically and correctly. However, I'm not sure whether it's possible to configure XML editors to always put tags on their own lines (couldn't find any option in Oxygen for that), and anyway the addition of attributes could still mess up the consistency of opening tags. Even with tags on separate lines, if the tags overlap then line-based diff tools can't cope.

Here's an example of a true conflict; one that the 3-way merge algorithm shouldn't automatically resolve.

Base content:

<conbody>
  <p>
    Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
    is necessary when a file is modified by two people on two different computers at the same
    time. When two branches are merged, the result is a single collection of files that contains
    both sets of changes.
  </p>
</conbody>

Author A splits it into a shorter paragraph and a note:

<conbody>
  <p>
    Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files.
  </p> 
  <note>
    Most often, it is necessary when a file is modified by two people on two different computers 
    at the same time. When two branches are merged, the result is a single collection of files 
    that contains both sets of changes.
  </note>
</conbody>

Author B also splits it, but at a different place:

<conbody>
  <p>
    Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files. Most often, it
    is necessary when a file is modified by two people on two different computers at the same
    time.
  </p> 
  <note>
    When two branches are merged, the result is a single collection of files that contains
    both sets of changes.
  </note>
</conbody>

KDiff3 reports 4 automatically solved conflicts (it shouldn't really be solving any in this case), and 2 remaining ones:

<conbody>
  <p>
    Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files.
<Merge Conflict>
  </p> 
  <note>
<Merge Conflict>
  </note>
</conbody>

In KDiff3, if I accept A's changes, I get exactly what I had. However, if I accept B's changes, it becomes malformed, gaining an duplicate </note> tag and also losing some of the content:

<conbody>
  <p>
    Merging (also called integration) in revision control, is a fundamental operation that
    reconciles multiple changes made to a revision-controlled collection of files.
    time.
  </p> 
  <note>
    When two branches are merged, the result is a single collection of files that contains
    both sets of changes.
  </note>
  </note>
</conbody>

In Project: Merge, for each element (<p> and <note>), there's a preview and a button to pick the version I want to keep.

Other XML-aware tools

The reason I've focused on Project: Merge is that it's cheap and it integrates very easily with distributed version control systems and clients.

DeltaXML provide a powerful set of tools to enable N-way merging (i.e. more than just two simultaneous changes to a base ancestor), keeping clear info on who changed what and integrating with editors and CMSs to provide friendly functionality based on that. I haven't yet looked in any detail at how it might integrate with distributed version control systems.

I don't think there's much else out there — there are a couple of defunct XML merge projects that seem to have ceased active development in the mid 2000-s, but that's it.

2016-12-03 update: I've just been made aware that Oxygen XML introduced an XML-aware 3-way merge feature in version 18. I've just tried the feature out and although it seems to take a simpler approach than Project:Merge in terms of predicting the changes you'll want to keep, it offers a lot of control as to the differences between files that you choose to merge in. There are now new ways to use Oxygen with Git anyway, and the merge feature is likely to keep improving.

Back to the big picture

An increasing number of organizations are managing XML content in distributed version control systems such as Git or Mercurial, and for most it works quite well. If each author is responsible for certain files exclusively (at least for any given release), then merge problems will be greatly reduced. Even when multiple authors need to work on the same file concurrently, problems can be avoided by setting your tools up to always trigger a manual merge. (I believe this is what Adrian Warman's team at IBM did.) However, if you want to work concurrently with the same ease that you do with code, and make the most of these systems' 3-way merge features, then an XML-aware merge tool seems like a wise investment.

Brewed coffee recipe: less mess "French press"

Brewed coffee recipe: less mess "French press"

Redundancy is Not a Crime

Redundancy is Not a Crime