QA Plugin: Solving for Attribute Chunk

The Issue

The @chunk="to-content" requirement for the QA plugin has always been a bit sticky. Honestly, I hadn’t thought much about it since we run the QA plugin through a self-service web server and that attribute is handled by a Python controller. However, thinking in terms of local builds, it became evident that setting the @chunk by hand would quickly become a tiresome routine.

Besides attribute handling, the web server also masks another consideration—the QA plugin may not be running in isolation from other plugins.

 The First Iteration

The first iteration of to move this functionality to the plugin itself resulted in a new build target extending the chunk preprocess.

In plugin.xml:

<feature extension="depend.preprocess.chunk.pre" value="setchunk"/>

The target in build_qadata.xml:

<target name="setchunk" description="Set @chunk to-content on the temp input bookmap" if="if.chunk">
<replace file="${dita.temp.dir}/${user.input.file}"
token="" value="" />
<replace file="${dita.temp.dir}/${user.input.file}"
token="&lt;bookmap " value="&lt;bookmap chunk='to-content' " />
<replace file="${dita.temp.dir}/${user.input.file}"
token="&lt;map " value="&lt;map chunk='to-content' " />

The new target used a regex replace to add the chunk attribute just before processing began in the temporary build directory. This solved the problem of manually setting the attribute, but also extended the chunk pre-processing to other sibling plugins as well.

The Solution

It’s possible to add an if-condition to a target to look for the presence of a command-line parameter, but I needed to look for a parameter with a specific value.  A second iteration added a double-hop if-condition to the ant call.

<condition property="if.chunk">
<equals arg1="${setchunk}" arg2="true" casesensitive="false" />

<target name="setchunk" description="Set @chunk to-content on the temp input bookmap" if="if.chunk">
<replace file="${dita.temp.dir}/${user.input.file}"
token="" value="" />
<replace file="${dita.temp.dir}/${user.input.file}"
token="&lt;bookmap " value="&lt;bookmap chunk='to-content' " />
<replace file="${dita.temp.dir}/${user.input.file}"
token="&lt;map " value="&lt;map chunk='to-content' " />

This approach looks for the presence of the setchunk switch and a value of true before applying the target, which is called with:

dita -f qa -i samples/taskbook.ditamap -Dsetchunk=true

So if you run the QA plugin alongside any others, you can leave off the switch to avoid unwanted chunk attributes.

Word Conversion Tool

I initially drafted this post about a Word conversion tool two years ago, and for some forgotten reason, never published it. Eventually the project was scrapped, but we did use the tool for quite a while. I’ve gone back and made the article past-tense, and added some reflection. I think there’s still some good stuff in here…

Word. It ain’t going away anytime soon. As a DITA–and more generally, as a structured authoring–evangelist, I’ve long loathed the wild West approach to documentation Word engenders. Okay sure, it is easy to use. It doesn’t really require any training. I suppose, yes, one could argue that writing in Word lets authors focus on the content rather than on “tagging.” (I think that argument is faulty, as I’m sure you do, but we won’t go into that here.) The point is, if Word is going to be here awhile, what might we do about it?

Continue reading

Word2DITA Plugin (DITA4Publishers)

One of the groups my team supports is solutions engineering. This group figures out the best way to run 3rd-party applications on our platform. Although they are not writers, one of their primary deliverables is documentation, and a lot of it. My team provides editorial services throughout the entire lifecycle.

Now that we’ve made great progress on content quality, which is the most important thing, here’s the problem: how to improve the user experience of the content. It will surely come as no surprise that the solutions documentation is authored in Word. You can get a minimally viable PDF from Word, but that’s about it.

I dream of the time when there is such a nice DITA authoring interface that they could create DITA topics, but that time isn’t now. These authors are more than casual authors, but writing and content management isn’t close to 100% of their time either. As such, authoring and reviewing in Word is a requirement.

At the same time, we already have a refined process to publish to our support portal in a searchable fashion. This process is based on DITA to HTML.

You can see where I’m going. How to get from Word to DITA so we can use our existing publishing pipeline?

The first solution that comes to mind is the DITA4Publishers Word2DITA plugin.


The instructions here seem to be out of date. The good news is that the reality is easier.

  1. Install the DITA4Publishers plugins.
  2. Copy the sample file from GitHub to the samples folder under DITA-OT
  3. Run the transform.
    ant -f build.xml -Dtranstype=word2dita -Dargs.input=word2dita_single_doc_to_map_and_topics_01.docx

In the out folder you’ll see a map and topics. This is a sample document with the default style-to-tag mapping.


Unfortunately, images are not extracted. The solution (found here) is to open the Word file in oXygen and extract the media folder to the topics folder created by word2dita. This is a bitter disappointment–I had envisioned a single build target that would convert the Word file to DITA and then to PDF, HTML, and ePUB. There will have to be a manual step in there to extract the images.

But then I put 2 and 2 together: the DOCX file is a ZIP, and Ant has an unzip task. So I added these lines to the target:

<unzip src="${args.input}" dest="${temp}" />
<copy todir="${out}/topics/media" failonerror="false">
  <fileset dir="${temp}/word/media" />

Now the DITA output is complete.

QA Check Compiler

We’ve been working on some enhancements for the QA plugin that are now available. You can download the plugin from GitHub.

The first enhancement I want to talk about is the QA check compiler.

Writing a QA script in PowerShell was a pretty keen idea even if I do say so myself. Moving to an Open Toolkit plugin was an even better idea with better execution. One of the drawbacks to the OT mechanism, however, is how complicated the expression of a simple check is.

For example, let’s say you want to flag occurrences of utilize and suggest use instead. This is the expression you have to write:

<xsl:if test="descendant::*[not($excludes)]/text()[matches(.,'utilize', 'i')]">
  <data type="msg" outputclass="term mmstp" importance="recommended">Found "utilize". Use "use".</data>

The contents of the matches call and the value and attributes of the data element are all significant and also very repetitive. As we all know, repetition leads to errors.

Authoring Checks for use with the Compiler

With the QA check compiler, you author the checks in an abbreviated form. The checks go inside a properties table inside a DITA reference topic. To express the example rule above, just add a row to a properties table to specify the severity, expression, and message.

The QA compiler, executed by the compilechecks target, takes care of the converting the rows in the properties tables to checks that the plugin can execute.

  • The propdesc becomes the message for the check.
  • The propvalue becomes the argument to the matches function in the XPath expression.
  • The proptype becomes the @importance.
  • The @id of the parent properties table becomes the @outputclass of the check.

You can have as many properties tables as you want.  If the @id is term_mmstp the resulting category will be term mmstp. (Spaces aren’t allowed in @id, so an underscore is necessary but then replaced with a space in the output.) These categories are unconstrained–you can make them whatever you want.

The proptype element is limited to the values for @importance: default, deprecated, high, low, normal, obsolete, optional.

Enabling the QA Compiler

The result of the QA compiler isn’t enabled by default. To do so, uncomment the xsl:include call in xsl/qa_checks/_qa_checks.xsl and also remove the term template from that stylesheet. The QA compiler produces a template called term to make it easy to integrate, and you can’t have two templates with the same name. Once the result is included, you can start adding and modifying checks in tools/qacompiler/qa_checks_r.dita, which is a DITA reference topic. Don’t forget to run ant compilechecks after editing the DITA topic.

XML-aware diff with Git

One of the less-than-perfect aspects of using Git for XML is comparing versions of a file. Standard diff tools are not optimized for files that contain markup. Not only is the markup exposed, but irrelevant details (like indentation or line length) can appear far more significant than they really are. Although you can reduce the impact by telling the diff tool to ignore whitespace, such tools will never be semantically aware.

The Windows client TortoiseGit includes a graphical diff tool. If you select a revision of a file in the Git repository, you can diff it with previous or later versions. This is a convenient feature, but disappointing that the diff is not XML aware.

I just found out that oXygen includes a graphical diff tool called diffFiles.exe. It’s significant that it’s graphical because it can’t write output to the console. But I wondered if there is a way to have TortoiseGit use diffFiles rather than TortoiseDiff.

It turns out that there is. Go to TortoiseGit > Settings > Diff
Viewer and click Advanced. Create new entries for .dita and .xml setting the following (adjusting the file path as needed for your environment) as the Program:

 C:\Program Files\Oxygen XML Editor 16\diffFiles.exe %base %mine

Now when you tell TortoiseGit to compare DITA or other XML files it will use the oXygen XML-aware diff rather than TortoiseGitMerge.

There are couple of limitations. One is that you can’t use oXygen’s diff to do a 3-way merge, which can be useful if you have merge conflicts. However, I never do this with XML files. The other limitation is that oXygen diff takes much longer to start than TortoiseGitMerge. TortoiseGitMerge is almost instantaneous, while oXygen diff takes several seconds.

QA Plugin Use Case: Learning Engagement

I thought it would be useful to share a use case for the QA Plugin from the Education team at Citrix. In addition to the metrics in the open-source code, we’ve added a number of our own used to measure the quality of instructional design in our courses. For example, we calculate what we call an “engagement ratio”, which is the ratio of words to interactions. We find a good target is 250 words to each interaction. The ratio gives us a single metric that tells us, at least directionally, whether the course will offer a sound experience for the student.

Of course, if the content uses a lot of “click to see more text” interactions, then a low ratio may be misleading. That’s why we also total up the number of each interaction type. Showing these two metrics together gives us a solid understanding of the variety and frequency of interaction in a course.

In addition, we are able to calculate reading time vs other activities, like videos, labs, and simulations, as well as an estimated total course length. Therefore, we have language metrics telling us about terminology and style use, interaction metrics telling us about variety and frequency, and timing metrics about various activity types. Those metrics combined give us a accurate picture of how engaging a course will be, without having to read a single page.

But, you know, you should still read the course. :-) But with the QA plugin, you know where to focus, what issues you are likely to encounter, and how much work you are likely to need in order to get the course ready for release.

If you have a use case for the QA plugin, please let us know! We’d be more than happy to feature it here on ditanauts.

QA Plugin Updated! (Finally, right?)

Hi folks, I’m happy to let you know that we have posted a major update to the QA Plugin. The ditanauts team owes a huge debt of gratitude to Don Day and Michael Boses for their work on this update. What’s new you ask? Well…

  • Reports are prettier. The HTML report we generate uses Google Charts to render visual elements.
  • We create a data file (written in DITA) rather than generating the report HTML directly from the DITA input. With the data file, you can then render whatever you want using normal OT processing. The plugin creates an HTML report and a .csv file from the data file.
  • @Chunk set automatically on bookmaps. One of the really annoying things with the old version was that you had to set the @chunk attribute manually before a build. That is no longer the case when building from a bookmap!
I’ve updated the install and run sections of the how-to page; I will be updating the customization section soon.
Let us know what you think!

QA Plugin XSLT: Locating Distinct Values for Duplicate IDs

As part of a new framework enhancement, I needed a method to ensure that certain DITA elements carried unique @id values. Our authoring tool does a good job identifying duplicate @id values within a topic, but does not indicate whether those values also exist in other topics referenced in the DITA map.

In this case, the best fit was to add a new check to the QA plugin.

The check should:

• Identify duplicate @id values on specific elements
• Return only distinct values (i.e. if 123 appears several times, then return 123 only once)

After some forum research, my first thought was to use a key match.

<xsl:key name="duplicateIds" match="elementName" use="@id" />

And reference it with

key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1

So the if statement for the check looked like

<xsl:for-each select="//parentElement">
<xsl:if test="key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>

Even though it met my requirements, it’s still an outdated approach that didn’t leave me with a job-well-done sense of completion. I wanted a cleaner solution that fit somewhere closer to the 2.0 realm.

I decided to use grouping to identify each distinct @id value. Then I could wrap an if statement to test for any groups that contained a second (duplicate) item. The result:

<xsl:for-each-group select="//elementName" group-by="@id">
<xsl:if test="current-group()[2]">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>

Have a better approach? Leave a comment.

Flattening HTML output

My DITA repository has a number of subdirectories to keep maps and topics organized. This strategy is convenient, but it can be a drawback when I need to further process HTML output, as I had to do for a recent publishing project. The HTML output type in the DITA Open Toolkit retains the organization of the source files, so that every processing task turned into file tree navigation with tools that aren’t suited for it.

The DITA For Publishers HTML2 plugin provides a mechanism for flattening the output: the html2.file.organization.strategy Ant parameter. To make flattening a viable approach, there needs to be a provision for avoiding collisions. For example, say you have two directories, indir1 and indir2, each of which contain a topic file topic1.dita. The single-directory output, then, can’t be outdir/topic1.html because there are two topic1 files.

The plugin deals with this requirement by appending a string, created by generate-id(), to the file name. So indir1/topic1.dita would become outdir/topic1_d97.html and indir2/topic1.dita would become outdir/topic1_d84.html. The exact expression (in the get-result-topic-base-name-single-dir template) is

concat(relpath:getNamePart($topicUri), '_', generate-id(.))

While that’s a reasonable approach, it’s not the one that I want to use because the filenames ultimately get exposed to my customers. Since I don’t know the algorithm for generating the unique ID, it’s not deterministic enough, and a bookmarked link might become invalidated without my knowing. Instead, I’d like to prepend the parent directory name, so I modified the expression to this:

concat(relpath:getNamePart(relpath:getParent($topicUri)), '_', relpath:getNamePart($topicUri))

The result for the two example files then would be outdir/indir1_topic1.html and outdir/indir2_topic1.html. This approach has the added advantage that the output doesn’t lose information about its location in the source.

REST API documentation: From JSON to DITA (but skip the JS)

In our most recent release, a REST API was added to the product. Included in the API was some built-in documentation that is exposed in the user interface. Users can also send API calls and see the results. All this is done with Swagger and is a really nice way to get familiar with the API.

Then, of course, the request for a comprehensive API reference arrived. If you aren’t familiar with the API, it’s hard to get an overview and find what you’re looking for in the version presented in the UI. Or you may not have access to a system where you can poke around.

Since there is at least a representation of the API structure with definitions, there’s no way I was going to start from nothing or spend a lot of time copying and pasting. And since the built-in documentation was available as JSON I was sure that I’d be able to get from that to a document in a fairly straight line. Since all the other documentation is in DITA, using DITA as the target format that could then be processed in the normal manner seemed like the way to go.

Since JSON stands for “JavaScript Object Notation” I thought that using JavaScript would be the way to go for the initial pass. Boy was I wrong. Something about not having a DOM context when embedding JavaScript in Ant. So I could read the JSON files but not output them in XML.

After wasting enough time with JavaScript I checked into using (as a frequent reader of this blog, should such a person exist, would guess) PowerShell. After some digging around I found this blog post and my problems were pretty much solved. I use the Convert-JsonToXml function exactly as given in that post. Calling it is another simple matter:

$strJson =  Get-Content $inputFile

$xml = Convert-JsonToXml $strJson
Try {
    $xml.Get_OuterXml() | out-file $outputFile
    write-host "$outputFile"
} Catch {
    write-host "Could not save $outputFile"

The $inputFile variable is the JSON file and $outputFile is the same as $inputFile, just with .xml extension rather than .json.

A record in the API JSON file looks like this:

      "path": "/alerts/hardware",
      "operations": [{
        "method": "GET",
        "summary": "Get the list of hardware Alerts.",
        "notes": "Get the list of hardware Alerts generated in the cluster.",
        "type": "void",
        "nickname": "getHardwareAlerts",
        "parameters": [{
          "name": null,
          "description": "Filter criteria",
          "required": false,
          "allowMultiple": false,
          "type": "AlertRequestDTO",
          "paramType": "query"
        "responseMessages": [{
          "code": 500,
          "message": "Any internal exception while performing this operation"

And the XML from PowerShell looks like this:

<item type="object">
  <path type="string">/alerts/hardware</path>
  <operations type="array">
    <item type="object">
      <method type="string">GET</method>
      <summary type="string">Get the list of hardware Alerts.</summary>
      <notes type="string">Get the list of hardware Alerts generated in the cluster.</notes>
      <type type="string">void</type>
      <nickname type="string">getHardwareAlerts</nickname>
      <parameters type="array">
        <item type="object">
          <name type="null" />
          <description type="string">Filter criteria</description>
          <required type="boolean">false</required>
          <allowMultiple type="boolean">false</allowMultiple>
          <type type="string">AlertRequestDTO</type>
          <paramType type="string">query</paramType>
      <responseMessages type="array">
        <item type="object">
          <code type="number">500</code>
          <message type="string">Any internal exception while performing this operation</message>

So then I just need some XSLT to convert that into DITA, which is straightforward enough. The overall publishing pipeline is JSON through PowerShell to well-formed but non-validating XML through Ant/XSLT to DITA through the Open Toolkit to PDF and HTML and whatever else I might use in the future.

Here is the result. I was anxious that this was woefully inadequate API documentation, but after discussing with other attendees at the TC Camp unconference this weekend, I realized it’s not as deficient as I feared.