QA Plugin XSLT: Locating Distinct Values for Duplicate IDs

As part of a new framework enhancement, I needed a method to ensure that certain DITA elements carried unique @id values. Our authoring tool does a good job identifying duplicate @id values within a topic, but does not indicate whether those values also exist in other topics referenced in the DITA map.

In this case, the best fit was to add a new check to the QA plugin.

The check should:

• Identify duplicate @id values on specific elements
• Return only distinct values (i.e. if 123 appears several times, then return 123 only once)

After some forum research, my first thought was to use a key match.

<xsl:key name="duplicateIds" match="elementName" use="@id" />

And reference it with

key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1

So the if statement for the check looked like

<xsl:for-each select="//parentElement">
<xsl:if test="key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>

Even though it met my requirements, it’s still an outdated approach that didn’t leave me with a job-well-done sense of completion. I wanted a cleaner solution that fit somewhere closer to the 2.0 realm.

I decided to use grouping to identify each distinct @id value. Then I could wrap an if statement to test for any groups that contained a second (duplicate) item. The result:

<xsl:for-each-group select="//elementName" group-by="@id">
<xsl:if test="current-group()[2]">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>

Have a better approach? Leave a comment.

Flattening HTML output

My DITA repository has a number of subdirectories to keep maps and topics organized. This strategy is convenient, but it can be a drawback when I need to further process HTML output, as I had to do for a recent publishing project. The HTML output type in the DITA Open Toolkit retains the organization of the source files, so that every processing task turned into file tree navigation with tools that aren’t suited for it.

The DITA For Publishers HTML2 plugin provides a mechanism for flattening the output: the html2.file.organization.strategy Ant parameter. To make flattening a viable approach, there needs to be a provision for avoiding collisions. For example, say you have two directories, indir1 and indir2, each of which contain a topic file topic1.dita. The single-directory output, then, can’t be outdir/topic1.html because there are two topic1 files.

The plugin deals with this requirement by appending a string, created by generate-id(), to the file name. So indir1/topic1.dita would become outdir/topic1_d97.html and indir2/topic1.dita would become outdir/topic1_d84.html. The exact expression (in the get-result-topic-base-name-single-dir template) is

concat(relpath:getNamePart($topicUri), '_', generate-id(.))

While that’s a reasonable approach, it’s not the one that I want to use because the filenames ultimately get exposed to my customers. Since I don’t know the algorithm for generating the unique ID, it’s not deterministic enough, and a bookmarked link might become invalidated without my knowing. Instead, I’d like to prepend the parent directory name, so I modified the expression to this:

concat(relpath:getNamePart(relpath:getParent($topicUri)), '_', relpath:getNamePart($topicUri))

The result for the two example files then would be outdir/indir1_topic1.html and outdir/indir2_topic1.html. This approach has the added advantage that the output doesn’t lose information about its location in the source.

REST API documentation: From JSON to DITA (but skip the JS)

In our most recent release, a REST API was added to the product. Included in the API was some built-in documentation that is exposed in the user interface. Users can also send API calls and see the results. All this is done with Swagger and is a really nice way to get familiar with the API.

Then, of course, the request for a comprehensive API reference arrived. If you aren’t familiar with the API, it’s hard to get an overview and find what you’re looking for in the version presented in the UI. Or you may not have access to a system where you can poke around.

Since there is at least a representation of the API structure with definitions, there’s no way I was going to start from nothing or spend a lot of time copying and pasting. And since the built-in documentation was available as JSON I was sure that I’d be able to get from that to a document in a fairly straight line. Since all the other documentation is in DITA, using DITA as the target format that could then be processed in the normal manner seemed like the way to go.

Since JSON stands for “JavaScript Object Notation” I thought that using JavaScript would be the way to go for the initial pass. Boy was I wrong. Something about not having a DOM context when embedding JavaScript in Ant. So I could read the JSON files but not output them in XML.

After wasting enough time with JavaScript I checked into using (as a frequent reader of this blog, should such a person exist, would guess) PowerShell. After some digging around I found this blog post and my problems were pretty much solved. I use the Convert-JsonToXml function exactly as given in that post. Calling it is another simple matter:

$strJson =  Get-Content $inputFile

$xml = Convert-JsonToXml $strJson
Try {
    $xml.Get_OuterXml() | out-file $outputFile
    write-host "$outputFile"
} Catch {
    write-host "Could not save $outputFile"

The $inputFile variable is the JSON file and $outputFile is the same as $inputFile, just with .xml extension rather than .json.

A record in the API JSON file looks like this:

      "path": "/alerts/hardware",
      "operations": [{
        "method": "GET",
        "summary": "Get the list of hardware Alerts.",
        "notes": "Get the list of hardware Alerts generated in the cluster.",
        "type": "void",
        "nickname": "getHardwareAlerts",
        "parameters": [{
          "name": null,
          "description": "Filter criteria",
          "required": false,
          "allowMultiple": false,
          "type": "AlertRequestDTO",
          "paramType": "query"
        "responseMessages": [{
          "code": 500,
          "message": "Any internal exception while performing this operation"

And the XML from PowerShell looks like this:

<item type="object">
  <path type="string">/alerts/hardware</path>
  <operations type="array">
    <item type="object">
      <method type="string">GET</method>
      <summary type="string">Get the list of hardware Alerts.</summary>
      <notes type="string">Get the list of hardware Alerts generated in the cluster.</notes>
      <type type="string">void</type>
      <nickname type="string">getHardwareAlerts</nickname>
      <parameters type="array">
        <item type="object">
          <name type="null" />
          <description type="string">Filter criteria</description>
          <required type="boolean">false</required>
          <allowMultiple type="boolean">false</allowMultiple>
          <type type="string">AlertRequestDTO</type>
          <paramType type="string">query</paramType>
      <responseMessages type="array">
        <item type="object">
          <code type="number">500</code>
          <message type="string">Any internal exception while performing this operation</message>

So then I just need some XSLT to convert that into DITA, which is straightforward enough. The overall publishing pipeline is JSON through PowerShell to well-formed but non-validating XML through Ant/XSLT to DITA through the Open Toolkit to PDF and HTML and whatever else I might use in the future.

Here is the result. I was anxious that this was woefully inadequate API documentation, but after discussing with other attendees at the TC Camp unconference this weekend, I realized it’s not as deficient as I feared.

How to build a lot of maps a lot of times with consistent filtering

…and not go crazy.

The mechanism for applying conditions while building with the DITA Open Toolkit has vexed for me for a while. You have to specify the ditaval file at the time you initiate the build. That’s probably ok if you are working on 1 or 2 maps with 1 or 2 ditavals. I’m faced with over 50 maps that heavily reuse topics and over 10 ditaval files. Here’s the vexing part: every map is built with the same ditaval every time. I don’t want to have to manually specify the ditaval every time I build since it never changes (per map). And my target outputs are at least two. That makes the problem twice as bad.

Over 50 maps going to two outputs is, if my math is correct, something over 100 build events if I want to generate all my docs. There’s no way I’m going to do that manually. It’s virtually guaranteed that I’ll overlook some map or apply the wrong ditaval and that I’ll be driven to the brink of madness by the time the whole thing is done.

My feeling is that the ditaval is really a property of the map. So what I’d really like to do is specify the ditaval in the map and have a build routine that passes the info. This way I only have to specify the filtering set once, when the map is created, and not every infernal time I want output. Also I want to get out of the business of production some day, and there needs to be a repeatable process for doing all this.

Then it would be nice to be able to specify sets of maps to build based on wildcards. For a while I was trying to maintain lists of files as build targets inside Ant build files. But I already have lists: it’s files on a filesystem. That’s duplicated content. I hate duplicated content; it means mismatches. And anyway what I want to build is different from day to day. Maybe I need all admin docs, or all docs for a particular version, or hardware replacement docs for one platform, or hardware replacement docs for one component on all platforms, or maybe just one doc all by itself.

So those are the requirements: specify the ditaval in each map, and specify the maps based on wildcards in the filenames.

How to do this? For the first requirement, insert some type of metadata in the map. For the second, write a script that takes a set of files as an argument, reads the metadata from each map, then calls Ant. Given my past positive experience with PowerShell that’s what I’ll use.

The metadata part is easy enough. And while I’m at it I’ll specify the build targets as well.

<othermeta name="filter" content="build/external.ditaval" />
<othermeta name="targets" content="pdf epub" />

Now I need a script to make use of those elements.

The core is a loop over the items in the build set, which is specified as a command-line argument.

$buildSet = Get-ChildItem -recurse $buildSet
ForEach ($input in $buildSet) {

Then read in the metadata.

Try {
  $targets = [string]$fileContent.SelectSingleNode('/bookmap/bookmeta/othermeta[@name="targets"]/@content').get_InnerText()
  } Catch [system.exception] {
    $targets = $defaultTargets
    write-host "No build targets specified. Using default `"$targets`"."

$targets = $targets -split " " | ForEach-Object {$_ = "`"$_`""; $_}

Try {
  $filter = [string]$fileContent.SelectSingleNode('/bookmap/bookmeta/othermeta[@name="filter"]/@content').get_InnerText()
  $filter = Join-Path -path $filePath -childpath $filter
} Catch [system.exception] {
  $filter = $defaultFilter
  write-host "No filter specified. Using default `"$filter`"."

That middle ugly line is because the targets passed to Ant have to be in double quotation marks. This punctuation doesn’t seem to be necessary in the legacy Windows shell but is in PowerShell (figuring this out cost me a distressing amount of time).

And finally call Ant.

Try {
  ant -f mybuild.xml "-Dargs.input=$input" "-Dargs.filter=$filter" "-Dargs.xhtml.toc=$fileName" $targets
} Catch [Exception] {
  write-host "Build failed."

The complete script is a bit longer of course because of the need to set up some defaults and manipulate paths but is still only 74 lines.

There are a few things that I need outside the script: a generic build file and an OT start script that I keep at the top level of my repository along with the script itself. These are all standard Open Toolkit requirements.

Some examples.

    • All admin docs for version 3.5:
      > ./build.ps1 -b maps_administration/*v3_5*.ditamap
    • Just the setup guide:
      > ./build.ps1 -b maps_administration/Setup_Guide-v3_5.ditamap
    • All hardware replacement docs for the 3000 product:
      > ./build.ps1 -b maps_hardware_replacement/*3000*.ditamap
    • Power supply replacement for all products:
      > ./build.ps1 -b maps_hardware_replacement/Power_Supply*.ditamap

Now I have the flexibility and repeatability I want.

Fixing part/chapter numbering in PDF

Apparently part and chapter numbering for bookmaps has been broken in the Open Toolkit PDF output since the beginning. Instead of numbering the chapters in a series from beginning to end, chapter numbering resets to 1 for every part so the TOC looks like this:

  • Part I
    • Chapter 1
    • Chapter 2
    • Chapter 3
  • Part II
    • Chapter 1
    • Chapter 2
    • Chapter 3
  • Part III
    • Chapter 1
    • Chapter 2
    • Chapter 3

The correct behavior would be a single chapter series that goes from 1 to 9. OT issue 1418 indicates that this was fixed in OT 1.7 but I didn’t see any change when I tried it out. Instead I changed my PDF plugin and thought I’d document the change here for anyone else who might need to do it in the future since I wasn’t able to find the answer anywhere.

There are two templates that need to be updated: one for the TOC and one for the chapter first page.

The template for the TOC is below.

<xsl:template match="*[contains(@class, ' bookmap/chapter ')] |
  *[contains(@class, ' boookmap/bookmap ')]/opentopic:map/*[contains(@class, ' map/topicref ')]" mode="tocPrefix" priority="-1">
  <xsl:call-template name="insertVariable">
    <xsl:with-param name="theVariableID" select="'Table of Contents Chapter'"/>
    <xsl:with-param name="theParameters">
          <xsl:variable name="id" select="@id" />
          <xsl:variable name="topicChapters">
            <xsl:copy-of select="$map//*[contains(@class, ' bookmap/chapter ')]" />
          <xsl:variable name="chapterNumber">
            <xsl:number format="1"
              value="count($topicChapters/*[@id = $id]/preceding-sibling::*) + 1" />
          <xsl:value-of select="$chapterNumber" />

The significant parts here are the $topicChapters and $chapterNumber variables.

The template for the chapter first page is insertChapterFirstpageStaticContent. It’s too long to reproduce in its entirety here, but the code is the same as what’s inside the number element in the TOC template. The number element that contains the $topicChapters and $chapterNumber variables needs to replace the one inside the xsl:when test="$type = 'chapter'" block.

Chunking vs. nesting

At the blog I’d rather be writing there is some concern about navigability when the content set has too many small files. The solution discussed is to use @chunk="to-content", but that’s not the only option.

Although it’s common practice to have one topic per file, it’s not required. If you are not going to be reusing topics in multiple contexts you might prefer to have multiple topics in the same file.

For example:

<!DOCTYPE dita PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
  <concept id="concept_bm1_q4m_yj">
    <title>About widgets</title>
      <p>Widgets are very useful.</p>
    <task id="task_bm1_q4m_yj">
      <title>Create a widget</title>
        <steps> ... </steps>
    <task id="task_bm1_q4m_yk">
      <title>Drag a widget in place</title>
        <steps> ... </steps>

The question about relying primarily on TOC is an interesting one discussed by Jonatan Lundin here and Mark Baker here.

DITA to PPT mapping

In a discussion about my previous post on the LinkedIn DITA Awareness group, the question of mapping DITA elements to PPT was raised. Because the answer is too long to comfortably fit in the discussion thread I’ll answer it here. There are two aspects: conditions and element selection/mapping.

For the conditions, the normal Open Toolkit filtering with ditaval is used.

  • @otherprops="slide" is explicitly included in the PPT and implicitly included in PDF.
  • @audience="instructor" is explicitly included in the PPT and explicitly excluded from PDF.
  • @otherprops="noslide" is explicitly excluded from PPT and implicitly included in PDF.

The question of element selection and mapping is a bit more complicated. First, the PPT template has three bullet levels defined. The first is meant for unbulleted items, for example learning objectives lead in. The second is meant for bulleted and numbered items, for example step and li elements. The third is meant for command and output references, for example codeblock and msgblock.

In the PPT utility, XPath expressions are mapped to one of these three paragraph styles or to an object.
XPath expression PPT mapping
$t//ph[$s] P1 unbulleted
$t//p[$s] P1 unbulleted
$t//*[$s]/li P2 bulleted
$t//sl[$s]/sli P2 unbulleted
$t//dl[$s]/dlentry/dt P2 bulleted
$t//codeblock[$s] P3 fixed-width
$t//msgblock[$s] P3 fixed-width
$t//steps[$s]/step/cmd P2 numbered
$t//steps[$s]/step/info/codeblock P3 fixed-width
$t//fig/image/@href image object
$t//image/@href image object
$t//note[contains(@audience,'instructor')] note object
$t//lcObjectivesStem P1 unbulleted
$t//lcObjectivesGroup//lcObjective P2 bulleted
$t//table[$s] table object
$t//simpletable[$s] table object

The variables $t and $s are also XPath expressions that I put into variables to keep the other expressions legible:

  • $t = "child::*[contains(@class, ' topic/body ')]"
  • $s = "contains(@otherprops, 'slide')"
Edit: correction on filtering logic.

PowerPoint Output in an ant Task

For several years now I have been working on utilities to convert DITA to PowerPoint, as I have discussed elsewhere. The original implementation was as a VBA macro stored inside a PPT template. I won’t go on at length about all the limitations that VBA has; suffice it to say that it’s verbose and not very robust. My colleagues in the training group requested some enhancements and I couldn’t bear the thought of spending much time in VBA.

So I decided to rewrite it in PowerShell. Other posts in Ditanauts have detailed working with XML files in PowerShell, which I have found to work pretty well. In addition, PowerShell has much better capabilities for exception handling, and it can manipulate PPT objects.

Continue reading

Configuring Fonts for the Open Toolkit with Apache FOP

While I can’t justify the cost for an expert stylesheet developer or a fancy PDF renderer, I don’t wan’t my PDFs to look like garbage either. Using Jarno Elivirta’s PDF plugin generator is a great place to start for PDF customization, and Apache FOP (bundled with the Open Toolkit) is my only option for a PDF renderer. Once I did a lot of the work in a PDF plugin, I got to the point where I wanted to change fonts. Although despite overuse there’s nothing too offensive about Times New Roman + Arial + Courier, that font set doesn’t conform to any branding guidelines that have been given any thought.

Should be pretty easy to do what I wanted, right?

Continue reading

Reporting on your repository with PowerShell, part 2

A couple months ago, some developers and support engineers were looking over some documentation and said to me, “These procedures are too complicated!” To which I said, “I know! I made them as simple as possible, but I can do only so much within the constraints of the interface.” Then the engineers asked me an astounding question: “Can you give us some complexity measure on each procedure so we know where to start making things simpler?” Because I knew how to use PowerShell to get information out of my set of DITA topics, I calmly said, “Let me look into it” while inside I was bursting with excitement. Continue reading