DITA > HTML > JSON

At Information Development World 2015 some attendees expressed interest in the JSON documentation format that feeds my documentation portal.

Starting from DITA source, there is a series of two transformation:

  1. HTML2 from DITA4Publishers, which flattens the directory structure.
  2. A custom XSLT that reads the resulting index and creates nested structures representing the document.
Each topic in the map becomes a “document” element in the JSON that is made up of the following pieces:
Field Source
Title Topic title
ID Topic filename
Unique key Top-level document filename + topic filename
Ancestors List of ancestor topics at all levels
Summary* Topic shortdesc
Body Topic body
HREF Topic path + topic filename
Documents* List of sub-documents

The JSON created in stage 2 is loaded into MongoDB for rendering on the documentation portal. As the loader and the rest of the portal infrastructure was developed by the support tools team I can’t give any insight there except to say that cross-references and image links presented a bit of a challenge.

The XSLT (ditahtml2json.xsl) and a sample JSON (hierarchy.json) generated from the DITA-OT hierarchy.ditamap are available from GitHub. More background is available in the slides from the IDW presentation.

Word2DITA Plugin (DITA4Publishers)

One of the groups my team supports is solutions engineering. This group figures out the best way to run 3rd-party applications on our platform. Although they are not writers, one of their primary deliverables is documentation, and a lot of it. My team provides editorial services throughout the entire lifecycle.

Now that we’ve made great progress on content quality, which is the most important thing, here’s the problem: how to improve the user experience of the content. It will surely come as no surprise that the solutions documentation is authored in Word. You can get a minimally viable PDF from Word, but that’s about it.

I dream of the time when there is such a nice DITA authoring interface that they could create DITA topics, but that time isn’t now. These authors are more than casual authors, but writing and content management isn’t close to 100% of their time either. As such, authoring and reviewing in Word is a requirement.

At the same time, we already have a refined process to publish to our support portal in a searchable fashion. This process is based on DITA to HTML.

You can see where I’m going. How to get from Word to DITA so we can use our existing publishing pipeline?

The first solution that comes to mind is the DITA4Publishers Word2DITA plugin.

Installation

The instructions here seem to be out of date. The good news is that the reality is easier.

  1. Install the DITA4Publishers plugins.
  2. Copy the sample file from GitHub to the samples folder under DITA-OT
  3. Run the transform.
    ant -f build.xml -Dtranstype=word2dita -Dargs.input=word2dita_single_doc_to_map_and_topics_01.docx

In the out folder you’ll see a map and topics. This is a sample document with the default style-to-tag mapping.

Images

Unfortunately, images are not extracted. The solution (found here) is to open the Word file in oXygen and extract the media folder to the topics folder created by word2dita. This is a bitter disappointment–I had envisioned a single build target that would convert the Word file to DITA and then to PDF, HTML, and ePUB. There will have to be a manual step in there to extract the images.

But then I put 2 and 2 together: the DOCX file is a ZIP, and Ant has an unzip task. So I added these lines to the target:

<unzip src="${args.input}" dest="${temp}" />
<copy todir="${out}/topics/media" failonerror="false">
  <fileset dir="${temp}/word/media" />
</copy>

Now the DITA output is complete.

cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |

QA Plugin Updated! (Finally, right?)

Hi folks, I’m happy to let you know that we have posted a major update to the QA Plugin. The ditanauts team owes a huge debt of gratitude to Don Day and Michael Boses for their work on this update. What’s new you ask? Well…

  • Reports are prettier. The HTML report we generate uses Google Charts to render visual elements.
  • We create a data file (written in DITA) rather than generating the report HTML directly from the DITA input. With the data file, you can then render whatever you want using normal OT processing. The plugin creates an HTML report and a .csv file from the data file.
  • @Chunk set automatically on bookmaps. One of the really annoying things with the old version was that you had to set the @chunk attribute manually before a build. That is no longer the case when building from a bookmap!
I’ve updated the install and run sections of the how-to page; I will be updating the customization section soon.
Let us know what you think!

QA Plugin XSLT: Locating Distinct Values for Duplicate IDs

As part of a new framework enhancement, I needed a method to ensure that certain DITA elements carried unique @id values. Our authoring tool does a good job identifying duplicate @id values within a topic, but does not indicate whether those values also exist in other topics referenced in the DITA map.

In this case, the best fit was to add a new check to the QA plugin.

The check should:

• Identify duplicate @id values on specific elements
• Return only distinct values (i.e. if 123 appears several times, then return 123 only once)

After some forum research, my first thought was to use a key match.

<xsl:key name="duplicateIds" match="elementName" use="@id" />

And reference it with

key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1

So the if statement for the check looked like

<xsl:for-each select="//parentElement">
<xsl:if test="key('duplicateIds',@id)[2] and count(.|key('duplicateIds', @id)[1]) = 1">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>
</xsl:if>
</xsl:for-each>

Even though it met my requirements, it’s still an outdated approach that didn’t leave me with a job-well-done sense of completion. I wanted a cleaner solution that fit somewhere closer to the 2.0 realm.

I decided to use grouping to identify each distinct @id value. Then I could wrap an if statement to test for any groups that contained a second (duplicate) item. The result:

<xsl:for-each-group select="//elementName" group-by="@id">
<xsl:if test="current-group()[2]">
Remove duplicate elementName @id=<xsl:value-of select="@id"/>
</xsl:if>
</xsl:for-each-group>

Have a better approach? Leave a comment.

Flattening HTML output

My DITA repository has a number of subdirectories to keep maps and topics organized. This strategy is convenient, but it can be a drawback when I need to further process HTML output, as I had to do for a recent publishing project. The HTML output type in the DITA Open Toolkit retains the organization of the source files, so that every processing task turned into file tree navigation with tools that aren’t suited for it.

The DITA For Publishers HTML2 plugin provides a mechanism for flattening the output: the html2.file.organization.strategy Ant parameter. To make flattening a viable approach, there needs to be a provision for avoiding collisions. For example, say you have two directories, indir1 and indir2, each of which contain a topic file topic1.dita. The single-directory output, then, can’t be outdir/topic1.html because there are two topic1 files.

The plugin deals with this requirement by appending a string, created by generate-id(), to the file name. So indir1/topic1.dita would become outdir/topic1_d97.html and indir2/topic1.dita would become outdir/topic1_d84.html. The exact expression (in the get-result-topic-base-name-single-dir template) is

concat(relpath:getNamePart($topicUri), '_', generate-id(.))

While that’s a reasonable approach, it’s not the one that I want to use because the filenames ultimately get exposed to my customers. Since I don’t know the algorithm for generating the unique ID, it’s not deterministic enough, and a bookmarked link might become invalidated without my knowing. Instead, I’d like to prepend the parent directory name, so I modified the expression to this:

concat(relpath:getNamePart(relpath:getParent($topicUri)), '_', relpath:getNamePart($topicUri))

The result for the two example files then would be outdir/indir1_topic1.html and outdir/indir2_topic1.html. This approach has the added advantage that the output doesn’t lose information about its location in the source.

Fixing part/chapter numbering in PDF

Apparently part and chapter numbering for bookmaps has been broken in the Open Toolkit PDF output since the beginning. Instead of numbering the chapters in a series from beginning to end, chapter numbering resets to 1 for every part so the TOC looks like this:

  • Part I
    • Chapter 1
    • Chapter 2
    • Chapter 3
  • Part II
    • Chapter 1
    • Chapter 2
    • Chapter 3
  • Part III
    • Chapter 1
    • Chapter 2
    • Chapter 3

The correct behavior would be a single chapter series that goes from 1 to 9. OT issue 1418 indicates that this was fixed in OT 1.7 but I didn’t see any change when I tried it out. Instead I changed my PDF plugin and thought I’d document the change here for anyone else who might need to do it in the future since I wasn’t able to find the answer anywhere.

There are two templates that need to be updated: one for the TOC and one for the chapter first page.

The template for the TOC is below.

<xsl:template match="*[contains(@class, ' bookmap/chapter ')] |
  *[contains(@class, ' boookmap/bookmap ')]/opentopic:map/*[contains(@class, ' map/topicref ')]" mode="tocPrefix" priority="-1">
  <xsl:call-template name="insertVariable">
    <xsl:with-param name="theVariableID" select="'Table of Contents Chapter'"/>
    <xsl:with-param name="theParameters">
        <number>
          <xsl:variable name="id" select="@id" />
          <xsl:variable name="topicChapters">
            <xsl:copy-of select="$map//*[contains(@class, ' bookmap/chapter ')]" />
          </xsl:variable>
          <xsl:variable name="chapterNumber">
            <xsl:number format="1"
              value="count($topicChapters/*[@id = $id]/preceding-sibling::*) + 1" />
          </xsl:variable>
          <xsl:value-of select="$chapterNumber" />
        </number>
    </xsl:with-param>
  </xsl:call-template>
</xsl:template>

The significant parts here are the $topicChapters and $chapterNumber variables.

The template for the chapter first page is insertChapterFirstpageStaticContent. It’s too long to reproduce in its entirety here, but the code is the same as what’s inside the number element in the TOC template. The number element that contains the $topicChapters and $chapterNumber variables needs to replace the one inside the xsl:when test="$type = 'chapter'" block.

Configuring Fonts for the Open Toolkit with Apache FOP

While I can’t justify the cost for an expert stylesheet developer or a fancy PDF renderer, I don’t wan’t my PDFs to look like garbage either. Using Jarno Elivirta’s PDF plugin generator is a great place to start for PDF customization, and Apache FOP (bundled with the Open Toolkit) is my only option for a PDF renderer. Once I did a lot of the work in a PDF plugin, I got to the point where I wanted to change fonts. Although despite overuse there’s nothing too offensive about Times New Roman + Arial + Courier, that font set doesn’t conform to any branding guidelines that have been given any thought.

Should be pretty easy to do what I wanted, right?

Continue reading

QR Codes in DITA Ouput

Inspired by a thread started by Sean Healy, and building on the instructions posted by Kevin Brown, I added the ability to generate and insert QR Codes into PDF output to the mypdf plugin.

I ignore QR Codes in marketing, but I think they could be a great way to link to resources, such as videos, from printed technical documents. Readers can simply zap the codes with their phones to pull up the content.

Continue reading

Example Constraints Plugin

One of the most important features, in my opinion, of DITA 1.2 is the constraints mechanism. In short, constraints let you reduce the elements and attributes available to your authors. You can also specify when elements/attributes are required, and which tag structures are legal (and, therefore, which are illegal). Eliot Kimber wrote a great tutorial on how to set up constraints, but if you’d like an example plugin, you can download the one I’ve created off sourceforge.

Continue reading