Reporting on your DITA repository with PowerShell

As a long-time Unix bigot, it pains me to say but I have to be honest: I really like PowerShell, Microsoft’s new-ish scripting language. In fact, it has an advantage over Unix shell scripting languages in that it is fundamentally based on objects rather than on strings. Now what else do I use daily that is object-based? XML of the DITA kind, of course.

While there are a number of good articles on using PowerShell with XML, I thought I’d take you through a use case specifically with DITA.

I’m not going to cover a lot of preliminaries because those are already copiously documented elsewhere. I’ll just say that if you have execution problems, do a search for Set-ExecutionPolicy.

Since Patrick and Dusty have already superseded my PowerShell QA script with an OT plugin, I’ll talk about using PowerShell for auditing your topic set. Last week my colleague said to me, “Ben, I think the titles of the topics have sort of drifted away from their filenames. Is there anyway we can compare them?” Because our content is stored in a version control repository, not a CCMS, good filenaming according to our defined scheme is an important aspect of content manageability. We don’t have a huge number of topics, but enough that I don’t want to go through them one by one. In other words, I need to do some automation.

I’ll start off with getting a list of the filenames. Since I only want to display the relative path and not the full path, I need to convert the path from an object to a string then mask out the root.

$root = 'C:\Users\ben\Documents\documentation\source\'
$topicTitles = @{}
Get-Childitem $root -recurse -include *.dita,*.ditamap |
    ForEach-Object {
        $file = $_
        $fileFull = Join-Path -path $ -childpath $ | Resolve-Path
        $fileFull = [string]$fileFull
        $fileRel = $fileFull.Replace($root, '')
        write-host $fileRel

Now I need to get the titles of the topics and maps. Here’s where I have to work around PowerShell a little bit. By default it tries to validate XML files against the DTD, but you can’t tell it where the DTD is. So I’ll create a little function to open the file and tell it to ignore the DTD.

Function Get-XML ($filePath) {
    $fileContent = New-Object System.Xml.XmlDocument
    $fileContent.XmlResolver = $null
    Try {
    Catch [system.exception] {
        write-host "Could not open file $filePath"

Once the function is declared, I can open the DITA files. I like associative arrays (or hashes), which are name/value pairs. So inside the Get-Childitem loop I’ll open the file, get the title, and create a record consisting of the filename and title.

$fileContent = Get-XML($fileFull)
$title = $fileContent.SelectSingleNode("/*/title | /bookmap/booktitle/mainbooktitle").get_InnerText()
$topicTitles.Add($fileRel, $title)

Finally, I want to output the contents of the hash and write it to a file.

$topicTitles.GetEnumerator() | Sort-Object Name |
Format-Table -Autosize | Tee-Object $root\audit.txt

When I run the script, a 2-column report shows the topic name and its title for us to compare. If you have especially long file paths or titles, you might want to use Format-List rather than Format-Table.

Once you are able to iterate over some files and select elements using XPath, there are a lot of possibilities. Here are some ideas:

  • Extend this script to select the shortdesc in addition to the title to get a summary of your documentation set.
  • Check the filenames with a regular expression¬†to determine if they conform to your naming standards.
  • Extract the topic @id and make sure you don’t have any duplicates in your repository.
  • Rather than select files based on directory location, select the values of @href in a ditamap.
  • Count used elements in preparation for implementing constraints. If an element is used frequently, you need to consider keeping it; if it’s used rarely, you might be able to replace it with something more appropriate.
  • Write output as an HTML table or XML structure of your devising (such as I describe here) that can be nicely formatted with XSLT.

A copy of the full script audit_ps1 with some additional comments is attached to this post. Change the name from audit_ps1.txt to audit.ps1.

Edit: added link to script.

One thought on “Reporting on your DITA repository with PowerShell

  1. to BULK INSERT or Data.SqlClient.SqlBulkCopy may be all that is required to load an CSV file. See SQLServerCentral Article on Importing Powershell Output into SQL Server for serveal

Leave a Reply

Your email address will not be published. Required fields are marked *