Recursive DITA parsing using Perl and XML Twig, Part I

If you’re doing any text parsing and manipulation, it’s hard to ignore the processing power of Perl—especially considering that it comes pre-installed on the basic OS X platform.   There are several XML parsing modules out there and plenty of sites comparing the features and usability of each one.

Personally, I like Michael Rodriguez’s (mirod) Twig module.   It’s available on CPAN here.

Besides the standard DTD support, twig handlers, twig roots, and push-pull functionality, it’s designed for parsing large tree structures by moving the parsing to ‘twig’ sub-sections.  This is great for a couple of reasons:

1) You can process very large documents through a combined stream/tree mode.

2) You can process smaller documents straight from memory.

More recently, mirod also developed an xml_spellcheck based on the twig module.

Oh, and it’s fast to boot.

The twig module works really well with both HTML and XML.  It has a safe-parsing mode to handle not-so-well-formed XML, although if you’re working in a regulated DITA environment, this shouldn’t be an issue for you.   Building a recursive DITA parser isn’t all that complicated in Perl.

A sample selection of DITA tags would look something like this:

my $twig= XML::Twig->new(
TwigHandlers => { 'p' => \&processing_sub,
'li' => \&processing_sub,
'glossterm' => \&processing_sub,
'stepresult' => \&processing_sub,
'step' => \&processing_sub,
'userinput' => \&processing_sub,
'ph' => \&processing_sub,
'info' => \&processing_sub,
'note' => \&processing_sub,
'dd' => \&processing_sub,
'dt' => \&processing_sub,
'filepath' => \&processing_sub,
'varname' => \&processing_sub,
'codeblock' => \&processing_sub,
'strow' => \&processing_sub,
'stentry' => \&processing_sub,
'shortdesc' => sub { $_->delete; },
},TwigRoots => {concept => 1, task => 2}
);

Twig handlers operate on a pull system—similar to XSLT—where  the reference tag can exist at any level within the tree or it can be specified as a complex XPATH statement for better control.  In this example, most handlers directly reference a subroutine, but you could just as easily explicitly call out a unique set of functions from within the handler.  As you might imagine, the twig root marks where parsing begins.  So if you are working with very large trees, you might need to only parse a small sub-set of the entire structure.

Once you’ve identified the necessary twig handlers, you need a way to filter a sub-set of actual DITA files to parse.   The find function (from the FILE::FIND module, part of most basic Perl releases…I like Strawberry) recursively searches a directory and also allows for pattern matching.  It doesn’t get much simpler than this:

find(\&process_wanted_files, $dir);

For each file or directory found, the find function calls the process_wanted_files subroutine.  Here, you can set filtering methods within the process_wanted_files subroutine:

if (-f and /\.dita$/)  {  Some_function_processes  }

You can then directly set your twig handlers within the filtered results.  Of course, how you approach this would depend on the size of your DITA documents and what your content looks like.

This is by no means an exhaustive look at the real capabilities of XML Twig.  Check here for more information about internal and external DTD handling, extended functionality, and a slew of useful code snippets.

In part II, I’ll walk through how to develop a DITA copy editing tool with Perl and XML Twig.

7 thoughts on “Recursive DITA parsing using Perl and XML Twig, Part I

  1. Perl has been my glue language for publishing for years — mainly on post-processed HTML that needed a little tweaking before delivery. We’re still using a mix of non-structured/structured FrameMaker, but when we finally get to DITA-based FM (next year in Jerusalem, lol), I plan on doing some of the heavy lifting in Perl during my pre-publication routine — checking status of short descriptions, etc.

    • We do a lot of that kind of thing through XSLT and a DITA OT QA plugin, but I think Perl is a great fit for it.

      What do you use as a parser?

      • I just created a brand new SC 6.2 site and added the CustomItemGenerator socure to my solution as another project. I installed the package into my site, then built the socure to overwrite the package’s CustomItemGenerator.dll. I tried to generate custom item classes for a sample dummy template and the generate button doesn’t seem to do anything still.I checked what the button does in the core and saw it calls devtools:generatecustomitem(id=$Target) so I’m thinking maybe something isn’t hooking in. I looked in the CustomItemGeneratorCommand class and saw that if you try to generate custom items on any items other than a template or template folder, there should be a sheer alert in SC, which I don’t get.

    • This looks very promising! I have had the same ideas and acullaty built an implementation of this for a client, but without the automatic generation of the classes. I was acullaty checking out how to do this, but it seems that it is already done. Great work!In our code we have a Factory that delivers a hand-written class which is typesafe for the item’s template. Like:var item = TemplateFactory.GetTemplate(Sitecore.Context.Item);litTitle.Text = item.Title;Not too much work, but automatically generating these classes is wonderful.BTW. You forgot to mention how useful this is in databinding contexts:Cheers, Matthijs

      • follow-up:Thank you for those links to more information. I ended up puittng a DOCTYPE in my XML, specifying the specific entities to allow (in PrestaShop software). function defined (in gc_xmlbuilder.php):function addDOCTYPE($element) { $this->xml .= ‘. \n ;}and function called (from within the function GetXML in googlecart.php) when XML is built:$xml_data->addDOCTYPE( checkout-shopping-cart’);I don’t know of a more dynamic way of, for example, allowing all HTML entities to go through and be parsed.

  2. Hi Gabriel,I have incorporated CustomItemGenerator’ into my procejts and now I want to work with automated unit testing using NUnit Framework on one of my project.When I access the properties (like CustomTextField) generated through CustomItemGenerator’ in NUnit GUI it throws error, below is the stack trace of error -at Sitecore.Pipelines.RenderField.RenderFieldArgs..ctor()at Sitecore.Web.UI.WebControls.FieldRenderer.RenderField()at Sitecore.Web.UI.WebControls.FieldRenderer.Render()at CustomItemGenerator.Fields.BaseCustomField`1.get_Rendered()at CustomItemGenerator.Fields.SimpleTypes.CustomTextField.get_Text()But when I create my own properties (string type) in the same class (generated through CustomItemGenerator’), they are accessible in NUnit GUI.It would be great if you can provide some help on this problem.

Leave a Reply

Your email address will not be published. Required fields are marked *


*