Word Conversion Tool

I initially drafted this post about a Word conversion tool two years ago, and for some forgotten reason, never published it. Eventually the project was scrapped, but we did use the tool for quite a while. I’ve gone back and made the article past-tense, and added some reflection. I think there’s still some good stuff in here…

Word. It ain’t going away anytime soon. As a DITA–and more generally, as a structured authoring–evangelist, I’ve long loathed the wild West approach to documentation Word engenders. Okay sure, it is easy to use. It doesn’t really require any training. I suppose, yes, one could argue that writing in Word lets authors focus on the content rather than on “tagging.” (I think that argument is faulty, as I’m sure you do, but we won’t go into that here.) The point is, if Word is going to be here awhile, what might we do about it?

In 2013, my organization at the time decided to embrace that reality, and we were fortunate enough to be able to invest in developing tools to support Word-based input while maintaining data in DITA format. At the time, we had 25 or so authors producing content in Word, not to mention all the SMEs, consultants, or other content contributors. It was kind of liberating, actually. When I talked to teams about content collaboration, I didn’t have to:

  1. Explain what DITA is.
  2. Explain why we use it.
  3. Convince them they have a content problem.
  4. Convince them DITA is the solution.

Instead, I could talk about this cool web app where you can author content, or import from Word, and in the background we can produce PDFs or eLearning or whatever you need. We provided production and publication work as a service.

What was it?

We built a web app we called CourseBuilder, because, after all, we used it to build courses. The app let users log in, set up a project, and then either import a Word doc or start a chapter of content from scratch. After importing, authors could edit the content, drag-and-drop to rearrange topic order, adjust nesting, specify a topic type, and create a few specific information types, like lists, notes, code blocks, and comments. It was all done in a very simple, Word-like interface. It was actually easier to use than the WordPress post editor I’m using now.

How did it work?

When a user uploaded a Word document, that file was sent to a server running Open Office. Open Office returned an HTML conversion of the file, which was loaded into a customized version of Redactor. Then there was a bunch of JavaScript magic that happened that I don’t really understand. When authors edited, they were actually editing the HTML markup in a “tags-off” view. We applied classes to element blocks that determined how those blocks would be transformed to DITA. So maybe we were cheating? We weren’t really converting Word to DITA, but instead converting Word to HTML to DITA.

As authors adjusted heading levels in the editor, those changes were immediately reflected in the table of contents view. The TOC also served as a “jump to topic” navigation aid.

Challenges

From a presentation perspective, Word is a WYSIWIG editor. From a markup perspective, it assuredly is not. What you see on the screen, in many cases, has little to do with the markup behind the presentation. What appears to be a continuous bulleted list, for example, oftentimes will be constructed of several separate lists in the Word markup. The result is that we often see “broken” lists in the HTML rendition of the content, which will persist in the DITA markup if it isn’t caught and fixed in Redactor.

Then there is Redactor. It’s a pretty good HTML editor, but it wasn’t made with conversion in mind. Redactor likes to add elements, especially paragraphs, all over the place. For example, when you add a new list item, you end up with a <p> nested in an <li>. That’s annoying when you are just creating a bulleted list, but it breaks conversions when that is an ordered list that will become a set of steps in DITA.

The other major challenge was faced was image handling, which was really a consequence of the technology stack we chose and the available API bindings with Open Office. We used a PHP framework for CourseBuilder, but to get the image handling features we wanted, we had to switch over to PyUNO binding to talk to the conversion server. Not the end of the world, but it did introduce another language (Python) into the mix of technologies: HTML5/JavaScript, PHP, DITA, Python, with a MongoDB. But hey, at least it wasn’t Word! 😉

Assumptions

We made a couple of key assumptions that largely determined our approach. First, we knew we would never hold all authors to a single template, and even when we could convince folks to use one, we knew there was a good chance they’d stray from the guidelines. So we wanted to make the conversion as generalized as possible. To that end, we developed early against the worst use case: a .txt file. We assumed, then, a use case where all the styles in Word were thrown out. How would we convert the content?

Our second assumption was that we would never get to a place where we’d always guess the user’s intention correctly. We thought we could get close, but never 100%. Word is just too wonky, and besides, people change their minds.

These two assumptions led to our design philosophy: make really good guesses based on as many factors as possible, but give users a chance to override those decisions.

Conversion in stages

What we ended up doing was breaking the process down into the following stages, not all of which were implemented before the project was scrapped. After a document is uploaded,

  1. Pre-flight check [not implemented]: first, we identify any major issues that will prevent a successful outcome. For example, the document doesn’t have any heading styles. We’ll flag that and provide options to the user on how to fix it. Sometimes, it may be easier to make large edits in Word before importing.
  2. Editing: second, we enable the author to make whatever changes they want to the document, but continue to help them recognize and recover from errors.
  3. Automated QA [not implemented]: third, we provide automated QA (using the QA Plugin), to identify terminology and standard language issues, highlighting errors in context (similar to a spell checker). For example, perhaps in step 2 they set a topic to a task, but the title does not start with an infinitive. We’ll flag that for them. Users must correct all critical QA errors before we’ll enable conversion to DITA.
  4. Conversion: next, users can convert the content and download a .zip of DITA maps and topics to then upload to process in our CMS. This will be replaced by a publication step, where users will never have to see the DITA markup. They’ll simply request an output type, and the service will return the published output.

Results

For the 18 months or so that it was in use, authors seemed to really like it. Training was very short and we were fairly quick to adapt the system as new needs / use cases arose. From a pure conversion perspective, we reduced the time it took to convert contributed content by more than 80%. (And I think we could have beaten that number if we’d improved the conversion routines.)

Why’d we scrap the project?

The team wanted to use Word to author content because they felt it would help them move faster than if they used DITA. Well, after 18 months back in Word, it was pretty clear that we were not moving any faster. So we went back to DITA, but with a drastically simplified content model. This, of course, had its own pros and cons, but it made a good balance between ease of use and benefits of structured markup.

cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |
cheap football kits  |
cheap football shirts  |
cheap football tops  |

One thought on “Word Conversion Tool

  1. Interesting that your thought process started with .txt. That makes me wonder: what about using markdown as the intermediate format? pandoc can convert Word to markdown, and convert markdown to many different formats. http://pandoc.org/ I’ve used it a little bit and seems pretty robust.

Leave a Reply

Your email address will not be published. Required fields are marked *


*