LaTeX to Word: the basic issue

It seems to me that there are two possible avenues for the challenge of converting LaTeX source to Microsoft Word—a challenge which humanists will have to take up whenever they collaborate with non-TeXnical co-authors and editors. One route is to parse the source and output Word or, more likely, a less closed intermediate markup format. Humanists are actually better off here than scientists, since they don’t use the major feature that TeX handles very differently from word processors—and far, far better: namely, mathematical equation typesetting. And assuming that Word is not going to be the format for final publication, there’s no need to cry (much) over the loss of typographical quality resulting from subjecting your paragraphs to Word’s justification and pagination “algorithms” rather than TeX’s. The goal is most likely just to share and edit content. For these purposes (from shared authorship to substantial editing to copy-editing), LaTeX markup for humanists can be pretty simple—indeed, practically equivalent to the text-markup subset of HTML itself: paragraph, header, section (division), span. For this very minimal version of LaTeX—entirely conceivable as a possibility for humanists—conversion to Word is a matter of getting a parser for the minimal LaTeX and outputting its equivalent in XML.

Numerous converters do this. The one I have nearest to hand is the astonishing pandoc, an all-purpose converter. Pandoc’s “native” format is a plain-text format called markdown, which pretty much corresponds to the minimal text markup I mentioned above. And pandoc writes lots of markup formats, including HTML and OpenDocument. I’m writing this post in markdown and using pandoc to output the HTML. If your LaTeX has a markdown equivalent, pandoc can very robustly produce an ODT. Then NeoOffice (etc) can convert it to .doc format.

As far as I can tell, this is also the approach taken by the python-based Word converters supplied for LyX. But I haven’t tried those.

There’s a problem, however. What if your LaTeX markup exceeds the capacities of markdown? After all, TeX itself is a Turing-complete programming language (sez Wikipedia), and markdown, lacking loops and conditionals, definitely isn’t. If your LaTeX uses any of LaTeX’s more robust algorithmic powers to generate your text, the magic of pandoc will not, at least in its present version, be powerful enough for you. (I’d love to be wrong about this, but I’m pretty sure this argument holds. Because the Haskell interpreter underlies pandoc, I guess a robust TeX parser is in principle possible for pandoc, but that’s not really in the spirit of pandoc’s minimalism. Possibly some kind of compromise involving markdown with embedded haskell fragments would be possible. Sounds painful.)

But humanists—when would they use such powers? Alas, they will if they want to use those lovely bibliography-generation capabilities. A lot of algorithmic work goes into lining up all those nice Chicago-style footnotes and short references and ibids.

Now we come to the other avenue for TeX-to-Word conversion: reading the output rather than the input and converting that. On the plus side, all the algorithmic hard work will have already taken place, so all that nice generated text will be easy alphanumeric characters, spaces, and punctuation. On the minus side, TeX’s output is a DVI or a PDF, images of pages with much less semantic structure and lots and lots of non-semantic layout information. That’s the whole point of LaTeX! I guess you could use a PDF-to-Word converter, like the one embedded in Acrobat Pro; but the layout-not-semantics problems quickly spiral out of control (I’ve tried. The results with footnotes make you cry). The converted Word document may sort of look like the PDF you make with LaTeX, but it will be very hard to use it in collaborating on content.

Now I’ve often wondered whether there isn’t some intermediate stage in the LaTeX processing that would be more suitable for conversion into ODT (which is just xml markup). After all, a package like biblatex doesn’t output DVI code when it generates a citation, it outputs TeX. Isn’t there some mid-processing version of my article (say) which consists of all the LaTeX I wrote, but with all the commands replaced with their output? Without knowing biblatex’s internals, I think we can be pretty sure that it’s not so simple. Lots of programmatic magic happens in a call, magic that depends on global knowledge of the TeX processing run (decisions about pagination, information about sections and chapters, counters, etc. etc.).

So what’s left? tex4ht. The ingenious tactic used by this general-purpose TeX-to-markup converter is to piggyback on the TeX processing run, allowing TeX to do the work of output generation but annotating the result—a DVI file—with reminders of the original semantic structure. Then tex4ht reads the annotated TeX output back in and converts it to a new markup format; the package speaks xml and can output several flavors of html and—the key desideratum—OpenDocument XML. tex4ht redefines basic TeX/LaTeX commands to produce the annotations it needs. Of course you can immediately see the challenge: this means:

  1. tex4ht needs to “annotate” every command your document uses.

  2. Which means tex4ht needs to redefine every command you use—or a generating subset of them (i.e. a set {f_1, f_2, f_3, …} such that every command you use can be expressed in terms of the f_i).

  3. And those redefinitions are supposed to behave just like the original commands, modulo output format: fancy TeX typesetting can re lost, but not any actual text or basic style information.

The result is that the maintainers of tex4ht are constantly playing catch-up with the entire TeX ecosystem, writing “.4ht” workalike packages to convert the commands offered by popular packages into working surrogates for tex4ht. The task has been even more challenging because tex4ht was the brainchild of one person, Eitan Gurari, who died suddenly in 2009; the current maintainers have had to plunge into his work in medias res.

So there you have it: the basic issue. As I blog on, I’ll discuss the little ways of coping with it. Amazingly, you really can.

Advertisements

5 Comments

Filed under Conversion, General Reflections, tex4ht, Word

5 responses to “LaTeX to Word: the basic issue

  1. Alex Roberts

    tex4ht is indeed quite impressive; I’ve managed to get it to work with footnotes and biblatex, two crucial details. The only thing missing now for my documents, unless I am mistaken, is compatibility with fontspec: SimpleTeX4ht returns an error if my document has fontspec. The only trouble is I don’t know how to use Unicode characters (diacritical marks for semitic languages, Greek, Arabic & Syriac scripts) without fontspec. Until I can get tex4ht to work with Unicode, my documents are stuck just this side of Word-land.

  2. Andrew Goldstone

    Alex, belatedly: yes, this is a serious problem. Basically there’s very little cross-compatibility between tex4ht and xetex, though I have not tried to make the “htxelatex” commands work. fontspec is just not supported by tex4ht in any form right now. As you can see in this tex4ht mailing list thread, the tex4ht developers acknowledge the basic issues but I don’t think they’ve found a fix yet. My workaround has been to factor out my xetex packages when doing Word conversions. Then you need “normal” latex support for all your Unicode. \usepackage[english]{babel}\usepackage[utf8]{inputenc} will get you basic unicode support, which worked for my book with French and German, but you need fancier packages in order to get support for Unicode Greek, Arabic, and Syriac–and do those even exist? Try googling around your basic issues, and posting to the tex4ht and xetex mailing lists (both pretty active). It’s frustrating. I want to do more work on this issue, and I know others do too.

  3. Jonathan

    Have you tried LaTeX2RTF? It’s the best way I’ve found to do Latex to Word conversions. It gets footnotes and bibliographies right (using Bibtex — haven’t tried Biblatex), which is what counts for me.

  4. I stumbled across this blog from the fix from the Biber cache problem in the next post. A few months ago, I realized Adobe Acrobat can generate Word documents from PDFs, including those produced by LaTeX (I’d assume also XeTeX). It’s just a “Save As…” command in Acrobat. Detailed instructions and caveats are here: http://jefais.tumblr.com/post/27916537053/a-very-easy-but-slightly-expensive-way-to-convert.

    • Andrew Goldstone

      Thanks for linking your post on the Acrobat possibility! I’ve tried it but didn’t find it satisfactory; I’m pretty sure the last time I used it footnotes were not converted as footnotes but as text frames at the bottoms of pages. In other words, it throws out too much in the way of semantics. Maybe it’s improved since I did this experiment years ago. Might be okay for the purpose of sharing drafts to co-edit, but what we academo-TeXies really need is a way to put TeX into publishers’ editorial & typesetting pipelines, which are (in humanities) set up to start from Word, with no hope of change in sight. Will blog about this eventually.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s