Traces

It’s come to my attention that this blog has an occasional visitor. Because of wordpress.com’s instrusive video ads policy (and because I’m already paying for hosting elsewhere), I’m continuing my LaTeX blogging over on my main website, andrewgoldstone.com. The archive of this blog is also visible there under the category “tex,” where I’ll also continue to add new posts. The RSS feed for that blog is http://andrewgoldstone.com/atom.xml. I recently added there, for example, a post on my experience TeXing my book, Fictions of Autonomy.

Leave a comment

Filed under General Reflections

On Wishing Death to Word

So I keep seeing links to this denunciation of the dysfunction of Microsoft Word in Slate: Tom Scocca, “Death to Word”.

One heartily endorses the sentiment. Scocca’s example of what pasted Microsoft Word XML looks like is comedy gold (and all too familiar from student blog posts). The heart sinks, however, when it turns out that the only alternative Scocca knows to mention is TextEdit, even though his explicit concern is with the crippling defects of Word when it comes to moving documents between print and HTML. In other words, the entire universe of text editing software (as opposed to word processors) is invisible to the writer of the article. No doubt he can’t imagine any way to break Word’s near-monopoly, let alone that there are both open-source and commercial systems of long standing that are much more versatile.

As I keep learning when I try to explain my use of LaTeX to humanists, the first obstacle is that the very concept of text processing is alien to most word-processing, WYSIWYG-expecting users. The response to a screen full of my TeX source is, “How do you print that”? Such users have long accepted the endless frustrations of Word in exchange for the relative simplicity with which it allows them to produce printable documents and share them. Or they have accepted the frustrations because the alternatives are unknown, maybe inconceivable without a different kind of conceptual framework.

But it is baffling, in a way, that though people who write are willing to spend many many hours learning to persuade Word to do its job and fighting with its problems, the same people are unlikely to spend the hours (probably fewer, in the end) needed to become adept at text-processing. Somehow the digital facts of life about text–markup, text encoding, processing—are quarantined in Code Land, the forbidden zone where only the Techies dare to venture. And everyone knows it’s okay for humanists and literary people not to be Techies. In spite of that they become, by default, technicians of Word but not technicians of text.

The latter would be better. Why is this not part of everyone’s basic digital literacy curriculum? Oh, wait, we don’t have a widespread basic digital literacy curriculum. But we should, as part of the goal of distributing the cultural capital of genuinely useful literacy as widely as possible. And it should include some lessons on two distinct tasks: composing text in a digital medium and processing digitally-composed texts into other formats (including print). Everyone should have a chance to learn what it’s like to write text in a text editor and then do something with that text in a processor.

Extra thoughts

I really wish the popular blogging platforms made the ability to swap between HTML and WYSIWIG editing more prominent, and encouraged everyone to do it. It seems to me that if more of the people who are writing on the web could be encouraged to play with this ready-made demonstration of how they are really first composing marked-up text and then rendering it in a browser, many more people could become technicians of text. And the day when we dance on Word’s grave would come a little closer.

I also think very highly of markdown. tumblr, bless its heart, allows you to compose posts in markdown. Markdown is easy to write, and its relation to HTML is easy to understand. Thus you can actually see how your composed markdown text leads to HTML, and then you can render it yourself in-browser.

E-mail clients too. Who thought RTF would be a good “rich” e-mail format?

To come, maybe

A guide for the perplexed on how to gain “reading knowledge” of LaTeX, if you are ever working with a TeXhead like me who shares their source and gets crabby if you ask for it “in Word.”

4 Comments

Filed under Conversion, General Reflections, Word

Percent-Ampersand Is Not Shebang

%& is not a shebang.

Somewhere on the internets I picked up the idea that the first line of a tex file should specify what flavor of TeX it’s written in, where flavor was one of TeX, LaTeX, XeLaTeX, XeTeX, etc. I think was imagining that it was some kind of TeX-internal version of the “shebang” #! that tells a Unix shell what program to use to execute a script. Indeed my standard template file had such a line (but not any more).

Wrong.

After a lot of googling and digging and frustration, I have learned the following. (This sort of thing happens a lot, when you want to know about anything that has to do with to the running of TeX instead of just typesetting it.) The %& comment, which I propose to call the “peramp,” is a mechanism for feeding a “format” to the tex engine. A format, as one learns from Victor Eijkhout’s book TeX By Topic, is a kind of precompiled bundle of macros that the basic TeX engine can use. (Even “plain” TeX, it turns out, is the TeX engine using a “plain” format). It is possible to make your own formats and feed them to the TeX engine at the command line (cf. the tex manpage) with a command like tex &myformat.fmt. As one might guess LaTeX is implemented as a format. The pdftex engine, furthermore, supports switching among plain TeX, LaTeX (both dvi output), pdfTeX, and pdfLaTeX using formats. So—this is kind of cool—if your source file has a peramp line of the form %&engine, you can process it with any of latex, pdftex, and pdflatex and get consistent results equivalent to processing it with engine at the command line. In fact it seems that for a while all of these engines have secretly been the single pdftex program invoked with the appropriate format.

Apparently some other TeX varieties I don’t use can also be handled this way. Unfortunately, XeTeX, which I do use, is not secretly a TeX format. As the author of XeTeX testily explains in this mailing list post I found, xetex is an independent engine and must be invoked as such on the command line. pdftex will not switch over to xelatex if it finds %&xelatex at the start of the file, and xelatex will not switch over to pdftex in the converse situation. The engine xetex does support TeX formats compiled specifically for it—that is what XeLaTeX is. So if you run xetex on a file that begins %&xelatex it will indeed be processed as XeLaTeX and not as plain XeTeX.

The water is further muddied because I seem not to be the only person to have picked up this idea about peramp-as-shebang, and there are programs on the internet that this method or a variant. In particular the GUI front-end TeXShop does support this kind of engine detection, with its own distinctive first-line syntax: %! TS-program = XeLaTeX (note the caps). But this is specific to TeXShop, not a feature of TeX.

Working with TeX one often feels one has stepped into a kind of Bizarro *nix Land: a lot of things look very similar to, but not quite the same as, things in a Unix-style programming environment. This is, I think, mostly evidence of TeX’s age (older than the GNU project) and devotion to backward compatibility. It’s also evidence of the fact that TeX users have not, by and large, been programmers but scientists, who (in my small experience) seem to specialize in Bizarro Programming.

[Edit, same day: If you use tex at the command line, it will not process a peramp-line for one of the other formats.]

1 Comment

Filed under General Reflections, running tex

Funny old thing (LaTeX class for Yale PhD thesis)

Here’s a funny thing. This is a LaTeX document class file for a Yale Ph.D. dissertation: mythesis.cls. When I submitted mine, I discovered an older document class other people had worked on, and tweaked it to bring it up to the 2006 spec. Might be useful for current Yalies (if they somehow find this page by googling) or others TeXing their dissertations. Invoke with \documentclass[12pt]{mythesis}.

I’m adding it to the small github repository in which I’m going to start keeping tex sources for people to play with.

Coming next: a document class/style file/sample source for a student paper. One day I am going to persuade a student to TeX a paper.

I can dream, anyway.

Leave a comment

Filed under General Reflections, kludgetastic, Sample files

How to Begin

So you just want to begin. The TeX Users Group website has a Getting Started Page with the essentials: introductory documents, examples, and, most importantly, links to the software itself: TeX Live on Unix, MacTeX on Mac, proTeXt on Windows (all free, and rather enormous, downloads). If you are on a Mac, look for the TeXShop application; if you are on Windows, look for TeXnicCenter. If Unix, pffft; open up an xterm and vim/emacs. However you do things, here is a file to start playing around with: xelatex.sample.tex. If your setup is working like mine—and if you have the fonts I use—you should be able to typeset it with XeLaTeX and get a result that looks like this: xelatex.sample.pdf.

Engines

In the rest of this post, I’ll walk through that source file, but before we can do that, there’s one technicality to get out of the way: what engine will you use? The engine is the program that converts your TeX source code into a presentational format like PDF. TeX has been around long enough to have developed a bunch of variants, each with their corresponding processing engine. First of all there’s the contrast between the original or “plain” TeX and LaTeX. Plain TeX is a specialist taste. Stick with LaTeX, which is easier and much more in the spirit of contemporary document markup (XML, etc.). A TeX distribution comes with three important LaTeX engines: latex, pdflatex, and xelatex. (There are more out there, but never mind them).

How do you use an “engine”? If you are working in a graphical front-end program, look for a pulldown menu that allows you to choose one. Here is what it looks like in the MacOS program TeXShop: texshop screenshot

Click the “Typeset” button to produce a PDF. TeXShop will automatically use pdflatex if you choose “LaTeX” from its menu. I think that’s a bit confusing.

If you like the command line, the command is simply

[ENGINE] [FILENAME]

as in

xelatex my-article.tex

Classic latex processes LaTeX source into DVI (“device independent”) format, a TeX-specific filetype devised in pre-PDF days. At this stage, think no more about it.

pdflatex processes LaTeX source directly into PDF. Pretty much anything you can latex you can pdflatex. If you are reading introductions to LaTeX (or Mittlebach and Goosens’s big reference book, the LaTeX Companion) and you want to try out their example code, try it in pdflatex.

xelatex processes XeLaTeX source to PDF. XeLaTeX source is LaTeX with a certain preamble and Unicode characters used freely throughout. Unlike (pdf)latex, xelatex uses your system fonts.

I think the font-and-Unicode combo is compelling enough that humanists should use XeLaTeX, so for the rest of this post I’ll talk about that. But if you run into trouble, try switching to pdflatex.

Here is a sample file in LaTeX, to be processed with pdflatex: pdflatex.sample.tex. Its output should look like this: pdflatex.sample.pdf. Enjoy!

A minimal sample document

But on to the promised description of the sample.

A LaTeX document has two sections, the preamble and the body. Ideally the preamble describes the details of the appearance of the final page, whereas the body describes the structured content of the document. The preamble commands are trickier to master than the body commands, so what you really need is someone else’s preamble to get you going. I’ll talk about the body first and then come back to the preamble.

In general

Here are the most important things about TeX syntax.

  • A comment begins with a % sign. Everything from % to the end of a line is ignored by the typesetter.
  • TeX commands—instructions to the typesetter—always begin with \.
  • What programmers call the grouping operator—used to delimit blocks of code and parameters to commands—is the curly brace, {}.
  • Optional parameters to commands use square brackets, [].

The body

The body begins, naturally enough, with the command \begin{document}. It ends with \end{document}. In general, simply type the text you want set. TeX turns multiple spaces into just one space, and it ignores single carriage returns.

Paragraph breaks are made with one (or more) blank lines. The nature of a paragraph break—how much to indent the paragraph, whether to leave extra whitespace between paragraphs—is an aspect of layout and as such should be specified in the preamble. A forced line break is made with a double backslash, \\.

Two more idiosyncracies. TeX always produces curly quotes, but you must tell it which kind you want. Double quotes are typed as double backticks `` and double apostrophes ''; single quotes as ` and '; and the apostrophe as itself '. A good TeX editor will automatically type these for you when you type a regular double quote. (In TeXShop, you’ll have to make sure Source > Key Bindings > Toggle On/Off is checked.)

TeX also cares deeply, very deeply, about dashes. The em dash is written as a triple hyphen, ---, the en dash (for e.g. ranges of numbers like 4–5) as --, and the hyphen as itself, -.

That’s almost it! If all you’re doing is writing free-form paragraphs, as in a blog post, you know what you need. But if your document has more structure, you need to know some LaTeX markup commands.

The simplest one is emphasis: \emph{my emphasized text}. To understand the difference between typesetting italic font and marking up emphasis, consider that you can nest emphasis commands: \emph{my emphasis has a \emph{further} emphasis within it}. The inner emphasis appears in roman type, as it should.

And the favorite humanist command: \footnote{my footnote text}. This is the kind of thing where LaTeX really shines. You put the footnote command right where you want the note “anchor” (i.e. the little superscript number) to appear, so you never lose track of how your notes and your body text are related. LaTeX numbers your notes for you and thinks hard about how best to lay out your pages, deciding if it’s necessary to continue footnotes onto the next page, making sure your body text fills out the page, and so on.

Then there are commands that describe the structure of the document. In LaTeX these are called \section{}, \subsection{}, and \subsubsection{}. The title of the section goes in between the braces. If you specify \documentclass{book} (see the discussion of the preamble below) then you can also use \chapter{}. All these commands not only typeset your section headings distinctively, but can also number them (if you wish) and remember them for a table of contents (if you wish).

A little more complex is the construct called an environment. These are made up of two commands: \begin{environment-name} and \end{environment-name}. Between these two statements comes text that you want typeset differently from body paragraphs. The most important one for the humanities is the quote environment for blockquotes. There is also a verse environment, as well as listing environments for numbered or bullet-pointed lists. Bullet points are distasteful, but numbered lists are useful. They begin with \begin{enumerate} and end with \end{enumerate}. Each item begins with the command \item, which LaTeX converts to the item number.

And that’s really it.

The preamble

As I say, the preamble is a bit trickier, and it’s probably best to begin with someone else’s preamble and modify it to suit. But the basic idea is straightforward.

First you declare the “class” of the document: the important ones are article and book. The declaration also specifies a base point-size as an option:

The rest of the preamble combines invocations of packages and layout commands. Packages are self-contained modules of code that extend LaTeX’s capabilities, either by modifying what existing commands do or by giving you access to new commands. LaTeX is supported by an enormous open-source library of packages called CTAN (large chunks of which will be installed with your latex distribution). A package is invoked with the command \usepackage[options]{package-name}.

pdflatex doesn’t need anything before the begin{document}. XeLaTeX documents always start, after the document class declaration, with the following package invocations:

 \usepackage{fontspec}
 \usepackage{xunicode}
 \usepackage{xltxtra}

Then comes a font declaration:

\defaultfontfeatures{Ligatures=TeX,Numbers=OldStyle}
\setmainfont{Hoefler Text} % Or the full name of any other font on your system

It is rather disconcerting for first-time users to discover that by default LaTeX has big margins. The margins are chosen to make your lines not too wide—the typical 6-inch line of a word-processed document is much longer than any book designer would use for 12-point font in most cases. But this may look too odd when you begin. Fortunately the geometry package gives you an easy way to reassert control:

\usepackage{geometry}
\geometry{width=6 in,height=8.5 in}

If you really want that Microsoft Word-y look, you can doublespace:

\usepackage{setspace}
\doublespacing

Though very elaborate things are possible with headers and footers in LaTeX (look up the fancyhdr package), the next command gives you a barebones page-number-footer:

\pagestyle{plain}

Finally, you may be mystified about those numbered sections. The following incantation ensures that no section numbers will be typeset:

\setcounter{secnumdepth}{-2}

To be continued!

That should be all you need to start experimenting. There’s much more to play with—e.g., bibliographies and citations, images, and of course mathematical equations—but this should be enough to get things underway.

I recommend continuing by looking at Tobias Oetiker’s Not-So-Short Introduction to LaTeX2e. Contact me or comment here if you like, too!

Edit 3/10/12: Changed sample file links to point to their new home, in a repository on github.

Leave a comment

Filed under General Reflections, Sample files

biber first aid for “data source not found”

I don’t know what causes it, but every now and then biber gives up working on my system. Then I start getting error messages like data source /var/folders/m6/bn7r45zx6cx55rr9g4qh6s6w0000gn/T/par-agoldst/cache-87e533530c5239fb9f8e3ff008979f1f16ea0e5e//inc/lib/Biber/LaTeX/recode_data.xml not found in .

There is a magic incantation, which I found somewhere on the biber sourceforge forums, to get biber to clear the cobwebs by reinstalling itself using perl’s PAR tool, as it does the first time it is run on a new system. Just clear the PAR cache, whose location under /var is helpfully specified in the error message:

 sudo rm -rf /var/folders/m6/bn7r45zx6cx55rr9g4qh6s6w0000gn/T/par-agoldst

You will have different stuff in /var/folders.../par-<username>.

Since I’ve seen other people complaining of this, I’m convinced it’s not my fault but the result of some deeply buried sporadic bug in biber. I’d try harder to figure it out, but it’s just too easy to make this one go away when it pops up once every few months. TeX is hacking, people!

23 Comments

Filed under biber, kludgetastic

LaTeX to Word: the basic issue

It seems to me that there are two possible avenues for the challenge of converting LaTeX source to Microsoft Word—a challenge which humanists will have to take up whenever they collaborate with non-TeXnical co-authors and editors. One route is to parse the source and output Word or, more likely, a less closed intermediate markup format. Humanists are actually better off here than scientists, since they don’t use the major feature that TeX handles very differently from word processors—and far, far better: namely, mathematical equation typesetting. And assuming that Word is not going to be the format for final publication, there’s no need to cry (much) over the loss of typographical quality resulting from subjecting your paragraphs to Word’s justification and pagination “algorithms” rather than TeX’s. The goal is most likely just to share and edit content. For these purposes (from shared authorship to substantial editing to copy-editing), LaTeX markup for humanists can be pretty simple—indeed, practically equivalent to the text-markup subset of HTML itself: paragraph, header, section (division), span. For this very minimal version of LaTeX—entirely conceivable as a possibility for humanists—conversion to Word is a matter of getting a parser for the minimal LaTeX and outputting its equivalent in XML.

Numerous converters do this. The one I have nearest to hand is the astonishing pandoc, an all-purpose converter. Pandoc’s “native” format is a plain-text format called markdown, which pretty much corresponds to the minimal text markup I mentioned above. And pandoc writes lots of markup formats, including HTML and OpenDocument. I’m writing this post in markdown and using pandoc to output the HTML. If your LaTeX has a markdown equivalent, pandoc can very robustly produce an ODT. Then NeoOffice (etc) can convert it to .doc format.

As far as I can tell, this is also the approach taken by the python-based Word converters supplied for LyX. But I haven’t tried those.

There’s a problem, however. What if your LaTeX markup exceeds the capacities of markdown? After all, TeX itself is a Turing-complete programming language (sez Wikipedia), and markdown, lacking loops and conditionals, definitely isn’t. If your LaTeX uses any of LaTeX’s more robust algorithmic powers to generate your text, the magic of pandoc will not, at least in its present version, be powerful enough for you. (I’d love to be wrong about this, but I’m pretty sure this argument holds. Because the Haskell interpreter underlies pandoc, I guess a robust TeX parser is in principle possible for pandoc, but that’s not really in the spirit of pandoc’s minimalism. Possibly some kind of compromise involving markdown with embedded haskell fragments would be possible. Sounds painful.)

But humanists—when would they use such powers? Alas, they will if they want to use those lovely bibliography-generation capabilities. A lot of algorithmic work goes into lining up all those nice Chicago-style footnotes and short references and ibids.

Now we come to the other avenue for TeX-to-Word conversion: reading the output rather than the input and converting that. On the plus side, all the algorithmic hard work will have already taken place, so all that nice generated text will be easy alphanumeric characters, spaces, and punctuation. On the minus side, TeX’s output is a DVI or a PDF, images of pages with much less semantic structure and lots and lots of non-semantic layout information. That’s the whole point of LaTeX! I guess you could use a PDF-to-Word converter, like the one embedded in Acrobat Pro; but the layout-not-semantics problems quickly spiral out of control (I’ve tried. The results with footnotes make you cry). The converted Word document may sort of look like the PDF you make with LaTeX, but it will be very hard to use it in collaborating on content.

Now I’ve often wondered whether there isn’t some intermediate stage in the LaTeX processing that would be more suitable for conversion into ODT (which is just xml markup). After all, a package like biblatex doesn’t output DVI code when it generates a citation, it outputs TeX. Isn’t there some mid-processing version of my article (say) which consists of all the LaTeX I wrote, but with all the commands replaced with their output? Without knowing biblatex’s internals, I think we can be pretty sure that it’s not so simple. Lots of programmatic magic happens in a call, magic that depends on global knowledge of the TeX processing run (decisions about pagination, information about sections and chapters, counters, etc. etc.).

So what’s left? tex4ht. The ingenious tactic used by this general-purpose TeX-to-markup converter is to piggyback on the TeX processing run, allowing TeX to do the work of output generation but annotating the result—a DVI file—with reminders of the original semantic structure. Then tex4ht reads the annotated TeX output back in and converts it to a new markup format; the package speaks xml and can output several flavors of html and—the key desideratum—OpenDocument XML. tex4ht redefines basic TeX/LaTeX commands to produce the annotations it needs. Of course you can immediately see the challenge: this means:

  1. tex4ht needs to “annotate” every command your document uses.

  2. Which means tex4ht needs to redefine every command you use—or a generating subset of them (i.e. a set {f_1, f_2, f_3, …} such that every command you use can be expressed in terms of the f_i).

  3. And those redefinitions are supposed to behave just like the original commands, modulo output format: fancy TeX typesetting can re lost, but not any actual text or basic style information.

The result is that the maintainers of tex4ht are constantly playing catch-up with the entire TeX ecosystem, writing “.4ht” workalike packages to convert the commands offered by popular packages into working surrogates for tex4ht. The task has been even more challenging because tex4ht was the brainchild of one person, Eitan Gurari, who died suddenly in 2009; the current maintainers have had to plunge into his work in medias res.

So there you have it: the basic issue. As I blog on, I’ll discuss the little ways of coping with it. Amazingly, you really can.

5 Comments

Filed under Conversion, General Reflections, tex4ht, Word