Haskell: Parsing LaTeX .aux Files

July 31, 2020

These programs are moving toward to goal of allowing references from Java source code to LaTeX documents. The short program below gathers the necessary information from the LaTeX documents.

LaTeX Refresher

LaTeX is a wonderful thing, and if you're not familiar with it, then you should be! LaTeX also happens to be the format used for "block-style" (.lhs) literate Haskell files.

When composing a LaTeX document, the \label command allows you to mark some location so that you can refer to it later with \pageref or \ref. For example, if the document contains \label{i-care-about-this}, then \pageref{i-care-about-this} will appear in the output file as "8" or "452", or whatever the page number where the \label was defined. Exactly what \ref{i-care-about-this} becomes depends on the context in which the \label was given. Usually, it's a section, as in "Section (3.1.2)," but it may also be an equation number, a figure number, or a number in an enumerated list.

The LaTeX document generation system works like a two-pass compiler. The first pass generates an .aux file that contains the \label values (and other information), and the second pass uses the .aux information to fill in any \ref or \pageref values. Here's a short excerpt from an .aux file:

\relax 
\@writefile{toc}{\contentsline {section}{\numberline {1}Introduction}{1}}
\newlabel{becomes-a-note}{{2}{2}}
\newlabel{goals-and-concerns}{{4}{8}}
\@writefile{toc}{\contentsline {subsection}{\numberline {5.3}Implementation}{13}}
\newlabel{five-atomic-steps}{{5.4.2}{14}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {5.4.7}Global Undo}{19}}

All kinds of stuff can appear in an .aux file; e.g., the lines above that start with \@writefile have to do with generating a table of contents. For the current task, the only lines we care about start with \newlabel. For example, the line

\newlabel{five-atomic-steps}{{5.4.2}{14}}

means that the original document had \label{five-atomic-steps} somewhere in Section (5.4.2), which was on page 14. Those two pieces of information, and the label itself, are what we care about.

Parsing an .aux File

The program below parses out the three parts of a \newlabel, and folds them into a larger "aux database" file. The idea is to run the program on a series of .aux files to create a single file that contains the label definitions from all of the files. The aux database file is just a flat file; it consists of a series of lines, each of which has the format

file_name label_type latex_label section page

where latex_label, section and page are taken directly from the .aux file, file_name is the name of the file from which these values came, and label_type doesn't serve any purpose yet (it's there for future-proofing).

The entire program is available here, and each part is described below. First, the aux database file is described by

 
data LabelEntry = LabelEntry {
  fileName :: String,
  labelType :: String,
  latexLabel :: String,
  section :: String,
  page :: String
} deriving (Eq)

instance Show LabelEntry where
  show (LabelEntry f1 f2 f3 f4 f5) = f1 ++ " " ++ f2 ++ " " ++ f3 ++ " " ++
    f4 ++ " " ++ f5

readLabelEntry :: String -> LabelEntry
readLabelEntry s = do
  let xs = words s
  LabelEntry {fileName = xs !! 0,
            labelType = xs !! 1,
            latexLabel = xs !! 2,
            section = xs !! 3,
            page = xs !! 4
            }  

and this "database" is read into memory by

-- Given the name for a database file, read it in.
readDB :: String -> IO [LabelEntry]
readDB fname = do
  contents <- readFile fname
  return $ map readLabelEntry $ lines contents

The individual .aux files are read into memory and parsed by

-- Given an .aux file name, parse the file.
readAuxFile :: String -> IO [LabelEntry]
readAuxFile fname = do
  contents <- readFile fname

  -- Parse out the items we care about.
  let rawEntries = 
          -- Get rid of the blank items.
          map (filter (\s -> length s > 0)) $
          -- Each line of input becomes a list of strings, some of which
          -- are blank or empty.
          map parseItem $ 
          -- Drop the first 10 characters from each line (i.e., "\newlabel{")
          map (drop 10) $ 
          -- Only those lines that start with "\newlabel".
          filter (\s -> isPrefixOf "\\newlabel" s) $
          (lines contents)
  
  return $ map (rawToLabelDB fname "arbitrary") rawEntries

-- Breaks a line from an .aux file into the three pieces we care about. It is
-- assumed that we have stringXstringX, etc., where X is some combination of 
-- '{' and '}'. We want to split on any combination of these. This generates
-- lots of empty strings which need to be filtered out.
parseItem :: String -> [String]
parseItem "" = []
parseItem s = firstString : (parseItem rest) where
  firstString = takeWhile (\c -> (c /= '{') && (c/= '}')) s
  rest = drop (length firstString + 1) s
  
-- Convert already parsed data from an .aux file to a LabelEntry value.
rawToLabelDB :: String -> String -> [String] -> LabelEntry
rawToLabelDB fname labelType xs = 
  LabelEntry {
    fileName = fname,
    labelType = labelType,
    latexLabel = xs !! 0,
    section = xs !! 1,
    page = xs !! 2
  }

Pulling it all together, main takes two file names as command-line arguments: the database file and an .aux file. It reads the data from each of these files, combines it, and overwrites the database file with the combined data.

main = do
args <- getArgs

argsValid args >>= \case
    False -> putStrLn "Provide database file, then aux file."
    True -> do
        knownReferences <- readDB (args !! 0)
        newReferences <- readAuxFile (args !! 1)
        
        let combinedReferences = union newReferences knownReferences

        -- Careful here since Haskell's lazy IO can cause problems 
        -- reading/writing to the same file. One way to deal with this
        -- would be to write to a temporary file, then copy to the final 
        -- destination. That's safest since nothing is lost in a crash.
        -- Another way (done here) is to do something that requires that
        -- combinedReferences is complete.
        let totSize = length combinedReferences

        when (totSize > 0) $
          writeFile (args !! 0) (unlines $ map show combinedReferences)

        putStrLn $ "total number of labels: " ++ show totSize
    
argsValid :: [String] -> IO Bool
argsValid names = do
  if (null names) || (length names /= 2)
    then return False
    else doFilesExist names

doFilesExist :: [String] -> IO Bool
doFilesExist names = allM (\s -> doesFileExist s) names

Prev

Contact

Next