Haskell: Basic Parsec

July 29, 2020

An intermediate goal, which will be reached in few steps is to parse out the comments from Java source code so that they can be manipulated in some way. As a way of easing into the problem, begin by parsing a file into individual lines.

The main part of the program is straightforward, with a framwork similar to an earlier example. The program below consumes a text file, breaks it into lines, then prints each line. The outcome is no different than Linux cat, although it takes a major detour to get there.

{-# LANGUAGE LambdaCase #-}

import System.Environment
import System.Directory
import Text.ParserCombinators.Parsec

main = do
    args <- getArgs
    
    (argsValid args) >>= \case
        False -> putStrLn "Give me a single valid file name."
        True -> do
            contents <- readFile $ head args
            let result = parseJava contents
            case result of 
                Left err -> putStrLn (show(err))
                Right valid -> mapM_ putStr valid

argsValid :: [String] -> IO Bool
argsValid names = do
    if (null names) || (length names /= 1)
      then return False
      else doesFileExist $ head $ names

The question is now how to implement parseJava. The parser below uses Parsec, and the best sources I've found concerning Parsec are Chapter 16 of Real World Haskell, the paper by the original authors of Parsec, and a longer tutorial.

The parser below is equivalent to Haskell's lines function. It splits the input text on the \n character.

parseJava :: String -> Either ParseError [String]
parseJava input = parse many_lines "error_preface" input

many_lines :: Parser [String]
many_lines = many either_line

either_line :: Parser String
either_line = try (full_line) <|> partial_line
               
partial_line :: Parser String
partial_line = many1 (noneOf "\n")

full_line :: Parser String
full_line = do
        x <- partial_line
        y <- string ("\n")
        return (x ++ y)

The entry-point to Parsec is the call to parse, where "error_preface" is a message provided with any parse errors. For this program, a parse error should be impossible since any run of text can always be split on newlines. Most of the program should be clear, after looking up the meaning of many, many1, noneOf and string, but there are some less obvious points.

The type specification for parseJava ends with [String], which must match the type produced by the many_lines function. many_lines uses Parsec's many function, which produces a list of the results from either_line. many and many1 produce a list of items returned by the function they call.

The try function allows for lookahead (or backing up, depending on your point of view). Above, in either_line, the parser tries to pull out a full_line (which must end with a \n); if it can't find a newline, then the parse fails, the "mark" is restored to its initial position, and the parser tries to generate a partial_line. The <|> looks like "or" and that's exactly what it means. It would be natural to assume that <|> implictly assumes a try on each alternative, but that's not how Parsec operates; you need to explicitly try each option. Without the try, when a particular alternative fails, the consumed characters are not "put back" for the next alternative to consume.

When developing a parser, a common error seen when running the parser is something along the lines of

combinator 'many' is applied to a parser that accepts an empty string.

Above, if the many1 in partial_line is changed to many, then you'll get this error when you try to parse a file. The problem is that many accepts zero or more characters, which means that partial_line could validly return infinitely many empty results.

Finally, the implemenation above is inefficient. If full_line fails, then the parser backs all the way to the point where the line started, then starts over to produce a partial_line. This is a minor inefficiency since it can only happen on the last line, but it can be eliminated, making the parser shorter too, although the result is more difficult to understand:

parseJava :: String -> Either ParseError [String]
parseJava input = parse many_lines "error_preface" input

many_lines :: Parser [String]
many_lines = many either_line

either_line :: Parser String
either_line = do
    x <- many1 (noneOf "\n")
    y <- (try $ string "\n") <|> (string "")
    return $ x ++ y

Prev

Contact

Next