July 29, 2020
An intermediate goal, which will be reached in few steps is to parse out the comments from Java source code so that they can be manipulated in some way. As a way of easing into the problem, begin by parsing a file into individual lines.
The main
part of the program is straightforward, with a framwork similar to an
earlier example. The program below consumes a text file,
breaks it into lines, then prints each line. The outcome is no different than Linux cat
,
although it takes a major detour to get there.
{-# LANGUAGE LambdaCase #-} import System.Environment import System.Directory import Text.ParserCombinators.Parsec main = do args <- getArgs (argsValid args) >>= \case False -> putStrLn "Give me a single valid file name." True -> do contents <- readFile $ head args let result = parseJava contents case result of Left err -> putStrLn (show(err)) Right valid -> mapM_ putStr valid argsValid :: [String] -> IO Bool argsValid names = do if (null names) || (length names /= 1) then return False else doesFileExist $ head $ names
The question is now how to implement parseJava
. The parser below uses Parsec,
and the best sources I've found concerning Parsec are Chapter 16 of
Real World Haskell, the
paper by the original authors of Parsec, and
a longer tutorial.
The parser below is equivalent to Haskell's lines
function. It splits
the input text on the \n
character.
parseJava :: String -> Either ParseError [String] parseJava input = parse many_lines "error_preface" input many_lines :: Parser [String] many_lines = many either_line either_line :: Parser String either_line = try (full_line) <|> partial_line partial_line :: Parser String partial_line = many1 (noneOf "\n") full_line :: Parser String full_line = do x <- partial_line y <- string ("\n") return (x ++ y)
The entry-point to Parsec is the call to parse
, where
"error_preface"
is a message provided with any parse errors. For this
program, a parse error should be impossible since any run of text can always
be split on newlines. Most of the program should be clear, after looking up the meaning of
many
, many1
, noneOf
and string
,
but there are some less obvious points.
The type specification for parseJava
ends with
[String]
, which must match the type produced by the
many_lines
function. many_lines
uses Parsec's many
function, which produces a list of the results from either_line
.
many
and many1
produce a list of items returned by the function
they call.
The try
function allows for lookahead (or backing up, depending on your
point of view). Above, in either_line
, the parser tries to pull out a
full_line
(which must end with a \n
); if it can't find a newline,
then the parse fails, the "mark" is restored to its initial position, and the parser tries
to generate a partial_line
. The <|>
looks like "or" and that's
exactly what it means. It would be natural to assume that <|>
implictly assumes
a try
on each alternative, but that's not how Parsec operates; you need to
explicitly try
each option. Without the try
, when a particular
alternative fails, the consumed characters are not "put back" for the next alternative
to consume.
When developing a parser, a common error seen when running the parser is something along the lines of
combinator 'many' is applied to a parser that accepts an empty string.
Above, if the many1
in partial_line
is changed to many
,
then you'll get this error when you try to parse a file. The problem is that many
accepts zero or more characters, which means that partial_line
could
validly return infinitely many empty results.
Finally, the implemenation above is inefficient. If full_line
fails, then the
parser backs all the way to the point where the line started, then starts over to produce
a partial_line
. This is a minor inefficiency since it can only happen on the
last line, but it can be eliminated, making the parser shorter too,
although the result is more difficult to understand:
parseJava :: String -> Either ParseError [String] parseJava input = parse many_lines "error_preface" input many_lines :: Parser [String] many_lines = many either_line either_line :: Parser String either_line = do x <- many1 (noneOf "\n") y <- (try $ string "\n") <|> (string "") return $ x ++ y