July 29, 2020
An intermediate goal, which will be reached in few steps is to parse out the comments from Java source code so that they can be manipulated in some way. As a way of easing into the problem, begin by parsing a file into individual lines.
The main part of the program is straightforward, with a framwork similar to an
earlier example. The program below consumes a text file,
breaks it into lines, then prints each line. The outcome is no different than Linux cat,
although it takes a major detour to get there.
{-# LANGUAGE LambdaCase #-}
import System.Environment
import System.Directory
import Text.ParserCombinators.Parsec
main = do
args <- getArgs
(argsValid args) >>= \case
False -> putStrLn "Give me a single valid file name."
True -> do
contents <- readFile $ head args
let result = parseJava contents
case result of
Left err -> putStrLn (show(err))
Right valid -> mapM_ putStr valid
argsValid :: [String] -> IO Bool
argsValid names = do
if (null names) || (length names /= 1)
then return False
else doesFileExist $ head $ names
The question is now how to implement parseJava. The parser below uses Parsec,
and the best sources I've found concerning Parsec are Chapter 16 of
Real World Haskell, the
paper by the original authors of Parsec, and
a longer tutorial.
The parser below is equivalent to Haskell's lines function. It splits
the input text on the \n character.
parseJava :: String -> Either ParseError [String]
parseJava input = parse many_lines "error_preface" input
many_lines :: Parser [String]
many_lines = many either_line
either_line :: Parser String
either_line = try (full_line) <|> partial_line
partial_line :: Parser String
partial_line = many1 (noneOf "\n")
full_line :: Parser String
full_line = do
x <- partial_line
y <- string ("\n")
return (x ++ y)
The entry-point to Parsec is the call to parse, where
"error_preface" is a message provided with any parse errors. For this
program, a parse error should be impossible since any run of text can always
be split on newlines. Most of the program should be clear, after looking up the meaning of
many, many1, noneOf and string,
but there are some less obvious points.
The type specification for parseJava ends with
[String], which must match the type produced by the
many_lines function. many_lines uses Parsec's many
function, which produces a list of the results from either_line.
many and many1 produce a list of items returned by the function
they call.
The try function allows for lookahead (or backing up, depending on your
point of view). Above, in either_line, the parser tries to pull out a
full_line (which must end with a \n); if it can't find a newline,
then the parse fails, the "mark" is restored to its initial position, and the parser tries
to generate a partial_line. The <|> looks like "or" and that's
exactly what it means. It would be natural to assume that <|> implictly assumes
a try on each alternative, but that's not how Parsec operates; you need to
explicitly try each option. Without the try, when a particular
alternative fails, the consumed characters are not "put back" for the next alternative
to consume.
When developing a parser, a common error seen when running the parser is something along the lines of
combinator 'many' is applied to a parser that accepts an empty string.
Above, if the many1 in partial_line is changed to many,
then you'll get this error when you try to parse a file. The problem is that many
accepts zero or more characters, which means that partial_line could
validly return infinitely many empty results.
Finally, the implemenation above is inefficient. If full_line fails, then the
parser backs all the way to the point where the line started, then starts over to produce
a partial_line. This is a minor inefficiency since it can only happen on the
last line, but it can be eliminated, making the parser shorter too,
although the result is more difficult to understand:
parseJava :: String -> Either ParseError [String]
parseJava input = parse many_lines "error_preface" input
many_lines :: Parser [String]
many_lines = many either_line
either_line :: Parser String
either_line = do
x <- many1 (noneOf "\n")
y <- (try $ string "\n") <|> (string "")
return $ x ++ y