July 30, 2020
In an earlier post, a parser was created that is
about the simplest possible. It converts a text file to a [String]
,
with one String
for each line. In this step of the series,
the parser will distinguish between comments and code in a Java source file. The same
parser will work with other languages too, perhaps with some adjustments,
depending on how the other language expresses comments and literal strings.
The program starts off exactly as before:
{-# LANGUAGE LambdaCase #-} import System.Environment import System.Directory import Text.ParserCombinators.Parsec import Control.Monad main = do args <- getArgs (argsValid args) >>= \case False -> putStrLn "Give me a single valid file name." True -> do contents <- readFile $ head args let result = parseJava contents case result of Left err -> putStrLn (show(err)) Right valid -> mapM_ putStr valid argsValid :: [String] -> IO Bool argsValid names = do if (null names) || (length names /= 1) then return False else doesFileExist $ head $ names
but a different parseJava
function is needed.
Java allows two types of comments. Multi-line comments are enclosed in /*
and */
, while single-line comments run from //
to the end of the
line. How about the following?
parseJava :: String -> Either ParseError [String]
parseJava input = parse parseJavaInput "" input
parseJavaInput :: Parser [String]
parseJavaInput = manyTill javaPart eof
javaPart :: Parser String
javaPart = parseSLComment <|> parseMLComment <|> javaCode
javaCode :: Parser String
javaCode = manyTill anyChar $ (lookAhead $ try endCode)
endCode :: Parser String
endCode = string "//" <|> string "/*" <|> myEOF
myEOF :: Parser String
myEOF = do
try eof
return ""
parseSLComment :: Parser String
parseSLComment = do
try $ string "//"
guts <- manyTill anyChar (try $ string "\n")
return ("//" ++ guts ++ "\n")
parseMLComment :: Parser String
parseMLComment = do
try $ string "/*"
guts <- manyTill anyChar (try $ string "*/")
return ("/*" ++ guts ++ "*/")
The parser above will generate one javaPart
at a time, where each of these
parts is either a comment or arbitrary javaCode
. For the two
comment-generating parsers, it's important to realize that manyTill
consumes
and discards the "till" part. Neither the opening characters for the comment,
nor the closing character(s) are kept by the parser, so they must be re-inserted
by return
. Also, it is implicit in manyTill
that EOF satisfies
the "till" condition.
javaCode
must be the last term in javaPart
because it consumes
every character until it hits either a comment or EOF. When it does hit this "till" condition,
lookAhead
rewinds back to the opening characters of the comment; this allows the
comment parsers to determine which type of comment is being dealt with. If javaCode
were not last, then the parser would enter an infinite loop when it reaches a comment.
The last bit of cleverness is myEOF
. It would be natural to define
endCode = string "//" <|> string "/*" <|> eof
but Parsec's built-in eof
does not return a String
, and
every term of endCode
must share a common type.
Most Java files are properly parsed by the above, but not all. Something like
morlock = "// Foiled again!";
is incorrectly parsed. The parser above treats morlock = "
as ordinary code,
but when it reaches //
, it goes off the rails because the parser isn't aware
of the surrounding string context.
If the parser needs to be on the lookout for opening double-quotes, then it also needs to look for single-quotes. Otherwise, something like
drat = '"';
double_drat = '\"';
triple_drat = '\'';
could gum up the works.
To handle these possibilities, replace the code that appears in red above. That is, redefine the
javaCode
parser function, eliminate endCode
, myEOF
, and
add some new functions for string and character parsing (or here is the final version):
javaCode :: Parser String javaCode = fmap concat $ many1 javaBite javaBite :: Parser String javaBite = stringCode <|> quoteCode <|> nonStringCode stringCode :: Parser String stringCode = do try $ char '"' x <- manyTill stringChar (string "\"") return ("\"" ++ concat x ++ "\"") stringChar :: Parser String stringChar = stringNonEscape <|> stringEscape stringNonEscape :: Parser String stringNonEscape = do x <- noneOf "\\\"" return [x] stringEscape :: Parser String stringEscape = do d <- char '\\' c <- oneOf "\\\"0nrvtbf'u" return [d,c] quoteCode :: Parser String quoteCode = do try $ char '\'' x <- manyTill quoteChar (string "'") return ("'" ++ concat x ++ "'") quoteChar :: Parser String quoteChar = quoteNonEscape <|> quoteEscape quoteNonEscape :: Parser String quoteNonEscape = do x <- noneOf "\\'" return [x] quoteEscape :: Parser String quoteEscape = do d <- char '\\' c <- oneOf "\\\"0nrvtbf'u" return [d,c] nonStringCode :: Parser String nonStringCode = (many1 $ noneOf "/'\"") <|> try goodSlash goodSlash :: Parser String goodSlash = do x <- char '/' y <- noneOf "/*" return [x,y]
The parser above does allow a few things to fall through the cracks. Java processes unicode characters as the first step of compilation. This won't often matter, but there are some strange corner-cases. For instance,
// \u000d System.out.println("weird");
does not comment out the println()
. \u000d
is newline, so the compiler
sees the above as
//
System.out.println("weird");
while parseJava
considers everything after the //
to be a comment.