July 30, 2020
In an earlier post, a parser was created that is
about the simplest possible. It converts a text file to a [String],
with one String for each line. In this step of the series,
the parser will distinguish between comments and code in a Java source file. The same
parser will work with other languages too, perhaps with some adjustments,
depending on how the other language expresses comments and literal strings.
The program starts off exactly as before:
{-# LANGUAGE LambdaCase #-}
import System.Environment
import System.Directory
import Text.ParserCombinators.Parsec
import Control.Monad
main = do
args <- getArgs
(argsValid args) >>= \case
False -> putStrLn "Give me a single valid file name."
True -> do
contents <- readFile $ head args
let result = parseJava contents
case result of
Left err -> putStrLn (show(err))
Right valid -> mapM_ putStr valid
argsValid :: [String] -> IO Bool
argsValid names = do
if (null names) || (length names /= 1)
then return False
else doesFileExist $ head $ names
but a different parseJava function is needed.
Java allows two types of comments. Multi-line comments are enclosed in /*
and */, while single-line comments run from // to the end of the
line. How about the following?
parseJava :: String -> Either ParseError [String]
parseJava input = parse parseJavaInput "" input
parseJavaInput :: Parser [String]
parseJavaInput = manyTill javaPart eof
javaPart :: Parser String
javaPart = parseSLComment <|> parseMLComment <|> javaCode
javaCode :: Parser String
javaCode = manyTill anyChar $ (lookAhead $ try endCode)
endCode :: Parser String
endCode = string "//" <|> string "/*" <|> myEOF
myEOF :: Parser String
myEOF = do
try eof
return ""
parseSLComment :: Parser String
parseSLComment = do
try $ string "//"
guts <- manyTill anyChar (try $ string "\n")
return ("//" ++ guts ++ "\n")
parseMLComment :: Parser String
parseMLComment = do
try $ string "/*"
guts <- manyTill anyChar (try $ string "*/")
return ("/*" ++ guts ++ "*/")
The parser above will generate one javaPart at a time, where each of these
parts is either a comment or arbitrary javaCode. For the two
comment-generating parsers, it's important to realize that manyTill consumes
and discards the "till" part. Neither the opening characters for the comment,
nor the closing character(s) are kept by the parser, so they must be re-inserted
by return. Also, it is implicit in manyTill that EOF satisfies
the "till" condition.
javaCode must be the last term in javaPart because it consumes
every character until it hits either a comment or EOF. When it does hit this "till" condition,
lookAhead rewinds back to the opening characters of the comment; this allows the
comment parsers to determine which type of comment is being dealt with. If javaCode
were not last, then the parser would enter an infinite loop when it reaches a comment.
The last bit of cleverness is myEOF. It would be natural to define
endCode = string "//" <|> string "/*" <|> eof
but Parsec's built-in eof does not return a String, and
every term of endCode must share a common type.
Most Java files are properly parsed by the above, but not all. Something like
morlock = "// Foiled again!";
is incorrectly parsed. The parser above treats morlock = " as ordinary code,
but when it reaches //, it goes off the rails because the parser isn't aware
of the surrounding string context.
If the parser needs to be on the lookout for opening double-quotes, then it also needs to look for single-quotes. Otherwise, something like
drat = '"';
double_drat = '\"';
triple_drat = '\'';
could gum up the works.
To handle these possibilities, replace the code that appears in red above. That is, redefine the
javaCode parser function, eliminate endCode, myEOF, and
add some new functions for string and character parsing (or here is the final version):
javaCode :: Parser String
javaCode = fmap concat $ many1 javaBite
javaBite :: Parser String
javaBite = stringCode <|> quoteCode <|> nonStringCode
stringCode :: Parser String
stringCode = do
try $ char '"'
x <- manyTill stringChar (string "\"")
return ("\"" ++ concat x ++ "\"")
stringChar :: Parser String
stringChar = stringNonEscape <|> stringEscape
stringNonEscape :: Parser String
stringNonEscape = do
x <- noneOf "\\\""
return [x]
stringEscape :: Parser String
stringEscape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf'u"
return [d,c]
quoteCode :: Parser String
quoteCode = do
try $ char '\''
x <- manyTill quoteChar (string "'")
return ("'" ++ concat x ++ "'")
quoteChar :: Parser String
quoteChar = quoteNonEscape <|> quoteEscape
quoteNonEscape :: Parser String
quoteNonEscape = do
x <- noneOf "\\'"
return [x]
quoteEscape :: Parser String
quoteEscape = do
d <- char '\\'
c <- oneOf "\\\"0nrvtbf'u"
return [d,c]
nonStringCode :: Parser String
nonStringCode = (many1 $ noneOf "/'\"") <|> try goodSlash
goodSlash :: Parser String
goodSlash = do
x <- char '/'
y <- noneOf "/*"
return [x,y]
The parser above does allow a few things to fall through the cracks. Java processes unicode characters as the first step of compilation. This won't often matter, but there are some strange corner-cases. For instance,
// \u000d System.out.println("weird");
does not comment out the println(). \u000d is newline, so the compiler
sees the above as
//
System.out.println("weird");
while parseJava considers everything after the // to be a comment.