Parsing out Comments with Parsec

July 30, 2020

In an earlier post, a parser was created that is about the simplest possible. It converts a text file to a [String], with one String for each line. In this step of the series, the parser will distinguish between comments and code in a Java source file. The same parser will work with other languages too, perhaps with some adjustments, depending on how the other language expresses comments and literal strings.

The program starts off exactly as before:

{-# LANGUAGE LambdaCase #-}

import System.Environment
import System.Directory
import Text.ParserCombinators.Parsec
import Control.Monad

main = do
    args <- getArgs
    
    (argsValid args) >>= \case
        False -> putStrLn "Give me a single valid file name."
        True -> do
            contents <- readFile $ head args
            let result = parseJava contents
            case result of 
                Left err -> putStrLn (show(err))
                Right valid -> mapM_ putStr valid

argsValid :: [String] -> IO Bool
argsValid names = do
    if (null names) || (length names /= 1)
      then return False
      else doesFileExist $ head $ names

but a different parseJava function is needed.

First Try

Java allows two types of comments. Multi-line comments are enclosed in /* and */, while single-line comments run from // to the end of the line. How about the following?

parseJava :: String -> Either ParseError [String]
parseJava input = parse parseJavaInput "" input

parseJavaInput :: Parser [String]
parseJavaInput = manyTill javaPart eof

javaPart :: Parser String
javaPart = parseSLComment <|> parseMLComment <|> javaCode

javaCode :: Parser String
javaCode = manyTill anyChar $ (lookAhead $ try endCode)

endCode :: Parser String
endCode = string "//" <|> string "/*" <|> myEOF

myEOF :: Parser String
myEOF = do
    try eof
    return ""

parseSLComment :: Parser String
parseSLComment = do
    try $ string "//"
    guts <- manyTill anyChar (try $ string "\n")
    return ("//" ++ guts ++ "\n")

parseMLComment :: Parser String
parseMLComment = do
    try $ string "/*"
    guts <- manyTill anyChar (try $ string "*/")
    return ("/*" ++ guts ++ "*/")

The parser above will generate one javaPart at a time, where each of these parts is either a comment or arbitrary javaCode. For the two comment-generating parsers, it's important to realize that manyTill consumes and discards the "till" part. Neither the opening characters for the comment, nor the closing character(s) are kept by the parser, so they must be re-inserted by return. Also, it is implicit in manyTill that EOF satisfies the "till" condition.

javaCode must be the last term in javaPart because it consumes every character until it hits either a comment or EOF. When it does hit this "till" condition, lookAhead rewinds back to the opening characters of the comment; this allows the comment parsers to determine which type of comment is being dealt with. If javaCode were not last, then the parser would enter an infinite loop when it reaches a comment.

The last bit of cleverness is myEOF. It would be natural to define


  endCode = string "//" <|> string "/*" <|> eof

but Parsec's built-in eof does not return a String, and every term of endCode must share a common type.

Second Try

Most Java files are properly parsed by the above, but not all. Something like


  morlock = "// Foiled again!";

is incorrectly parsed. The parser above treats morlock = " as ordinary code, but when it reaches //, it goes off the rails because the parser isn't aware of the surrounding string context.

If the parser needs to be on the lookout for opening double-quotes, then it also needs to look for single-quotes. Otherwise, something like


  drat = '"';

  double_drat = '\"';

  triple_drat = '\'';

could gum up the works.

To handle these possibilities, replace the code that appears in red above. That is, redefine the javaCode parser function, eliminate endCode, myEOF, and add some new functions for string and character parsing (or here is the final version):

javaCode :: Parser String
javaCode = fmap concat $ many1 javaBite

javaBite :: Parser String
javaBite = stringCode <|> quoteCode <|> nonStringCode

stringCode :: Parser String
stringCode = do
    try $ char '"'
    x <- manyTill stringChar (string "\"")
    return ("\"" ++ concat x ++ "\"")

stringChar :: Parser String
stringChar = stringNonEscape <|> stringEscape

stringNonEscape :: Parser String
stringNonEscape = do
    x <- noneOf "\\\""
    return [x]

stringEscape :: Parser String
stringEscape = do
    d <- char '\\'
    c <- oneOf "\\\"0nrvtbf'u"
    return [d,c]

quoteCode :: Parser String
quoteCode = do
    try $ char '\''
    x <- manyTill quoteChar (string "'")
    return ("'" ++ concat x ++ "'")

quoteChar :: Parser String
quoteChar = quoteNonEscape <|> quoteEscape

quoteNonEscape :: Parser String
quoteNonEscape = do
    x <- noneOf "\\'"
    return [x]

quoteEscape :: Parser String
quoteEscape = do
    d <- char '\\'
    c <- oneOf "\\\"0nrvtbf'u"
    return [d,c]

nonStringCode :: Parser String
nonStringCode = (many1 $ noneOf "/'\"") <|> try goodSlash

goodSlash :: Parser String
goodSlash = do
    x <- char '/'
    y <- noneOf "/*"
    return [x,y]

The parser above does allow a few things to fall through the cracks. Java processes unicode characters as the first step of compilation. This won't often matter, but there are some strange corner-cases. For instance,


  // \u000d System.out.println("weird");

does not comment out the println(). \u000d is newline, so the compiler sees the above as


  // 

  System.out.println("weird");

while parseJava considers everything after the // to be a comment.

Contact