July 30, 2020
  In an earlier post, a parser was created that is 
  about the simplest possible. It converts a text file to a [String],
  with one String for each line. In this step of the series,
  the parser will distinguish between comments and code in a Java source file. The same
  parser will work with other languages too, perhaps with some adjustments, 
  depending on how the other language expresses comments and literal strings.
The program starts off exactly as before:
{-# LANGUAGE LambdaCase #-}
import System.Environment
import System.Directory
import Text.ParserCombinators.Parsec
import Control.Monad
main = do
    args <- getArgs
    
    (argsValid args) >>= \case
        False -> putStrLn "Give me a single valid file name."
        True -> do
            contents <- readFile $ head args
            let result = parseJava contents
            case result of 
                Left err -> putStrLn (show(err))
                Right valid -> mapM_ putStr valid
argsValid :: [String] -> IO Bool
argsValid names = do
    if (null names) || (length names /= 1)
      then return False
      else doesFileExist $ head $ names
  but a different parseJava function is needed.
  Java allows two types of comments. Multi-line comments are enclosed in /*
  and */, while single-line comments run from // to the end of the
  line. How about the following?
parseJava :: String -> Either ParseError [String]
parseJava input = parse parseJavaInput "" input
parseJavaInput :: Parser [String]
parseJavaInput = manyTill javaPart eof
javaPart :: Parser String
javaPart = parseSLComment <|> parseMLComment <|> javaCode
javaCode :: Parser String
javaCode = manyTill anyChar $ (lookAhead $ try endCode)
endCode :: Parser String
endCode = string "//" <|> string "/*" <|> myEOF
myEOF :: Parser String
myEOF = do
    try eof
    return ""
parseSLComment :: Parser String
parseSLComment = do
    try $ string "//"
    guts <- manyTill anyChar (try $ string "\n")
    return ("//" ++ guts ++ "\n")
parseMLComment :: Parser String
parseMLComment = do
    try $ string "/*"
    guts <- manyTill anyChar (try $ string "*/")
    return ("/*" ++ guts ++ "*/")
  The parser above will generate one javaPart at a time, where each of these
  parts is either a comment or arbitrary javaCode. For the two 
  comment-generating parsers, it's important to realize that manyTill consumes
  and discards the "till" part. Neither the opening characters for the comment,
  nor the closing character(s) are kept by the parser, so they must be re-inserted 
  by return. Also, it is implicit in manyTill that EOF satisfies 
  the "till" condition.
  javaCode must be the last term  in javaPart because it consumes 
  every character until it hits either a comment or EOF. When it does hit this "till" condition, 
  lookAhead rewinds back to the opening characters of the comment; this allows the
  comment parsers to determine which type of comment is being dealt with. If javaCode
  were not last, then the parser would enter an infinite loop when it reaches a comment.
  The last bit of cleverness is myEOF. It would be natural to define
  endCode = string "//" <|> string "/*" <|> eof
  but Parsec's built-in eof does not return a String, and 
  every term of endCode must share a common type.
Most Java files are properly parsed by the above, but not all. Something like
  morlock = "// Foiled again!";
  is incorrectly parsed. The parser above treats morlock = " as ordinary code,
  but when it reaches //, it goes off the rails because the parser isn't aware 
  of the surrounding string context.
If the parser needs to be on the lookout for opening double-quotes, then it also needs to look for single-quotes. Otherwise, something like
  drat = '"';
  double_drat = '\"';
  triple_drat = '\'';
could gum up the works.
  To handle these possibilities, replace the code that appears in red above. That is, redefine the
  javaCode parser function, eliminate endCode, myEOF, and 
  add some new functions for string and character parsing (or here is the final version):
javaCode :: Parser String
javaCode = fmap concat $ many1 javaBite
javaBite :: Parser String
javaBite = stringCode <|> quoteCode <|> nonStringCode
stringCode :: Parser String
stringCode = do
    try $ char '"'
    x <- manyTill stringChar (string "\"")
    return ("\"" ++ concat x ++ "\"")
stringChar :: Parser String
stringChar = stringNonEscape <|> stringEscape
stringNonEscape :: Parser String
stringNonEscape = do
    x <- noneOf "\\\""
    return [x]
stringEscape :: Parser String
stringEscape = do
    d <- char '\\'
    c <- oneOf "\\\"0nrvtbf'u"
    return [d,c]
quoteCode :: Parser String
quoteCode = do
    try $ char '\''
    x <- manyTill quoteChar (string "'")
    return ("'" ++ concat x ++ "'")
quoteChar :: Parser String
quoteChar = quoteNonEscape <|> quoteEscape
quoteNonEscape :: Parser String
quoteNonEscape = do
    x <- noneOf "\\'"
    return [x]
quoteEscape :: Parser String
quoteEscape = do
    d <- char '\\'
    c <- oneOf "\\\"0nrvtbf'u"
    return [d,c]
nonStringCode :: Parser String
nonStringCode = (many1 $ noneOf "/'\"") <|> try goodSlash
goodSlash :: Parser String
goodSlash = do
    x <- char '/'
    y <- noneOf "/*"
    return [x,y]
  
The parser above does allow a few things to fall through the cracks. Java processes unicode characters as the first step of compilation. This won't often matter, but there are some strange corner-cases. For instance,
  // \u000d System.out.println("weird");
  does not comment out the println(). \u000d is newline, so the compiler
  sees the above as
  // 
  System.out.println("weird");
  while parseJava considers everything after the // to be a comment.