ESC/Java2
© 2003,2004,2005 David Cok and Joseph Kiniry
© 2005 UCD Dublin
© 2003,2004 Radboud University Nijmegen
© 1999,2000 Compaq Computer Corporation
© 1997,1998,1999 Digital Equipment Corporation
All Rights Reserved

javafe.parser
Class Lex

java.lang.Object
  extended byjavafe.parser.Token
      extended byjavafe.parser.Lex
Direct Known Subclasses:
EscPragmaLex

public class Lex
extends Token

A Lex object generates a sequence of Java "input elements" (that is, tokens) by converting the sequence of input characters and line terminators generated by an underlying CorrelatedReader.

The conversion of input characters occurs according to the lexical rules in Chapter Three of the Java Language Specification. This specification describes three lexical translation steps: the first two steps translate a raw Unicode input stream into a "cooked" one in which Unicode escape characters from the raw stream have been processed and in which line terminators have been identified; the last step translates this cooked stream into a sequence of Java "input elements" (comments, white space, identifiers, tokens, literals, and punctuation). Lex objects perform the last of these translations steps; the first two are performed by an underlying CorrelatedReader that is given to the Lex object as the source of input characters.

Before a newly-created Lex object can be used, its restart method must be called giving a CorrelatedReader to scan. At any point, a Lex can be restarted on a different underlying reader.

The Lex class is thread safe, but instances of Lex are not. That is, two different threads may safely access two different instances of Lex concurrently, but they may not access the same instance of Lex concurrently.

Simple scanning

The getNextToken method of Lex objects returns the translated token sequence, one token at a time. It discards white space, and it processes comments as described below. In addition, getNextToken fills in the Token fields of this (Lex is a subclass of Token); for example, ttype gets an integer code defining the type of token returned and startingLoc gives the location of the first character making up the token. If the token is an identifier, identifierVal indicates which one; if the token is a literal, auxVal gives its value.

Lex objects report errors by calling both the fatal and error methods of ErrorSet. The fatal errors are unexpected characters, unterminated comments and string and character literals, and IO errors in the underlying input stream. Recoverable errors are overflows in literals (including overflows in octal escape sequences), non-octal digits in integer literals, the string 0x (interpreted as a malformed integer literal), missing digits in a floating-point exponent, bad escape sequences in character and string literals, character literals containing no or multiple characters, and the character literal ''' (interpreted as '\'').

Lookahead

Lex objects allow their clients to peek ahead into the token stream by calling lookahead. This method returns the token code for the future token, but it does not affect the Token fields of this.

If a call to lookahead needs to look past the set of tokens already scanned, and those tokens have errors, then the errors are reported immediately.

Extensibility: Keywords, punctuation

A keyword is a Java identifier with a special token code. Ordinarily, identifiers are associated with the code TagConstants.IDENT. Keywords, while matching the lexical grammar of identifiers, are associated with different codes. In fact, each keyword is typically associated with its own code.

The set of identifiers that a Lex object recognizes as keywords is extensible. A keyword is added to a Lex object by calling the addKeyword method. As a convenience, a boolean given to the Lex constructor indicates whether a newly-constructed Lex object should automatically have all Java keywords added to it.

A punctuation string is a string of punctuation characters recognized by a Lex object to be a token. (Punctuation characters are non-alphanumeric ASCII characters whose ASCII codes are between 33 ('!') and 126 ('~') inclusive.) As with keywords, the set of punctuation strings recognized by a Lex object is extensible. A punctuation string is added to a Lex object by calling the addPunctuation method. A boolean given to the Lex constructor indicates whether a newly-constructed Lex object should automatically have all Java punctuation strings added to it.

Extensibility: comments and pragmas

The handling of comments is special in two ways: the punctuation strings that start comments is extensible, the text of comments can be parsed for pragmas.

Comment recognition

Ordinarily, keywords and punctuation strings are mapped to token codes that are returned by calls to getNextToken. However, two token codes are treated specially: TagConstants.C_COMMENT and TagConstants.EOL_COMMENT. These codes are used to indicate the start of C-like comments (/*...*/) and end-of-line comments (//...), respectively. When a keyword or punctuation string is mapped to these codes, it is handled like a comment initiator rather than a regular token.

For all newly-created Lex objects, /* is mapped to TagConstants.C_COMMENT and // is mapped to TagConstants.EOL_COMMENT. Other punctuation strings can be made comment initiators by mapping them to comment-initiating codes. This is more useful for TagConstants.EOL_COMMENT than for TagConstants.C_COMMENT, since the string */ is hard-wired as the terminator of C-like comments.

Pragma parsing
Lex objects are designed to support annotation of Java programs through pragmas. A separate document describes our overall aproach to pragmas. In brief, our front-end supports two kinds of pragmas: control pragmas that can appear anywhere in an input file and are collected in a list apart from the parse tree, and syntax pragmas that can only appear in certain grammatical contexts and become part of the parse tree. Pragmas always appear in Java comments, at most one pragma per comment. These comments must have one of the following forms:
  • /*tag white-space pragma-text*/
  • //tag white-space-minus-EOL pragma-text EOL
  • When a Lex object is created, it can optionally be associated with a PragmaParser object. If a Lex object has no PragmaParser, it discards all comments. Otherwise, the Lex object passes the first character of the comment (or -1 if the comment is empty) to the checkTag method of the PragmaParser object, which returns false if the comment definitely does not contain any pragmas. If the comment may contain pragmas, the Lex object bundles the text between the delimiters of a comment into a CorrelatedReader which it passes to the restart method of its PragmaParser. (This text excludes both the opening /* or // and the closing */ or line terminator.)

    The Lex object then calls getNextPragma to read the pragmas out of the comment one at a time. The Lex object does this in a lazy manner; that is, it reads a pragma, returns it to the parser, and waits until the parser calls for the next token before it attempts to read another pragma. The getNextPragma method returns a boolean, returning false if there are no more pragmas to be parsed. The getNextPragma method takes a Token as an argument, storing information about the pragma parsed into this argument.

    When PragmaParser.getNextPragma returns a LexicalPragma, the Lex object puts it in an internal list rather than returning it to the parser. The list of collected lexical pragmas can be retrieved by calling getLexicalPragmas.

    See Also:
    CorrelatedReader, TagConstants, Token, PragmaParser

    Field Summary
    protected  boolean inPragma
               
    protected  boolean javakeywords
               
    protected  java.util.Hashtable keywords
              Unenforceable invariant: all tokenTypes in this table do not require a non-null auxVal.
     LexicalPragmaVec lexicalPragmas
               
    protected  TokenQueue lookaheadq
               
    protected  CorrelatedReader m_in
              Current state of input stream underlying the scanner minus the first character.
    protected  int m_nextchr
              Each call to getNextToken reads ahead one character and leaves the result in m_nextchr.
    protected  boolean onlyjavakeywords
               
    protected  PragmaParser pragmaParser
               
    private  PunctuationPrefixTree punctuationTable
               
    protected  Token savedState
               
    private  char[] stringLit
               
    private  int stringLitLen
               
    protected  char[] text
              The characters that constitute the current token.
    protected  int textlen
              The number of characters in the current token.
     
    Fields inherited from class javafe.parser.Token
    auxVal, CLEAR, endingLoc, identifierVal, startingLoc, ttype
     
    Constructor Summary
    Lex(PragmaParser pragmaParser, boolean isJava)
              Creates a lexical analyzer that will tokenize the characters read from an underlying CorrelatedReader.
     
    Method Summary
     void addJavaKeywords()
              Add all of Java's keywords to the scanner.
     void addJavaPunctuation()
              Add all of Java's punctuation strings to the scanner.
     void addKeyword(java.lang.String newkeyword, int code)
              Add a keyword to a Lex object with the given code.
     void addPunctuation(java.lang.String punctuation, int code)
              Add a punctuation string to a scanner associated with a given code.
    protected  void append(int c)
              Append 'c' to text, expanding if necessary.
     void close()
              Closes the CorrelatedReader underlying this, clears the set of collected lexical pragmas, and in other ways frees up resources associated with this.
    private  int finishFloatingPointLiteral(int nextchr)
              Finishes scanning a floating-point literal.
     LexicalPragmaVec getLexicalPragmas()
              Returns the set of lexical pragmas collected.
     int getNextToken()
              Scans next token from input stream.
     int lookahead(int k)
              Returns token type of the kth future token, where k=0 is the current token.
     Token lookaheadToken(int k)
               
     LexicalPragma popLexicalPragma()
              Remove the first LexicalPragma from our set of lexical pragmas collected, returning it or null if our set is empty.
     void replaceLookaheadToken(int k, Token t)
               
     int restart(CorrelatedReader in)
              Start scaning a new CorrelatedReader.
    private  int scanCharOrString(int nextchr)
              Scan a character or string constant.
    private  void scanComment(int commentKind)
              Handle a comment.
    protected  int scanJavaExtensions(int nextchr)
              Scans a Java extension.
    private  int scanNumber(int nextchr)
              Scans a numeric literal.
    private  int scanPunctuation(int nextchr)
              Scans a punctuation string or a floating-point number.
    private  int scanToken()
              Returns the code of the next token in the token stream, updating the Token fields of this along the way.
    private  void stringLitAppend(int c)
               
     void zzz(java.lang.String prefix)
              Checks invariants (assumes that Token fields haven't been mucked with by outside code).
     
    Methods inherited from class javafe.parser.Token
    clear, copyInto, ztoString, zzz
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    m_in

    protected CorrelatedReader m_in
    Current state of input stream underlying the scanner minus the first character. See m_nextchr.

    This is null iff we are closed.


    m_nextchr

    protected int m_nextchr
    Each call to getNextToken reads ahead one character and leaves the result in m_nextchr. In other words, between calls to getNextToken, the stream of characters yet-to-scanned consists of the character in m_nextchr followed by the characters remaining in m_in.


    text

    protected char[] text
    The characters that constitute the current token. Only the first textlen characters are part of the current token; the actual length of text may be bigger. The lexer may occasionally need to resize text, so the same array might not be used throughout the lifetime of the lexer.


    textlen

    protected int textlen
    The number of characters in the current token. The "current token" is the one parsed by the previous call to getNextToken (there is no "current token" between creation of a lexer and the first call to getNextToken).


    lookaheadq

    protected final TokenQueue lookaheadq

    lexicalPragmas

    public LexicalPragmaVec lexicalPragmas

    pragmaParser

    protected PragmaParser pragmaParser

    inPragma

    protected boolean inPragma

    savedState

    protected Token savedState

    stringLit

    private char[] stringLit

    stringLitLen

    private int stringLitLen

    punctuationTable

    private PunctuationPrefixTree punctuationTable

    keywords

    protected java.util.Hashtable keywords
    Unenforceable invariant: all tokenTypes in this table do not require a non-null auxVal. (cf. Token.auxVal).


    javakeywords

    protected boolean javakeywords

    onlyjavakeywords

    protected boolean onlyjavakeywords
    Constructor Detail

    Lex

    public Lex(PragmaParser pragmaParser,
               boolean isJava)
    Creates a lexical analyzer that will tokenize the characters read from an underlying CorrelatedReader. Before the newly-created scanner can be used, its restart method must be called on a CorrelatedReader. The pragmaParser object is used to parse pragmas out of comments; if it is null, all comments are discarded. The isJava flag controls the initial set of keywords and punctuation strings; if true, the new scanner will recognize Java's keywords and punctuation strings, if false, the new scanner will recognize no keywords or punctuation strings. If isJava is true, the token codes used for the Java's keywords and punctuation strings are those defined by the TagConstants class.

    Method Detail

    append

    protected void append(int c)
    Append 'c' to text, expanding if necessary.


    restart

    public int restart(CorrelatedReader in)
    Start scaning a new CorrelatedReader. First closes the old CorrelatedReader associated with this (if there was one), and clears out the set of collected lexical pragms. In addition to (re)-seting the underlying input stream, this method scans the first token, returning the token kind of the result and setting the Token fields of this. If a CorrelatedReader is already underlying this, it is closed before the new reader is installed. Note: The argument in is "captured" in the internal, private state of the resulting scanner and should not be used by other parts of the program.


    close

    public void close()
    Closes the CorrelatedReader underlying this, clears the set of collected lexical pragmas, and in other ways frees up resources associated with this. After being closed, a Lex object can be restarted by calling restart. (An IO exception raised by closing the underlying input stream will be converted into a javafe.util.FatalError runtime exception.)


    replaceLookaheadToken

    public void replaceLookaheadToken(int k,
                                      Token t)

    getNextToken

    public int getNextToken()
    Scans next token from input stream. Returns the code of the next token in the token stream and fills in the Token fields of this. Note that the startingLoc and endingLoc fields of this are not accurate for the end-of-file token.


    lookahead

    public int lookahead(int k)
    Returns token type of the kth future token, where k=0 is the current token. If k is past the end of the token stream, TagConstants.EOF is returned.


    lookaheadToken

    public Token lookaheadToken(int k)

    getLexicalPragmas

    public LexicalPragmaVec getLexicalPragmas()
    Returns the set of lexical pragmas collected. It also clears the set of lexical pragmas so that the next call will not include them. (If this lexer has no PragmaParser, then an empty vector is returned.)


    popLexicalPragma

    public LexicalPragma popLexicalPragma()
    Remove the first LexicalPragma from our set of lexical pragmas collected, returning it or null if our set is empty.


    scanToken

    private int scanToken()
    Returns the code of the next token in the token stream, updating the Token fields of this along the way. Advances underlying stream to the character just past the last character of the token returned, and changes the internal buffer used by getTokenText to contain the text of this token.

    In most cases, this method leaves m_nextchr holding the character just after the token scanned and m_in pointing to the character after that. However, if TagConstants.C_COMMENT or TagConstants.EOL_COMMENT is returned, it leaves m_in pointing to the character just after the token scanned and m_nextchr undefined. This aids in pragma processing.


    scanComment

    private void scanComment(int commentKind)
    Handle a comment. m_in points to the character just after the "//" or "/*". The mark is set at the last character read.


    scanCharOrString

    private int scanCharOrString(int nextchr)
    Scan a character or string constant.


    scanNumber

    private int scanNumber(int nextchr)
    Scans a numeric literal. Requires nextchr is a decimal digit. Reads a numeric literal into text. Depending on the kind of literal found, will return one of TagConstants.INTLIT, TagConstants.LONGLIT, TagConstants.FLOATLIT or TagConstants.DOUBLELIT. If an error is detected, a message is sent to ErrorSet, m_in is advanced to what appears to be the end of the erroneous token, and a legal literal is left in text.


    finishFloatingPointLiteral

    private int finishFloatingPointLiteral(int nextchr)
    Finishes scanning a floating-point literal.

    Requires: text contains a possibly empty sequence of decimal digits followed by an optional '.'; also, text cannot be empty. Further, let s be the sequence of characters consisting of the characters in text followed by nextchr followed by the characters in m_in. This routine requires that a prefix of s match the syntax of floating-point literals as defined by the Java language specification.

    Ensures: Scans the floating-point literal in s. Depending on the type of the literal, returns TagConstants.FLOATLIT or TagConstants.DOUBLELIT and sets sets auxVal to either a Float or Double. If an error is encountered, a message is sent to ErorrSet and recovery is performed.


    scanPunctuation

    private int scanPunctuation(int nextchr)
    Scans a punctuation string or a floating-point number. If input doesn't match either a floating-point number or any punctuation, returns TagConstants.NULL. Assumes startingLoc already filled in.

    The routine may change the mark arbitrarily.

    This method leaves m_in in a different state than the previous ones do. Ordinarily, scanXXX routines return with m_nextchr holding the character just after the token scanned and m_in pointing to the character after that. scanPunctuation does too, but only when the value returned is not TagConstants.C_COMMENT or TagConstants.EOL_COMMENT; in those two cases, it returns with m_in pointing to the character just after the token scanned and m_nextchr undefined. This aids in pragma processing. Also, if TagConstants.NULL is returned, then m_nextchr is undefined and m_in is where it was on entry.


    scanJavaExtensions

    protected int scanJavaExtensions(int nextchr)
    Scans a Java extension. If input doesn't match any Java extension, returns TagConstants.NULL. Assumes startingLoc already filled in, and assumes textlen is 0.

    The routine may change the mark arbitrarily.

    If a Java extension is matched, returns with m_nextchr holding the character just after the token scanned and m_in pointing to the character after that.


    stringLitAppend

    private void stringLitAppend(int c)

    addJavaKeywords

    public void addJavaKeywords()
    Add all of Java's keywords to the scanner. The token codes used for these keywords are those defined by the TagConstants class. Requires that none of these keywords have been added already.


    addKeyword

    public void addKeyword(java.lang.String newkeyword,
                           int code)
    Add a keyword to a Lex object with the given code. Requires that newkeyword is a Java identifier and that code is not TagConstants.NULL or a tokenType that requires auxVal to be non-null; (cf. Token.auxVal). Also requires that the keyword hasn't already been added.


    addJavaPunctuation

    public void addJavaPunctuation()
    Add all of Java's punctuation strings to the scanner. The codes used for these punctuation strings are the found in the TagConstants class. Requires that none of these punctuation strings have been added before.


    addPunctuation

    public void addPunctuation(java.lang.String punctuation,
                               int code)
    Add a punctuation string to a scanner associated with a given code. Requires that the characters in the punctuation string are all punctuation characters, that is, non-alphanumeric ASCII characters whose codes are between 33 ('!') and 126 ('~') inclusive. Also requires that the code is not TagConstants.NULL and that the punctuation string hasn't already been added.


    zzz

    public void zzz(java.lang.String prefix)
    Checks invariants (assumes that Token fields haven't been mucked with by outside code). prefix is used to prefix error messages with context provided by the caller.


    ESC/Java2
    © 2003,2004,2005 David Cok and Joseph Kiniry
    © 2005 UCD Dublin
    © 2003,2004 Radboud University Nijmegen
    © 1999,2000 Compaq Computer Corporation
    © 1997,1998,1999 Digital Equipment Corporation
    All Rights Reserved

    The ESC/Java2 Project Homepage