|
ESC/Java2 © 2003,2004,2005 David Cok and Joseph Kiniry © 2005 UCD Dublin © 2003,2004 Radboud University Nijmegen © 1999,2000 Compaq Computer Corporation © 1997,1998,1999 Digital Equipment Corporation All Rights Reserved |
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjavafe.parser.Token
javafe.parser.Lex
A Lex object generates a sequence of Java "input elements"
(that is, tokens) by converting the sequence of input characters and
line terminators generated by an underlying
CorrelatedReader
.
The conversion of input characters occurs according to the lexical
rules in Chapter Three of the Java Language
Specification. This specification describes three lexical
translation steps: the first two steps translate a raw Unicode input
stream into a "cooked" one in which Unicode escape characters from the
raw stream have been processed and in which line terminators have been
identified; the last step translates this cooked stream into a
sequence of Java "input elements" (comments, white space, identifiers,
tokens, literals, and punctuation). Lex
objects perform
the last of these translations steps; the first two are performed by
an underlying CorrelatedReader
that is given to the
Lex
object as the source of input characters.
Before a newly-created Lex
object can be used, its
restart
method must be called giving a
CorrelatedReader
to scan. At any point, a
Lex
can be restarted on a different underlying reader.
The Lex class is thread safe, but instances of Lex are not. That is, two different threads may safely access two different instances of Lex concurrently, but they may not access the same instance of Lex concurrently.
The getNextToken
method of Lex
objects
returns the translated token sequence, one token at a time. It
discards white space, and it processes comments as described below.
In addition, getNextToken
fills in the Token
fields of this
(Lex
is a subclass of
Token
); for example, ttype
gets an integer
code defining the type of token returned and startingLoc
gives the location of the first character making up the token. If the
token is an identifier, identifierVal
indicates which
one; if the token is a literal, auxVal
gives its value.
Lex
objects report errors by calling both the
fatal
and error
methods of
ErrorSet
. The fatal errors are unexpected characters,
unterminated comments and string and character literals, and IO errors
in the underlying input stream. Recoverable errors are overflows in
literals (including overflows in octal escape sequences), non-octal
digits in integer literals, the string 0x
(interpreted as
a malformed integer literal), missing digits in a floating-point
exponent, bad escape sequences in character and string literals,
character literals containing no or multiple characters, and the
character literal '''
(interpreted as '\''
).
Lex
objects allow their clients to peek ahead into the
token stream by calling lookahead
. This method returns
the token code for the future token, but it does not affect the
Token
fields of this
.
If a call to lookahead
needs to look past the set of
tokens already scanned, and those tokens have errors, then the errors
are reported immediately.
A keyword is a Java identifier with a special token code.
Ordinarily, identifiers are associated with the code
TagConstants.IDENT
. Keywords, while matching the lexical
grammar of identifiers, are associated with different codes. In fact,
each keyword is typically associated with its own code.
The set of identifiers that a Lex
object recognizes
as keywords is extensible. A keyword is added to a Lex
object by calling the addKeyword
method. As a
convenience, a boolean given to the Lex
constructor
indicates whether a newly-constructed Lex
object should
automatically have all Java keywords added to it.
A punctuation string is a string of punctuation characters
recognized by a Lex
object to be a token. (Punctuation
characters are non-alphanumeric ASCII characters whose ASCII codes are
between 33 ('!') and 126 ('~') inclusive.) As with keywords, the set
of punctuation strings recognized by a Lex
object is
extensible. A punctuation string is added to a Lex
object by calling the addPunctuation
method. A boolean
given to the Lex
constructor indicates whether a
newly-constructed Lex
object should automatically have
all Java punctuation strings added to it.
The handling of comments is special in two ways: the punctuation strings that start comments is extensible, the text of comments can be parsed for pragmas.
Ordinarily, keywords and punctuation strings are mapped to token
codes that are returned by calls to getNextToken
.
However, two token codes are treated specially:
TagConstants.C_COMMENT
and TagConstants.EOL_COMMENT
.
These codes are used to indicate the start of C-like comments
(/*...*
/
) and end-of-line comments
(//...
), respectively. When a keyword or punctuation
string is mapped to these codes, it is handled like a comment
initiator rather than a regular token.
For all newly-created Lex
objects, /*
is
mapped to TagConstants.C_COMMENT
and //
is mapped
to TagConstants.EOL_COMMENT
. Other punctuation strings can be
made comment initiators by mapping them to comment-initiating codes.
This is more useful for TagConstants.EOL_COMMENT
than for
TagConstants.C_COMMENT
, since the string
*
/
is hard-wired as the terminator of C-like
comments.
Lex
objects are designed to support annotation of Java
programs through pragmas. A
separate document describes our overall aproach to pragmas. In
brief, our front-end supports two kinds of pragmas: control pragmas
that can appear anywhere in an input file and are collected in a list
apart from the parse tree, and syntax pragmas that can only appear in
certain grammatical contexts and become part of the parse tree.
Pragmas always appear in Java comments, at most one pragma per
comment. These comments must have one of the following forms:
When a Lex
object is created, it can optionally be
associated with a PragmaParser
object. If a
Lex
object has no PragmaParser
, it discards
all comments. Otherwise, the Lex
object passes the first
character of the comment (or -1 if the comment is empty) to the
checkTag
method of the PragmaParser
object,
which returns false
if the comment definitely does not
contain any pragmas. If the comment may contain pragmas, the
Lex
object bundles the text between the delimiters of a
comment into a CorrelatedReader
which it passes to the
restart
method of its PragmaParser
. (This
text excludes both the opening /*
or //
and
the closing *
/
or line terminator.)
The Lex
object then calls getNextPragma
to read the pragmas out of the comment one at a time. The
Lex
object does this in a lazy manner; that is, it reads
a pragma, returns it to the parser, and waits until the parser calls
for the next token before it attempts to read another pragma. The
getNextPragma
method returns a boolean, returning
false
if there are no more pragmas to be parsed. The
getNextPragma
method takes a Token
as an
argument, storing information about the pragma parsed into this
argument.
When PragmaParser.getNextPragma
returns a
LexicalPragma
, the Lex
object puts it in an
internal list rather than returning it to the parser. The list of
collected lexical pragmas can be retrieved by calling
getLexicalPragmas
.
CorrelatedReader
,
TagConstants
,
Token
,
PragmaParser
Field Summary | |
protected boolean |
inPragma
|
protected boolean |
javakeywords
|
protected java.util.Hashtable |
keywords
Unenforceable invariant: all tokenTypes in this table do not require a non-null auxVal. |
LexicalPragmaVec |
lexicalPragmas
|
protected TokenQueue |
lookaheadq
|
protected CorrelatedReader |
m_in
Current state of input stream underlying the scanner minus the first character. |
protected int |
m_nextchr
Each call to getNextToken reads ahead one character
and leaves the result in m_nextchr . |
protected boolean |
onlyjavakeywords
|
protected PragmaParser |
pragmaParser
|
private PunctuationPrefixTree |
punctuationTable
|
protected Token |
savedState
|
private char[] |
stringLit
|
private int |
stringLitLen
|
protected char[] |
text
The characters that constitute the current token. |
protected int |
textlen
The number of characters in the current token. |
Fields inherited from class javafe.parser.Token |
auxVal, CLEAR, endingLoc, identifierVal, startingLoc, ttype |
Constructor Summary | |
Lex(PragmaParser pragmaParser,
boolean isJava)
Creates a lexical analyzer that will tokenize the characters read from an underlying CorrelatedReader . |
Method Summary | |
void |
addJavaKeywords()
Add all of Java's keywords to the scanner. |
void |
addJavaPunctuation()
Add all of Java's punctuation strings to the scanner. |
void |
addKeyword(java.lang.String newkeyword,
int code)
Add a keyword to a Lex object with the given code.
|
void |
addPunctuation(java.lang.String punctuation,
int code)
Add a punctuation string to a scanner associated with a given code. |
protected void |
append(int c)
Append 'c' to text , expanding if necessary. |
void |
close()
Closes the CorrelatedReader underlying
this , clears the set of collected lexical pragmas,
and in other ways frees up resources associated with
this . |
private int |
finishFloatingPointLiteral(int nextchr)
Finishes scanning a floating-point literal. |
LexicalPragmaVec |
getLexicalPragmas()
Returns the set of lexical pragmas collected. |
int |
getNextToken()
Scans next token from input stream. |
int |
lookahead(int k)
Returns token type of the kth future token, where k=0 is the current token. |
Token |
lookaheadToken(int k)
|
LexicalPragma |
popLexicalPragma()
Remove the first LexicalPragma from our set of lexical pragmas collected, returning it or null if our set is empty. |
void |
replaceLookaheadToken(int k,
Token t)
|
int |
restart(CorrelatedReader in)
Start scaning a new CorrelatedReader . |
private int |
scanCharOrString(int nextchr)
Scan a character or string constant. |
private void |
scanComment(int commentKind)
Handle a comment. |
protected int |
scanJavaExtensions(int nextchr)
Scans a Java extension. |
private int |
scanNumber(int nextchr)
Scans a numeric literal. |
private int |
scanPunctuation(int nextchr)
Scans a punctuation string or a floating-point number. |
private int |
scanToken()
Returns the code of the next token in the token stream, updating the Token fields of this along the way.
|
private void |
stringLitAppend(int c)
|
void |
zzz(java.lang.String prefix)
Checks invariants (assumes that Token fields
haven't been mucked with by outside code). |
Methods inherited from class javafe.parser.Token |
clear, copyInto, ztoString, zzz |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected CorrelatedReader m_in
m_nextchr
. This is null iff we are closed.
protected int m_nextchr
getNextToken
reads ahead one character
and leaves the result in m_nextchr
. In other words,
between calls to getNextToken
, the stream of
characters yet-to-scanned consists of the character in
m_nextchr
followed by the characters remaining in
m_in
.
protected char[] text
protected int textlen
protected final TokenQueue lookaheadq
public LexicalPragmaVec lexicalPragmas
protected PragmaParser pragmaParser
protected boolean inPragma
protected Token savedState
private char[] stringLit
private int stringLitLen
private PunctuationPrefixTree punctuationTable
protected java.util.Hashtable keywords
protected boolean javakeywords
protected boolean onlyjavakeywords
Constructor Detail |
public Lex(PragmaParser pragmaParser, boolean isJava)
CorrelatedReader
. Before
the newly-created scanner can be used, its restart
method must be called on a CorrelatedReader
. The
pragmaParser
object is used to parse pragmas out
of comments; if it is null
, all comments are
discarded. The isJava
flag controls the initial
set of keywords and punctuation strings; if true
,
the new scanner will recognize Java's keywords and punctuation
strings, if false
, the new scanner will recognize
no keywords or punctuation strings. If
isJava
is true, the token codes used for the
Java's keywords and punctuation strings are those defined by
the TagConstants
class.
Method Detail |
protected void append(int c)
text
, expanding if necessary.
public int restart(CorrelatedReader in)
CorrelatedReader
. First closes
the old CorrelatedReader
associated with
this
(if there was one), and clears out the set of
collected lexical pragms. In addition to (re)-seting the
underlying input stream, this method scans the first token,
returning the token kind of the result and setting the
Token
fields of this
. If a
CorrelatedReader
is already underlying
this
, it is closed before the new reader is
installed. Note: The argument in
is "captured" in
the internal, private state of the resulting scanner and should
not be used by other parts of the program.
public void close()
CorrelatedReader
underlying
this
, clears the set of collected lexical pragmas,
and in other ways frees up resources associated with
this
. After being closed, a Lex
object
can be restarted by calling restart
. (An IO
exception raised by closing the underlying input stream will be
converted into a javafe.util.FatalError
runtime
exception.)
public void replaceLookaheadToken(int k, Token t)
public int getNextToken()
Token
fields of this
. Note that the
startingLoc
and endingLoc
fields of
this
are not accurate for the end-of-file token.
public int lookahead(int k)
k
is past the end of the
token stream, TagConstants.EOF
is returned.
public Token lookaheadToken(int k)
public LexicalPragmaVec getLexicalPragmas()
PragmaParser
, then an
empty vector is returned.)
public LexicalPragma popLexicalPragma()
private int scanToken()
Token
fields of this
along the way.
Advances underlying stream to the character just past the last
character of the token returned, and changes the internal buffer
used by getTokenText
to contain the text of this
token.
In most cases, this method leaves m_nextchr
holding the character just after the token scanned and
m_in
pointing to the character after that. However,
if TagConstants.C_COMMENT
or
TagConstants.EOL_COMMENT
is returned, it leaves
m_in
pointing to the character just after the token
scanned and m_nextchr
undefined. This aids in pragma
processing.
private void scanComment(int commentKind)
private int scanCharOrString(int nextchr)
private int scanNumber(int nextchr)
nextchr
is a
decimal digit. Reads a numeric literal into text
.
Depending on the kind of literal found, will return one of
TagConstants.INTLIT
,
TagConstants.LONGLIT
,
TagConstants.FLOATLIT
or
TagConstants.DOUBLELIT
. If an error is detected, a
message is sent to ErrorSet
, m_in
is
advanced to what appears to be the end of the erroneous token, and
a legal literal is left in text
.
private int finishFloatingPointLiteral(int nextchr)
Requires: text
contains a possibly empty sequence
of decimal digits followed by an optional '.'
; also,
text
cannot be empty. Further, let s be the
sequence of characters consisting of the characters in
text
followed by nextchr
followed by the
characters in m_in
. This routine requires that a
prefix of s match the syntax of floating-point literals as
defined by the Java language specification.
Ensures: Scans the floating-point literal in s.
Depending on the type of the literal, returns
TagConstants.FLOATLIT
or
TagConstants.DOUBLELIT
and sets sets
auxVal
to either a Float
or
Double
. If an error is encountered, a message is
sent to ErorrSet
and recovery is performed.
private int scanPunctuation(int nextchr)
TagConstants.NULL
. Assumes
startingLoc
already filled in.
The routine may change the mark arbitrarily.
This method leaves m_in
in a different state than
the previous ones do. Ordinarily, scanXXX
routines
return with m_nextchr
holding the character just
after the token scanned and m_in
pointing to the
character after that. scanPunctuation
does too, but
only when the value returned is not
TagConstants.C_COMMENT
or
TagConstants.EOL_COMMENT
; in those two cases, it
returns with m_in
pointing to the character just
after the token scanned and m_nextchr
undefined.
This aids in pragma processing. Also, if TagConstants.NULL
is returned, then m_nextchr
is undefined and
m_in
is where it was on entry.
protected int scanJavaExtensions(int nextchr)
TagConstants.NULL
. Assumes startingLoc
already filled in, and assumes textlen
is 0.
The routine may change the mark arbitrarily.
If a Java extension is matched, returns with m_nextchr
holding the character just after the token scanned and m_in
pointing to the character after that.
private void stringLitAppend(int c)
public void addJavaKeywords()
TagConstants
class. Requires that none of these keywords have been added
already.
public void addKeyword(java.lang.String newkeyword, int code)
Lex
object with the given code.
Requires that newkeyword
is a Java identifier and
that code
is not TagConstants.NULL
or
a tokenType that requires auxVal to be non-null;
(cf. Token.auxVal).
Also requires that the keyword hasn't already been added.
public void addJavaPunctuation()
TagConstants
class. Requires that none of these
punctuation strings have been added before.
public void addPunctuation(java.lang.String punctuation, int code)
TagConstants.NULL
and that the punctuation string
hasn't already been added.
public void zzz(java.lang.String prefix)
Token
fields
haven't been mucked with by outside code). prefix
is
used to prefix error messages with context provided by the
caller.
|
ESC/Java2 © 2003,2004,2005 David Cok and Joseph Kiniry © 2005 UCD Dublin © 2003,2004 Radboud University Nijmegen © 1999,2000 Compaq Computer Corporation © 1997,1998,1999 Digital Equipment Corporation All Rights Reserved |
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |