org.apache.uima.internal.util
Class TextStringTokenizer

java.lang.Object
  extended by org.apache.uima.internal.util.TextStringTokenizer

public class TextStringTokenizer
extends java.lang.Object

An implementation of a text tokenizer for whitespace separated natural lanuage text.

The tokenizer knows about four different character classes: regular word characters, whitespace characters, sentence delimiters and separator characters. Tokens can consist of

The character classes are completely user definable. By default, whitespace characters are the Unicode whitespace characters. All other characters are word characters. The two separator classes are empty by default. The different classes may have non-empty intersections. When determining the class of a character, the user defined classes are considered in the following order: end-of-sentence delimiter before other separators before whitespace before word characters. That is, if a character is defined to be both a separator and a whitespace character, it will be considered to be a separator.

By default, the tokenizer will return all tokens, including whitespace. That is, appending the sequence of tokens will recover the original input text. This behavior can be changed so that whitespace and/or separator tokens are skipped.

A tokenizer provides a standard iterator interface similar to StringTokenizer. The validity of the iterator can be queried with hasNext(), and the next token can be queried with nextToken(). In addition, getNextTokenType() returns the type of the token as an integer. NB that you need to call getNextTokenType() before calling nextToken(), since calling nextToken() will advance the iterator.

Version:
$Id: TextStringTokenizer.java,v 1.6 2003/04/07 14:50:11 goetz Exp $

Field Summary
static int EOS
          Sentence delimiter character/word type.
static int SEP
          Separator character/word type.
static int WCH
          Word character/word type.
static int WSP
          Whitespace character/word type.
 
Constructor Summary
TextStringTokenizer(java.lang.String string)
          Construct a tokenizer from a Java string.
 
Method Summary
 void addSeparators(java.lang.String chars)
          Add to the set of separator characters.
 void addToEndOfSentenceChars(java.lang.String chars)
          Add to the set of sentence delimiters.
 void addWhitespaceChars(java.lang.String chars)
          Add to the set of whitespace characters.
 void addWordChars(java.lang.String chars)
          Add to the set of word characters.
 int getCharType(char c)
          Get the type of an individual character.
 java.lang.String getToken()
          Return the next token.
 int getTokenEnd()
          Get the end of the token.
 int getTokenStart()
          Get the start of the token.
 int getTokenType()
          Get the type of the token returned by the next call to nextToken().
 boolean isValid()
          Return true iff there is a next token.
 void setEndOfSentenceChars(java.lang.String chars)
          Set the set of sentence delimiters.
 void setSeparators(java.lang.String chars)
          Set the set of separator characters.
 void setShowSeparators(boolean b)
          Set the flag for showing separator tokens.
 void setShowWhitespace(boolean b)
          Set the flag for showing whitespace tokens.
 void setToFirst()
          Reset the tokenizer at any time.
 void setToNext()
          Compute the next token.
 void setWhitespaceChars(java.lang.String chars)
          Set the set of whitespace characters (in addition to the Unicode whitespace chars).
 void setWordChars(java.lang.String chars)
          Set the set of word characters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EOS

public static final int EOS
Sentence delimiter character/word type.

See Also:
Constant Field Values

SEP

public static final int SEP
Separator character/word type.

See Also:
Constant Field Values

WSP

public static final int WSP
Whitespace character/word type.

See Also:
Constant Field Values

WCH

public static final int WCH
Word character/word type.

See Also:
Constant Field Values
Constructor Detail

TextStringTokenizer

public TextStringTokenizer(java.lang.String string)
Construct a tokenizer from a Java string.

Parameters:
string - The string to tokenize.
Method Detail

setShowWhitespace

public void setShowWhitespace(boolean b)
Set the flag for showing whitespace tokens.

Parameters:
b - The whitespace flag.

setShowSeparators

public void setShowSeparators(boolean b)
Set the flag for showing separator tokens.

Parameters:
b - The flag.

setEndOfSentenceChars

public void setEndOfSentenceChars(java.lang.String chars)
Set the set of sentence delimiters.

Parameters:
chars - A string containing EOS chars.

addToEndOfSentenceChars

public void addToEndOfSentenceChars(java.lang.String chars)
Add to the set of sentence delimiters.

Parameters:
chars - A string containing EOS chars.

setSeparators

public void setSeparators(java.lang.String chars)
Set the set of separator characters.

Parameters:
chars - The separator chars.

addSeparators

public void addSeparators(java.lang.String chars)
Add to the set of separator characters.

Parameters:
chars - Separator chars.

setWhitespaceChars

public void setWhitespaceChars(java.lang.String chars)
Set the set of whitespace characters (in addition to the Unicode whitespace chars).

Parameters:
chars - Whitespace chars.

addWhitespaceChars

public void addWhitespaceChars(java.lang.String chars)
Add to the set of whitespace characters.

Parameters:
chars - Whitespace chars.

setWordChars

public void setWordChars(java.lang.String chars)
Set the set of word characters.

Parameters:
chars - Word chars.

addWordChars

public void addWordChars(java.lang.String chars)
Add to the set of word characters.

Parameters:
chars - Word chars.

getTokenType

public int getTokenType()
Get the type of the token returned by the next call to nextToken().

Returns:
The token type, or -1 if there is no next token.

isValid

public boolean isValid()
Return true iff there is a next token.

Returns:
true iff there is a next token.

setToFirst

public void setToFirst()
Reset the tokenizer at any time.


getToken

public java.lang.String getToken()
Return the next token.

Returns:
The next token.

getTokenStart

public int getTokenStart()
Get the start of the token.

Returns:
The start of the token.

getTokenEnd

public int getTokenEnd()
Get the end of the token.

Returns:
The token end.

setToNext

public void setToNext()
Compute the next token.


getCharType

public int getCharType(char c)
Get the type of an individual character.

Returns:
The char type.


Copyright © 2011. All Rights Reserved.