org.apache.commons.csv
Class CSVParser

java.lang.Object
  extended by org.apache.commons.csv.CSVParser

public class CSVParser
extends java.lang.Object

Parses CSV files according to the specified configuration. Because CSV appears in many different dialects, the parser supports many configuration settings by allowing the specification of a CSVStrategy.

Parsing of a csv-string having tabs as separators, '"' as an optional value encapsulator, and comments starting with '#':

  String[][] data = 
   (new CSVParser(new StringReader("a\tb\nc\td"), new CSVStrategy('\t','"','#'))).getAllValues();
 

Parsing of a csv-string in Excel CSV format

  String[][] data =
   (new CSVParser(new StringReader("a;b\nc;d"), CSVStrategy.EXCEL_STRATEGY)).getAllValues();
 

Internal parser state is completely covered by the strategy and the reader-state.

see package documentation for more details


Nested Class Summary
(package private) static class CSVParser.Token
          Token is an internal token representation.
 
Field Summary
private  CharBuffer code
           
private static java.lang.String[] EMPTY_STRING_ARRAY
          Immutable empty String array.
private  ExtendedBufferedReader in
           
private static int INITIAL_TOKEN_LENGTH
          length of the initial token (content-)buffer
private  java.util.ArrayList record
          A record buffer for getLine().
private  CSVParser.Token reusableToken
           
private  CSVStrategy strategy
           
protected static int TT_EOF
          Token (which can have content) when end of file is reached.
protected static int TT_EORECORD
          Token with content when end of a line is reached.
protected static int TT_INVALID
          Token has no valid content, i.e.
protected static int TT_TOKEN
          Token with content, at beginning or in the middle of a line.
private  CharBuffer wsBuf
           
 
Constructor Summary
CSVParser(java.io.InputStream input)
          Deprecated. use CSVParser(Reader).
CSVParser(java.io.Reader input)
          CSV parser using the default CSVStrategy.
CSVParser(java.io.Reader input, char delimiter)
          Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(java.io.Reader input, char delimiter, char encapsulator, char commentStart)
          Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(java.io.Reader input, CSVStrategy strategy)
          Customized CSV parser using the given CSVStrategy
 
Method Summary
private  CSVParser.Token encapsulatedTokenLexer(CSVParser.Token tkn, int c)
          An encapsulated token lexer Encapsulated tokens are surrounded by the given encapsulating-string.
 java.lang.String[][] getAllValues()
          Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).
 java.lang.String[] getLine()
          Parses from the current point in the stream til the end of the current line.
 int getLineNumber()
          Returns the current line number in the input stream.
 CSVStrategy getStrategy()
          Obtain the specified CSV Strategy
private  boolean isEndOfFile(int c)
           
private  boolean isEndOfLine(int c)
          Greedy - accepts \n and \r\n This checker consumes silently the second control-character...
private  boolean isWhitespace(int c)
           
protected  CSVParser.Token nextToken()
          Convenience method for nextToken(null).
protected  CSVParser.Token nextToken(CSVParser.Token tkn)
          Returns the next token.
 java.lang.String nextValue()
          Parses the CSV according to the given strategy and returns the next csv-value as string.
private  int readEscape(int c)
           
 CSVParser setStrategy(CSVStrategy strategy)
          Deprecated. the strategy should be set in the constructor CSVParser(Reader,CSVStrategy).
private  CSVParser.Token simpleTokenLexer(CSVParser.Token tkn, int c)
          A simple token lexer Simple token are tokens which are not surrounded by encapsulators.
protected  int unicodeEscapeLexer(int c)
          Decodes Unicode escapes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INITIAL_TOKEN_LENGTH

private static final int INITIAL_TOKEN_LENGTH
length of the initial token (content-)buffer

See Also:
Constant Field Values

TT_INVALID

protected static final int TT_INVALID
Token has no valid content, i.e. is in its initilized state.

See Also:
Constant Field Values

TT_TOKEN

protected static final int TT_TOKEN
Token with content, at beginning or in the middle of a line.

See Also:
Constant Field Values

TT_EOF

protected static final int TT_EOF
Token (which can have content) when end of file is reached.

See Also:
Constant Field Values

TT_EORECORD

protected static final int TT_EORECORD
Token with content when end of a line is reached.

See Also:
Constant Field Values

EMPTY_STRING_ARRAY

private static final java.lang.String[] EMPTY_STRING_ARRAY
Immutable empty String array.


in

private final ExtendedBufferedReader in

strategy

private CSVStrategy strategy

record

private final java.util.ArrayList record
A record buffer for getLine(). Grows as necessary and is reused.


reusableToken

private final CSVParser.Token reusableToken

wsBuf

private final CharBuffer wsBuf

code

private final CharBuffer code
Constructor Detail

CSVParser

public CSVParser(java.io.InputStream input)
Deprecated. use CSVParser(Reader).

Default strategy for the parser follows the default CSVStrategy.

Parameters:
input - an InputStream containing "csv-formatted" stream

CSVParser

public CSVParser(java.io.Reader input)
CSV parser using the default CSVStrategy.

Parameters:
input - a Reader containing "csv-formatted" input

CSVParser

public CSVParser(java.io.Reader input,
                 char delimiter)
Deprecated. use CSVParser(Reader,CSVStrategy).

Customized value delimiter parser. The parser follows the default CSVStrategy except for the delimiter setting.

Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation

CSVParser

public CSVParser(java.io.Reader input,
                 char delimiter,
                 char encapsulator,
                 char commentStart)
Deprecated. use CSVParser(Reader,CSVStrategy).

Customized csv parser. The parser parses according to the given CSV dialect settings. Leading whitespaces are truncated, unicode escapes are not interpreted and empty lines are ignored.

Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation
encapsulator - a Char used as value encapsulation marker
commentStart - a Char used for comment identification

CSVParser

public CSVParser(java.io.Reader input,
                 CSVStrategy strategy)
Customized CSV parser using the given CSVStrategy

Parameters:
input - a Reader containing "csv-formatted" input
strategy - the CSVStrategy used for CSV parsing
Method Detail

getAllValues

public java.lang.String[][] getAllValues()
                                  throws java.io.IOException
Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).

The returned content starts at the current parse-position in the stream.

Returns:
matrix of records x values ('null' when end of file)
Throws:
java.io.IOException - on parse error or input read-failure

nextValue

public java.lang.String nextValue()
                           throws java.io.IOException
Parses the CSV according to the given strategy and returns the next csv-value as string.

Returns:
next value in the input stream ('null' when end of file)
Throws:
java.io.IOException - on parse error or input read-failure

getLine

public java.lang.String[] getLine()
                           throws java.io.IOException
Parses from the current point in the stream til the end of the current line.

Returns:
array of values til end of line ('null' when end of file has been reached)
Throws:
java.io.IOException - on parse error or input read-failure

getLineNumber

public int getLineNumber()
Returns the current line number in the input stream. ATTENTION: in case your csv has multiline-values the returned number does not correspond to the record-number

Returns:
current line number

nextToken

protected CSVParser.Token nextToken()
                             throws java.io.IOException
Convenience method for nextToken(null).

Throws:
java.io.IOException

nextToken

protected CSVParser.Token nextToken(CSVParser.Token tkn)
                             throws java.io.IOException
Returns the next token. A token corresponds to a term, a record change or an end-of-file indicator.

Parameters:
tkn - an existing Token object to reuse. The caller is responsible to initialize the Token.
Returns:
the next token found
Throws:
java.io.IOException - on stream access error

simpleTokenLexer

private CSVParser.Token simpleTokenLexer(CSVParser.Token tkn,
                                         int c)
                                  throws java.io.IOException
A simple token lexer Simple token are tokens which are not surrounded by encapsulators. A simple token might contain escaped delimiters (as \, or \;). The token is finished when one of the following conditions become true:

Parameters:
tkn - the current token
c - the current character
Returns:
the filled token
Throws:
java.io.IOException - on stream access error

encapsulatedTokenLexer

private CSVParser.Token encapsulatedTokenLexer(CSVParser.Token tkn,
                                               int c)
                                        throws java.io.IOException
An encapsulated token lexer Encapsulated tokens are surrounded by the given encapsulating-string. The encapsulator itself might be included in the token using a doubling syntax (as "", '') or using escaping (as in \", \'). Whitespaces before and after an encapsulated token are ignored.

Parameters:
tkn - the current token
c - the current character
Returns:
a valid token object
Throws:
java.io.IOException - on invalid state

unicodeEscapeLexer

protected int unicodeEscapeLexer(int c)
                          throws java.io.IOException
Decodes Unicode escapes. Interpretation of "\\uXXXX" escape sequences where XXXX is a hex-number.

Parameters:
c - current char which is discarded because it's the "\\" of "\\uXXXX"
Returns:
the decoded character
Throws:
java.io.IOException - on wrong unicode escape sequence or read error

readEscape

private int readEscape(int c)
                throws java.io.IOException
Throws:
java.io.IOException

setStrategy

public CSVParser setStrategy(CSVStrategy strategy)
Deprecated. the strategy should be set in the constructor CSVParser(Reader,CSVStrategy).

Sets the specified CSV Strategy

Returns:
current instance of CSVParser to allow chained method calls

getStrategy

public CSVStrategy getStrategy()
Obtain the specified CSV Strategy

Returns:
strategy currently being used

isWhitespace

private boolean isWhitespace(int c)
Returns:
true if the given char is a whitespace character

isEndOfLine

private boolean isEndOfLine(int c)
                     throws java.io.IOException
Greedy - accepts \n and \r\n This checker consumes silently the second control-character...

Returns:
true if the given character is a line-terminator
Throws:
java.io.IOException

isEndOfFile

private boolean isEndOfFile(int c)
Returns:
true if the given character indicates end of file