A Field Guide to Parsing and Creating Flat Files
Using the Humble Flatworm
For Version
1.2
Last Revised August 01, 2007
Flat files. Much as we live in an XML/SOAP/Web Services world,
there's still a ton of data being moved around between proprietary and
legacy applications that consists of fixed length fields delimited by
EOLs. Around about the time I wrote my 20th Java class who's only
purpose in life was to suck up a flat file, use
String.substring
to break it up into pieces, and then populate a bean with it, I decided
there had to be a better way. This package represents the fruit of
that frustration.
What is Flatworm?
Flatworm is a Java library intended to allow a developer to describe the
format of a flat file using an XML definition file, and then to be able
to automatically read lines from that file, and have one or more beans
be instantiated for each logical record.
There are a few powerful features in Flatworm worth mentioning.
For one thing, a record may consist of one or more physical lines in the
file. A record may contain more than one bean once decoded.
A flat file may contain more than one type of record, and Flatworm can
use line length and substring matching to determine which type of record
a line begins.
Besides fielded buffer flat files, Flatworm also supports text files where
the different fields are separated by a separator character, e.g. CSV (comma separated values) files.
Last but not least, Flatworm is able to produce flat files from beans
and the same definition file.
Requirements
In addition to the flatworm jar file, you will also need to have the
following jars in your classpath in order for Flatworm to thrive:
- commons-beanutil (from Apache Commons)
- commons-collections (from Apache Commons)
- commons-logging (from Apache Commons)
- log4j (www.log4j.org)
Recent versions of all of these packages are available in the source jar
file.
Downloading
The latest version of Flatworm is Release 1.2. You can download it
from
Sourceforge
.
A Simple Example
Before diving into the complexities of Flatworm, let's look at a simple
example that illustrates the basic operation. Imagine the
following input file which contains new hire data for a company:
NHJAMES TURNER M123-45-67890004224345
NHJOHN JONES M987-65-43210104356745
The layout of the file is as follows:
RECORD NAME
|
TYPE
|
LENGTH
|
recordtype
|
char
|
2
|
firstname
|
char
|
15
|
lastname
|
char
|
15
|
gender
|
char
|
1
|
ssn
|
char
|
11
|
salary
|
double
|
10 (2 decimal places)
|
We want to suck this file into a Java bean called Employee that has
properties firstName, lastName, ssn, gender and salary. These are
available via the standard JavaBean mechanisms.
To do this, we start by writing the Flatworm XML descriptor for the
file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE file-format SYSTEM "http://www.blackbear.com/dtds/flatworm-data-description_1_0.dtd">
<file-format>
<converter name="char" class="com.blackbear.flatworm.converters.CoreConverters" method="convertChar" return-type="java.lang.String"/>
<converter name="decimal" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDecimal" return-type="java.lang.Double"/>
<record name="newhire">
<record-ident>
<field-ident field-start="0" field-length="2">
<match-string>NH</match-string>
</field-ident>
</record-ident>
<record-definition>
<bean name="employee" class="Employee"/>
<line>
<record-element length="2"/>
<record-element length="15" beanref="employee.firstName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="15" beanref="employee.lastName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="1" beanref="employee.gender" type="char"/>
<record-element length="11" beanref="employee.ssn" type="char">
<conversion-option name="strip-chars" value="non-numeric"/>
</record-element>
<record-element length="10" beanref="employee.salary" type="decimal">
<conversion-option name="decimal-places" value="2"/>
<conversion-option name="decimal-implied" value="true"/>
<conversion-option name="pad-character" value="0"/>
<conversion-option name="justify" value="right"/>
</record-element>
</line>
</record-definition>
</record>
</file-format>
The file-format tag is required, and specifies the beginning of the
actual description. The first thing that we must do is to register
converters for the datatypes used in the file. There are a number
of predefined converter methods in the provided class
com.blackbear.flatworm.coverters.CoreConverters:
- convertChar - Simply returns the field specified, optionally
stripping leading or trailing (or both) padding characters, and
removing unwanted characters.
- convertDecimal - As above but converts the value to a Double.
The decimal place may be implied by position, or explicit
- convertDate - Parses the date using the default (MM-dd-yyyy)
or a user defined format.
- convertInteger - Parses to an Integer
- convertLong - Parses to a Long
- covertBigDecimal - Parses to a BigDecimal
In order to be used in record definitions, a converter must always be
registered first. Next in the file, a record is defined. A
file may contain several different types of records, the record-indent
tag is used to specify which record definition is approach for a given
line. There are two different ways to identify a record, by a
substring match on a specific section of the line, or by the overall
length of the line. Later, you will see how multiple record types
can be read from the same file, for them moment only one is defined,
which matches on the characters NH (new hire) at locations 0-2 on the
line. If no record-ident is defined, all records will match.
Once we're sure that we are dealing with the correct record type, we can
define the record. We start by defining the beans that will be
returned. Each bean has a name which is used to reference it
inside the definition, and a class (fully qualified) with which to
create objects. The class specified must have a valid
zero-argument instantiator.
Finally the record is broken down line by line (since a record is
allowed to span multiple lines). Record-elements (fields) may be defined
in terms of:
- a length alone, in which case they are considered to
span from the end of the last field to that position plus the specified
length
- a start position and a length, in which case they span from
the start position to that position plus the length
- a start and end position, in which case they span from the
start to end position (not inclusive of the end)
- an end position alone, in which case they span from the last
end position to the specified end position (not inclusive of the end)
Each record element also defines the beanref (according to the standard
used in the Apache Commons BeanUtil package), and the type (which should
match one of the types defined at the top of the file) Record
elements also may have conversion-options, which are specific to the
converter specified. For example, in the above example, the
lastName field should have any trailing spaces removed, the social
security number show be stripped of all non-numeric characters, and the
salary has two implied decimal places and may be left-padded with zeros
which should be removed.
Now we're ready to fire it all up. Here's a simple Java class that
parses the input file and prints out the beans produced:
import java.io.*;
import java.util.HashMap;
import com.blackbear.flatworm.ConfigurationReader;
import com.blackbear.flatworm.FileFormat;
import com.blackbear.flatworm.MatchedRecord;
import com.blackbear.flatworm.errors.*;
public class SimpleFlatwormExample {
public static void main(String[] args) {
ConfigurationReader parser = new ConfigurationReader();
try {
FileFormat ff = parser.loadConfigurationFile(args[0]);
InputStream in = new FileInputStream(args[1]);
BufferedReader bufIn = new BufferedReader(new InputStreamReader(in));
MatchedRecord results;
while ((results = ff.getNextRecord(bufIn)) != null) {
if (results.getRecordName().equals("newhire")) {
System.out.println(results.getBean("employee"));
}
}
} catch (FlatwormUnsetFieldValueException flatwormUnsetFieldValueError) {
flatwormUnsetFieldValueError.printStackTrace();
} catch (FlatwormConfigurationValueException flatwormConfigurationValueError) {
flatwormConfigurationValueError.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (FlatwormInvalidRecordException e) {
e.printStackTrace();
} catch (FlatwormInputLineLengthException e) {
e.printStackTrace();
} catch (FlatwormConversionException e) {
e.printStackTrace();
}
}
}
The location of the configuration file is passed in as the first
argument to the method, and the file to be parsed as the second. A
ConfigurationReader object is created, and the loadConfigurationFile
method is called with the path to the file as the argument. A
FileFormat is returned. After opening the input file and morphing
it into a BufferedReader, the BufferedReader is passed in to the
getNextRecord method of the FileFormat. getNextRecord either
returns null if the input file has been exhusted, or a MatchedRecord
object. The getRecordName method lets us know which type of record
is being returned (remembering again that a file can have several types
of records), and we can access specific beans with the getBean method.
When we run this test program, the results are as expected:
C:/j2sdk1.4.2_04\bin\java SimpleFlatwormExample simple-example.xml import1.txt
Employee@120a47e[TURNER, JAMES, 123456789, M, 42243.45]
Employee@f73c1[JONES, JOHN, 987654321, M, 1043567.45]
Process terminated with exit code 0
Defining Your Own Converters
If you want to define a novel new converter to use in your application,
it's quite simple. For each type to be converted, a converter has to
offer two methods:
- A method to convert a string read from the file to the target type (parsing). The signature
of such a method looks like this (being T the type to be parsed):
public T convertT(String str, HashMap options) throws FlatwormConversionException;
- A method to convert a value of the target type into a string representation (generation). The
signature of such a method looks like this (being T the type to be written):
public String convertT(Object obj, HashMap options)
To become a bit more specific, let's look at the definition
of ConvertDecimal from the CoreConverters file - first the parsing method:
public Double convertDecimal(String str, HashMap options) throws FlatwormConversionException
{
try
{
int decimalPlaces = 0;
ConversionOption conv = (ConversionOption) options.get("decimal-places");
String decimalPlacesOption = null;
if (null != conv)
decimalPlacesOption = conv.getValue();
boolean decimalImplied = "true".equals(Util.getValue(options, "decimal-implied"));
if (decimalPlacesOption != null)
decimalPlaces = Integer.parseInt(decimalPlacesOption);
if (str.length() == 0)
return new Double(0.0D);
if (decimalImplied)
return new Double(Double.parseDouble(str) / Math.pow(10D, decimalPlaces));
else
return Double.valueOf(str);
} catch (NumberFormatException ex)
{
cat.error(ex);
throw new FlatwormConversionException(str);
}
}
All parsing converter methods must accept exactly two arguments, a String and a
HashMap. The String contains the substring text from the input
line. The HashMap contains the key/value pairs from the
conversion-options tags. It's a good policy to call removePadding
first, since it automatically handles removing any left or right padding
as specified by the options, strips out unwanted characters, and
returns a default value if the string is empty. Converters should
return an object as opposed to an intrinsic, since the value must
eventually end up in a HashMap. Finally, if any errors are
encountered during processing, you should throw a
FlatwormConversionException with some useful diagnostic information.
Now let's take a look at the definition
of the CoreConverter's method for writing a Decimal:
public String convertDecimal(Object obj, HashMap options)
{
Double d = (Double) obj;
if (d == null)
{
return null;
}
int decimalPlaces = 0;
ConversionOption conv = (ConversionOption) options.get("decimal-places");
String decimalPlacesOption = null;
if (null != conv)
decimalPlacesOption = conv.getValue();
boolean decimalImplied = "true".equals(Util.getValue(options, "decimal-implied"));
if (decimalPlacesOption != null)
decimalPlaces = Integer.parseInt(decimalPlacesOption);
DecimalFormat format = new DecimalFormat();
format.setDecimalSeparatorAlwaysShown(!decimalImplied);
format.setGroupingUsed(false);
if (decimalImplied)
{
format.setMaximumFractionDigits(0);
d = new Double(d.doubleValue() * Math.pow(10D, decimalPlaces));
} else
{
format.setMinimumFractionDigits(decimalPlaces);
format.setMaximumFractionDigits(decimalPlaces);
}
return format.format(d);
}
The generating converter methods have a similar restriction as the parsing methods, just
the first parameter must be of type Object. It is not the actual attribute type, so
Flatworm remains compatible with Java version below 5.0.
Record Matching
As promised, let's look at a more complex example now. This
example combines multiple beans in a single record, and multiple record
types in a single file: Let's assume we're in the IT department at
MegaMart, and we're importing a mixed flat file containing books,
videotapes and DVDs. Unfortunately, the three product types have
three different formats.
DVD
RECORD NAME
|
TYPE
|
LENGTH
|
title
|
char
|
30
|
studio name
|
char
|
30
|
release date
|
date
|
8 (YYYYMMDD)
|
sku
|
char
|
9
|
price
|
double
|
7
|
dual layer
|
char
|
1
|
The DVD record is a single-line 85 character record, 30 characters each
for the title and studio name, an 8 character date field, 9 for the
product SKU, 7 for the price with explicit decimal point, and a single
character Y/N field that says if the DVD is dual layer.
By contrast, the videotape record is a two-line return:
Videotape
RECORD NAME
|
TYPE
|
LENGTH
|
recordtype
|
char
|
1 ('V')
|
sku
|
char
|
9
|
price
|
double
|
6 (implied decimal, 2 places,
zero pad)
|
RECORD NAME
|
TYPE
|
LENGTH
|
title
|
char
|
30
|
studio
|
char
|
30
|
release date
|
char
|
10 (YYYY-MM-DD)
|
This record starts with a line with a leading 'V' character followed by
the SKU and price without a decimal point, then a second line with
title, studio and release date.
Finally, the book record is a single-line record, similar to the DVD
record
Book
RECORD NAME
|
TYPE
|
LENGTH
|
sku
|
char
|
9
|
title
|
char
|
30
|
author
|
char
|
30 |
price
|
double
|
7 (explicit decimal)
|
release date
|
date
|
10 (YYYY-MM-DD)
|
Further complicating thing, we want to use a common "Film" bean to store
the film-related info from both the DVD and Videotape records, but store
the rest in seperate DVD or Videotape beans. Finally, some of the
date records are missing, and should be given a default value on import.
As it turns out, this is a piece of cake for Flatworm:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE file-format SYSTEM "http://www.blackbear.com/dtds/flatworm-data-description_1_0.dtd">
<file-format>
<converter name="char" class="com.blackbear.flatworm.converters.CoreConverters" method="convertChar" return-type="java.lang.String"/>
<converter name="decimal" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDecimal" return-type="java.lang.Double"/>
<converter name="date" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDate" return-type="java.lang.Date"/>
<record name="dvd">
<record-ident>
<length-ident minlength="85" maxlength="85"/>
</record-ident>
<record-definition>
<bean name="dvd" class="Dvd"/>
<bean name="film" class="Film"/>
<line>
<record-element length="30" beanref="film.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="30" beanref="film.studio" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="8" beanref="film.releaseDate" type="date">
<conversion-option name="format" value="yyyyMMdd"/>
<conversion-option name="default-value" value="19990101"/>
</record-element>
<record-element length="9" beanref="dvd.sku" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="7" beanref="dvd.price" type="decimal">
<conversion-option name="justify" value="right"/>
</record-element>
<record-element length="1" beanref="dvd.dualLayer" type="char"/>
</line>
</record-definition>
</record>
<record name="videotape">
<record-ident>
<field-ident field-start="0" field-length="1">
<match-string>V</match-string>
</field-ident>
</record-ident>
<record-definition>
<bean name="video" class="Videotape"/>
<bean name="film" class="Film"/>
<line>
<record-element start="1" end="10" beanref="video.sku" type="char">
<conversion-option name="justify" value="right"/>
<conversion-option name="pad-character" value="0"/>
</record-element>
<record-element start="10" end="16" beanref="video.price" type="decimal">
<conversion-option name="decimal-implied" value="true"/>
<conversion-option name="decimal-places" value="2"/>
<conversion-option name="justify" value="right"/>
<conversion-option name="pad-character" value="0"/>
</record-element>
</line>
<line>
<record-element start="0" end="30" beanref="film.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element start="30" end="60" beanref="film.studio" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element start="60" end="70" beanref="film.releaseDate" type="date">
<conversion-option name="default-value" value="1980-01-01"/>
</record-element>
</line>
</record-definition>
</record>
<record name="book">
<record-definition>
<bean name="book" class="Book"/>
<line>
<record-element length="9" beanref="book.sku" type="char"/>
<record-element length="30" beanref="book.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="30" beanref="book.author" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="7" beanref="book.price" type="decimal">
<conversion-option name="justify" value="right"/>
</record-element>
<record-element length="10" beanref="book.releaseDate" type="date">
<conversion-option name="default-value" value="1970-01-01"/>
</record-element>
</line>
</record-definition>
</record>
</file-format>
Without rehashing old ground, you can see that in this example, we have
three different scenarios for matching records. Dvds are matched
based on record length. Videotapes are matched based on a leading V
character. And books, with no record matching tags, match anything
that remains. Flatworm processes record definitions in the order
they are defined in the file, and applies the first on that successfully
matches.
You can also see multiple beans being defined in a single record, and
the use of the format conversion option with a date. Given this input
file:
DIAL J FOR JAVA RUN ANYWHERE STUDIO 2004011555512121 49.95Y
546234476HE KNOWS WHEN YOU"RE CODING JAVALANG OBJECT 13.952003-11-10
V002346542002355
WHEN A STRANGER IMPLEMENTS NULL POINTER PRODUCTIONS 2003-03-12
546543476THE GC ALWAYS RINGS TWICE JAVAUTIL HASHMAP 23.432004-12-19
V002435542001955
DATA AND DATATYPES PRETENTIOUS FILMS LTD
And the following test program:
import java.io.*;
import java.util.HashMap;
import com.blackbear.flatworm.ConfigurationReader;
import com.blackbear.flatworm.FileFormat;
import com.blackbear.flatworm.MatchedRecord;
import com.blackbear.flatworm.errors.*;
public class ComplexFlatwormExample {
public static void main(String[] args) {
ConfigurationReader parser = new ConfigurationReader();
try {
FileFormat ff = parser.loadConfigurationFile(args[0]);
InputStream in = new FileInputStream(args[1]);
BufferedReader bufIn = new BufferedReader(new InputStreamReader(in));
MatchedRecord results;
while ((results = ff.getNextRecord(bufIn)) != null) {
if (results.getRecordName().equals("dvd")) {
System.out.println(results.getBean("dvd"));
System.out.println(results.getBean("film"));
}
if (results.getRecordName().equals("videotape")) {
System.out.println(results.getBean("video"));
System.out.println(results.getBean("film"));
}
if (results.getRecordName().equals("book")) {
System.out.println(results.getBean("book"));
}
System.out.println("");
}
} catch (FlatwormUnsetFieldValueException flatwormUnsetFieldValueError) {
flatwormUnsetFieldValueError.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormConfigurationValueException flatwormConfigurationValueError) {
flatwormConfigurationValueError.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FileNotFoundException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormInvalidRecordException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormInputLineLengthException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormConversionException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
}
The following output is produced:
Dvd@3901c6[55512121, 49.95, Y]
Film@a37368[Thu Jan 15 00:00:00 EST 2004, DIAL J FOR JAVA, RUN ANYWHERE STUDIO]
Book@ae506e[Mon Nov 10 00:00:00 EST 2003, HE KNOWS WHEN YOU"RE CODING, JAVALANG OBJECT,546234476,13.95]
Videotape@ba6c83[2346542, 23.55]
Film@12a1e44[Wed Mar 12 00:00:00 EST 2003, WHEN A STRANGER IMPLEMENTS, NULL POINTER PRODUCTIONS]
Book@29428e[Sun Dec 19 00:00:00 EST 2004, THE GC ALWAYS RINGS TWICE, JAVAUTIL HASHMAP,546543476,23.43]
Videotape@161f10f[2435542, 19.55]
Film@1193779[Tue Jan 01 00:00:00 EST 1980, DATA AND DATATYPES, PRETENTIOUS FILMS LTD]
CSV files
Flatworm also supports reading and writing of CSV (comma separated values) files.
The CSV mode is activated by the optional delimit attribute of the <line> tag,
where the delimiter character (e.g. a comma, a semicolon, etc.) is specified. The following
example shows the respective part of the XML descriptor:
<?xml version="1.0" encoding="UTF-8"?>
<file-format>
...
<record name="header">
<record-ident>
<field-ident field-start="0" field-length="14">
<match-string>foobar</match-string>
</field-ident>
</record-ident>
<record-definition>
<line delimit=";">
<record-element length="0" type="char">
<conversion-option name="default-value" value="field1" />
</record-element>
<record-element length="0" type="char">
<conversion-option name="default-value" value="field2" />
</record-element>
</line>
</record-definition>
</record>
</file-format>
The example shows also that the length attribute of record elements must be set to 0, since
in CSV files the length of each field is variable, hence meaningless.
Further Reading
The
JavaDoc
for Flatworm provides details on the core converters provided with the
package.