The handsome flatwormA Field Guide to Parsing and Creating Flat Files
Using the Humble Flatworm

For Version 1.2
Last Revised August 01, 2007



Flat files.  Much as we live in an XML/SOAP/Web Services world, there's still a ton of data being moved around between proprietary and legacy applications that consists of fixed length fields delimited by EOLs.  Around about the time I wrote my 20th Java class who's only purpose in life was to suck up a flat file, use String.substring to break it up into pieces, and then populate a bean with it, I decided there had to be a better way.  This package represents the fruit of that frustration.

What is Flatworm?

Flatworm is a Java library intended to allow a developer to describe the format of a flat file using an XML definition file, and then to be able to automatically read lines from that file, and have one or more beans be instantiated for each logical record.

There are a few powerful features in Flatworm worth mentioning.  For one thing, a record may consist of one or more physical lines in the file.  A record may contain more than one bean once decoded.  A flat file may contain more than one type of record, and Flatworm can use line length and substring matching to determine which type of record a line begins.

Besides fielded buffer flat files, Flatworm also supports text files where the different fields are separated by a separator character, e.g. CSV (comma separated values) files.

Last but not least, Flatworm is able to produce flat files from beans and the same definition file.

Requirements

In addition to the flatworm jar file, you will also need to have the following jars in your classpath in order for Flatworm to thrive:
Recent versions of all of these packages are available in the source jar file.

Downloading

The latest version of Flatworm is Release 1.2.  You can download it from Sourceforge .

A Simple Example

Before diving into the complexities of Flatworm, let's look at a simple example that illustrates the basic operation.  Imagine the following input file which contains new hire data for a company:
NHJAMES          TURNER         M123-45-67890004224345
NHJOHN JONES M987-65-43210104356745
The layout of the file is as follows:
RECORD NAME
TYPE
LENGTH
recordtype
char
2
firstname
char
15
lastname
char
15
gender
char
1
ssn
char
11
salary
double
10 (2 decimal places)

We want to suck this file into a Java bean called Employee that has properties firstName, lastName, ssn, gender and salary.  These are available via the standard JavaBean mechanisms.

To do this, we start by writing the Flatworm XML descriptor for the file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE file-format SYSTEM "http://www.blackbear.com/dtds/flatworm-data-description_1_0.dtd">
<file-format>
<converter name="char" class="com.blackbear.flatworm.converters.CoreConverters" method="convertChar" return-type="java.lang.String"/>
<converter name="decimal" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDecimal" return-type="java.lang.Double"/>
<record name="newhire">
<record-ident>
<field-ident field-start="0" field-length="2">
<match-string>NH</match-string>
</field-ident>
</record-ident>
<record-definition>
<bean name="employee" class="Employee"/>
<line>
<record-element length="2"/>
<record-element length="15" beanref="employee.firstName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="15" beanref="employee.lastName" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="1" beanref="employee.gender" type="char"/>
<record-element length="11" beanref="employee.ssn" type="char">
<conversion-option name="strip-chars" value="non-numeric"/>
</record-element>
<record-element length="10" beanref="employee.salary" type="decimal">
<conversion-option name="decimal-places" value="2"/>
<conversion-option name="decimal-implied" value="true"/>
<conversion-option name="pad-character" value="0"/>
<conversion-option name="justify" value="right"/>
</record-element>
</line>
</record-definition>
</record>
</file-format>
The file-format tag is required, and specifies the beginning of the actual description.  The first thing that we must do is to register converters for the datatypes used in the file.  There are a number of  predefined converter methods in the provided class com.blackbear.flatworm.coverters.CoreConverters:

In order to be used in record definitions, a converter must always be registered first.  Next in the file, a record is defined.  A file may contain several different types of records, the record-indent tag is used to specify which record definition is approach for a given line.  There are two different ways to identify a record, by a substring match on a specific section of the line, or by the overall length of the line.  Later, you will see how multiple record types can be read from the same file, for them moment only one is defined, which matches on the characters NH (new hire) at locations 0-2 on the line.  If no record-ident is defined, all records will match.

Once we're sure that we are dealing with the correct record type, we can define the record.  We start by defining the beans that will be returned.  Each bean has a name which is used to reference it inside the definition, and a class (fully qualified) with which to create objects.  The class specified must have a valid zero-argument instantiator.

Finally the record is broken down line by line (since a record is allowed to span multiple lines). Record-elements (fields) may be defined in terms of:
Each record element also defines the beanref (according to the standard used in the Apache Commons BeanUtil package), and the type (which should match one of the types defined at the top of the file)  Record elements also may have conversion-options, which are specific to the converter specified.  For example, in the above example, the lastName field should have any trailing spaces removed, the social security number show be stripped of all non-numeric characters, and the salary has two implied decimal places and may be left-padded with zeros which should be removed.

Now we're ready to fire it all up.  Here's a simple Java class that parses the input file and prints out the beans produced:

import java.io.*;
import java.util.HashMap;

import com.blackbear.flatworm.ConfigurationReader;
import com.blackbear.flatworm.FileFormat;
import com.blackbear.flatworm.MatchedRecord;
import com.blackbear.flatworm.errors.*;

public class SimpleFlatwormExample {
public static void main(String[] args) {
ConfigurationReader parser = new ConfigurationReader();
try {
FileFormat ff = parser.loadConfigurationFile(args[0]);
InputStream in = new FileInputStream(args[1]);
BufferedReader bufIn = new BufferedReader(new InputStreamReader(in));
MatchedRecord results;
while ((results = ff.getNextRecord(bufIn)) != null) {
if (results.getRecordName().equals("newhire")) {
System.out.println(results.getBean("employee"));
}
}

} catch (FlatwormUnsetFieldValueException flatwormUnsetFieldValueError) {
flatwormUnsetFieldValueError.printStackTrace();
} catch (FlatwormConfigurationValueException flatwormConfigurationValueError) {
flatwormConfigurationValueError.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (FlatwormInvalidRecordException e) {
e.printStackTrace();
} catch (FlatwormInputLineLengthException e) {
e.printStackTrace();
} catch (FlatwormConversionException e) {
e.printStackTrace();
}
}

}
The location of the configuration file is passed in as the first argument to the method, and the file to be parsed as the second.  A ConfigurationReader object is created, and the loadConfigurationFile method is called with the path to the file as the argument.  A FileFormat is returned.  After opening the input file and morphing it into a BufferedReader,  the BufferedReader is passed in to the getNextRecord method of the FileFormat.  getNextRecord either returns null if the input file has been exhusted, or a MatchedRecord object.  The getRecordName method lets us know which type of record is being returned (remembering again that a file can have several types of records), and we can access specific beans with the getBean method.

When we run this test program, the results are as expected:

C:/j2sdk1.4.2_04\bin\java SimpleFlatwormExample simple-example.xml import1.txt
Employee@120a47e[TURNER, JAMES, 123456789, M, 42243.45]
Employee@f73c1[JONES, JOHN, 987654321, M, 1043567.45]
Process terminated with exit code 0

Defining Your Own Converters

If you want to define a novel new converter to use in your application, it's quite simple.  For each type to be converted, a converter has to offer two methods:
  1. A method to convert a string read from the file to the target type (parsing). The signature of such a method looks like this (being T the type to be parsed):
    public T convertT(String str, HashMap options) throws FlatwormConversionException;
  2. A method to convert a value of the target type into a string representation (generation). The signature of such a method looks like this (being T the type to be written):
    public String convertT(Object obj, HashMap options)
To become a bit more specific, let's look at the definition of  ConvertDecimal from the CoreConverters file - first the parsing method:
    public Double convertDecimal(String str, HashMap options) throws FlatwormConversionException
    {
        try
        {
            int decimalPlaces = 0;
            ConversionOption conv = (ConversionOption) options.get("decimal-places");

            String decimalPlacesOption = null;
            if (null != conv)
                decimalPlacesOption = conv.getValue();

            boolean decimalImplied = "true".equals(Util.getValue(options, "decimal-implied"));

            if (decimalPlacesOption != null)
                decimalPlaces = Integer.parseInt(decimalPlacesOption);

            if (str.length() == 0)
                return new Double(0.0D);

            if (decimalImplied)
                return new Double(Double.parseDouble(str) / Math.pow(10D, decimalPlaces));
            else
                return Double.valueOf(str);

        } catch (NumberFormatException ex)
        {
            cat.error(ex);
            throw new FlatwormConversionException(str);
        }
    }
All parsing converter methods must accept exactly two arguments, a String and a HashMap.  The String contains the substring text from the input line.  The HashMap contains the key/value pairs from the conversion-options tags.  It's a good policy to call removePadding first, since it automatically handles removing any left or right padding as specified by the options,  strips out unwanted characters, and returns a default value if the string is empty.  Converters should return an object as opposed to an intrinsic, since the value must eventually end up in a HashMap.  Finally, if any errors are encountered during processing, you should throw a FlatwormConversionException with some useful diagnostic information.

Now let's take a look at the definition of the CoreConverter's method for writing a Decimal:
    public String convertDecimal(Object obj, HashMap options)
    {
        Double d = (Double) obj;
        if (d == null)
        {
            return null;
        }

        int decimalPlaces = 0;
        ConversionOption conv = (ConversionOption) options.get("decimal-places");

        String decimalPlacesOption = null;
        if (null != conv)
            decimalPlacesOption = conv.getValue();

        boolean decimalImplied = "true".equals(Util.getValue(options, "decimal-implied"));

        if (decimalPlacesOption != null)
            decimalPlaces = Integer.parseInt(decimalPlacesOption);

        DecimalFormat format = new DecimalFormat();
        format.setDecimalSeparatorAlwaysShown(!decimalImplied);
        format.setGroupingUsed(false);
        if (decimalImplied)
        {
            format.setMaximumFractionDigits(0);
            d = new Double(d.doubleValue() * Math.pow(10D, decimalPlaces));
        } else
        {
            format.setMinimumFractionDigits(decimalPlaces);
            format.setMaximumFractionDigits(decimalPlaces);
        }
        return format.format(d);
    }
The generating converter methods have a similar restriction as the parsing methods, just the first parameter must be of type Object. It is not the actual attribute type, so Flatworm remains compatible with Java version below 5.0.

Record Matching

As promised, let's look at a more complex example now.  This example combines multiple beans in a single record, and multiple record types in a single file:  Let's assume we're in the IT department at MegaMart, and we're importing a mixed flat file containing books, videotapes and DVDs.  Unfortunately, the three product types have three different formats.

DVD

RECORD NAME
TYPE
LENGTH
title
char
30
studio name
char
30
release date
date
8 (YYYYMMDD)
sku
char
9
price
double
7
dual layer
char
1

The DVD record is a single-line 85 character record, 30 characters each for the title and studio name, an 8 character date field, 9 for the product SKU, 7 for the price with explicit decimal point, and a single character Y/N field that says if the DVD is dual layer.

By contrast, the videotape record is a two-line return:

Videotape

RECORD NAME
TYPE
LENGTH
recordtype
char
1 ('V')
sku
char
9
price
double
6 (implied decimal, 2 places, zero pad)

RECORD NAME
TYPE
LENGTH
title
char
30
studio
char
30
release date
char
10 (YYYY-MM-DD)

This record starts with a line with a leading 'V' character followed by the SKU and price without a decimal point, then a second line with title, studio and release date.

Finally, the book record is a single-line record, similar to the DVD record

Book

RECORD NAME
TYPE
LENGTH
sku
char
9
title
char
30
author
char
30
price
double
7 (explicit decimal)
release date
date
10 (YYYY-MM-DD)

Further complicating thing, we want to use a common "Film" bean to store the film-related info from both the DVD and Videotape records, but store the rest in seperate DVD or Videotape beans.  Finally, some of the date records are missing, and should be given a default value on import. As it turns out, this is a piece of cake for Flatworm:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE file-format SYSTEM "http://www.blackbear.com/dtds/flatworm-data-description_1_0.dtd">
<file-format>
    <converter name="char" class="com.blackbear.flatworm.converters.CoreConverters" method="convertChar" return-type="java.lang.String"/>
    <converter name="decimal" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDecimal" return-type="java.lang.Double"/>
    <converter name="date" class="com.blackbear.flatworm.converters.CoreConverters" method="convertDate" return-type="java.lang.Date"/>
    <record name="dvd">
        <record-ident>
            <length-ident minlength="85" maxlength="85"/>
        </record-ident>
        <record-definition>
            <bean name="dvd" class="Dvd"/>
            <bean name="film" class="Film"/>
            <line>
                <record-element length="30" beanref="film.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="30" beanref="film.studio" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="8" beanref="film.releaseDate" type="date">
<conversion-option name="format" value="yyyyMMdd"/>
<conversion-option name="default-value" value="19990101"/>
</record-element>
<record-element length="9" beanref="dvd.sku" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="7" beanref="dvd.price" type="decimal">
<conversion-option name="justify" value="right"/>
</record-element>
<record-element length="1" beanref="dvd.dualLayer" type="char"/>
</line>
</record-definition>
</record>
<record name="videotape">
<record-ident>
<field-ident field-start="0" field-length="1">
<match-string>V</match-string>
</field-ident>
</record-ident>
<record-definition>
<bean name="video" class="Videotape"/>
<bean name="film" class="Film"/>
<line>
<record-element start="1" end="10" beanref="video.sku" type="char">
<conversion-option name="justify" value="right"/>
<conversion-option name="pad-character" value="0"/>
</record-element>
<record-element start="10" end="16" beanref="video.price" type="decimal">
<conversion-option name="decimal-implied" value="true"/>
<conversion-option name="decimal-places" value="2"/>
<conversion-option name="justify" value="right"/>
<conversion-option name="pad-character" value="0"/>
</record-element>
</line>
<line>
<record-element start="0" end="30" beanref="film.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element start="30" end="60" beanref="film.studio" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element start="60" end="70" beanref="film.releaseDate" type="date">
<conversion-option name="default-value" value="1980-01-01"/>
</record-element>
</line>
</record-definition>
</record>
<record name="book">
<record-definition>
<bean name="book" class="Book"/>
<line>
<record-element length="9" beanref="book.sku" type="char"/>
<record-element length="30" beanref="book.title" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="30" beanref="book.author" type="char">
<conversion-option name="justify" value="left"/>
</record-element>
<record-element length="7" beanref="book.price" type="decimal">
<conversion-option name="justify" value="right"/>
</record-element>
<record-element length="10" beanref="book.releaseDate" type="date">
<conversion-option name="default-value" value="1970-01-01"/>
</record-element>
</line>
</record-definition>
</record>
</file-format>

Without rehashing old ground, you can see that in this example, we have three different scenarios for matching records.  Dvds are matched based on record length. Videotapes are matched based on a leading V character.  And books, with no record matching tags, match anything that remains.  Flatworm processes record definitions in the order they are defined in the file, and applies the first on that successfully matches.

You can also see multiple beans being defined in a single record, and the use of the format conversion option with a date. Given this input file:
DIAL J FOR JAVA               RUN ANYWHERE STUDIO           2004011555512121   49.95Y
546234476HE KNOWS WHEN YOU"RE CODING JAVALANG OBJECT 13.952003-11-10
V002346542002355
WHEN A STRANGER IMPLEMENTS NULL POINTER PRODUCTIONS 2003-03-12
546543476THE GC ALWAYS RINGS TWICE JAVAUTIL HASHMAP 23.432004-12-19
V002435542001955
DATA AND DATATYPES PRETENTIOUS FILMS LTD
And the following test program:
import java.io.*;
import java.util.HashMap;

import com.blackbear.flatworm.ConfigurationReader;
import com.blackbear.flatworm.FileFormat;
import com.blackbear.flatworm.MatchedRecord;
import com.blackbear.flatworm.errors.*;

public class ComplexFlatwormExample {
public static void main(String[] args) {
ConfigurationReader parser = new ConfigurationReader();
try {
FileFormat ff = parser.loadConfigurationFile(args[0]);
InputStream in = new FileInputStream(args[1]);
BufferedReader bufIn = new BufferedReader(new InputStreamReader(in));

MatchedRecord results;
while ((results = ff.getNextRecord(bufIn)) != null) {
if (results.getRecordName().equals("dvd")) {
System.out.println(results.getBean("dvd"));
System.out.println(results.getBean("film"));
}
if (results.getRecordName().equals("videotape")) {
System.out.println(results.getBean("video"));
System.out.println(results.getBean("film"));
}
if (results.getRecordName().equals("book")) {
System.out.println(results.getBean("book"));
}
System.out.println("");
}

} catch (FlatwormUnsetFieldValueException flatwormUnsetFieldValueError) {
flatwormUnsetFieldValueError.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormConfigurationValueException flatwormConfigurationValueError) {
flatwormConfigurationValueError.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FileNotFoundException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormInvalidRecordException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormInputLineLengthException e) {
e.printStackTrace(); //To change body of catch statement use Options | File Templates.
} catch (FlatwormConversionException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}



}
The following output is produced:
Dvd@3901c6[55512121, 49.95, Y]
Film@a37368[Thu Jan 15 00:00:00 EST 2004, DIAL J FOR JAVA, RUN ANYWHERE STUDIO]

Book@ae506e[Mon Nov 10 00:00:00 EST 2003, HE KNOWS WHEN YOU"RE CODING, JAVALANG OBJECT,546234476,13.95]

Videotape@ba6c83[2346542, 23.55]
Film@12a1e44[Wed Mar 12 00:00:00 EST 2003, WHEN A STRANGER IMPLEMENTS, NULL POINTER PRODUCTIONS]

Book@29428e[Sun Dec 19 00:00:00 EST 2004, THE GC ALWAYS RINGS TWICE, JAVAUTIL HASHMAP,546543476,23.43]

Videotape@161f10f[2435542, 19.55]
Film@1193779[Tue Jan 01 00:00:00 EST 1980, DATA AND DATATYPES, PRETENTIOUS FILMS LTD]

CSV files

Flatworm also supports reading and writing of CSV (comma separated values) files. The CSV mode is activated by the optional delimit attribute of the <line> tag, where the delimiter character (e.g. a comma, a semicolon, etc.) is specified. The following example shows the respective part of the XML descriptor:
<?xml version="1.0" encoding="UTF-8"?>
<file-format>

    ...
    
    <record name="header">
        <record-ident>
            <field-ident field-start="0" field-length="14">
                <match-string>foobar</match-string>
            </field-ident>
        </record-ident>
        <record-definition>
            <line delimit=";">
                <record-element length="0" type="char">
                    <conversion-option name="default-value" value="field1" />
                </record-element>
                <record-element length="0" type="char">
                    <conversion-option name="default-value" value="field2" />
                </record-element>
            </line>
        </record-definition>
    </record>
</file-format>
The example shows also that the length attribute of record elements must be set to 0, since in CSV files the length of each field is variable, hence meaningless.

Further Reading

The JavaDoc for Flatworm provides details on the core converters provided with the package.