August 01, 2005 02:44
Programming, Java, Design Patterns, ATG Dynamo
For the last project I had, we depended on data from the business users. There wasn't very much data, but it was very variable and being revised in an excel spreadsheet on a daily basis. Not only that, but it was changing in a structural basis as well -- new rows and columns would be added, links between workbooks were used. We couldn't use a snapshot, because the snapshot would be obsolete in short order, and the business users couldn't use a database that would restrict them.

In the past, the business users had transcribed the excel data that they had one row at a time into the database through a web or Swing UI. We needed something faster this time around. We needed a way to import the excel spreadsheet directly.

This idea had appeal on an intuitive level. The business users could work in the structure they were most experienced at, and we could get fresh data almost as soon as they'd revised it. We knew that there were some data validation issues where values had to be restricted to a list, but we thought that Excel's validation features would cope with that.

I looked into reading Excel in its XML dialect, but quickly discarded the idea after eyeballing the XML Excel generated itself. Instead, I looked for a library which would do the work of conversion for me, and present an easy to use API.

There was only one freely-available Java library that handled Excel spreadsheets, and that was POI. POI is a library written specifically to handle Microsoft Office formats, and their solution for Excel was called HSSF (Horrible SpreadSheet Format).

On the database side of things, we had an advantage in that we were using ATG's persistence solution, the SQL repository (or Data Anywhere Architecture, as marketing calls it). Because the SQL repository can be fed data imported through XML, we could convert the Excel spreadsheet to a file in XML format, and then run startSQLRepository to import data.

The documentation for HSSF was eclectic, but servicable. Once I'd found the document to read an Excel document, I could iterate through the rows, and call row.getCellAt(0) to get the first cell in the row.

Once I'd done this, I copied the data from the row to a JavaBean object, passed a list of those object to the export class, and generated XML from them.

There were a number of issues that popped up in implementing this solution. The first one was that POI is abandonware. The last update was from 2004. Using Excel XP with some features (validation of specific cells, for example) cause NullPointerExceptions when it tries to parse it. The solution here was to not use those features.

Another issue was that the Excel data was linked between different workbooks, but was not linked tightly together. Because all the fields were text, any text that was not specifically validated had to match between workbooks exactly. Identifying data mismatches between workbooks was a time consuming process, and I didn't think of a good way to automate it.

Finally, as new data was added, the columns would be moved around or added to as the business users tweaked it. Sometimes I would be notified of this change, and sometimes I would find out by having the script fail.

Importing data directly from an Excel spreadsheet is a good solution in many circumstances. However, it's probably a good idea to have a 'business' that can be quickly modified, and a 'data' spreadsheet with immutable columns specifically for data export. Given a static schema, it's easy to use and extend, and is much easier to modify than tweaking XML directly.

« Flex | Home | Enterprise Architect »

There's also JExcelAPI which is quite good and far less "abandonware" thant POI. You should probably check it out. The last time I used it, it was more powerful than POI and finally easier to use.
http://jexcelapi.sourceforge.net/

What about turning the problem on its ass, and writing VBA code to generate your target XML output from within the spreadsheet. That way you're guaranteed full and correct access to the data. You can put that code into a seperate workbook on your machine and have it reach out into the files you get from your users (i.e. so you don't need to acutally modify what they're sending you). I'm not familiar enough with Office development to say if there's an even nicer .NET way of doing this... ahc.

AHC, there's a couple of reasons that won't work.

One is that we need source code to be under version control, and having the VBA code inside the spreadsheet would make it inaccessible.

Second, VBA source code is not well understood by everyone at the client site. It would involve giving up Eclipse, the build system, a bunch of metrics, etc.

Third, we'd need to copy over the VBA code for every spreadsheet.

Fourth, it would bulk up the size of the spreadsheet and make it more difficult to email. Spreadsheets can get hefty.

name
url