Microsoft .NET Case Study: Import Excel/CSV Files via Xml/Xsl
Large-scale ETL from Excel and CSV Files
Our client is a software manufacturer that develops comprehensive ERP system. Their customers needed a
facility that would extract, transform and load (ETL) data from many files from many locations. They asked us
to develop this system because their existing system was far too slow and inaccurate. Here's what we had to do:
Input Variations:
Accepts both CSV and Excel input.
Accepts comma-delimited (CSV) import files and strips enclosing quotations, if present.
For CSV input, optionally ignores column headings in the first row.
Supports both Excel 2003 and 2007 file formats.
Accepts orders in 3 different Excel configurations: one order with multiple items per row on one sheet, one order with items in
multiple rows on one sheet AND one order per row in header sheet with items in multiple rows in a detail sheet.
The mapping of input file columns to database table columns is controlled by user-defined import templates.
Imports user-defined variable fields for orders and order items.
User-defined generic recipient attributes.
User-defined enumerated values for Region, Format, Language recipient attributes.
Ignores blank lines in import files.
Windows.Forms GUI can process multiple files that share a common filename prefix with a single invocation.
Generates empty Excel import files that conform to a selected template to make it easy for new users to get started.
Validation and Database Update Features:
Imports sales orders, contact lists, recipients and shipping requests. Also exports contact lists.
Validates data the spans 35 tables across 3 databases, thereby translating user-friendly lookup table codes
into their corresponding unique database IDs.
Outputs detailed error logs to Excel that can be quickly edited and resubmitted.
An Excel error log looks like the input file with the addition of a right-most “Errors” column.
Outputs success logs to Excel so that there is an audit trail of what worked.
Moves input files to Archive directory so that users know what has been processed.
Displays an import error log and opens the error log in Excel on demand.
Architectural Features:
Field validations, e.g. required and datatype, are Xsd schema driven.
If the host has a multi-core CPU, it will use multi-tasking to improve performance via the
Parallel Extension to .NET Framework 3.5.
Packaged as a WCF service to support third party access and interfaces.
Includes comprehensive automated unit tests that can run unattended.
Integrates with log4net error logging.
Recipient import rules are consistent whether they are imported stand-alone, inside ContactLists or inside
Sales Orders because all of the code is reused.
Used inheritance extensively for data validation classes. Xsd schema determined what classes were instantiated.
Lookup table validations were lazy loaded can cached.
ETL System Architecture
We first tried to locate third party software that would do all of the above. But two strong requirements necessitated custom development:
Each custom uses a different subset of input fiields. The mapping of fields to columns was also customer-specific.
The templates that defined the mapping is stored in a database.
There could be no runtime licenses or additional installation processes.
The architecture that we planned appears above. We had customers with both Windows.Forms and Web interfaces.
We wanted to isolate shared components behind WCF so that they could be maintained in one place. We developed and
tested the WCF interface, but ended up bundling the assemblies without it. We came away with a few lessons learned:
We wanted to reuse Xml schema validations rather write our own. But, we were required to report errors by input row.
So, we marked each element that corresponded to an input row with a row number attribute. We developed a bottom-up technique
for validating XmlDocument sub-trees against the schema. When a validator found an error, we attached an error attribute to
the root of the sub-tree being validated. We only validated the next higher level if all the lower-level validations passed.
At the end of processing, we concatentated all of the errors below elements with row number attributes and diplayed them in
the righmost column of the appropriate row of the Excel output file.
XmlDocument validations are CPU-intensive. Fortunately, they can be executed in parallel in a Parallel.For loop supported
by the parallel extensions to .NET 3.5.
Excel import is done by opening the file on disk as though it were a database via a Jet database provider. When we were
developing this application, there was no 64-bit Jet DB provider. Therefore, we planned to move the Excel-to-Xml translation
to the other side of the interface so that the WCF processes could run on an x64 server.
We needed to call Microsoft XPath Extension Functions in Xslt for metadata.
The SqlXml bulk loader is packaged as a COM object that reads and writes files in its own single-threaded apartment.
Therefore, we had to launch the SqlXml BulkLoader in its own STA thread when running under ASP.NET.
It is much faster to use the SqlXml BulkLoader to populate staging tables and then use sql to perform updates than to
process UpdateGrams more directly.
VB.NET supports Xml as first-class literals with expression holes that make relatively short work of Xml conversion tasks.
This feature earned VB.NET its own assembly in our otherwise C# solution.
WCF can't serialize XmlDocuments, just elements. So, we had to pass root elements instead of documents.
Our Performance Objectives Were Met
With excellent guidance from our client's development staff, WinMetrics developed over 22,000
lines of C# in less than 6 months elapsed time to hit the target:
Imports 10,000 new recipients, addresses and credit cards in less than 65 seconds on a quad-core PC in debug mode.
Imports 5,000 new orders with 5,000 new order items in just over one minute on a quad-core PC in debug mode.
"Carl Kelley is a thoroughbred professional who loves his job and is extremely good at it.
I worked with Carl in re-engineering an ETL solution in our product suite. It was a complicated part of our Order
Entry that dealt with multiple file types, user defined formats, high volumes and complex business rules. He speeded
it up by 15 times while enhancing it to run with a web interface as well. Carl not only grasps the big picture pretty
quickly, but dives deeper to find holes and mishaps waiting to happen. He exhibited an enormous talent for spotting
things that could go wrong before they went wrong. He was then very persistent in coming up with elegant solutions to
plug them. He is also a very keen software engineer, who was not satisfied with code that works once. He was very
methodical about going back and re-engineering or refactoring code whenever he had to program new requirements.
His work ethic was a treat to watch- working long hours; persistent and methodical. Carl also has an excellent sense of
humor and a great communicator of ideas, concepts and issues. He also juggled between multiple projects,
never once losing his élan no matter, as he would say, how hot the kitchen got." -- Srinivasan Venkataraman, Senior Developer