Flat files, despite being a venerable technology, are still in common use across countless Data Warehouse ETL/LT processes. While many (perhaps most) flat files follow the standard of using comma- or tab- delimiters, with frequency, users encounter files for data testing with QuerySurge that use unconventional delimiters. This article shows the setup for testing the data in flat files with nonstandard delimiters.
Sample Data for this article
For the examples in this article, we use some basic sales data. In its "true" CSV format, our sample data looks like:
Row_Number,source_order_id,category,order_dt,ship_dt,status
1,1,Ground,2007-04-14,2007-04-17,Shipped
2,2,Overnight,2007-04-16,null,Pending
3,3,Express,2007-04-13,null,Ordered
Files with Multicharacter Delimiters
Files with multicharacter delimiters are uncommon but not unheard-of. QuerySurge's builtin flat file driver handles multicharacter delimiters. As with all unconventional delimiters, a multicharacter delimiter is set up using the Other option in the Delimiter Options section of the QuerySurge Connection Wizard.
The following sample flat file rows use the sequence '@$$@' as delimiter:
Row_Number@$$@source_order_id@$$@category@$$@order_dt@$$@ship_dt@$$@status
1@$$@1@$$@Ground@$$@2007-04-14@$$@2007-04-17@$$@Shipped
2@$$@2@$$@Overnight@$$@2007-04-16@$$@null@$$@Pending
3@$$@3@$$@Express@$$@2007-04-13@$$@null@$$@Ordered
The setup in the Connection Wizard Delimiter Options looks like this:
Files with "C-style Escape" Delimiters
QuerySurge supports some of the C-style Escape characters as delimiters. The most common of these, the tab escape ('\t') is handled as a standard choice due to its continued frequency. Less common options are the "bell" character ('\a') and the vertical tab ('\v'), which are handled using their escapes as convenience sytnax.
The following sample flat file rows use the bell character ('\a') as delimiter:
Row_NumberBELsource_order_idBELcategoryBELorder_dtBELship_dtBELstatus
1BEL1BELGroundBEL2007-04-14BEL2007-04-17BELShipped
2BEL2BELOvernightBEL2007-04-16BELnullBELPending
3BEL3BELExpressBEL2007-04-13BELnullBELOrdered
The setup in the Connection Wizard Delimiter Options looks like this:
Files with Hex-specified Delimiters
In some cases, you may find yourself confronted with files delimited by a non-printing character. Examples might come from the ASCII control character range, such as the "Start of Heading" (SOH) character or the "Group Separator" (GS) character. Delimiters like these can be set up in the Connection Wizard, using their hex values with the "0x" syntax.
The following sample flat file rows use the GS character (hex value: 0x001d) as delimiter:
Row_NumberGSsource_order_idGScategoryGSorder_dtGSship_dtGSstatus
1GS1GSGroundGS2007-04-14GS2007-04-17GSShipped
2GS2GSOvernightGS2007-04-16GSnullGSPending
3GS3GSExpressGS2007-04-13GSnullGSOrdered
The setup in the Connection Wizard Delimiter Options looks like this:
Note: The built-in CSV driver handles hex-defined delimiter characters in the range 0x0000 - 0x007f only. These comprise the ASCII control characters and the ASCII printable characters. The extended ASCII characters (0x0080 - 0x00ff) are not supported. If you need to handle delimiters in this unsupported range (or higher), solutions include: a) preprocessing your file with a custom function to change the delimiter, or b) using a commercial driver that supports a broader delimiter range.
Comments
0 comments
Article is closed for comments.