Flat Files with the Byte Order Mark
The Byte Order Mark (BOM) is a Unicode character that may be used as a signal at the beginning of a text stream. Because of this, it can end up as the first character in a Flat File. QuerySurge handles the BOM. However, for Flat Files that have a header row, the BOM is pre-pended to the name of the first column, and for Flat Files with no header row, it is pre-pended to the first data field. This can lead to a variety of issues: if the BOM is pre-pended to a column name, then if the column is explicitly used in a SQL query, the BOM must be part of the column name. If the BOM is pre-pended to a data field, the comparison will have to be constructed to account for the presence of the BOM in the first field.
The value of the BOM is 0xFEFF. The BOM Unicode representation is: 0xEF,0xBB,0xBF. It may appear as the following characters in your data, if the bytes have been re-encoded as Cp1252:
If you see these characters appear, it is likely that your file begins with the BOM.
Note: If you copy the characters above and try to use them in (for example) your SQL, this will not work, because these characters are not the BOM, they are the Cp1252 encoding of the bytes of the BOM. See Approach (3) below for further discussion of this approach.
Approaches for Working with the Byte Order Mark
- Manually pre-process the file to remove the BOM. This only practical if there are a small number of files involved.
- Pre-process the file to remove the BOM using an automated method. QuerySurge's Flat File JDBC driver accepts custom functions that can be called in QuerySurge to pre-process files. A custom function can be implemented to remove the BOM from the beginning of the file. The custom function is called as part of the QueryPair before the SELECT on the file is executed. You can find an example of a custom function discussed here.
- If headers are in the file, pre-pend the BOM to the first column name in your query. This is as simple as copying and pasting; the problem is finding a place to copy the BOM from. The file in the Resources section below is a small HTML file that contains a BOM between double-quotes; you may be able to download this and use it to copy the BOM.
If you can copy the BOM, then the usage is as follows (we use the capital 'B' for the BOM in this example; note however if you copy an actual BOM into the QuerySurge editor, it doesn't show up - but it is there):
SELECT "B""COL NAME""" FROM ...
With an actual BOM in place of the 'B', the query will appear in the editor as:
SELECT """COL NAME""" FROM ...
Note that the extra double quotes in this usage are required to escape the quotes around the column name (which allows the driver to process the column name with all the characters).
- Another option, if headers are in the file, is to use a custom schema file and URL to override the column names in the file. A custom schema.xml with "name" and "pos" attributes (and with " suppressHeaders" = "false") will set the column names to the values you select for the "name" attributes. This eliminates the problem of the first column name having the BOM as its first byte. A sample illustrating this is in the Resources section (custom-schema-override-ff-col-names.zip).
- If there are no headers in the file, then there is the option of writing SQL to either pre-pend the BOM to the intial value in the first field of the opposing query in the QueryPair (Target if the BOM is in Source or Source if the BOM is in Target), or of trying to remove the BOM via SQL.
If you are only dealing with a few instances of the BOM, approaches (1), (3) or (4) might work for you. However, because the BOM does not appear visually as a character, it can make SQL difficult to maintain. This is an important drawback to consider before proceeding along this path.
Approach (2) is the most reliable over the long term, so if you have large numbers of files that lead with the BOM, it is probably the best approach. However, the up-front implementation cost must be considered, since the custom function needs to be written to deal with any specific features your files have.