Apache Drill is a powerful tool for querying a variety of structured and partially structured data stores, including a number of different types of files. This article describes the use of QuerySurge with Drill for dealing with delimited Flat Files. Because Drill offers strong performance especially with large files, for QuerySurge data testing and verification that involves big files, Drill should be considered.
Note: You can get the setup basics and background for Apache Drill in our Knowledge Base articles about Drill installation and setup on Windows, or on Linux, and on Drill Storage Plugins.
Drill Configuration for Delimited Flat Files
In order to configure Drill for a delimited Flat File, the following configuration tasks need to be performed:
- Choose a Storage Plugin that defines the file format to modify - for delimited files this usually is the dfs plugin.
- Modify the plugin JSON for the proper file extension for your file, via the extensions property.
- Modify the plugin JSON for the proper delimiter for your file, using the delimiter property. Only single-character delimiters are allowed.
- If the file has headers in the first row, add or modify the extractHeader property.
- If you need to skip the first file line, add or modify the skipFirstLine property.
Storage Plugin Configuration
The default configuration for Flat Files with the dfs plugin is for a Flat File with a ".csv" extension and the standard comma delimiter:
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
}, ...
Note:Drill only supports delimiters within the ASCII character range. Delimiters with a code outside of the range 0 - 127 will return an exception.
Note: Note that this sample entry in the Storage Plugin JSON terminates with a comma, since it is only one element in a (potentially) long JSON. When you edit Storage Plugin JSON, make sure to conform to proper JSON syntax as you make your changes.
Note: If the plugin is misconfigured for the file, Drill returns an exception.
Our first sample Storage Plugin modification, which uses the default dfs plugin as a template, is for a delimited file that has the extension ".out" (instead of the default ".csv"), looks like:
"csv": {
"type": "text",
"extensions": [
"out"
],
"delimiter": ","
}, ...
It is important to note that the default dfs configuration comes with three different delimited file types pre-configured:
- comma delimited with ".csv" extension
- pipe-delimited with ".tbl" extension
- tab-delimited with ".tsv" extension
Note: You can connect to and query any file configuration in your dfs configuration on your Drill connection. This means that only a single Drill Connection is required in QuerySurge to connect to any configured file.
Let's consider a more elaborate Storage Plugin example, for the following 3-column, tilde-delimited flat file layout with headers:
sample.til:
id~animal~weight
1~dog~35.2
2~cat~8.4
3~pigeon~1.9
We start off with the default dfs configuration, which, by default, only specifies the file type, the extension, and the delimiter character:
"csv": {
"type": "text",
"extensions": [
"til"
],
"delimiter": "~"
},
For Drill to use the first row as the header, the extractHeader property must be added and set to "true":
"csv": {
"type": "text",
"extensions": [
"til"
],
"delimiter": "~",
"extractHeader": true
},
Alternately, if you wanted to ignore the first line, include the skipFirstLine property with a value of true:
"csv": {
"type": "text",
"extensions": [
"til"
],
"delimiter": "~",
"skipFirstLine": true
},
Note: You can find a more complete listing and discussion of dfs plugin properties here.
Once the Storage Plugin settings have been updated, you are ready to create a Connection in QuerySurge to query the file.
QuerySurge and Drill with Delimited Files
Deploy the Drill JDBC Driver
Make sure that you've deployed the Apache Drill JDBC driver to your QuerySurge Agents. See the Deploy the Drill JDBC Driver discussion in either of the installation articles (Windows or Linux articles) for instructions if you haven't deployed the driver yet.
Create a QuerySurge Connection to Drill
To create a new Drill Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a description of how to use Extensibility in this article. For this example suppose our file exists at the following directory location:
C:\files\sample.til (Windows)
/files/sample.til (Linux)
There are a couple of ways to set up a QuerySurge Connection for Drill, depending on whether you want to specify the Storage Plugin on the Connection in the JDBC URL or in your SQL (see this article about Storage Plugins for more details):
Specify the Storage Plugin in the JDBC URL
To specify the Storage plugin in the JDBC URL, use the following syntax:
jdbc:drill:schema=dfs;drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies only the file path: SELECT ... FROM `/files/sample.til`;
Specify the Storage Plugin in your SQL Query
To specify the Storage plugin in the SQL, use the following URL syntax (which omits the schema):
jdbc:drill:drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies both the Storage Plugin and the file path: SELECT ... FROM dfs.`/files/sample.til`;
Basic Notes on Drill SQL with Delimited Files
In the default configuration for delimited Flat Files, Drill represents the file columns as elements of an array named "columns". If you are using the default configuration, and if you want to specify columns, you must use the "columns[n] syntax" in the SELECT list:
SELECT columns[0], columns[1], columns[2] FROM `/files/sample.til`;
Note: The "columns[n] syntax" syntax uses a zero-based index, so the first column is column 0.
Note: See documentation for the "columns[n] syntax" here.
If your Flat File has headers, and you have configured your Storage Plugin to 'extract' headers with the extractHeader property, then you must use the headers in your SQL (the "columns[n] syntax" will result in an exception):
SELECT id, animal, weight FROM `/files/sample.til`;
Note: In the FROM clause of your SELECT statement, the path to the file is always relative to the root location property. The root location property is defined in the storage plugin. So, if the root location is /files, then SELECT * FROM `/data/myfile.csv` refers to a file at: /files/data/myfile.csv.
Note: You can find the full documentation for the Drill SQL dialect in the Drill SQL Reference. Specifics for querying a filesystem are found in this discussion.
Advanced Options
- Query directories instead of files, which gives the effect of file unions. Details on this option can be found in this article.
- Compressed delimited files can be queried with Drill; details for this setup are found in this article.
Comments
0 comments
Please sign in to leave a comment.