QuerySurge and Apache Drill - Parquet Files – Customer Support

Apache Drill is a powerful tool for querying a variety of structured and partially structured data stores, including a number of different types of files. This article describes the use of QuerySurge with Drill for dealing with Apache Parquet files. Because Drill offers strong performance especially with large files, for QuerySurge data testing and verification that involves big files, Drill should be considered.

Note: You can get the setup basics and background for Apache Drill in our Knowledge Base articles about Drill installation and setup on Windows, or on Linux, and on Drill Storage Plugins.

Drill Configuration for Parquet Files

Storage Plugin Configuration

To read Parquet data using Drill, you'll need to use a file system storage plugin that defines the Parquet format. You can use the dfs storage plugin, which includes the following definition:

"parquet": { "type": "parquet" }, ...

Note: Note that this sample entry in the Storage Plugin JSON terminates with a comma, since it is only one element in a (potentially) long JSON. When you edit Storage Plugin JSON, make sure to conform to proper JSON syntax as you make your changes.

Note: If the plugin is misconfigured for the file, Drill returns an exception.

Note: You can find a more complete listing and discussion of dfs plugin properties here.

Including the above definition within a Storage Plugin other than dfs will allow it to be used for Parquet file handling. Once the Storage Plugin settings have been updated (if necessary), you are ready to create a Connection in QuerySurge to query the file.

QuerySurge and Drill with Parquet Files

Deploy the Drill JDBC Driver

Make sure that you've deployed the Apache Drill JDBC driver to your QuerySurge Agents. See the Deploy the Drill JDBC Driver discussion in either of the installation articles (Windows or Linux articles) for instructions if you haven't deployed the driver yet.

Create a QuerySurge Connection to Drill

To create a new Drill Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a description of how to use Extensibility in this article.

There are a couple of ways to set up a QuerySurge Connection for Drill, depending on whether you want to specify the Storage Plugin on the Connection in the JDBC URL or in your SQL (see this article about Storage Plugins for more details):

Specify the Storage Plugin in the JDBC URL

To specify the Storage plugin in the JDBC URL, use the following syntax:

jdbc:drill:schema=dfs;drillbit=<server-or-IP>:<port>

Note: In this case, your query specifies only the file path: SELECT ... FROM `/drill/sample-data/region.parquet`;

Specify the Storage Plugin in your SQL Query

To specify the Storage plugin in the SQL, use the following URL syntax (which omits the schema):

jdbc:drill:drillbit=<server-or-IP>:<port>

Note: In this case, your query specifies both the Storage Plugin and the file path: SELECT ... FROM dfs.`/drill/sample-data/region.parquet`;

Querying Parquet Files with Drill

For this example we'll use the region.parquet file provided by Drill as sample data during installation. The file is located in the following path:

C:\drill\sample-data\region.parquet (Windows)

/opt/drill/sample-data/region.parquet (Linux)

Using ANSI SQL syntax standards, we can create a QueryPair that retrieves all fields within the Parquet file:

SELECT * FROM `/drill/sample-data/region.parquet`;

Note: In the FROM clause of your SELECT statement, the path to the file is always relative to the root location property. The root location property is defined in the storage plugin. So, if the root location is /files, then SELECT * FROM `/data/myfile.parquet` refers to a file at: /files/data/myfile.parquet.

Note: You can find the full documentation for the Drill SQL dialect in the Drill SQL Reference. Specifics for querying a filesystem are found in this discussion.

SQL Data Types to Parquet Types

Out of the box, Drill translates standard SQL data types to native Parquet data types behind the scenes. The following is a list of SQL datatypes and their Parquet counterparts:

SQL Type	Parquet Type	Description
BIGINT	INT64	8-byte signed integer
BOOLEAN	BOOLEAN	TRUE (1) or FALSE (0)
FLOAT	FLOAT	4-byte single precision floating point number
DOUBLE	DOUBLE	8-byte double precision floating point number
INTEGER	INT32	4-byte signed integer
VARBINARY	INT96	12-byte signed integer
VARCHAR	VARCHAR	Arbitrarily long character string

Note: Check out the Drill Parquet Format documentation for more information on Parquet datatype handling.

Note: If you have parquet columns that are displaying as clob or hex fields i.e.(00 d8 a0 52 dc 3e 00 00 b1 6f 25 00), you will need to modify the Drill\conf\drill-override.conf file to include

options: {
store.parquet.reader.int96_as_timestamp: true
}

within the drill.exec: function and restart drill.

Querying Parquet files in HDFS

Apache Drill can access and query flat files located in HDFS using the dfs Storage Plugin. With regard to the Parquet format, if you query a directory, Drill will merge the results from all of its files into one result set. In order to query an individual parquet file in a specified directory, the user under which the drillbit is running must have "execute" rights to it. These rights can be assigned either by changing the file's permission settings or via User Impersonation.

Articles in this section

Related articles