Apache Drill is a powerful tool for querying a variety of structured and partially structured data stores, including a number of different types of files. This article describes the use of QuerySurge with Drill to access and analyze flat files stored in Amazon Web Services S3. Using the S3a library, Drill can query files stored on Amazon's S3 cloud storage.
Drill Configuration for Amazon S3
Setup Amazon Web Services Credentials
For Drill to access your Amazon S3 cloud, it must be given the proper credentials to your AWS account.
Start by logging into your AWS dashboard and navigating to the "My Security Credentials" option under your username drop-down menu.
Under the "Access keys (access key ID and secret access key)" list item, click the "Create New Access Key" button to generate a new access key. We recommend downloading a key file in case your Secret Access Key is forgotten.
Note: This is the first and only time AWS will show your Secret Access Key. Please store it in a safe location for later reference.
Now that you have the necessary credentials, we can give them to Drill to enable its S3a support.
Edit the core-site.xml located at <Drill Install Dir>/conf/core-site.xml, and replace ENTER_YOUR_ACCESSKEY and ENTER_YOUR_SECRETKEY with your newly acquired Access Key and Secret Access Key respectively.
Storage Plugin Configuration
Enable the provided S3 Storage Plugin via the Drill Web Console and update it.
Replace the "connection" value of "file:///" with "s3a://" followed by your bucket name. Afterward, add the Access Key and Secret Key key-value pairs in a "config" property under the "connection" property:
QuerySurge and Drill with Amazon S3
Deploy the Drill JDBC Driver
Make sure that you've deployed the Apache Drill JDBC driver to your QuerySurge Agents. See the Deploy the Drill JDBC Driver discussion in either of the installation articles (Windows or Linux articles) for instructions if you haven't deployed the driver yet.
Create a QuerySurge Connection to Drill
To create a new Drill Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a description of how to use Extensibility in this article.
There are a couple of ways to set up a QuerySurge Connection for Drill, depending on whether you want to specify the Storage Plugin on the Connection in the JDBC URL or in your SQL (see Storage Plugins for more details):
Specify the Storage Plugin in the JDBC URL
To specify the Storage plugin in the JDBC URL, use the following syntax:
Note: In this case, your query specifies only the file path: SELECT ... FROM `sample.csv`;
Specify the Storage Plugin in your SQL Query
To specify the Storage plugin in the SQL, use the following URL syntax (which omits the schema):
Note: In this case, your query specifies both the Storage Plugin and the file path: SELECT ... FROM s3.`sample.csv`;
Querying Flat Files In S3 with Drill
At this point, querying flat files within the Amazon S3 cloud is identical to querying them from your local filesystem. and would in require both the same Storage Plugin edits and SQL syntax as when using the dfs Storage Plugin. (See our list of available flat file - Apache Drill articles below)
For this example we'll use a simple four-row CSV file provided below.
Similar to using the dfs Storage Plugin to query CSV files, the "columns[n]" zero-based index syntax is required when querying the file:
SELECT columns, columns, columns FROM `/drill/sample-data/region.parquet`;
Note: Keep in mind that drill returns all values as CLOBs. Proper casting should be implemented to avoid CLOB datatypes in your resultsets.