With increasing frequency, QuerySurge users need to query against schema-free data stores, including large Flat Files, large JSON files, Parquet files, Avro files, etc. Apache Drill supports these options and much more. In this article, we show how to set up Apache Drill on Windows so that you can use it with QuerySurge. Additional installation details can be found on Apache Drill's site, if needed.
Note: Apache Drill is originally a Linux solution. However, when installed in embedded mode, Apache Drill can be installed natively on the Windows OS.
Requirements
- Oracle JDK 8 or OpenJDK 8
- A minimum of 4GB RAM dedicated for Drill
- A file compression/expansion tool such as Winrar or 7zip to handle tar.gz files
Note: We recommend installing Apache Drill on its own hardware/VM. We do not recommend installing on your QuerySurge hardware/VM.
Pre-Installation Steps
Windows Environment Variables
Create the following Windows Environment variables:
- JAVA_HOME environment variable set to point to the JDK installation
- PATH environment variable includes the bin directory of JDK installation.
Note: It is recommended to save/install the Java JDK in a directory path that does not contain spaces; Apache Drill cannot use file paths with spaces.
Note: While adding the %JAVA_HOME%\bin directory to the PATH environment variable, ensure the %SYSTEMROOT%\system32 (generally %SYSTEMROOT% = C:\Windows) directory is also on the PATH environment variable. Some environments (including Azure virtual machine setups) do not include this entry on the PATH environment variable by default. However, this is required for Drill to run successfully.
Apache Drill Directory Structures
The user login that Drill runs under must have the following required directory structures. To create these directories, run the following in a command prompt for your Drill user:
mkdir "%userprofile%\drill"
mkdir "%userprofile%\drill\udf"
mkdir "%userprofile%\drill\udf\registry"
mkdir "%userprofile%\drill\udf\tmp"
mkdir "%userprofile%\drill\udf\staging"
takeown /R /F "%userprofile%\drill"
Note: The Windows environmental variable %userprofile% is a pre-set variable that exists whenever a user session starts. As indicated above, this means that you'll need to set up these directories for the user login under which Drill will be running.
Apache Drill Installation
Install Apache Drill in Embedded Mode
Start by downloading the latest version of Apache Drill. Expand the tar.gz download file to your desired install directory, e.g. C:\drill. This is all that is required for installation.
To start Drill, in a Windows command prompt, navigate to the Drill installation\bin directory and run the following command :
drill-embedded.bat
Note: In order to use Drill with QuerySurge or to perform any Drill-related activities, the drill-embedded.bat file must be running. By default, drill logs are output to: C:\drill\log\sqlline.out.
Note: You can use the contents of this log file to help troubleshoot Drill issues you may encounter.
Deploy the Drill JDBC Driver
For QuerySurge to use Drill, the Drill JDBC driver must be set up on all QuerySurge Agents. The JDBC driver is named drill-jdbc-all-*.jar (where the * is the Drill version) and is located in C:\drill\jars\jdbc-driver\.
To add a JDBC driver to an Agent deployed on Windows, see this Knowledge Base article.
To add a JDBC driver to an Agent deployed on Linux, see this Knowledge Base article.
Once the driver is deployed and your Agent is operational, you are ready to create a Connection in QuerySurge using the Drill JDBC driver.
Create a Connection
To create a new Connection in QuerySurge, we use the Connection Extensibility feature of the Connection Wizard. You can find a full description of this procedure in this Knowledge Base article. Key points include the following:
- Navigate to: Administration > Connections > Add. Check "Advanced Mode" and choose "* All Other JDBC Connections (Connection Extensibility)" in the Data Source dropdown
- For the Driver Class, enter: org.apache.drill.jdbc.Driver
- For the Connection URL, enter:
jdbc:drill:drillbit=localhost;schema=dfs
The two properties used in the Connection URL are:
drillbit: The location of the Drill instance you set up, which can be specified by server name, IP address or 'localhost' (as in this example).
schema: The name of a Storage Plugin to handle our specific filetype. Storage Plugins are software modules that connect Drill to data sources. Apache Drill includes a Storage Plugin configuration named dfs that points to the local file system on your machine by default - this is the Storage Plugin used in the example above.
Note: Check out our Knowledge Base article on Storage Plugins for more detail.
Test your Drill JDBC Connection
Create a small CSV file called sample.csv and deploy it on an Agent box. Write a simple query to pull information from a basic CSV file called sample.csv. For a CSV file deployed to:
C:\Users\myuser\Documents\sample.csv
the Drill query has the following format:
SELECT ... FROM `/Users/myuser/Documents/sample.csv`
Note that the query defines the path to the file, but omits the partition label (C: in this example). Also, notice that forward-slashes (/) are used in the path instead of back-slashes (\), and that the path is enclosed with back-tics (`), not single quotes.
To complete our SELECT, another aspect of Drill syntax comes into play. In the default configuration for comma delimited files, Drill represents the file columns as elements of an array named "columns". Because of this, you must use the columns[n] syntax in the SELECT list:
SELECT columns[0] FROM `/Users/myuser/Documents/sample.csv`
Note: This syntax uses a zero-based index, so the first column is column 0.
You can set up a QueryPair in QuerySurge to use this query to test out your Drill Connection. The simplest way to set up the test is to use the same query for both Source and Target.
Note: You can find the full documentation for the Drill SQL dialect in the Drill SQL Reference.
Comments
0 comments
Please sign in to leave a comment.