With increasing frequency, QuerySurge users need to query against schema-free data stores, including large Flat Files, large JSON files, Parquet files and Avro files. Apache Drill supports these options and much more. In this article, we show how to set up Apache Drill on Linux so that you can use it with QuerySurge.
Note: We recommend installing Apache Drill on its own hardware/VM. We do not recommend installing on your QuerySurge hardware/VM.
- Oracle JDK 7 or 8 JDK
- A minimum of 4GB RAM dedicated for Drill
Apache Drill Installation
Install Apache Zookeeper
To access Apache Drill via JDBC, Drill must be installed in distributed mode on one or multiple nodes in a cluster. For Drill to operate in distributed mode, a drillbit must be started on each node of the cluster. Drillbits are daemon processes that Drill uses to execute the user’s queries on a single or distributed device.
Drillbits rely on Apache’s ZooKeeper service. ZooKeeper is an open-source coordination service for distributed applications and processes. ZooKeeper allows these processes to coordinate with each other through a shared namespace which is organized similarly to a standard filesystem.
A key Zookeeper concept is the idea of a quorum. The ZooKeeper quorum is a replicated group of servers in the same cluster. One of the quorum members is designated as the leader, and the remainder are termed followers. The Zookeeper quorum leader coordinates the tasks of the quorum. The Zookeeper configuration reflects certain aspects of the leader and follower.
To get started, download the latest version of Apache ZooKeeper. Expand the tar.gz download file to /opt/zookeeper, with the following commands:
[sudo] tar -xzf zookeeper-n.tar.gz -C /opt
mv /opt/zookeeper-n /opt/zookeeper
rm -f zookeeper-*.tar.gz
Note: The n indicates the zookeeper version number.
For Drill to use ZooKeeper, ZooKeeper must be configured on every node of the cluster (you may only have one node, as in this example). To configure Zookeeper, navigate to the conf directory within the ZooKeeper install directory, e.g. /opt/zookeeper/conf. ZooKeeper comes with a sample configuration file which you can use as a template. Start by copying the zoo_sample.cfg file and renaming it to zoo.cfg. The zoo.cfg file will contain your active configuration:
[sudo] cp /opt/zookeeper/conf/zoo_sample.cfg /opt/zookeeper/conf/zoo.cfg
By default, the ZooKeeper configuration file comes with the following base properties:
- tickTime: The basic time unit in milliseconds used by ZooKeeper and its properties like initLimit and syncLimit.
- dataDir: The location to store the in-memory database snapshots; default update transaction log directory
- clientPort: The port to listen for client connections
- initLimit: The timeout used to limit the length of time the ZooKeeper servers must connect to a leader.
- syncLimit: The time a server can go without being connected to a leader.
When setting up the ZooKeeper quorum (a replicated group of servers in the same application), an additional property is needed in zoo.cfg:
- server.n: A server that makes up the ZooKeeper service; n is an integer counter starting from 1.
Note: The default values for these properties are fine for getting started, but they likely can be calibrated to your individual machine specifications as you get more experience with Drill.
In our example, we use the local machine as the only server in our ZooKeeper quorum. The zoo.cfg file contains:
For the server.n property, note the two port numbers after each server name: "2888" and "3888". Zookeeper quorum followers use the first port ("2888" in this example) to connect to the quorum leader. The second port ("3888") is used for 'leader election' - this port is used to designate the leader.
Additionally, a file named myid containing the server number of the machine must exist in the data directory specified in by the dataDir property in zoo.cfg (in our example: /tmp/zookeeper).
If either the myid file or the /tmp/zookeeper directory do not exist, you will need to create them.
The contents of the myid file are numeric - simply the numeric portion of the server.n entry in the zoo.cfg file. So, for server.1, the myid file simply contains a 1.
When the service starts up, it knows which server it is running on by looking for the file myid in dataDir (in our case /tmp/zookeeper/myid).
If the installation and configuration are correct, you can use the following command within the Cygwin terminal to start ZooKeeper:
[sudo] /opt/zookeeper/bin/zkServer.sh start
Install Apache Drill in Distributed Mode
Start by downloading the latest version of Apache Drill. Expand the tar.gz download file to your desired install directory, e.g. /opt/drill with the following:
[sudo] tar -xzf apache-drill-n.tar.gz -C /opt
mv /opt/apache-drill-n /opt/drill
rm -f apache-drill-n.tar.gz
Note: The n indicates the drill version number.
Like ZooKeeper, Drill must also be configured on all machines in your cluster. To do so, modify the drill-override.conf in every machine’s /opt/drill/conf directory. If the file does not yet exist, copy the included drill-override-example.conf to a file named drill-override.conf (which is the active configuration file). The basic drill-override.conf file contains:
By default, a unique drill-override.conf file’s cluster-id value is set to "drillbits1" and the zk.connect value is set to the ZooKeeper host name (from the Zookeeper server.n property) and clientPort numbers in zoo.cfg. If you install Drill on multiple nodes, assign the same cluster-id to each Drill node so that all Drill nodes share the same cluster-id.
A few notes about these the drill-override.conf file contents:
a) If the cluster-id property in all the machines of your cluster are identical, the actual value is irrelevant (meaning, as long as all nodes match, you can use "drillbits1" for all nodes, or "drillbits2", etc.). Because our example deals with only one server, we can simply use the default value.
b) The zk.connect value uses the server.n domain and clientPort properties in our zoo.cfg separated by a colon (:).
To start Drill, use the following command in the Cygwin terminal:
[sudo] /opt/drill/bin/drillbit.sh start
By default, drill will output its logs in /drill/log/drillbit.out.
Note: You can use the contents of this log file to help troubleshoot Drill issues you may encounter.
Deploy the Drill JDBC Driver
For QuerySurge to use Drill the Drill JDBC driver must be set up on all QuerySurge Agents. The JDBC driver is named drill-jdbc-all-*.jar (where the * is the Drill version) and is located in /opt/drill/jars/jdbc-driver/.
To add a JDBC driver to an Agent deployed on Windows, see this Knowledge Base article.
To add a JDBC driver to an Agent deployed on Linux, see this Knowledge Base article.
Once the driver is deployed and your Agent is operational, you are ready to create a Connection in QuerySurge using the Drill JDBC driver.
Create a Connection
To create a new Connection in QuerySurge, we use the Connection Extensibility feature of the Connection Wizard. You can find a full description of this procedure in this article. Key points include the following:
- Navigate to: Administration > Connections > Add. Check "Advanced Mode" and choose "* All Other JDBC Connections (Connection Extensibility)" in the Data Source dropdown.
- For the Driver Class, enter: org.apache.drill.jdbc.Driver.
- For the Connection URL, enter: jdbc:drill:drillbit=<your-drillbit-IP>;schema=dfs
The two properties used in the Connection URL are:
drillbit: The location of the drillbit you set up, which can be specified by server name, IP address or 'localhost' (as in this example).
schema: The name of a Storage Plugin to handle our specific filetype. Storage Plugins are software modules that connect Drill to data sources. Apache Drill includes a Storage Plugin configuration named
dfs that points to the local file system on your machine by default - this is the Storage Plugin used in the example above.
Note: Check out our Knowledge Base article on Storage Plugins for more detail.
Test your Drill JDBC Connection
Create a small CSV file called sample.csv and deploy it on an Agent box. Write a simple query to pull information from a basic CSV file called sample.csv. For a CSV file deployed to:
the Drill query has the following format:
SELECT ... FROM `/home/myuser/Documents/sample.csv`
Note that the query defines the path to the file. Also, notice that the path is enclosed with back-tics (`), not single quotes.
To complete our SELECT, another aspect of Drill syntax comes into play. In the default configuration for comma delimited files, Drill represents the file columns as elements of an array named "columns". Because of this, you must use the columns[n] syntax in the SELECT list:
SELECT columns FROM `/home/myuser/Documents/sample.csv`
Note: This syntax uses a zero-based index, so the first column is column 0.
You can set up a QueryPair in QuerySurge to use this query to test out your Drill Connection. The simplest way to set up the test is to use the same query for both Source and Target.