Drill Storage Plugins
Drill Storage Plugins are software modules that connect Drill to data sources. Storage Plugins are used to connect to data sources, to spedify the location of the data, to configure the workspace and file formats for reading data, and to optimize the execution of Drill queries. Drill comes with several default storage plugins:
- cp: Points to files in the Drill classpath.
- dfs: Points to the local file system, but you can configure this storage plugin to point to any distributed file system, such as a Hadoop or S3 file system.
- hbase: Provides a connection to HBase.
- hive: Integrates Drill with the Hive metadata abstraction of files, HBase, and libraries to read data and operate on SerDes and UDFs.
- mongo: Provides a connection to MongoDB data.
Note: Storage Plugins are generic in the following sense: they work for broad technology categories. So, for example, the dfs plugin will be used for all types of flat files, including delimited files, Parquet files, Avro files, JSON files, etc.
Creating a Storage Plugin
While the default Storage Plugins provided with the Apache Drill installation cover a broad range of applications, you may find the need to customize an existing Storage Plugin, or create a new customization. The ability to create and modify Storage Plugins is a vital part in maintaining an organized Drill workflow.
To create and configure a new Storage Plugin, access the Drill Web Console located at:
http://<drill-server>:8047/storage
If HTTPS support is enabled:
https://<drill-server>:8047/storage
To use a Storage Plugin with a specific data store, the general recommendation is to start with a default Storage Plugin, and if modifications are necessary, to use an default plugin as a template. If you need to create a new Storage Plugin from a default plugin, open the existing Storage Plugin by clicking on the Update button for the plugin. Copy the contents of the entire Storage Plugin Configuration - this is a JSON specifying the plugin behavior.
Return to the Storage tab of the Drill web client and enter the desired name of your new Storage Plugin under New Storage Plugin:
Press the Create button and in the null Configuration field, paste the configuration JSON that you copied into the new Storage Plugin Configuration text box:
Edit the configuration for your needs. Common edits include: removing a file type, changing a file extension, specifying a delimiter (for CSV files), include/exclude headers, etc. See the Drill documentation for your specific file type for details.
Once you've edited the configuration JSON, click the Create button to apply any changes made and create the new Storage Plugin.
Modifying a Storage Plugin
To modify an existing Storage Plugin, click the Update button next to the desired Storage Plugin name under the Enabled Storage Plugins/Disabled Storage Plugins sections on the Storage tab. Modify the JSON configuration as needed.
Specifying the Storage Plugin for Querying
When preparing a query, you can specify the Storage Plugin for your query in one of three ways:
- Specify the Storage Plugin in the JDBC URL (schema property)
- Specify the Storage Plugin in the FROM clause of the query (and not in the JDBC URL)
- Specify the Storage Plugin with the USE keyword (and not in the JDBC URL)
Examples of each of these options follow. These examples assume that the delimited file being queried is stored at one of the following paths (depending on OS):
C:\path\to\delimitedfile.ext (Windows)
/path/to/delimitedfile.ext (Linux)
Specify the Storage Plugin in the JDBC URL
To specify the Storage plugin in the JDBC URL, use the following syntax:
jdbc:drill:schema=dfs;drillbit=<server-or-IP>:<port>
Note: The default Drill JDBC port is 31010
A sample URL is:
jdbc:drill:schema=dfs;drillbit=192.168.10.1:31010
In this case, your query specifies only the file path:
SELECT columns[0] FROM `/path/to/delimitedfile.ext`;
Note: The query defines the path to the file, but for files on Windows, the path must omit the partition label (C: in this example); for files on Linux, the standard path is used. Also, notice that: (a) Linux filesystem path notation is used on both Windows and Linux, with forward-slashes (/) in the path, and (b) that the path is enclosed with back-tics (`), not single quotes.
Note: The path used in a query is actually a relative path, which by default is relative to the filesystem root (so it appears like an absolute path). The query path is relative to the root location set in the Storage Plugin configuration:
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
...
If you modify the location property from the root, then all your query paths will be modified, as they are relative to this location property.
Note: In your query FROM clause, the path to the JSON file is relative to the root location property in the Storage Plugin.
Specify the Storage Plugin in your SQL Query
To specify the Storage plugin in the SQL, use the following URL syntax (which omits the schema):
jdbc:drill:drillbit=<server-or-IP>:<port>
A sample URL is:
jdbc:drill:drillbit=192.168.10.1:31010
In this case, your query must specify both the Storage Plugin and the file:
SELECT columns[0] FROM dfs.`/path/to/delimitedfile.ext`;
An alternative to this option is to employ the USE keyword in your QueryPair:
USE dfs;
SELECT columns[0] FROM `/path/to/delimitedfile.ext`;
Note: If you use multiple statements in a query (the case when you use the USE keyword), all statements must be semi-colon terminated.
Comments
0 comments
Please sign in to leave a comment.