Organizations using Azure Blob Storage often require testing of these data assets with QuerySurge. However, this Azure technology does not natively offer access via JDBC, which QuerySurge requires. Apache Drill is a powerful tool for querying a variety of structured and partially structured data stores, including a number of different file types. This article describes the use of QuerySurge with Drill to access and analyze flat files stored in public or private Azure Blob Storage.
Note: You can get the setup basics and background for Apache Drill in our Knowledge Base articles about Drill installation and setup on Windows, or on Linux, and on Drill Storage Plugins.
Drill Configuration for Azure Blob Storage
Deploy Azure drivers to Drill
Drill needs Java drivers to access various Azure services. Run the following commands in the $DRILL_HOME\jars\3rdparty directory to retrieve the drivers needed to access Azure Blob storage.
wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.3/hadoop-azure-2.7.3.jar wget http://central.maven.org/maven2/com/microsoft/azure/azure-storage/4.4.0/azure-storage-4.4.0.jar
If you are using Windows, download the jar files from the repositories above and copy them to the same directory.
Configure Data Sources
If you are using a public blob, you can skip this step. If you are using a private blob you need to enter the credentials in a Apache Drill configuration file stored in $DRILL_HOME\conf\core-site.xml. Edit this file (carefully!) in any text editor and add the following property under the configuration tag.
<property>
<name>fs.azure.account.key.YourAzureStorageAccountname.blob.core.windows.net</name>
<value>YourStorageAccountKey</value>
</property>
Each blob data source that needs credentials must be entered in this XML file. Replace YourAzureStorageAccountname and YourStorageAccountKey with the values assigned to your storage container and account. To find your storage account name and access keys, navigate to the storage account that contains the blob container and select access keys.
Note: If the key used in the Drill configuration is regenerated it will need to be updated in the core-site.xml file.
Configuring Drill
Storage Plugin Configuration
Create a new Azure Blob Storage Plugin via the Drill Web Console and add the following JSON configuration. Make sure that YourContainer and YourAzureStorageAccountName are replaced with the correct values of the blob container name and the storage account name respectively.
{
"type":"file",
"enabled":true,
"connection":"wasbs://YourContainer@YourAzureStorageAccountname.blob.core.windows.net/",
"config":null,
"workspaces":{
"root":{
"location":"/",
"writable":false,
"defaultInputFormat":null
},
"tmp":{
"location":"/tmp",
"writable":true,
"defaultInputFormat":null
}
},
"formats":{
"psv":{
"type":"text",
"extensions":[
"tbl"
],
"delimiter":"|"
},
"csv":{
"type":"text",
"extensions":[
"csv"
],
"extractHeader":true,
"delimiter":","
},
"tsv":{
"type":"text",
"extensions":[
"tsv"
],
"delimiter":"\t"
},
"parquet":{
"type":"parquet"
},
"json":{
"type":"json",
"extensions":[
"json"
]
},
"avro":{
"type":"avro"
},
"sequencefile":{
"type":"sequencefile",
"extensions":[
"seq"
]
},
"csvh":{
"type":"text",
"extensions":[
"csvh"
],
"extractHeader":true,
"delimiter":","
}
}
}
After creating the plugin, ensure it is enabled in the web console.
Note: The name you give your storage plugin will need to be used later when defining the connection URL or choosing a schema in your SQL statements.
Note: Any file types you would like to query must be included as an entry under the formats tag in valid JSON format.
QuerySurge and Drill with Azure Blob Storage
Deploy the Drill JDBC Driver
Make sure that you've deployed the Apache Drill JDBC driver to your QuerySurge Agents. See the Deploy the Drill JDBC Driver discussion in either of the installation articles (Windows or Linux articles) for instructions if you haven't deployed the driver yet.
Create a QuerySurge Connection to Drill
To create a new Drill Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a description of how to use Extensibility in this article.
There are a couple of ways to set up a QuerySurge Connection for Drill, depending on whether you want to specify the Storage Plugin on the Connection in the JDBC URL or in your SQL (see Storage Plugins for more details):
Specify the Storage Plugin in the JDBC URL
To specify the Storage plugin in the JDBC URL, use the following syntax where the schema will be equal to the name of the storage plugin we created earlier.
jdbc:drill:schema=azureblob;drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies only the file path: SELECT ... FROM `sample.csv`;
Specify the Storage Plugin in your SQL Query
To specify the Storage plugin in the SQL query, use the following URL syntax (which omits the schema):
jdbc:drill:drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies both the Storage Plugin and the file path: SELECT ... FROM azureblob.`sample.csv`;
Querying Flat Files In Blob Storage with Drill
At this point, querying flat files within an Azure Blob Storage is identical to querying them from your local filesystem and would in require both the same Storage Plugin edits and SQL syntax as when using the dfs Storage Plugin. For more information on working with storage plugins with Drill, see this article.
For this example we'll use a simple four-row CSV file provided below.
Similar to using the dfs Storage Plugin to query CSV files, the "columns[n]" zero-based index syntax is required when querying the file:
SELECT columns[0], columns[1], columns[2] FROM azureblob.`sample.csv`;
Note: You can find the full documentation for the Drill SQL dialect in the Drill SQL Reference. Specifics for querying a filesystem are found in this discussion.
Comments
0 comments
Please sign in to leave a comment.