Organizations using Azure Data Lake Storage (ADLS) often require testing of ADLS data assets with QuerySurge. However, ADLS does not natively offer access via JDBC, which QuerySurge requires. Apache Drill is a powerful tool for querying a variety of structured and partially structured data stores, including a number of different types of files. This article describes the use of QuerySurge with Drill to access and analyze flat files stored in Azure Data Lake Storage.
Note: You can get the setup basics and background for Apache Drill in our Knowledge Base articles about Drill installation and setup on Windows, or on Linux, and on Drill Storage Plugins.
Drill Configuration for Azure Data Lake Storage
Deploy Azure drivers to Drill
Drill needs Java drivers to access various Azure services. Run the following commands in the $DRILL_HOME\jars\3rdparty directory to retrieve the drivers needed to access ADLS.
wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure-datalake/3.0.0-alpha3/hadoop-azure-datalake-3.0.0-alpha3.jar
wget http://central.maven.org/maven2/com/microsoft/azure/azure-data-lake-store-sdk/2.1.5/azure-data-lake-store-sdk-2.1.5.jar
If you are on Windows, download the jar files from the repositories above and copy them to the same directory.
Register an application in the Azure Portal
For Drill to access your Azure Data Lake store, it must be added as a Active Directory web application with proper access to your files in the Azure portal. To create an Azure AD application, app registrations must be enabled in the Azure active directory. Navigate to the Azure Active Directory, select user settings and check the app registrations setting.
This value can only be set by an administrator. If set to Yes, any user in the Azure AD tenant can register an app. If the app registrations setting is set to No, only global administrators can register apps.
If you are able to register apps then we can to create an Azure AD application. Log into the Azure portal and select the Azure active directory then App registrations. Create a new application registration with a name, URL and select the type Web app / API from the drop-down.
Note: Take note of the name of the app as it will be used in the Drill configuration file later on.
Note: While the Sign-on URL parameter must follow the standard HTTP address format, the value you enter is completely arbitrary and is not used by Azure.
To access resources in your subscription, you must assign the application you have created to a role. From the left pane select All services and subscriptions. Select the particular subscription to assign the application to.
Note: If you don't see the subscription you're looking for, select global subscriptions filter. Make sure the subscription you want is selected for the portal.
Select Access control (IAM) then Add. Find the name of the app you created earlier, assign a role and click save.
Gather Configuration Credentials
Now that the application is registered, we need to get configuration values for Drill; more specifically we need the Display name, Application (client) ID, Directory (tenant) ID and a Key value.
To get the tenant ID, go to Azure Active Directory and select properties. The directory ID is the tenant ID.
To get the application ID and the authentication ID, first go to the App registrations blade in the Azure active directory and select the application you created. Save the application ID from this page. Next click settings and add a key with a description and duration. Once you click save, the secret access key will be displayed.
Note: This is the first and only time Azure will show your Secret Access Key. Please store it in a safe location for later reference.
Assign the Azure AD application to the Azure Data Lake Store
In the Azure portal, navigate to to your Data Lake storage folder and click Data Explorer. In the data explorer blade select Access then add. Search for the application name you created earlier. For permissions select Read, Write and Execute then OK to add the application.
Note: If you have multiple groups to search from ensure you are searching in the correct group where you initially created the application.
Note: Permissions are not granted recursively by default. For child directories you will need to grant permissions separately or select add to this folder and all children from the parent directory.
Configuring Drill
Drill's core-site.xml
Edit or create the core-site.xml in conf directory under Drill home. Add the following lines within the element. Replace value for {Application_ID}, {Authentication Key} and {Tenant ID} in the configuration segment below.
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<value>https://login.microsoftonline.com/{TENANT ID}/oauth2/token</value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<value>{APPLICATION ID}</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value>{AUTHENTICATION KEY}</value>
</property>
<property>
<name>fs.adl.impl</name>
<value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.adl.impl</name>
<value>org.apache.hadoop.fs.adl.Adl</value>
</property>
Connect to Drill console by navigating to http://<IP Address>:8047 in a browser. Once you are there click on the Storage tab and then at the bottom of the screen you will see a New Storage Plugin section where you can register a new data source. Enter a friendly name for the ADLS data source since this will be used in your SQL statements to references the data in the Data Lake. Click on the Create button.
A large text box will open up where you need to paste the following JSON configuration. Replace YOURDATALAKEACCOUNT with that of your data source in Azure Data Lake Store. This can be found in the Azure Portal by navigating to that resource, the name will be displayed in the header.
{
"type":"file",
"enabled":true,
"connection":"adl://YOURDATALAKEACCOUNT.azuredatalakestore.net/",
"config":null,
"workspaces":{
"root":{
"location":"/",
"writable":false,
"defaultInputFormat":null
},
"tmp":{
"location":"/tmp",
"writable":true,
"defaultInputFormat":null
}
},
"formats":{
"psv":{
"type":"text",
"extensions":[
"tbl"
],
"delimiter":"|"
},
"csv":{
"type":"text",
"extensions":[
"csv"
],
"extractHeader":true,
"delimiter":","
},
"tsv":{
"type":"text",
"extensions":[
"tsv"
],
"delimiter":"\t"
},
"parquet":{
"type":"parquet"
},
"json":{
"type":"json",
"extensions":[
"json"
]
},
"avro":{
"type":"avro"
},
"sequencefile":{
"type":"sequencefile",
"extensions":[
"seq"
]
},
"csvh":{
"type":"text",
"extensions":[
"csvh"
],
"extractHeader":true,
"delimiter":","
}
}
}
Click Create to register the data source in Azure Data Lake Store for further use in querying within Drill.
Note: The name you give your storage plugin will need to be used later when defining the connection URL or choosing a schema in your SQL statements.
Note: Any file types you would like to query must be included as an entry under the formats tag in valid JSON format.
QuerySurge and Azure Data Lake Storage
Deploy the Drill JDBC Driver
Make sure that you've deployed the Apache Drill JDBC driver to your QuerySurge Agents. See the Deploy the Drill JDBC Driver discussion in either of the installation articles (Windows or Linux articles) for instructions if you haven't deployed the driver yet.
Create a QuerySurge Connection to Drill
To create a new Drill Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a description of how to use Extensibility in this article.
There are a couple of ways to set up a QuerySurge Connection for Drill, depending on whether you want to specify the Storage Plugin on the Connection in the JDBC URL or in your SQL (see Storage Plugins for more details):
Specify the Storage Plugin in the JDBC URL
To specify the Storage plugin in the JDBC URL, use the following syntax where the schema will be equal to the name of the storage plugin we created earlier.
jdbc:drill:schema=azuredatalake;drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies only the file path: SELECT ... FROM `sample.csv`;
Specify the Storage Plugin in your SQL Query
To specify the Storage plugin in the SQL query, use the following URL syntax (which omits the schema):
jdbc:drill:drillbit=<server-or-IP>:<port>
Note: In this case, your query specifies both the Storage Plugin and the file path: SELECT ... FROM azuredatalake.`sample.csv`;
Querying Flat Files In ADLS with Drill
At this point, querying flat files within an Azure Data Lake Store is identical to querying them from your local filesystem and requires both the same Storage Plugin edits and SQL syntax as when using the dfs Storage Plugin.
In the following example, we'll use a simple CSV file with four columns. Similar to using the dfs Storage Plugin to query local CSV files, the "columns[n]" zero-based index syntax can be used to query the file:
SELECT columns[0], columns[1] FROM azuredatalake.`sample.csv`;
Note: You can find the full documentation for the Drill SQL dialect in the Drill SQL Reference. Specifics for querying a filesystem are found in this discussion.
Comments
0 comments
Please sign in to leave a comment.