Organizations using Azure Data Lake Storage (ADLS) often require testing of ADLS data assets with QuerySurge. However, ADLS does not natively offer access via JDBC, which QuerySurge requires. Dremio is a tool designed for querying (via JDBC) a variety of structured and partially structured data stores, and supports Azure Data Lake Storage.
Note: You can get the setup basics and background for Dremio in our Knowledge Base article about Dremio installation and setup.
Dremio Configuration for Azure Data Lake Storage
- Register Dremio in the Azure Portal
In order for Dremio to access your Azure Data Lake Store, it must be added as a Active Directory web application with proper access to your files in the Azure portal.
Note: To create an Azure AD application, app registrations must be enabled in the Azure active directory. As an Administrator, navigate to the Azure Active Directory, select user settings and check the app registrations setting. If set to Yes, any user in the Azure AD tenant can register an app. If the app registrations setting is set to No, only global administrators can register apps.
Note: If you already have an registered AD application with appropriate permissions you can skip this step and move to Gather Configuration Credentials.
Once you have permissions to register apps, you can create an Azure AD application for Dremio. Log into the Azure portal and select the Azure Active Directory, and then App registrations. Create a new application registration with a Name, the type Web app / API (from the drop-down) and a Sign-on URL.
Note: Take note of the name of the app as it will be used in the Dremio configuration.
Note: While the Sign-on URL parameter must follow the standard HTTP address format, the value you enter is completely arbitrary and is not used by Azure.
To access resources in your Azure subscription, you must assign the application you have created to a role. From the left pane select All services and Subscriptions. Select the particular subscription to assign the application to.
Note: If you don't see the subscription you're looking for, select global subscriptions filter. Make sure the subscription you want is selected for the portal.
Select Access control (IAM) then Add. Find the name of the app you created earlier, assign a role and click Save.
Note: The name of the app to select is the name you assigned in step 1.
- Gather Configuration Credentials
Now that the application is registered, we need to retrieve the configuration values for Dremio; specifically we need:
- the Display name
- the Application (client) ID
- the Directory (tenant) ID
- an Access Key value
To get the application ID and the authentication ID, first go to the App registrations blade in the Azure active directory and select the application you created. Save the application ID from this page.
On the same page, click Settings and keys. Enter a description and duration. Once you click Save, the secret access key will be displayed.
Note: This is the first and only time Azure will show your Secret Access Key. Store it in a safe location for later reference.
- Assign the Azure AD application to the Azure Data Lake Store
In the Azure portal, navigate to to your Data Lake storage folder and click Data Explorer. In the data explorer blade select Access then Add. Search for the application name you created earlier. For permissions select: Read, Write and Execute then OK to add the application.
Note: If you have multiple groups to search from ensure you are searching in the correct group where you initially created the application.
Note: Permissions are not granted recursively by default. For child directories you will need to grant permissions separately or select add to this folder and all children from the parent directory.
- Create a New Data Source in Dremio
In the Dremio web console (by default located at the address <IP Address>:9047) add a new data source and choose Azure Data Lake Store. Give your data source a name and add the previously collected values to the form and click Save.
Note: After your first successful connection to ADLS, Dremio recommends that you add the following line to your dremio.conf file and restart your cluster:
fs.adl.impl.disable.cache : false
Once the connection is added, you will be able to see your files in the Dremio web console. For Dremio to use the ADLS flat files, they must be first represented as datasets in Dremio. To do this, click the dataset icon in the Action column:
In the example above, both sample.csv and test.csv are flat files in ADLS that Dremio has picked up as members of this data source. After Dremio's representation of the sample.csv file is changed to a dataset representation, its icon changes to show the difference between the different members in the data source:
QuerySurge and ADLS with Dremio
Finally, you'll need to create a QuerySurge Connection to Dremio. Follow the steps in this article to set up the Connection. Once you have connected QuerySurge to Dremio, you can query ADLS files via Dremio using the following syntax:
SELECT ... FROM <DATA STORE>.[<"PATH">...]<"FILE NAME">
In the example below, we query all columns from the sample.csv file that is hosted on our Azure Data Lake.
Note: Dremio's complete SQL reference guide can be found here.