Amazon Athena is an interactive query service that allows users to analyze data in Amazon S3 using a standard SQL syntax. This article shows how to connect QuerySurge to Athena and query data hosted on S3 using IAM Role permissions assigned to your EC2 Instance.
Prerequisites
This only applies to AWS hosted QuerySurge Agents. It requires an EC2 instance with an IAM Role that has access to the following resources in your AWS cloud environment:
- Athena
- S3 Query Output Bucket
Deploy the Athena JDBC Driver
For QuerySurge to connect to Athena, the Simba Athena JDBC driver with AWS SDK must be deployed to all Agents. The driver can be downloaded from the JDBC 42-compatible link found here.
To deploy drivers to your QuerySurge Agents, see this article for Windows or this article for Linux.
Note: The JDBC 42 drivers are compatible with QuerySurge 6.4.x or higher (Java 8 compatible).
Create a QuerySurge Connection to Athena
To create a new Connection in QuerySurge, use the Connection Extensibility feature of the Connection Wizard. You can find a full description of this procedure in this article. You'll need an Admin login to QuerySurge.
Navigate to: Administration > Connections > Add. Check "Advanced Mode" and choose "* All Other JDBC Connections (Connection Extensibility)" in the Data Source dropdown.
For the Driver Class, enter the following:
com.simba.athena.jdbc.Driver
The JDBC URL template for this driver is:
jdbc:awsathena://AwsRegion=[REGION];S3OutputLocation=[LOCATION];AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.DefaultAWSCredentialsProviderChain;
To build your JDBC URL, you'll need a region code and an S3 Output location.
- The Region Code
The region endpoint of your Athena instance can be found in the top right of the web console. In the example below, note that the instance is based in US East (Ohio) which corresponds top the us-east-2 region code. To find the region code from a region name consult this listing.
- The S3 Output Location
A query output location in S3 is required for the connection string. Athena caches all query results this location (more information can be found here). In this example, we use the directory test-results that we have created, residing in our sample-bucket on S3. This yields the following location s3://sample-bucket/test-results
The JDBC URL is:
jdbc:awsathena://AwsRegion=us-east-2;S3OutputLocation=s3://sample-bucket/test-results;AwsCredentialsProviderClass=com.simba.athena.amazonaws.auth.DefaultAWSCredentialsProviderChain;
Test Your Connection
To test your connection from QuerySurge to Athena enter the query SELECT 1 in the Test Connection field, and click on the Test Connection button.
S3 Databases and Tables
To query files hosted on S3, you'll need to create both a database and at least one table in S3. To start this process, log into the Athena web console and select Query Editor.
If there are no databases, you can create a database (called 'sample' in this example) with the following SQL statement:
CREATE DATABASE sample;
Once you have a database to work with, select the database you wish to use in the drop down menu then click Create Table. In the popup, select from S3 bucket data.
Choose a Table Name and the Location of Input Data Set; the location is a path to a directory in S3. Valid paths are of the format s3://bucket/folder/. In the following example, we use s3://sample-bucket/sample-folder/ as the location.
Note: For this example, the sample file used is named sample.csv, located in the directory //sample-bucket/sample-folder/. Note that if there are multiple files of the specified type in the directory (in this case CSV), when you query, data is extracted from all files in the directory and included in the result set.
Next select a data format - in this example we use CSV files.
In step 3, columns can be added and defined for the table. Columns can also be added in bulk with the syntax (col_name data_type), for example (id int, name string). You may also add partitions if you'd like on the final step 4.
After these steps have been completed, select Save and ddl to create your table is auto-generated. Select Run query to create the table in Athena.
Querying S3 Data
Now, you can run a select statement to ensure the data is being pulled from the location specified. The result set returned is shown in the Results section.
Note: Headers in flat files are included in the result sets.
You are now ready to query the bucket data from QuerySurge. In the example below, we query all columns from the sample.csv file that is hosted on S3 and defined by sampletable in Athena.
Note: Documentation for Athena SQL syntax can be found here.