QuerySurge Technical Whitepaper No. 11
QuerySurge users typically connect QuerySurge to Hadoop through a JDBC-based entry point (e.g., a JDBC connection to Hive or HBase). However, in some cases, users want to build tests around HDFS files directly, and in order for QuerySurge to access files on HDFS, the files have to be pulled off HDFS to a file system that QuerySurge can work with directly (i.e. Windows or Linux).
The whole process - pulling files off of HDFS and querying them - can be automated by QuerySurge. This is done via QuerySurge's Flat File JDBC driver and a custom function built for the driver calling into the HDFS API. This article shows the basics of how to build the custom function and deploy it.
A Custom HDFS Function
The Hadoop File System API offers full access to the file system. In terms of building a custom function to access HDFS, the general strategy is to obtain a reference to an org.apache.hadoop.fs.FileSystem object by specifying a URL for the file at the path that you're interested in. The general form of the URL is:
hdfs://<hdfsServer>:<hdfsPort>/hdfs-path/hdfs-file
A sample URL is:
hdfs://myhadoop.mycompany.com:8020/user/dev/purchase-data.csv
In this sample URL, note that the port that is used (port 8020) is the default port; you'll need to verify whether your Hadoop instance uses the default or not. It is also important to underscore that the file path in this sample (/user/def/purchase-data.csv) is (of course) an HDFS path, not an OS file system path.
See Resources section: See hdfs-api-jar-list.txt for listings of the required client API jar files. This file shows jars required for HDP 2.4 and CDH 5.8. For other versions, you'll need to search for jar file names that reflect your Hadoop distro and version. It is possible that other jars may be required for other distros.
In terms of writing code, there are multiple ways to structure HDFS API calls. In this example, we'll use the FileSystem method copyToLocalFile() to bring the target file down from HDFS to a local file system (where the QuerySurge Agent can access it):
public static void hdfsFileDownload(String hdfsServer,
String hdfsPort,
String hdfsDirPath,
String fileName,
String targetPath) throws IOException {
// handle the hdfs dir path to make sure the path ends with a separator char
String formattedHdfsDirPath =
(hdfsDirPath.endsWith(LINUXSEPARATOR)) ?
hdfsDirPath : hdfsDirPath + LINUXSEPARATOR;
// handle the local OS file system dir path
// to make sure that the path ends with a separator char
String formattedTargetDirPath = (targetPath.endsWith(File.separator)) ?
targetPath : targetPath + File.separator;
// set up a Configuration pointing to the hdfs server
Configuration conf = new Configuration ();
String path = "hdfs://" + hdfsServer + ":" + hdfsPort;
conf.set("fs.default.name", path);
// get a ref to the hdfs FileSystem
FileSystem fs = FileSystem.get(conf);
// get a FileStatus ref for the specified hdfs file
FileStatus[] filestatus = fs.listStatus(new Path(formattedHdfsDirPath + fileName));
try {
// make sure the hdfs file path is a file path and not a dir path
if (filestatus[0].isFile()) {
// get an hdfs Path ref to the hdfs file...
Path hdfsFilepath = filestatus[0].getPath();
// ...and copy the file to the local file system
fs.copyToLocalFile(
false,
hdfsFilepath,
new Path(formattedTargetDirPath),
true);
} else {
System.out.println("Directory specified, not File");
}
} catch (Exception e) {
e.printStackTrace();
} finally {
// cleanup
fs.close();
}
}
To complete a class implementation of this method, we can add a convenience wrapper method for specifying the server and port in a single argument, modifications for offering verbose output, and a main method for testing outside of QuerySurge. The complete sample class file is available in the Resources section.
See Resources section: Download the full HiveClient.java class.
Build and Deploy your Custom Jar File
Once you've set up your code, you'll need to compile it as a jar file. (For guidance on writing and building a Custom Function for files in QuerySurge see this worked example.) A critical task for this implementation is to set the required HDFS API library jar files on the classpath. One option for this is to set the Class Path attribute in the jar manifest to refer to the required HDFS library jars. Another option is to include the API jar files in the custom jar with a custom ClassLoader or "onejar" packaging.
Once the jar is complete, you can test it from a batch file. A template batch file called hdfs-client-1.0.0-template.bat is in the Resources section; you'll have to modify the file to work in your environment. Once the function can be successfully called from the batch file, you can deploy it to your Agent(s).
See Resources section: See the hdfs-client-1.0.0-template file for testing the HDFS file download outside of QuerySurge.
Deploy the jar file to each Agent's JDBC directory, along with the HDFS API library jars (consistent with your chosen classpath). Remember that you'll need to do this for all Agents that you expect to run these "HDFS" QueryPairs on.
Once this is done, you'll be ready to modify the Agent configuration (agentconfig.xml) to access the custom HDFS call. For details on how to do this, see Steps 3 - 5 in this Knowledge Base article. Your file modification will look like:
<connectionProps>
...
<driverProp driver="jstels.jdbc.csv.CsvDriver2" prop="function:hdfsFileDownload" type="void"
value="com.rttsweb.querysurge.HdfsClient.hdfsFileDownload" />
...
</connectionProps>
Note: The function name (which appears in the prop and value attributes) and the full class name (in the value attribute) must reflect how you have named the custom method and class, so if you have made changes in the sample class in these items, the tag will need to change as well.
Calling the Custom Function
Once you've:
- deployed your jar and libraries to your Agent(s),
- modified the Agent(s) configuration for the custom function and re-started your Agent(s),
then you're ready to set up a Flat File Connection and call the function in a QueryPair.
For the Flat File Connection setup, if you're not familiar with this process, you can start with the basics here (for delimited Flat Files) or here (for fixed width Flat Files). Set up the Connection for the location where the file will be delivered to the local file system. The file will have the same name as it has on HDFS (i.e. it is not renamed in the process) so you can use this name in your SQL.
For your QueryPair, the sample syntax is:
CALL hdfsFileDownload('myhadoop.mycompany.com:8020','/user/dev', 'purchase-data.csv', 'C:\Users\myuser\Documents', true);
SELECT * FROM "purchase-data";
Some important notes:
- The calling function signature shown here is for the convenience wrapper method (not shown above; see HdfsClient.java in Resources)
- As is customary with SQL calls, the string arguments to the custom function use single quotes.
- All statements in multi-statement queries (the CALL and the SELECT statements in this example) must be semi-colon terminated.
When executed, this query will download the specified file from HDFS to the specified local file system, and from then on, the query process is the same as the standard Flat File query execution.
HDFS Security
One point that should be stressed in the example shown here is that there is no default security on API access to HDFS. The calls in the sample code above specify the HDFS URL, but do not include any credentials. It is likely that your instance may have more elaborate security in place (typically either SSL or Kerberos authentication). You can find discussions on implementing a client with either of these technologies here, at the Hortonworks Community Connection site.
Large File Sizes
An important topic to consider before you put the effort into building a custom HDFS function for QuerySurge is the size of the files involved. Files on HDFS may be quite large (from hundreds of MB to the GB range or higher). In these cases, it may make sense to consider a different approach. QuerySurge does a great deal of processing to download and access the files as database tables, and the cost of this processing in disk space, CPU and memory scales up with file size. With files that are large relative to the resources available, it may make sense to import the files into a full database product for querying with QuerySurge. Where this is possible, the performance and efficiency gains may make this approach advantageous.
Resources
Comments
0 comments
Please sign in to leave a comment.