Discover Top Posts Tagged with #suirrel

Hadoop HDFS JDBC Driver

TIQ Solutions has released the Hadoop HDFS JDBC Driver. You can download a limited demo here:

Hadoop HDFS JDBC Driver demo for Hadoop version 1.0.3 (14.2 MB) Hadoop HDFS JDBC Driver demo for Hadoop version 2.0.1 (21.4 MB)

Introduction

The Hadoop HDFS JDBC Driver is designed to connect Hadoop HDFS from an external system (outside the Hadoop cluster) via JDBC and to extract relational data in a line or CSV based format from a HDFS (Hadoop distributed file system).

Indeed, there are already several ways to extract data from Hadoop but no really easy way to use existing query or analysis software on the HDFS itself, on a file level. For instance, if you want to perform a quick check of the last map reduce results you will need an interface, which allows you to keep track of your data in time.

use of HDFS JDBC driver with Squirrel

Preparation

The following settings are mandatory for configuring the Hadoop cluster.Ensure that they are set properly, when using the Hadoop jdbc.

There are some files in your Hadoop install directory (in the following refered as $HADOOP_HOME), which should be adjusted. Precisly speaking, there are located in $HADOOP_HOME/conf. They exists as .xml files, which contain a set of property elements. A property element is a key value pair, so you have to set a name tag and a value tag.

Take a look at the example configuration file: core-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/var/hadoop</value> <description>A base for other temporary directories. The executor of the hdfs deamon need to have the permission to write in this directory.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://<IP ADDRESS>:<PORT></value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. For being able to connect to the HDFS, the IP adress should be taken from the namenode machine of your local network. To avoid problems, you should configure a valid DNS setup between your cluster nodes too. </description> </property> </configuration>

Usage

You can include the driver like any usual JDBC Driver. Make sure, that the used Java library files (.jar) are in a directory called "lib" next to the connectors jar file.

Make sure that you have the appropriate Java libraries for your Hadoop deployment. There are library versions for the Hadoop version 1.0.3 (current stable release, e.g. Cloudera Version CDH3) and the version 2.0.1 (current alpha release, e.g. Cloudera Version CDH4).

Specify the driver class

The JDBC mechanism uses the class name of a driver to reflect and load it during runtime. Thus, you need to tell your application the driver name: de.tiq.hadoop.TIQHdfsDriver

Connect to your HDFS

For being able to connect, specify the following JDBC URL: jdbc:hdfs//<theHDFShost>:<PORT>?PARAMS=VALUE

Currently, there are no security protocols implemented. Settings for your connection

There are parameters (key-value pairs), which are appended to the URL like ordinary HTTP GET parameters after the question mark token:

user=<YourHDFSUserName> - connects you to the HDFS with the specified user name, e.g. `...?user=root`

recursive=<boolean> - if true, the driver will match files in any subdirectory when querying the HDFS with a regular expression. Use: `...?recursive=true` The default is false.

separator=<MaskedSeparatorSign> - set the column separator of your data, the default is a TAB '\t' `...?separator=,` you may also use a URL mask for special characters: `...?separator=%3B`(semicolon)

skip_header=<boolean> - if true, it will skip the header lines when concatenating a set of files, e.g. `...?skip_header=true` The default is false.

The whole URL might look like: jdbc:hdfs://localhost:9000?user=hduser&recursive=true&separator=%3B&skip_header=true&raiseUnsupportedOperationException=false Create Statements

The main functionality of the driver consists of issuing SQL queries. It will get the data from a file out of the HDFS. I remind you again that the data should be organized in a relational way, e.g. .csv files.

Single file: For retrieving data from a file of the HDFS, you can set the path to the file in the from clause of the select statement. For a single file the syntax appears to be like this: select * | CommaSeperatedColumnList From /path/to/hdfs/object seeing that the from clause represents a path in the hdfs like `/input/data.csv`. For a single file, it is allowed to omit the .csv suffix.

Set of files: When concatenating files, please keep in mind that the first file determines the structure of the whole table. All other files will be interpreted as they would have the same column structure. Knowing that, missing columns will be treated as NULL values and additional columns will be lost.

We assume that you don't want to mix up unrelated data with each other. If you need join functionality, checkout a map reduce job in hadoop to transform the data in the appropriate structure. You can state a set of files as a comma separated list, e.g. select * from /path/file1,/path/file2,/file3 they don't need to be in the same directory. Between the different files there must not be a whitespace. The first given path in such an expression is called "base path". We estimate that you want to extract more files of the base path. You can use the following query: select * from /path/file1,file2,file3 which will return the data of file1, file2 and file3 out of the /path directory from your HDFS. Regular expression: You can use a (Java) regular expression to retrieve a set of files, which are matched by a given pattern. They should contain the same column structure again, because the first file still determines the header. A pattern is initiated by a hash character (#). For example: select * from /path/#csv$ would extract all files, which end with the character suffix csv. It will search recursively, if you specify the `recursive=true` parameter in the URL. Directories: For providing an easy way to navigate, you can also pose a query to retrieve directory informations. Try: select * from / and you see a basic Unix like index of the root directory from your HDFS.

use of HDFS JDBC driver with QlikView

QlikView script example to read out a whole data model from HDFS

Demo mode The demo version of this software is limited to fetch 1000 rows of data. As a demo, this program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. If you have any questions, feel free to ask us: ralf.becher (at) tiq-solutions (dot) de

Update:

This images shows the configuration of the Hadoop HDFS JDBC Driver and the additional needed Java libraries in the QlikView JDBC Connector:

#hadoop #hdfs #jdbc #driver #suirrel #qlikview #bigdata #nosql

made this pic yesterday

#suirrel

Happy Valentine's Day! (by Raymond Lee Photography)

#animal #suirrel #kiss