Good news: The nut harvest has begun.
seen from China
seen from Malaysia
seen from Switzerland

seen from United States
seen from Jordan
seen from United States
seen from Switzerland
seen from Netherlands

seen from United States
seen from United States

seen from Italy

seen from China
seen from United States

seen from Malaysia
seen from United States

seen from United States
seen from Türkiye

seen from United States
seen from China
seen from Ukraine
Good news: The nut harvest has begun.
Red squirrel/ekorre.
Hadoop HDFS JDBC Driver
TIQ Solutions has released the Hadoop HDFS JDBC Driver. You can download a limited demo here:
Hadoop HDFS JDBC Driver demo for Hadoop version 1.0.3 (14.2 MB) Hadoop HDFS JDBC Driver demo for Hadoop version 2.0.1 (21.4 MB)
Introduction
The Hadoop HDFS JDBC Driver is designed to connect Hadoop HDFS from an external system (outside the Hadoop cluster) via JDBC and to extract relational data in a line or CSV based format from a HDFS (Hadoop distributed file system).
Indeed, there are already several ways to extract data from Hadoop but no really easy way to use existing query or analysis software on the HDFS itself, on a file level. For instance, if you want to perform a quick check of the last map reduce results you will need an interface, which allows you to keep track of your data in time.
use of HDFS JDBC driver with Squirrel
Preparation
The following settings are mandatory for configuring the Hadoop cluster.Ensure that they are set properly, when using the Hadoop jdbc.
There are some files in your Hadoop install directory (in the following refered as $HADOOP_HOME), which should be adjusted. Precisly speaking, there are located in $HADOOP_HOME/conf. They exists as .xml files, which contain a set of property elements. A property element is a key value pair, so you have to set a name tag and a value tag.
Take a look at the example configuration file: core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/var/hadoop</value> <description>A base for other temporary directories. The executor of the hdfs deamon need to have the permission to write in this directory.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://<IP ADDRESS>:<PORT></value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. For being able to connect to the HDFS, the IP adress should be taken from the namenode machine of your local network. To avoid problems, you should configure a valid DNS setup between your cluster nodes too. </description> </property> </configuration>
Usage
You can include the driver like any usual JDBC Driver. Make sure, that the used Java library files (.jar) are in a directory called "lib" next to the connectors jar file.
Make sure that you have the appropriate Java libraries for your Hadoop deployment. There are library versions for the Hadoop version 1.0.3 (current stable release, e.g. Cloudera Version CDH3) and the version 2.0.1 (current alpha release, e.g. Cloudera Version CDH4).
Specify the driver class
The JDBC mechanism uses the class name of a driver to reflect and load it during runtime. Thus, you need to tell your application the driver name: de.tiq.hadoop.TIQHdfsDriver
Connect to your HDFS
For being able to connect, specify the following JDBC URL: jdbc:hdfs//<theHDFShost>:<PORT>?PARAMS=VALUE
Currently, there are no security protocols implemented. Settings for your connection
There are parameters (key-value pairs), which are appended to the URL like ordinary HTTP GET parameters after the question mark token:
user=<YourHDFSUserName> - connects you to the HDFS with the specified user name, e.g. `...?user=root`
recursive=<boolean> - if true, the driver will match files in any subdirectory when querying the HDFS with a regular expression. Use: `...?recursive=true` The default is false.
separator=<MaskedSeparatorSign> - set the column separator of your data, the default is a TAB '\t' `...?separator=,` you may also use a URL mask for special characters: `...?separator=%3B`(semicolon)
skip_header=<boolean> - if true, it will skip the header lines when concatenating a set of files, e.g. `...?skip_header=true` The default is false.
The whole URL might look like: jdbc:hdfs://localhost:9000?user=hduser&recursive=true&separator=%3B&skip_header=true&raiseUnsupportedOperationException=false Create Statements
The main functionality of the driver consists of issuing SQL queries. It will get the data from a file out of the HDFS. I remind you again that the data should be organized in a relational way, e.g. .csv files.
Single file: For retrieving data from a file of the HDFS, you can set the path to the file in the from clause of the select statement. For a single file the syntax appears to be like this: select * | CommaSeperatedColumnList From /path/to/hdfs/object seeing that the from clause represents a path in the hdfs like `/input/data.csv`. For a single file, it is allowed to omit the .csv suffix.
Set of files: When concatenating files, please keep in mind that the first file determines the structure of the whole table. All other files will be interpreted as they would have the same column structure. Knowing that, missing columns will be treated as NULL values and additional columns will be lost.
We assume that you don't want to mix up unrelated data with each other. If you need join functionality, checkout a map reduce job in hadoop to transform the data in the appropriate structure. You can state a set of files as a comma separated list, e.g. select * from /path/file1,/path/file2,/file3 they don't need to be in the same directory. Between the different files there must not be a whitespace. The first given path in such an expression is called "base path". We estimate that you want to extract more files of the base path. You can use the following query: select * from /path/file1,file2,file3 which will return the data of file1, file2 and file3 out of the /path directory from your HDFS. Regular expression: You can use a (Java) regular expression to retrieve a set of files, which are matched by a given pattern. They should contain the same column structure again, because the first file still determines the header. A pattern is initiated by a hash character (#). For example: select * from /path/#csv$ would extract all files, which end with the character suffix csv. It will search recursively, if you specify the `recursive=true` parameter in the URL. Directories: For providing an easy way to navigate, you can also pose a query to retrieve directory informations. Try: select * from / and you see a basic Unix like index of the root directory from your HDFS.
use of HDFS JDBC driver with QlikView
QlikView script example to read out a whole data model from HDFS
Demo mode The demo version of this software is limited to fetch 1000 rows of data. As a demo, this program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. If you have any questions, feel free to ask us: ralf.becher (at) tiq-solutions (dot) de
Update:
This images shows the configuration of the Hadoop HDFS JDBC Driver and the additional needed Java libraries in the QlikView JDBC Connector:
made this pic yesterday
Happy Valentine's Day! (by Raymond Lee Photography)