Discover Top Posts Tagged with #yooreeka

This is a brief video introduction to Yooreeka -- setup and execution of an example.

#yooreeka #data mining #machine learning

Check out the code and start using intelligent search, recommendations, classifiers and clustering algorithms.

You can find the Yooreeka 2.0 API (Javadoc) here and you can also visit us at our Google+ home. The official blog for yooreeka is http://blog.yooreeka.com Lastly, Yooreeka 2.0 is licensed under the Apache License 2.0. The library is written 100% in the Java language.

#yooreeka #data mining #machine learning

The Sammon Mapping: Nonlinear Projections for High-Dimension Datasets

// <![CDATA[ (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-42543297-1', 'tumblr.com'); ga('send', 'pageview'); // ]]>

In the context of a data mining project, or as part of constructing a machine learning algorithm, we are given data sets whose records have a lot of attributes; think of a table with many columns, if you are more familiar with modeling your data through a relational database. This situation is not problematic for a wide range of data mining and machine learning algorithms, however, at times it can be overwhelming for the purpose of analysis. In particular, these data can quickly saturate are ability to visualize information, since the human perception is limited to a small number of dimensions. Thus, herein, the words "many" and "lots" are with relation to the number three (3).

Perhaps, one may think that the simplest technique for dimensionality reduction is a straightforward linear projection, such as described in Principal Component Analysis or PCA.However, the objective of PCA is very different from the needs of the analysis that we discuss here. The objective of the PCA is the compression of the information in the dataset in such a way that the total information can be described with fewer data points (coordinates). In mathematical terms, PCA creates a sequence of dimensions such that the first dimension accounts for as much of the variability of the data in the original dataset and each subsequent dimension attempts to do the same for the remaining of the information, under the constraint that all dimensions are mutually perpendicular to each other (i.e. linearly independent).

While PCA is very useful in a number of cases, it makes no effort to preserve any geometric structures that might be present in the original data set. For certain purposes of visualization and data analysis, the preservation of the high-dimensional structures is more valuable and very revealing of the information that the data set contains.

One algorithm, that aims at achieving exactly that, is the so-called Sammon Mapping, named after John Sammon, who initially proposed it in 1969 (see "A nonlinear mapping for data structure analysis" by Sammon JW in IEEE Transactions on Computers 18, p.401–409). You can find an implementation of that algorithm in Yooreeka, just click here.

The results can be visualized by using the class ScatterGui.java.

#yooreeka

Yooreeka Roadmap

I am working on the following changes for Yooreeka:

The BeanShell scripting library has served well our project. We did investigate replacing it but we determined that it is better to continue using it. An investigation of the Spring Shell project was conducted and the benefits were not worth the effort. Another option, as a more powerful execution environment, would be to load the yooreeka library in the execution shell of an interactive (JVM) language such as Groovy or Scala. However, this is trivial if you are using Groovy or Scala, so there will be no changes to the Yooreeka project in that regard.

I am preparing a document that explicitly states the other open source projects that are used by Yooreeka, and their associated licenses. All projects that do not have a sufficiently permissive OS license will be replaced by others who do or they will be cleanly decoupled from the Yooreeka core JAR. Starting with Yooreeka 2.0, we adopted the Apache 2.0 license with the intention that you can use Yooreeka within your projects whether those are work related (proprietary) or not.

Lastly, I will introduce the Processing language for the purpose of creating powerful visualizations. If you are not familiar with Processing, I highly recommend that you take a look. Briefly, the Processing language offers the following:

Interactive programs with 2D, 3D or PDF output

OpenGL integration for accelerated 3D

Over 100 libraries that extend the core software beyond simple graphs and visualization

I think, and hope, that the above efforts will make Yooreeka more attractive and useful to researchers in Academia and industry alike.

If you have any specific requests for the Yooreeka library, please, create a ticket on the project's website or send me an email at yooreeka AT marmanis DOT com.

#yooreeka