Apps for Big Data @bigdataapps - Tumblr Blog

Seasonality in R

Some good options for seasonality, was looking for an updated holt-winters and came across this

https://github.com/welch/seasonal

Blogs.Oracle.Com - Oracle R Enterprise

using forecast and SES for a peak into whats to come...

Data Expo

I needed to find demo data for visualizations - this is a great set of data used bi-annually for a visualization challenge. I am interested in additional data sources available publicly, put them in the comments!

http://stat-computing.org/dataexpo/

R based systems - recommendation

Recomendation System in R

Interesting article on R based recommender : http://blog.yhathq.com/posts/recommender-system-in-r.html

Oracle gets behind R!!

Interesting news for the R community and another validating data point that R is mainstream and has impact for Enterprises.

http://www.oracle.com/technetwork/topics/bigdata/r-offerings-1566363.html

Soccer + Pagerank

Great post on how soccer can be analyzed using network theory.

http://www.technologyreview.com/view/428399/pagerank-algorithm-reveals-soccer-teams/

Quick-R

Great site for getting up to speed on R:

http://www.statmethods.net/index.html

R 2.15 is out

R 2.15 is out,detailed release information is here: http://www.r-bloggers.com/r-2-15-0-is-released/

Lots of small improvements and tweaks across the landscape. Some of the new load balancing functions (clustermaps new argument and parLapplyLB and parSapplyLB) are worth digging into a bit more. As I get into this release I may have more to say.

#rstats #visualization

Huge list of great visualization resources.

#visualization

TidBits

Networking things I use rarely enough I can not remember

sudo tcpdump -i eth0 -s 65535 -w tcpoutput

tshark -i eth1 -f 'host 1.2.3.4' -R 'http' -S -V -l | awk '/^[HL]/ {p=30} /^[^ HL]/ {p=0} /^ / {--p} {if (p>0) print}'

Start a terminal session other people can join and watch

screen -d -R watch_me_code

to join the session:

screen -x watch_me_code

#linux #tidbits

HCP encode script

Encoding credentials for HCP http based access one liner:

echo `echo -n $1 | base64`:`echo -n $2 | md5sum` | awk '{print $1}'

#hcp #sh

Machine Generated Data: TempDuino II

This is the second article on MGD, the first is here. In that article I had setup a simple sensor to capture temperature and was recording that value every minute into a file. We left off with the sensor running. Now that we have some data lets get into it a bit and see what we can learn.

$ wc -l raw_temp_data.csv 43948 raw_temp_data.csv

Nice, almost 44,000 observations. Keep in mind that the majority of time in analysis is spent in data preparation and cleaning. Especially if you have data from different sources in different formats. In this simplified example we begin to see some of what that data preparation and cleaning will look like using some basic linux shell commands.

We know our date should all be of the form date,reading and here is a sample:

2012/01/17 07:07:59,56.26

The following is a regular expression representation of the form above.

'[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}\.[0-9]{2}$'

(If you are unfamiliar with regular expressions and are interested in this subject thats a good area to invest your time.) The command "grep -v pattern" will return all lines which do not match the given pattern. That is perfect to see how "clean" our data is.

$ grep -v -E '[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}\.[0-9]{2}$' raw_temp_data.csv

2012/01/16 22:03:16,60.660.65 2012/01/16 22:03:25,60.660.65 2012/01/17 04:45:22012/01/17 07:07:59,56.26 2012/01/17

Interesting. Four "dirty" entries out of 44k. First two are double reads from the sensor. Then comes the interesting bit in the third line there, where we lost a few hours. The time goes from 4:45:22 to 07:07:59. It turns out my computer kernel panicked at 4:45am. I didn't get to it until 7:07am. This is a classic missing data problem, but thats for another post. For now, we will simply clean up the offending lines (in this case drop the -v from grep and pipe to another file) and move on.

$grep -E '[0-9]{4}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{2}\.[0-9]{2}$' raw_temp_data.csv> temp_data.csv

Good enough for this experiment. Whats this look like? Time series data lends itself pretty naturally to plots, and plots present a nice visual playground for understanding. Time to launch R.

Once R is running, I load the data which we have just cleaned into a data frame using R's built in data<-read.csv function.

> data<-read.csv("./temp_data.csv") > plot(data[,2])

That has some oddities in it, most likely a bad sensor read. Lets dig around a bit for low temps.

$ awk -F"," '{if ($2 < 45.00) print $2}' temp_data.csv 21.98 33.41 19.34 29.01

Since the outdoor temp over night was 28, these values are probably garbage.

$awk -F"," '{if ($2 > 45.00) print $1","$2}' temp_data.csv > cleaned_temp_data.csv

Add some color, labels, legend and a spline:

Conclusions:

The first thing that pops into my mind is that my temperature sensor is not very good. I mean look at all those outliers! The most interesting thing, however, is that the very simplest sensor system captures enough data that real interesting insights can be derived. Extrapolate that across industries and different data available, be it from log files on a web server or sensors in a factory, and its interesting to think about what can happen. Thats the big deal about machine generated data. There is real value here currently not being leveraged.

Thanks to @statsinthewild for help on the finer points on R's plot command.

#MGD #R #bigdata

The core components

I believe in abstractions. Here is one I use, and may come as an "obviously" to some.

When I am building a technology system, there are three fundamental things I want to do with information: move it, compute it, save it for later. I like to decompose how I think of systems into those basic pieces, and build up from there. It is a simple method, but one I have come to rely on.

I came to this mind set shortly after getting my head around the schedule scope and resources argument from the mythical man month. If you are not familiar with that ook, its concepts are still relevant.

Usually I will draw a triangle and move the lengths of the sides based on how much of each I will need to solve a particular problem, or how much time I will need to spend on each component.

Other times when I am digesting a new complex system I like to think how the system reacts as I vary each of those three knobs. This invariably helps me understand the system better, which leads to knowing where to look to solve whatever comes up. Because something always does.

Move it: For any system to be useful it has to be able to get data. This can be over pci,serial, usb, ethernet, scsi, fiber channel technologies whose purpose is to get the bits and bytes from point A to B. Design point - the bits I send from A arrive at B. at this level ordering, error correction, bandwidth, latency are all implementation details.

Save it: A non volatile place to put my 1's and 0's. USB stick, hard drive, ssd, tape all the way to big enterprise storage systems (past EMC and present HDS employers) from the big guys. The design point of this technology is all largely the same - when 1's and 0's are put into the device, they can be retrieved at a later point in time. Implementation details - reliability, availability, throughput, latency.

Compute it: People call this business logic, which is a phrase that is frustrating to me since its just a series of if then else's, but marketing needs something to do. This is interesting from the TempDuino up to large scale hadoop clusters of 10's of thousands of nodes (a few years ago I would have used beowulf cluster here). High level design point - Adding 1 and 1 gives me 10 or 2, the difference is in the implementation.

This simple method has helped me through all of my experience in technology. It was instrumental in building a deduplicating block storage system at EMC, in grokking the complex telephony and ad targeting systems at Jingle(aquired by Marchex). Currently I am applying this to the large scale analytics and scale out storage world.

Would love to hear your thoughts, and what simple schemes are working for you.

Machine Generated Data: TempDuino

I've been hearing a LOT about machine generated data (MGD) lately. I am incredibly interested in getting into the nuts and bots of what the value of all of this data is, how to manage it, and how to draw insights from the data. So in that light, I started wondering how I could better get my head around this problem space and form a fresher perspective.

First - what is machine generated data? The general format in my experience is:

time, value, value, value

where "," can be any arbitrary field separator and values can be sensor values, IP addresses etc. Whatever it is: server log files, sensor data, CDR's or financial trades, MGD is almost always a time value(s) pair.

My thinking was straight forward: How to generate the simplest set of interesting and new (to me) machine data and then dig around in it. I decided to use my arduino UNO circuit board and a simple temperature sensor. Using a single sensor(temperature here - not that it is terribly important) and capturing data every second from the arduino will give me a reasonable set of data to start - 86400 observations a day.

For the curious the temperature circuit is slightly modified version from "Arduino Experimentation Kit Example Code".

I am intentionally NOT using application logs (like apache) in an effort to broaden my thinking here. I've spent tons of time looking at web clicks and doing call trending. What about tractors with soil testing sensors, pressure sensors on oil/natural gas pipelines, and car/plane black boxes. How similar is the problem space? My gut says it is nearly identical, but I haven't proven that to myself.

Back to the experiment. This is what my "machine", the TempDuino spits out:

2012/01/17 10:52:38,62.41 2012/01/17 10:52:39,62.41

Note* the tempduino actually just writes some bytes to the serial port, which get read by a simple python script which inserts the timestamp and then writes to stdout and is piped to a file.

In this version, each observation is a time stamp and sensor value. One reading every second. I set this up around 10pm EST and let it rip all night. For the next post I will dig through the data bit, and hopefully draw some nice conclusions.

The proliferation of MGD is only going to increase (If anyone has studies showing reasonably researched MGD growth rates I'd be very appreciative)

#python #arduino #MGD #machine generated data

Data Analysis

I recently started to read Data Analysis from Oreilly media. As the space for analytics and "Big Data" grows, the quality and quantity of our (the product and practitioner side) learning resources will need to expand, and this book is a significant entrant. It is broad enough in its applicability and provides enough high level references to concrete tools that it should find broad appeal.

In my experience there are two camp - "modelers" and " data miners". These roughly correlate to the education background, with modelers clearly stemming from statisticians and the data miners more focused on computer science. There is not a lot of cross pollination to date. I am sure there are exceptions to this, but I am speaking in gross generalities. Data Analysis has roughly spent the same number of pages between them should be useful for those deep in one or the other.

As I have long stated, R and to a lesser extent Octave will (are) lead the first round in terms of initial open source applications in this space. So with that in mind I was extremely pleased by the organization of the book, and the inclusion of what Janert calls "workshops" which are short subsections devoted to applying the lessons from the chapter. The workshop for R was short, but provides enough of a flavor of that environment that interested parties will explore it further.

This books broad coverage and useful workshops, from applications like R scripting language Python to BerkelyDB and SQLite for data management, the base set of skills which can be leveraged from this book make it a great starting point for getting any data project off the ground.

#books #r #octave

Big Data. And apps

Its been a while since I was actively blogging, hopefully this will change as I begin to look into the wide world of enterprise computing on this blog. It seems fitting as I move further into my tenure as a technologist with Hitachi Data Systems in the File and Content areas.

I will also probably post random bits of technology here as well.

Trending Blogs

Recently Viewed Blogs

Apps for Big Data