Marine Scientist Turned Data Scientist @datasciencegirl - Tumblr Blog

Supervised Learning Algorithms Done Blind

Have you ever had to analyze a data set, without knowing what the features were called or what the data set consisted of? This was a first for me. Luckily this data challenge had guidelines. It was a fun exercise, and I was rewarded when the description of the data set was revealed, and I was able to infer what my results really meant!

The Problem

Given an X and Y training data set (first hint: Supervised Learning) of over 4000 instances and 600+ features and an `Xtest` data set of more than 2600 instances, I was to provide accuracy estimates for a classification model of 22 classes, and apply the model to the test data set.

Steps of the Analysis

I first needed to determine which classification algorithm to use. Since I decided to perform analyses in Python, I referred to the Scikit-learn flowchart (shown below) to determine which algorithm to dive into first.

With plenty of observations and the need to predict categories of a labelled data set, the chart suggested that I apply a linear Support Vector Classification to the data set. As a requirement for this analysis, all features should be on the same scale; inpspection of minimums and maximums all features indicated that they were all on the same scale.

import pandas as pd xTrain.describe()

Next, I randomly split the training data set into x and y training and validation data sets for the purposes of cross-validation.

from sklearn.cross_validation import train_test_split x_train, x_val, y_train, y_val = train_test_split(xTrain, yTrain, test_size=0.33, random_state=42) y_train = np.ravel(y_train) y_val = np.ravel(y_test)

I then trained the model by applying Python’s scikit-learn linear SVM algorithm to the randomly sampled x and y training data set. And I modeled the prediction of the x validation data set.

preds = clf.predict(x_val) score = clf.score(x_val, y_val) #mean accuracy # manual accuracy check correct = 0 for x in range(len(y_val)): if y_val[x]== preds[x]: correct += 1 acc= correct/float(len(y_val)) * 100.0

I then calculated the predicted model accuracy, recall, f-score and support by comparing the predicted response to the y validation response variable. A quick description of these descriptors are shown below in terms of True Positives (TP), False Positives (FP) and False Negatives (FN):

precision (or accuracy) = TP/(TP +FP)

recall = TP/(TP + FN)

F-score: weighted harmonic mean of precision and recall. Values range from 0 to 1, where 1 indicates equal importance between recall and precision.

support: the number of occurrences in each class

from sklearn.metrics import classification report classification_report(y_test, preds)

As shown in the table below, class accuracies ranged from 0.72 to 1.0. A manual check of mean accuracy (correct predictions/total number of predictions) gave exact agreement (94%). Another point to note is the discrepancy between precision and recall for the 13th and 14th classes.

Results were also visualized in a confusion matrix, as shown below. As you can see more clearly in this figure than in the chart above, the 13th and 14th classes (12th and 13th if you count from zero!) were not classified as well as the others. I wonder why that is?

Out of curiosity, I compared the prediction accuracy of other algorithms (CART, K-nearest neighbors, and polynomial and radial basis function (RBG) SVC kernels) to linear SVC. Linear SVC surpassed accuracy of the decision tree (90%) K-nearest neighbors (87%) and polynomial (90%) SVC. RBF SVC did equally well to the linear SVC. However I would argue that the linear model is the most parsimonious.

from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=50, oob_score=True) model.fit(x_train, y_train) model = DecisionTreeClassifier() model.fit(x_train, y_train) predicted = model.predict(x_val) cartreport = metrics.classification_report(y_test, predicted)) from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() knn.fit(x_train, y_train) knnpreds= knn.predict(x_val) knnscore = knn.score(x_test, y_test) knnreport = classification_report(y_test, knnpreds) poly3svc = svm.SVC(kernel='poly', degree=3) poly3svc.fit(x_train, y_train) poly3svcPreds = poly3svc.predict(x_val) poly3svcScore = poly3svc.score(x_val, y_val) rbfsvc = svm.SVC(kernel='rbf') rbfsvc.fit(x_train, y_train) rbfsvcPreds = rbfsvc.predict(x_val) rbfsvcScore = rbfsvc.score(x_val, y_val)

Finally, the model was applied to the X test data set and the predicted response variable (row numbers and predicted responses) were submitted to the judge.

predTest = clf.predict(xTest)

Results and Inferences

The accuracy of my test set was the same as that calculated for the cross-validation (94% - yes!!). And the meaning of the results became clear when I was told what the data represented: it was a digital recording of people reciting the first 22 letters of the alphabet! Can you guess why variables 13 and 14 were poorly categorized? Well, it turns out they represent the letters m and n which can be mistaken for one another easily when spoken/recorded.

This was a fun exercise, to go into a data set blindly, categorize and test the accuracy of the data set, and then be rewarded with it’s meaning post-hoc!

The code snippets above were extracted from the full script which can be found on my GitHub repo.

#machinelearning #scikit-learn

Natural Language Processing: a simple back-off ngram algorithm

I have a love-hate relationship with predictive text. At times it can improve the efficiency of my communications. And other times it can serve as a form of entertainment. Or it can get me into trouble.

With the aim to better understand how predictive text algorithms work, I developed a language prediction system similar to what you find on smartphones and word processors. This small project would give me an appreciation for those sometimes-pesky word predictors, as well as familiarize myself with Natural Language Processing (NLP) tools, packages, and methods.

I took the following general steps, which are detailed in the remainder of the post:

Obtain a corpus that is representative of the language of interest.

Clean this corpus so that it is free of special characters, numbers, whitespace, etc.

Tokenize the corpus into ngrams of different lengths (i.e. quadgrams, or 4-word terms).

Formulate an algorithm that is trained on the cleaned and tokenized corpus, that can take a sequence of words and predict the subsequent term.

#nlp #text prediction

Algo(s) of the Day: kNN and K-Means (or "Which of these things is most like the other?")

k-Nearest Neighbor (kNN) and k-Means clustering are two of the most commonly used, and relatively easy to comprehend, methods to analyze data through clustering. However, these algorithms serve different, but potentially overlapping, functions that can be confused with one another.

k-Means allows you to take an unlabelled data set of instances and put each instance into one of k number of groups. The use of an unlabelled data set deems this method unsupervised.

For example, lets say you get a big pile of plant species that have never been named, but you know features (or measurements) of the plant species such as size, color, shape, weight, etc. You can use these features to start putting them into piles or groups of "similar" plants.

kNN allows you to use a labelled data set to classify an unlabelled instance due to its proximity to k number of neighbors. In this supervised method, the "training" of the model is carried out at the time of classifying an unlabelled instance. Depending on the data type, the hard work could be in creating the labelled data set (k-Means could potentially be used to create these labels) rather than training a model; the data set is the model. The expense comes when testing unknowns and, depending on the size of the data set, storing all that data.

For example, you obtain a pile of identified plant species, each described by a number of features (or measurements). But then a scientists discovers a new species and your job is to figure out which knowns species it is most similar to.

Basic Assumptions: For these analyses, k represents either the number of clusters created (k-Means) or the number of neighbors (kNN) used for classification. Because the analyst must determine the optimal number for k, these methods are sometimes characterized as semi-supervised.

Both algorithms calculate some sort of similarity (often a distance metric) between data points. For example, Euclidean distance is a common metric calculated in feature space.

Both methods are non-parametric and therefore lack a need to assume anything about the distribution of the data (which is one less assumption to worry about, yay!).

In upcoming posts I will go into more detail (and Python code) on how these algos work. I might even compare the explicitly written code with (for example) scipy or R functions in an effort to demonstrate how data science can be as much as an art form as it is a form of science; beware of situations that lead to inaccurate solutions, which can be avoided through visualizations of your data (if possible). For example, how would you categorize the yellow tomato below?

#k-mean #knn #algorithm #data science

Curious About Predicted Temperature Changes Across the World? Then use this app!

I have neglected my blog while diving into a sea of Coursera MOOCs. While it has been fun honing my data science skills, it has pulled me away from demonstrating them!

Here I show an app created in Coursera's Developing Data Products, the final of 9 courses that require completion before taking on the Capstone Project. In this simple project, we are meant to create an app using R's web application framework Shiny and deploy this app on the Shinyio host.

Motivation and Scope: Having an interest in maps and climate and model predictions, as well as a limited amount of time to tackle the project, I decided to creates maps of predicted temperature across all the countries on our globe. I utilized data from the World Bank Knowledge Portal which was accessible via the rWBclimate R package interface (which can be found on GitHub and is described in the R vignette and pdf).

About the Data Used: This World Bank data set offers predicted responses of temperature and precipitation to change in rising greenhouse gases across all countries according to 15 different General Circulation Models (GCMs). These are the models used in the Intergovernmental Panel on Climate Change 4th Asessment Report. Each GCM produces a hindcast as well as a forecast for two different climate scenarios (IPCC, 2000):

scenario `a2` corresponds to relatively unconstrained growth in emissions that increase over time and reach 850 ppm by 2100,

scenario `b1` eliminates increases in global emissions by allowing them to level off around mid-century and reach 550 ppm by 2100 (`b1`) , therefore predicting a world with fewer emissions.

The 15 GCMs are compiled into ensembles (with quartiles across the range in model predictions) representing each of these climate scenarios. To simplify interpretation of model outputs, I only considered the median value for each ensemble type, for each country and 20-year time period between 1920 and 2100.

To use this app, simply select a hindcast ("pre-2000") or a prediction climate scenario (scenario `a2` or `b1`) and then select the desired two-decade range in years (e.g., "1920-1939") in order to see the mean annual temperature for each selection (top and middle plots) as well as the difference in selections across the globe (bottom plot). These data represent average temperature predictions by ensembles of general circulation models for past, high (`a2`) and/or moderate (`b1`) changes in greenhouse gases between the years 1920 and 2099. Be patient as this app can take a few minutes to load. I show an example snapshot of the page below. In this snapshot, I compare the high greenhouse gas emission scenario for 2080-2099 to the hindcasted years 1920-1939. The difference between these two demonstrates that in order to avoid the most extreme rise in temperature avoid high latitudes (or find another hospitable planet to live on!).

More information on how this was constructed is shown in this report.

The R shiny code can be found on this GitHub repo.

#climate #worldbank #gcm #shiny #r

Diversity is the Key to Phytoplankton Productivity

My submission to the PacX challenge discussed in my previous posts was motivated by the main thrust of my research: modeling phytoplankton diversity in the ocean.

Thousands of phytoplankton species live throughout the ocean. These microscopic, plant-like, single celled organisms differ in their preferences for nutrients, light, and temperature. Some can cope with very low nutrient concentrations in oceanic Gyres, while others cannot live without high pulses of nutrients and light. These differences allow phytoplankton species to fill various ecological niches and therefore avoid competition for resources.

Despite this large variety of phytoplankton species in the ocean, ecosystem modelers often only consider a handful of types - usually TWO, to be exact. Depending on the question at hand, including only one "big" and one "small" phytoplankton does the trick. But when questioning the functioning of an ecosystem with a variety of niches, it may be important to represent the diverse array of life in the ocean.

A novel approach to model phytoplankton diversity in the ocean has been developed by Mick Follows, Penny Chisholm and other members of The Darwin Project at MIT. This group has formulated a self-emergent ecosystem model that includes on the order of 100 phytoplankton types (78 to be exact). While not exactly representative of the large number of species in the ocean, these analogs for phytoplankton species this representation of modeled diversity comes closer than ever before. The variety of modeled phytoplankton types are "designed" by randomly assigning a set of traits that enable survival in a range of realistic conditions (e.g., low or high nutrients, warm or cold water, high or low light). All of these phytoplankton types are "thrown" into a realistic physical and chemical simulated ocean, and the games begin - phytoplankton "emerge" in regions where they are suited to thrive.

The Darwin Project group applies this modeling approach on a global scale, displaying enlightening insights into the factors that drive phytoplankton diversity patterns across the entire ocean, as shown in Andrew Barton's figure of global phytoplankton diversity above. At the University of California Santa Cruz, Chris Edwards, John Zehr and myself applied this modeling approach on a local scale to the California Current System (CCS). As a result, the model performed well, capturing observations of silicate-imbibing diatoms in the high nutrient, chilly coastal waters as well as the small phytoplankton types able to withstand the low nutrient, relatively calm and warm waters offshore. The model also represented seasonal/temporal variation in the phytoplankton types, with large diatom blooms that coincided with the spring upwelling season, and an increase in small phytoplankton types in winter.

The ability of the model to capture these general temporal (seasonal) and spatial (biogeographies) variations provided confidence in the model performance. However, of the 78 phytoplankton types modeled, only a dozen or so contributed to the upper 99% of the total modeled phytoplankton biomass (or productivity). What about the remaining 65 phytoplankton types just hanging around at low, background concentrations? Would these types play an important role with changing environmental conditions? Or were we "wasting" a bunch of computational effort to calculate the activity of 65 phytoplankton types that simply "hung around" in the background contributing very little?

Ecological research questions aside, I was slightly fed-up with the time it took to run these computer-intensive model simulations (!). And I could not help but wonder, "How many phytoplankton do we actually need to model the CCS?". This led to a new application of this modeling approach that tested the link between phytoplankton diversity and their function in the ocean ecosystem.

#phytoplankton #ecosystemmodel

Measuring Phytoplankton Patchiness (and Who Won the Competition?)

So far, the descriptive analysis of the Wave Glider observations provided value in my effort to refine our estimates of phytoplankton in the ocean. But what about the replicated Wave Glider observations that I raved about earlier? When do those come into play?

I applied an autospectral analysis to replicate Wave Glider observations. Briefly described, this analysis is used to measure the covariance between the original data set and a lagged duplicate data set (which is carried out after cleansing the data set by removing temporal trends and interpolating data over equally spaced sampling distances). This correlation indicates how similar one data point is to its neighbor. This process is replicated with each lag of the data set (over a set distance) until the original and lagged data sets are no longer correlated; the distance at which the data sets are no longer correlated, as determined by a set threshold (dotted line below), is known as the “Decorrelation Length Scale” (DLS).

This DLS distance gives a measure of spatial “patchiness” or variability of chlorophyll in the ocean. I applied this analysis to each Wave Glider transect across the variety of ecosystem types (Coastal Upwelling, Transition Zone, Equatorial Upwelling, and Oligotrophic). Means and standard deviations of the replicate patchiness measures (i.e. autospectra or DLS) indicated how patchiness varied among ecosystem types. Results demonstrated a decrease in patchiness with distance offshore, that corresponded nicely to an increase the the spatial scale of eddies.

So what does this all mean? Why would this measure of patchiness be of interest to an oceanographer?

Although Wave Gliders can collect similar types of data as ships at a fraction of the cost (and with virtually zero environmental impact), ship time is still necessary to carry experiments and make more complex measurements. Saying this, the high costs and environmental impacts of ships emphasize the need to spend time at sea efficiently. Therefore knowing when, where, and how frequently to sample is of high value to oceanographers.

The autospectral analysis above suggests that sampling distances of 4 to 7 Km in the coastal upwelling zone off of CA is adequate for capturing phytoplankton patchiness. This distance increases to 30 Km in offshore waters. However in the transition zone between these two regions, one must sample at a minimum of every 8 Km to have any hope of capturing the phytoplankton variability in this highly dynamic region! And let us not forget that none of this applies to the daily variations in phytoplankton, which would require virtual stationary sampling on the order of less than every 24 hours in high growth systems.

Furthermore, Wave Gliders can serve as a scouts that locate the optimal sampling region for a ship, while collecting a continuous record of the conditions in that region before the ship arrives and until after it departs.

With regards to my own research, I see Wave Gliders as the ultimate observer for those hard-to-get-to regions. This type of data is extremely valuable for ecosystem modelers who wish to check the consistency of their models, and even assimilate data into their models, for regions where little is known or observations are highly variable.

So how did my research fare in the PacX Challenge? Well, the competition was stiff and all the finalists carried out some great research. But I was beat out in the end by the similar approach of Tracy Villareal of the University of Texas. I wish Tracy all the best in his future Wave Glider endeavours!

#PacX #Wave Glider #autospectra #patchiness #phytoplankton #chlorophyll

What Technology Tools Measure the Ocean Best (Part II)?

In order to compare the the Wave Glider data to traditional methods of data collection, as described in my last post, my first task was to clean up the Wave Glider data set. There were gaps between observations, likely during energy conservation and servicing periods. These gaps would be critical when it came time to apply autospectral analyses, as I will discuss shortly.

Once unreliable data and gaps were omitted, it was necessary to calibrate the raw fluorometer measurements so that they represented actual chlorophyll concentrations. This is commonly done with either simultaneous in situ measurements or in the lab with lab-grade standards. Liquid Robotics had performed the latter at the start and end of the long 13+ month transect. However, large drift in the sensor output rendered lab measurements useless. Sparsity of in situ chlorophyll measures resulted in the reliance on Satellite Image (yes, that data source that gives great synoptic coverage, but at the cost of a coarser sampling resolution compared to Wave Gliders!). It was the comparison of these two methods of data collection, and their differences that revealed my first insights about the Wave Glider observations.

In point-by-point comparisons of Wave Glider and Satellite Image observations at the same location and time, Satellite Image data points could only account for 20% of Wave Glider data (due the positioning of the Satellite with relation to the rotation of the earth). This resulted in poor agreement between Wave Glider and Satellite image data in dynamic, often cloud-covered coastal waters, where phytoplankton concentrations were changing faster (i.e., advected by eddies, physiological changes, growth and death on time scales of less than one day) than could be captured by daily Satellite Image snapshots. However in the open ocean systems, such as that between CA and HI, Wave Gliders actually captured the same large scale features as Satellite Images averaged over the same time-span as the duration of the WG journey (shown below as dots that overly the SI).

The most surprising result was revealed along the transect from HI and the equator. Again, Wave Gliders captured the low and high phytoplankton regions measured by Satellite Image. In addition, WGs could detect daily changes in phytoplankton physiological responses, which would be currently impossible to detect by Satellite (we would need many more of these up in space to accomplish this - a far more expensive option!).

So overall, I found that despite differences in the way that Wave Gliders and Satellite Images measure the proxy for phytoplankton (i.e., chlorophyll) in the ocean over space and time, their observations were complementary. Wave Gliders provided the detail not provided by Satellite Images, and Satellite Images provided the synoptic coverage not available through the replicate number of Wave Gliders deployed. So it is with difficulty to proclaim that one is better than the other. With these two technologies, I really do think we could reduce our highly variable estimates of phytoplankton biomass in the ocean!

But what can we do with the replicate Wave Glider observations? tbc.

#Wave Glider #Satellite #phytoplankton #chlorophyll

What Technology Tools Measure the Ocean Best (Part I)?

In my last post, I talked about my proposed idea to improve estimates of phytoplankton biomass in the ocean using the unique data set collected by sea-faring robots. What is it about the Wave Glider data that sets it apart from the many other, more traditional measures of phytoplankton in the ocean?

Traditionally, phytoplankton are measured by boat, stationary moorings, and satellite images. While all of these methods have their merits, they either lack spatial coverage (even the synoptic coverage of satellites, often blocked by cloud cover) or temporal variability (often offering only a snapshot in time).

Wave Gliders overcome many of these drawbacks by sampling at such high spatial resolution (every quarter mile or less, if needed) that they can behave as a stationary mooring (more on that in upcoming posts). And by sending many Wave Gliders out in the ocean they can cover more ground and sample many ecosystem types.

Another benefit of being able to send more than one Wave Glider into the ocean is the ability to replicate observations. This is a data scientist's dream: to simultaneously measure the same regions of the ocean at the same time with independent sensors. No traditional methods of observing phytoplankton in the ocean do this. This trait would be the key to answering my research question, by providing error bars on my calculations!

Robots Collect a Sea of Data

Liquid Robotics of Sunnyvale, CA sent four remotely operated robots, called Wave Gliders (WGs), across the Pacific Ocean. Propelled along the surface by the motion of the ocean (using a set of undulating underwater wings, as pictured below) and equipped with an array of sensors powered by the sun, the WGs collected streams of data: salinity, temperature, oxygen, fluorescence (a proxy for chlorophyll-containing phytoplankton - more on that soon!), GPS location, and weather. Liquid Robotics wanted to know what scientists could do with this data set, and held a competition to find the best idea.

I had a great idea! I proposed to improve estimates of the amounts of phytoplankton in the ocean. But why do we need these measurements? And why don't we know how much phytoplankton is in the ocean already, you ask?

Keep in mind that these microscopic plant-like cells are tiny but mighty, and come in all shapes and sizes. Phytoplankton provide the primary source of fuel (think Carbon) for marine life in the ocean (e.g., ultimately affecting organisms such as the fish we eat). Meanwhile, phytoplankton provide our atmosphere with more than half of the oxygen we breath, while sequestering much of the carbon we expel. Therefore phytoplankton play a critical role in the health of our ocean and atmosphere - and by default, the well being of life on earth.

The importance of phytoplankton is rivaled by the difficulty in measuring them. The ocean is big (0.3 billion cubic miles) and phytoplankton are tiny (on the order of microns in size). In addition, phytoplankton grow and die rapidly with changes in their environment, while being transported over the expanse of the ocean by currents. All this variability in the biology of phytoplankton and the physics of the ocean result in patches of phytoplankton that change on the order of hours to days throughout the ocean, as demonstrated in the video below (these images are from an ecosystem model that I run).

All this variability results in estimates of phytoplankton that are no better than plus or minus 50%, at best. Without better estimates of phytoplankton in the ocean, predictions of oxygen and carbon in the Earth's ocean and atmosphere are compromised. This is critical during the present phase of rapid climate change, where we need answers soon and fast!

Liquid Robotics liked my idea enough to place me in the top five finalists who would compete for 6 months of Wave Glider time and $50,000 of grant funds to carry out the accompanying research. In the meantime, I had some work cut out for me to develop and apply my idea to the data set!

Embarking on an exciting, new adventure

My Biological Oceanography career is coming to a close as I open the doors to new opportunities in the big wide world of data.

Why and how did I decide to make this transition? I love the ocean, phytoplankton and ecosystem modeling. But I equally love DATA and all the accompanying programming languages and techniques. Funding challenges have dissuaded me from the academic track. However the vast availability of data has lured me toward a new and exciting Data Science track! As I finish off a few manuscripts and model simulations in the ocean sciences, I continue to hone my data analytical and machine learning skills.

I am diving into interesting data sets with wild abandon. I have competed as a finalist in the PacX challenge, where Marine Scientists were asked to propose what they would do with a data set collected by autonomous robots that traversed the Pacific ocean.

And I also plan on checking out the awesome Kaggle competitions (although they don't allow for creativity in formulating research questions, they provide the opportunity to work on some challenging data!).

The more I learn the more I want to learn more and more! I am loving learning new skills and applying them to real-world, challenging questions. And that is what this blog is all about. Stay tuned!

P.S. As of April 2014, I entered my first Kaggle competition, where I placed in the top 80 (out of 500+) on my first go!

Trending Blogs

Recently Viewed Blogs

Marine Scientist Turned Data Scientist