Discover Top Posts Tagged with #debrisfinder

Goal: I'm going to restate this a lot, because I think it's important to have in the forefront of your mind when interpreting the data. My goal is to take image data sets and trim the amount of images such that humans only need to analyze a fraction of the whole set when searching for anomalies amongst the images. This saves time, money, and most importantly, it can potentially save lives.

A word about machine learning and linear discriminate analysis:

Below (previous post) you'll see a blue and red line graph of two normal looking curves. Those lines represent the frequency (how often something occurs) of values that my algorithm has given to images of satellite data. You can think of these values as the "probability this image contains an anomaly", and how those values are reached is explained below (hint: it uses a training set to teach my computer through machine learning which images contain anomalies). Since these curves are normal, it means that we can run a great deal of analysis on them, particularly linear discriminate analysis.

Here's the idea: I choose a bunch of images, and decide for myself if these images contain or don't contain anomalies. I mark "does contain" with a 1, and doesn't contain with a 0 (I literally created a hits and misses 1's & 0's array for this). Then, I allow my algorithms to assign values to these images based on the criteria in the previous post. Now I have two numbers assigned to each image. A 1 or 0 representing a yes or no (respectively), and some value the computer has chosen (these values range from approximately 2.71 to 49.12, not really relevant, but just to clarify that these are not discrete values, they are real number representations of the probability and image contains an anomaly). Then, I do what is called linear discriminate analysis on those values.

Linear discriminate analysis looks at those two values, 1 or 0, and the real number representation of the probability an image contains an anomaly, and builds a function. To get a feel for what it does, scroll down to the blue and red line graph from the previous post and imagine a vertical line drawn somewhere where both curves overlap. Depending on the location of this curve, all satellite images that land on the right are flagged as a hit, and all values that land on the left are marked as a miss.

Now we have an LDA (linear discriminate analysis) function. So instead of finding new data, I run my test set back through, and have the LDA function predict out of that image set, which images contain an anomaly. With those predictions, you get the graphs above.

Interpreting the graphs:

Figures 1 & 2 legend explanation:

False Alarm: Photo does not contain anomaly, but was flagged.

Complete Miss: Photo contains anomaly, but was not flagged.

Correct Hit: Photo contains anomaly, was flagged.

Correct Miss: Photo does not contain anomaly, was not flagged.

Figure 1: Here we see two prediction sets, and their successes based on minor optimizations. Instead of running the basic probability values through my LDA tool in R, I chose to run the log base 10 of those values. This is because the "misses" data shows somewhat of an exponential distribution when we don't take the log of the values. Taking the log of the values transforms the data to be more "normal" without sacrificing it's predictive quality, and also allows to do sweet things like linear discriminate analysis.

Figure 2: This is technically the most important part of the experiment. Getting the highlighted value down to zero is extremely important because those are images that contain something, and didn't get flagged. This is a problem. The only thing we don't want flagged are images that contain no anomalies, which isn't yet happening based on my algorithms. I'll be working on this.

Figure 2 & 3: Pretty self explanatory, just percent of successful finds and misses.

THE TAKE AWAY: So far, I have demonstrated that my algorithms can successfully trim an image data set from 114 images, down to 46 with a success rate of about 88%. This is good, but we can do better.

#DebrisFinder

This is my preliminary data for my airplane wreckage debris finder, and this is very good news. Let me explain my process:

First, my goal was to create an automated process for narrowing down images that required searching by humans after the Malaysian flight disappeared over the ocean. Many images were needlessly searched by humans, and I wanted to decrease the work required. This is actually seen in the real world by places like Microsoft, who use similar tactics to narrow down the legal documents that are required to be sent to their legal team for discovery when they are being sued. If they can narrow down the documents that their lawyers have to wade through, their legal costs go down. Very cool.

Next, I have a training set of data. I've collected images from the internet of the ocean, some showing anomalies, some showing nothing. I then mark the images I want flagged, and the images I want the algorithm to ignore. In the graph above, the red line represents images I want ignored, and the blue line represents images I want the algorithm to keep.

Next, I run these images through a Matlab script that turns the image into a black and white binary image, which is that image above that looks like outer space. The black and white image is showing things that stand out (white) vs. the background (black). Next, I lower the threshold or sensitivity, so that it starts only picking large anomalies. I divide the area of the anomalies by the number of anomalies to get the value charted on the x-axis of my chart above. The larger that number gets, the higher the likelihood of the image containing an anomaly.

Here is the reason a normal distribution is good: I can use linear discriminate analysis to find a value which "everything above" is a hit, and "everything below" is a miss.

GREAT NEWS!

#DebrisFinder

A word about machine learning and linear discriminate analysis:

Interpreting the graphs:

Figures 1 & 2 legend explanation:

False Alarm: Photo does not contain anomaly, but was flagged.

Complete Miss: Photo contains anomaly, but was not flagged.

Correct Hit: Photo contains anomaly, was flagged.

Correct Miss: Photo does not contain anomaly, was not flagged.

Figure 2 & 3: Pretty self explanatory, just percent of successful finds and misses.

THE TAKE AWAY: So far, I have demonstrated that my algorithms can successfully trim an image data set from 114 images, down to 46 with a success rate of about 88%. This is good, but we can do better.

#DebrisFinder

#debrisfinder

Trending Tags

Recently Viewed Tags

#debrisfinder