Goal: I'm going to restate this a lot, because I think it's important to have in the forefront of your mind when interpreting the data. My goal is to take image data sets and trim the amount of images such that humans only need to analyze a fraction of the whole set when searching for anomalies amongst the images. This saves time, money, and most importantly, it can potentially save lives.
A word about machine learning and linear discriminate analysis:
Below (previous post) you'll see a blue and red line graph of two normal looking curves. Those lines represent the frequency (how often something occurs) of values that my algorithm has given to images of satellite data. You can think of these values as the "probability this image contains an anomaly", and how those values are reached is explained below (hint: it uses a training set to teach my computer through machine learning which images contain anomalies). Since these curves are normal, it means that we can run a great deal of analysis on them, particularly linear discriminate analysis.
Here's the idea: I choose a bunch of images, and decide for myself if these images contain or don't contain anomalies. I mark "does contain" with a 1, and doesn't contain with a 0 (I literally created a hits and misses 1's & 0's array for this). Then, I allow my algorithms to assign values to these images based on the criteria in the previous post. Now I have two numbers assigned to each image. A 1 or 0 representing a yes or no (respectively), and some value the computer has chosen (these values range from approximately 2.71 to 49.12, not really relevant, but just to clarify that these are not discrete values, they are real number representations of the probability and image contains an anomaly). Then, I do what is called linear discriminate analysis on those values.
Linear discriminate analysis looks at those two values, 1 or 0, and the real number representation of the probability an image contains an anomaly, and builds a function. To get a feel for what it does, scroll down to the blue and red line graph from the previous post and imagine a vertical line drawn somewhere where both curves overlap. Depending on the location of this curve, all satellite images that land on the right are flagged as a hit, and all values that land on the left are marked as a miss.
Now we have an LDA (linear discriminate analysis) function. So instead of finding new data, I run my test set back through, and have the LDA function predict out of that image set, which images contain an anomaly. With those predictions, you get the graphs above.
Interpreting the graphs:
Figures 1 & 2 legend explanation:
False Alarm: Photo does not contain anomaly, but was flagged.
Complete Miss: Photo contains anomaly, but was not flagged.
Correct Hit: Photo contains anomaly, was flagged.
Correct Miss: Photo does not contain anomaly, was not flagged.
Figure 1: Here we see two prediction sets, and their successes based on minor optimizations. Instead of running the basic probability values through my LDA tool in R, I chose to run the log base 10 of those values. This is because the "misses" data shows somewhat of an exponential distribution when we don't take the log of the values. Taking the log of the values transforms the data to be more "normal" without sacrificing it's predictive quality, and also allows to do sweet things like linear discriminate analysis.
Figure 2: This is technically the most important part of the experiment. Getting the highlighted value down to zero is extremely important because those are images that contain something, and didn't get flagged. This is a problem. The only thing we don't want flagged are images that contain no anomalies, which isn't yet happening based on my algorithms. I'll be working on this.
Figure 2 & 3: Pretty self explanatory, just percent of successful finds and misses.
THE TAKE AWAY: So far, I have demonstrated that my algorithms can successfully trim an image data set from 114 images, down to 46 with a success rate of about 88%. This is good, but we can do better.








