How to Compute With Data You Can't See
http://spectrum.ieee.org/computing/software/how-to-compute-with-data-you-cant-see
d e v o n
KIROKAZE
he wasn't even looking at me and he found me

No title available
No title available
Sade Olutola
dirt enthusiast
Misplaced Lens Cap
ojovivo
No title available
YOU ARE THE REASON

Janaina Medeiros

@theartofmadeline
Today's Document
tumblr dot com

No title available
🪼
styofa doing anything
2025 on Tumblr: Trends That Defined the Year
"I'm Dorothy Gale from Kansas"
seen from Argentina

seen from Brazil
seen from China
seen from Iraq

seen from Malaysia
seen from United Kingdom

seen from China

seen from Australia
seen from Czechia

seen from Romania
seen from United States
seen from Italy

seen from Romania
seen from United States

seen from France

seen from Brazil
seen from South Korea
seen from United States
seen from United States

seen from Kazakhstan
@richardadenman
How to Compute With Data You Can't See
http://spectrum.ieee.org/computing/software/how-to-compute-with-data-you-cant-see
In Data mining, “What are we trying to learn?”
What are we trying to find-the result of the learning process-is a description that is intelligible in that it can be understood, discussed and disputed and operational in that it can be applied to actual examples.
What are the distinctions?
Four basically different styles of learning appear in data mining applications.
In classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples. In association learning, any association among features is sought, not just ones that predict a particular class value. In clustering, groups of examples that belong together are sought. In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity.
Classifying Your Data
A classification problem, sometimes is called a supervised learning problem. Supervised because you get to know the class values of the training instances. We take as input a data set as classified examples; these examples are independent examples with a class value attached.
The idea is to produce automatically some kind of model that can classify new examples. That's a "classification" problem. There is an "instance", with the different attribute values a fixed set of features; and we add to that the class to get the classified example.
That's what we have to have in our training dataset. These attributes, or features, can be discrete or continuous. The data can be discrete; we call them nominal attribute values when they belong to a certain fixed set. Or they can be numeric or continuous values. Also, the class can be discrete or continuous. Another kind of machine learning problem would involve continuous classes, where you're trying to predict a number. That's called a "regression" problem in the trade.
It's always a good idea to check for reasonableness when you're looking at datasets. It's really important to get down and dirty with your data and make sure things look real.
The Data Mining Process - Revisited
This might be your vision of the data mining process:
You've got some data or someone gives you some data. You've got some nice machine learning software. You apply your machine learning software to the data, you get some kind of cool result from that, and everyone's happy.
It's not going to be like that at all.
Really, this would be a better way to think about it. You're going to have a circle; you're going to go round and round the circle. It's true that your software is important -- it's in the very middle of the circle. It's going to be crucial, but it's only a small part of what you have to do.
Perhaps the biggest problem is going to be to ask the right kind of question. You need to be answering a question, not just vaguely exploring a collection of data. Then, you need to get together the data that you can get hold of that gives you a chance of answering this question using data mining techniques. It's hard to collect the data. You're probably going to have an initial dataset, but you might need to add some data, some data about other stuff. You're going to have to go to the web and find more information to augment your dataset. Then you'll merge all that together: do some database hacking to get a dataset that contains all the attributes that you think you might need.
Then you're going to have to clean the data. The bad news is that real world data is always very messy. That's a long and painstaking process of looking around, looking at the data, trying to understand it, trying to figure out what the anomalies are and whether it's good to delete them or not. That's going to take a while. Then you're going to need to define some new features, probably. This is the feature engineering process, and it's the key to successful data mining. Then, finally, you're going to use your data mining software, of course. You might go around this circle a few times to get a nice algorithm for classification, and then you're going to need to deploy the algorithm in the real world.
Each of these processes is difficult. You need to think about the question that you want to answer. "Tell me something cool about this data" is not a good enough question. You need to know what you want to know from the data. Then you need to gather it.
They say that more data beats a clever algorithm. So rather than spending time trying to optimize the exact algorithm you're going to use, you might be better off employed in getting more and more data. Then you've got to clean it, and real data is very mucky. That's going to be a painstaking matter of looking through it and looking for anomalies.
Features or Algorithms?
What is the most important, engineering Features or choosing the absolute best Algorithm?
In most of Kaggle competitions the participants talk about selecting what they call Golden features and how it is more important than selecting the classification algorithm.
Actually the success of all Machine Learning algorithms depends on how you present the data. These golden features can be extracted in two ways: By a human expert or by using automated feature extraction methods such as PCA, or Deep Learning tools such as DBN. Both can be used on top of each other, too. But to evaluate the goodness of each feature, there are some criteria such as Gini-index, Info-gain, Likelihood ratio, etc.
When your goal is to get the best possible results from a predictive model, you need to get the most from what you have. This includes getting the best results from the algorithms you are using. It also involves getting the most out of the data for your algorithms to work with.
How do you get the most out of your data for predictive modeling?
This is the problem that the process and practice of feature engineering solves.
How credible is your learned algorithm?
How credible is your learned algorithm?
In numeric prediction situations, errors are not just present or absent as in classification, but they come in different sizes. Several different measures can be used to evaluate the success of numeric prediction on the test data. Mean-Squared Error is the principal and most commonly used measure; sometimes the square root is taken to give it the same dimensions as the predicted value itself. Mean Absolute Error is an alternative, just average the magnitude of the individual errors without taking account of their sign. Mean-squared error tends to exaggerate the effect of outliers. Relative Squared Error is made relative to what it would have been if a simple predictor had been used. The simple predictor is just the average of the actual values from the training data. Relative Absolute Error is just the total absolute error with the same kind of normalization. Correlation Coefficient measures the statistical correlation between the a's and the p's and ranges from 1 for perfectly correlated results through 0 when there is no correlation to -1 when the results are perfectly correlated negatively. Which is appropriate? It is matter that can only be determined by studying the application. It is not easy.
ROC Curves - Two Class Classifiers
We are interested in the "true positive rate", that is the accuracy on class "a", (the number of true positives), divided by the total size of class "a", and the "false positive rate", which is the number of false positives, divided by the total number of negative instances. That's 1 minus the accuracy on class "b".
There's a tradeoff between these things. You can trade off the accuracy on class "a" against the accuracy on class "b". You can get better accuracy on class "a" at the expense of accuracy on class "b", and vice versa.
Plot these points on a graph. Plot the accuracy on class "a" (TP) against 1 minus the accuracy on class "b" (FP). The top left-hand corner corresponds to perfect accuracy on class "a" and perfect accuracy on class "b". That is where you'd like to be. So lines that push up toward that top corner, are better. That is where you want to be.
One way of evaluating the overall merit of a particular classifier is to look at the area under the curve. If that area is large, then we're going to get a better classifier evaluated across all the different possible tradeoffs, the different thresholds. The area under the curve is a way of measuring classifier accuracy independent of the particular tradeoff that you happen to choose.
Look at threshold curves that plot the accuracy of one class against the accuracy on the other class and that depict the tradeoff between these two things. ROC curves plot the true positive rate against the false positive rate. They go from the lower left to the upper right, and good ones stretch up towards the top left corner. In fact, a diagonal line corresponds to a random decision, so you shouldn't go below the diagonal line.
The area under the curve is a measure of the overall quality of a classifier. It turns out that it's equal to the probability that the classifier ranks a randomly chosen positive test instance above a randomly chosen negative one.
A Basic Classifier Model
A Basic Classifier Algorithm learns what you might call a "one-level decision tree", or a set of rules that all test one particular attribute. A tree that branches only at the root node depending on the value of a particular attribute, or, equivalently, a set of rules that test the value of that particular attribute.
There's one branch for each value of the attribute. We choose which attribute first, and we make one branch for each possible value of the attribute. Each branch assigns the most frequent class that comes down that branch. The error rate is the proportion of instances that don't belong to the majority class of their corresponding branch. Choose the attribute with the smallest error rate.
The algorithm would be that for each attribute, make some rules. For each value of the attribute, make a rule that counts how often each class appears, finds the most frequent class, makes the rule assign that most frequent class to this attribute-value combination, and then calculate the error rate of this attribute's rules. Repeat that for each of the attributes in the dataset, and choose the attribute with the smallest error rate.
How can such a simple method work so well? Some datasets really are simple, and others are so small, noisy, or complex that you can't learn anything from them. So it's always worth trying the simplest things first.
Visualize your Classifier Boundaries
It may be helpful to visualize your classifier boundaries to come more intimate with how things are working inside the instance space and with your algorithms. Plot training data with the classes and the decision boundary that the classifier scheme creates.
Classifiers create boundaries in instance space and different classifiers have different capabilities for carving up instance space. That's called the "bias" of the classifier -- the way in which it's capable of carving up the instance space.
This kind of visualization is restricted to numeric attributes and 2-dimensional plots, so it's not a very general tool, but it certainly helps you think about different classifiers.
How much Test Data do you need?
How much test data do you need for best performance?
General Rules - if you have a large separate test set, use the test set. If you have lots of data, use holdout. Otherwise the best way of getting the most reliable performance estimate out of a limited amount of data, use 10-fold cross-validation and repeat.
What is "a lot"? Well there is no answer, but it depends on the number of attributes, also the structure of the domain. Are you looking for complicated decision boundaries? It depends on the kind of model, the sort of decision boundaries it makes. If you've got a machine learning technique that looks for linear decision boundaries, then they're pretty simple. You might not need so much data as you would for ones that look for more convoluted linear boundaries, or for decision trees, perhaps.
The only way to look at it really is to look at it empirically using learning curves. Generally, As the size of the training data increases, the performance gets better and better, but of course, it asymptotes off. The point where it starts to asymptote off is probably enough training data to get a reliable estimate.
Data Mining - The Nuggets
There is no magic in Data Mining. There is a huge array of techniques and they are all straightforward algorithms. There is no single universal best method. Data Mining is an experimental science. You should discover what works best with your problem. You should not blindly click around in your application and view the results. Understand what you are doing and evaluate the significance of the results. Spend more time looking at the features and how the problem is described and the operational context that you are working in. Don't stress over getting the absolute best algorithm. Use your time wisely!
Integration of Data
Integrate Data into an overall Data Warehouse
To avoid information islands and manual processes. To avoid overloading of source systems with reporting and analysis requirements. To integrate data from many different source systems. To create a historical data foundation that can be changed or removed in the source systems. To aggregate performance and data. To add new business terms, rules and logic to data. To establish central reporting and analysis. To hold documentation of metadata centrally. To secure scalability. To ensure consistency and valid data definitions.
Personal Data - A New Class
-Personal Data is the new economic asset class - our own personal data and the economic importance of that. -Once, we learned from data and the internet revolution - i.e., Wikipedia, etc., but now we have a new economic class. -Trust, definition of boundaries and understanding between individuals, governments and the private sector are vital to take full advantage of this new economic asset.
Data Science and Wisdom
-Wisdom is the value attached to knowledge. -"Knowledge speaks, but wisdom listens." - Jimi Hendricks -This is very important. -Ponder this as you make your way in your Data Science practice.
How Much Data Do You Need?
-If you have a large, separate test set, then use the test set. If you have lots of data, use the holdout method. If you have a limited amount of data use 10-fold cross validation and repeat. -What is a "lot" of data? It depends. -It depends on the number of attributes. -It depends on the structure of the domain, i.e. complicated decision boundaries, linear decision boundaries, etc. -Look at the learning curve - as the size of the training data increases, the performance gets better, but it levels off. The point at which it levels off is probably enough training data to get a reliable estimate.
Neural Networks and the Perceptron
-It is not necessary to perform probability estimation if the purpose is to predict class labels. -Learn a hyperplane that separates the instances according to the different classes. -If the data can be separated into only two classes, then a hyperplane can be used to perform the linear separation. -A Perceptron algorithm can be used to determine the hyperplane. The Perceptron can the represented as a graph with nodes and weighted edges - a network of neurons. -When an instance is presented to the algorithm, its attribute values activate the input layer and are multiplied by the weights and summed up at the output node. If the weighted sum is greater than 0 the output is 1, representing the first class, otherwise it is -1 representing the second class.
The Data Analytics Center of Excellence
The DACE is a forum that includes analytical, business and IT competencies. This combination insures that Data Analytics has the necessary impact on the organization. The most limiting factor to the successful creation and execution of a Data Analytics strategy are barriers in relation to competencies and organizational structure.
The Center of Excellence is a problem solving forum that maximizes the revenue flow from initiatives and makes Data Analytics a business process rather than an IT process. It ensures that the business needs drive all technical initiatives, realizes the full potential of Data Analytics and ensures that the analytical competencies are present and accessible.