Machine Learning - Introduction
Time to start blogging again. I am starting to study Machine Learning. And, I am going to try to write my notes about what I learn here as blog posts. This way, hopefully I can help someone and hopefully others can inform me if they notice any mistake in my understanding.
Machine Learning vs Data Mining:
Machine Learning is the subset of Artificial Intelligence that focuses on using data for self-learning. Note that AI itself doesn't need to involve learning at all.
Data Mining is digging into large amounts of data using ML techniques to discover patterns that were not immediately apparent. Unlike ML, DM is done by a person, using ML tools.
Attribute: Variable (e.g., mileage).
Feature: Variable + Value (e.g., mileage = 15K).
Label / Response: Dependent variable / attribute.
Predictor: Independent variable / attribute.
Learn a model from labeled training data. Use it to make predictions on unseen or future data. Example, train a spam filter with emails marked spam or not-spam, and then the filter will predict whether future emails are spam or not.
Classification: Classify data into discrete class labels. E.g., this email is spam vs not (binary classification) or this handwritten letter is 'X' as opposed to any other letter between A-Z (multi-class classification).
Regression: Response is continuous value. Given a number of predictor variables and a continuous response variable, try to estimate the relationship to be able to predict responses in future. E.g., given H hours of studying, your SAT score will be S. Note: Regression can be used for classification as well. E.g., the response can be a value that corresponds to the probability of belonging to a given class (like, 20% chance of being spam).
Example Supervised Learning Algorithms: k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees and Random Forests, Neural Networks.
Deal with data having unknown structure or no labels and extract meaningful information without any known outcomes or reward signals.
Clustering: Organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of the nature or types of these clusters. Clustering is also called "Unsupervised Classification." E.g., marketers may want to cluster customers into subgroups based on their behavior without knowing exactly what makes those customers similar. If you use a Hierarchical Clustering algorithm, it may also subdivide clusters into sub-clusters.
Dimensionality Reduction: Compressing data into smaller dimensional subspace (i.e., removing features / dimensions) by reducing noise without any prior knowledge of which features / dimensions can be removed. Smaller dimensional space can also be achieved if few dimensions are highly correlated (e.g., a car's mileage and its age). Dimensionality Reduction can be the first step that prepares data for another ML step (e.g., supervised learning).
Anomaly Detection: Finding anomalies. E.g., unusual credit card transactions, catching manufacturing defects, etc. Anomaly Detection can also be the first ML step that prepares data for another ML step (e.g., supervised learning) by removing outliers from the data.
Association Rule: Discover relations between attributes. E.g., sales logs show that people who buy ketchup also buy potato chips.
C) Reinforcement Learning
Develop a system (agent) that improves its performance based on interactions with the environment. The environment typically includes a reward signal. The agent learns a series of actions that maximizes this reward signal via a mix of trial-and-error approach and deliberate planning (by programmer). E.g., learning to play chess by treating the board (and game rules?) as the environment, and winning or losing the game as the reward signal.
Matrix item: To represent data in a Matrix, I will use M(i,j) to indicate an individual item in the matrix where i represents the training sample (think row) and j represents feature / dimension (think column).
Vector: One row of training sample or a column of one feature is called a vector (row or column vector).
Preprocessing: Preprocessing is the first step of machine learning. The goal is to give the raw data the shape and form that optimizes effectiveness of a learning algorithm. Preprocessing may consist of randomization (randomize rows to remove any biases), feature extraction (find meaningful features), feature scaling (scale various features to a range of 0 to 1 or to a normal distribution with zero mean and unit variance), dimensionality reduction (remove irrelevant or highly correlated features), and division (randomly divide data into training and test sets).
The best ML algorithm: There's no single all-powerful ML algorithm. Different algorithms have different biases and assumptions. So, one should pick all relevant algorithms, try them with the training and test sets, and pick the one performing best for that kind of data. Note: Even within an algorithm, there are going to be configurable parameters (think knobs) for tweaking the algorithm for further maximizing effectiveness.