ملخص كتاب التوق إلى تعلّم الآلة – أندرو نج
يركّز الكتاب على التفكير العملي في بناء نماذج تعلم الآلة، مع أهمية الاستراتيجية، وتحليل الأخطاء، وجودة البيانات
ويؤكد أن النجاح يبدأ بنموذج بسيط ثم تحسينه تدريجيًا

seen from United States
seen from United States

seen from United States
seen from United States
seen from Germany

seen from Saudi Arabia
seen from United States
seen from China
seen from United States
seen from United States
seen from United States

seen from Malaysia
seen from China
seen from United States
seen from Japan

seen from India
seen from United States
seen from Poland
seen from Greece
seen from Germany
ملخص كتاب التوق إلى تعلّم الآلة – أندرو نج
يركّز الكتاب على التفكير العملي في بناء نماذج تعلم الآلة، مع أهمية الاستراتيجية، وتحليل الأخطاء، وجودة البيانات
ويؤكد أن النجاح يبدأ بنموذج بسيط ثم تحسينه تدريجيًا
Machine Learning with Andrew Ng from @stanford University, I recently finished this Machine Learning course from Stanford Online @Coursera with Andrew Ng (Professor Stanford University)This education course was very useful and practical for me. But I would really appreciate the efforts taken by Andrew Ng & Stanford University to make it very simple to learn the 360 views of what machine learning is with various coding exercises as well. My friends, Please take education seriously. #coursera #Stanford #andrewng گذروندن دوره های بسیار مفید یادگیری ماشین با پروفسور اندرو ان جی و در پایان اخذ مدرک یادگیری ماشین از دانشگاه استنفورد😊😍✌ https://www.instagram.com/p/B_9duqdgzoy/?igshid=1bhyozc05oln3
AI is the new electricity tshirt available on our shop with free shipping. #AI #artificial #artificialinteligence #andrewNg #deeplearning #machinelearning #neuralNet #neuralnetworks #aishirt #aitshirt #tshirt #tshirtdesign #cottonshirts #fashion #computer #computerscience #datascience #love #lovetshirt #datasciencetshirt #datascienceshirt #tee #freetshirt #freeshipping #freeshippingtshirt #computersciencetshirt #cnn #lstm #gans https://www.instagram.com/p/B0E_SbQgPGE/?igshid=1nzkaeu1q2jcw
Andrew Ng: Deep Learning, Self-Taught Learning and Unsupervised Feature Learning Graduate Summer School: Deep Learning, Feature Learning "Deep Learning, Self-Taught Learning and Unsupervised Feature Learning (Part 1 Slides1-68; Part 2 Slides 69-109)" source
deeplearning - Improving Deep Neural Networks
EFFICIENT DATA SPLIT
A good practice is to split your entire data into 3 parts, namely:
Train set
Development set (also called hold-out cross validation set)
Test set
BIAS VARIANCE TRADEOFF
Errors on the above sets gives us an estimate on bias and variance. Optimal error is usually the benchmark to compare the bias generated from train set. In most cases, optimal error is the error for human eye.
If train error is comparable to optimal error, then bias of the model is minimal. If train error is close to dev-set error, this gives a hint of generalization and hence variance is less.
Bias Variance Statistics Model High Low Underfitting Weak Low High Overfitting Not Generalized Low Low Perfect Best High High Underfitting and overfitting Worst
INITIALIZATION
A well chosen initialization can:
Speed up the convergence of gradient descent
Increase the odds of gradient descent converging to a lower training (and generalization) error
For randomly initialized weights, the cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3] )=log(0) , the loss goes to infinity.
Bench-marking different initialization methods:
Model Train accuracy Problem/Comment 3-layer NN with zeros initialization 50% fails to break symmetry 3-layer NN with large random initialization 83% too large weights 3-layer NN with He initialization 99% recommended method
HE initialization uses a scaling factor of sqrt(2./layers_dims[l-1]). Meanwhile, Xavier's initialization uses a scaling factor of sqrt(1./layers_dims[l-1])
GRADIENT CHECKING
Steps to perform Gradient Checking:
Put all the parameters in a giant vector Θ, and compute derivatives of all weights/parameters to put it to dΘ.
Compute for every i, dΘ[i] using the limit theorem and find euclidean distance with the original dΘ computed above.
If the distance is not every large, then algorithm runs perfectly (in the order of 10e-7)
Use this only when debugging the code, and this doesn't work with dropout.
http://ufldl.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization
REGULARIZATION
The value of λ is a hyperparameter that you can tune using a dev set.
L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to "oversmooth", resulting in a model with high bias.
What is L2-regularization actually doing?:
L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. Weights end up smaller ("weight decay")
DROPOUT
The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. Apply dropout both during forward and backward propagation, and drop the same nodes during both propagation in one iteration.
Common mistakes:
Adding dropout to input layer and output layer. You should use dropout only on the middle layers
Using dropout for both training and testing. You should use dropout (randomly eliminate nodes) only in training
Statistics:
model train accuracy test accuracy 3-layer NN without regularization 95% 91.5% 3-layer NN with L2-regularization 94% 93% 3-layer NN with dropout 93% 95%
MINI BATCH GRADIENT DESCENT
Gradient steps with respect to all m examples on each step is called Batch Gradient Descent. In Mini-batch gradient descent, a single pass through the training set (i.e one epoch) allows you to take (num_chunks) gradient descent steps.
Think of mini-batch gradient descent as a baby step of batch gradient descent.
In each mini-batch gradient descent, the trend is downwards but it has some noise with it. It makes sense, since Batch(1) might be well resonating with output, but Batch(2) has some contrasting outputs.
Mini-batch size is a hyper-parameter, which leads to 3 gradient descents:
Batch size Name Advantages/ Disadvantages m Batch Gradient Descent Too long per iteration 1 Stochastic Gradient Descent
Loose speedup from vectorization
Never converge to minimum
somewhere in between Mini-batch Gradient Descent
Takes advantage of vectorization.
Not too long per iteration
GRADIENT DESCENT WITH MOMENTUM
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent. This method almost always works better than straight-forward gradient descent algorithm.
Compute exponentially weighted moving average of gradients, and use those gradients to update the weights (in the weight update step)
The ideal state of gradient descent would be to slower learning in the vertical step and faster learning in the horizontal step.
(If β = 0, then it becomes standard gradient descent without momentum)
How do you choose β?
The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.
Common values for β range from 0.8 to 0.999. If you don't feel inclined to tune this, β=0.9 is often a reasonable default.
RMSPROP
Suppose we have contour such as above, where red dot signifies the minimum cost. The zig-zag pattern shown is the path traversed by straight forward gradient descent. In that case, most learning is happening in the vertical direction and less learning is happening in the horizontal direction, which is not the ideal case as highlighted above.
Now, let parameter w be on x-axis (horizontal) and b be on y-axis (vertical).
In the horizontal direction, slope is less, so dW is small, Sdw is small, and weight update of W becomes big. Therefore, you're advancing horizontally very fast Similarly, in the vertical direction, slope is more, db is big, Sdb is big, and weight update of b becomes small. Therefore, vertical direction gets damped out
Combining RMSProp and Momentum results in Adam optimization algorithm.
ADAM OPTIMIZATION ALGORITHM
Adam Optimization algorithm = Gradient Descent with Momentum + RMSprop + Bias correction + Zero correction
ADAM = Adaptive moment estimation
LEARNING RATE DECAY
Usually, learning rate should be high when the model training starts, so that gradient descent can take quick steps. But should be low when the gradient descent starts to converge, if not it will keep bouncing around the minima.
One way to solve this issue is by using a method called Learning Rate decay. This method decays the learning rate as the number of epochs increase. Below are some ways to do it:
On the moon dataset, here are some statistics:
Optimization method Accuracy Cost shape Gradient descent 79.7% oscillations Momentum 79.7% oscillations Adam 94%
smoother
Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations in the cost come from the fact that some minibatches are more difficult than others for the optimization algorithm.
Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. Other two models return good accuracy when trained for longer epochs. However, Adam converges a lot faster.
TENSORFLOW
Code for finding the best parameters corresponding to lowest cost
tensorflow.py
import numpy as np import tensorflow as tf W = tf.Variable(0, dtype = float32) cost = tf.add(tf.add(W**2, tf.multiply(-10,W)), 25) train = tf.train.GradientDescentOptimizer(0.01).minimize(cost) with tf.Session() as sess: sess.run(init) for i in range(1000): sess.run(train) print(sess.run(W)) # This is the parameter after cost function optimized 1000 times
A placeholder is an object whose value you can specify only later. To specify values for a placeholder, you can pass in values by using a "feed dictionary"
tensorflow_1.py
import numpy as np import tensorflow as tf coeff = np.array([[1],[-20],[25]]) W = tf.Variable(0, dtype = float32) x = tf.placeholder(tf.float32, [3,1]) cost = x[0][0]*W**2 + x[1][0]*W + x[2][0] train = tf.train.GradientDescentOptimizer(0.01).minimize(cost) with tf.Session() as sess: sess.run(init) for i in range(1000): sess.run(train, feed_dict = {x: coefficients}) print(sess.run(W))
Writing and running programs in TensorFlow has the following steps:
Create Tensors (variables) that are not yet executed/evaluated.
Write operations between those Tensors.
Initialize your Tensors.
Create a Session.
Run the Session. This will run the operations you'd written above.
When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.
A SOFTMAX layer generalizes SIGMOID to when there are more than two classes
HYPERPARAMETER TUNING
Random sampling works better than a grid search. Why?
Say HP1 is very important but HP2 is not. In grid search, after 25 iterations we would have checked only 5 distinct values of HP1 whereas in random search, we would've checked 25 distinct values of HP1
Coarse to fine search process
At the beginning of the run, start off with coarse values, find the area which has good accuracy. Next, zoom in to that region, set hyper-parameters to fine values. In the later iterations, this method focuses more into useful range of HPs
Choose HPs on log scale. Not on linear scale Say we are finding best HP for learning rate (α). When we are around lower end of the scale, the sensitivity of results is very high. The algorithm should use more resources to find HPs in the high sensitivity region rather than spending on low sensitivity regions (higher values of alpha). Logarithmic scale samples more densely in the regime when alpha is on lower end of the scale. This is an efficient way to distribute the samples to explore the space of possible outcomes more efficiently.
Two approaches for hyperparameter tuning in practice:
Pandas approach: Computation resources are very low. Babysit the model with different HPs everyday and track the reduction in error
Caviar approach: Computational resources are huge. Run many models parallely and find the best one possible
COVARIATE SHIFT
Most supervised machine learning techniques are built on the assumption that data at the training and production stages follow the same distribution.
Distributions of inputs (queries) change but the conditional distribution of outputs (answers) is unchanged. Distribution of the inputs used as predictors (covariates) changes between training and prediction stages. This is normally due to changes in state of latent variables, which could be temporal (even changes to the stationarity of a temporal process)
BATCH NORMALIZATION
Normalizing the input features X can help learning a neural network. Batch norm applies that normalization process to the values deep in hidden layers of a NN. This normalizes the mean and variance of hidden layer's values (Z).
Steps to implement gradient descent with batch normalization:
Compute forward propagation on input mini-batch. In each hidden layer, use batch normalization to convert Z to Z(tilda)
Use backprop to compute dW, dΒ, dΓ
Update respective weights with gradients W, b and Γ
Use any optimization algorithm such as momentum, RMSProp and Adam
Input normalization makes all features in the input X on the same scale. Say there are two input features in X, one ranges from 1....10 and other ranges from 1....1000. Normalizing the input makes both features on the same scale, and makes cost function not to be an ellipse, but concentric circles. This makes it easier to find the minima. Batch normalization does similar thing for values in hidden units
Batch-normalization helps overcome covariante shift by making weights deeper in the neural network more robust to changes to weights earlier in the neural network. If the distribution of input changes, the mean and variance of hidden layers will be the same. This makes layers deep in the network more robust, since it sees data which has similar mean and variance.
Batch Normalization limits the amount to which updating the parameters in the earlier layers (input too) can effect the distribution of values that deeper layers now see. It weakens the coupling between the earlier layer parameters and later layer parameters, which forces each layer to learn by itself.
Each mini-batch is scaled by the mean/variance computed on just that mini-batch. This adds some noise to the values Z[l] within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations. This has a slight regularization effect.
During training, we calculate mean and variance over a mini-batch of input data.
During train time, we use a mini-batch of training examples but during test time, we use only one test example at a time. Calculating mean and variance over a single test example doesn't make sense. To combat this issue, estimate mean and variance using exponentially weighted averages on train set.
Portrait Elizabeth @epentagrama #beauty #hair by @damienboissinothair #makeup by @karimrahmanmakeup #fashion #andrewng #paris #topmodel #photo #eddyming #newyork (at Daylightstudio)
... often, you first become good at something, and then you become passionate about it. And I think most people can become good at almost anything.
Andrew Ng, for Huffington Post http://www.huffingtonpost.com.au/2015/05/13/andrew-ng_n_7267682.html
Inside The Mind That Built Google Brain: On Life, Creativity, And Failure
+Andrew Ng on Life, Creativity and Failure.
#GoogleBrain #AndrewNg
https://plus.google.com/113839778987212293817/posts/iyGhceENr9e