Showing posts with label Machine Learning. Show all posts
Showing posts with label Machine Learning. Show all posts

Regularization

Overfitting and underfitting:


Image result for overfitting
This image should give you a rough idea of what underfitting and overfitting are.
When does not fit the data correctly it is underfitting and overfitting occurs when the curve fits the data too much.

Underfitting is also called as "High Bias" and overfitting is also called as "High Variance".

Overfitting:

If we have too many features, the learned hypothesis may fit the training set very well, that $cost=0$ but may fail to generalize to new examples.

Addressing overfitting:

  • Reduce the number of features, manually.
  • Regularization:
Regularization:
  • Keep all the features but reduce the magnitude/values of parameters thetas.
  • IT works well when we have a lot of features, and each of them contributes a bit to predicting y.
  • Whenever we regularize a cost function we add the sum of thetas squared multiplied with some parameter to the un-regularized cost function.
For example, let J(theta) be un regularized then we add $lambda⅀(theta(j)^2)$ Where lambda is some regularization parameter and the summation runs over each theta in Theta matrix. 

PCA: choosing the number of principal components

Choosing the number of principal components in PCA:


Choosing K:

We typically choose K to be the smallest value so that:
\[(1/m)*⅀||x(i) - xapprox(i)||^2/ (1/m)*⅀||x(i)||^2<0.01\]

The logic is simple, on the numerator is the averaged squared projection error$(1/m)*⅀||x(i) - xapprox(i)||^2$, and in the denominator is the total variance in the data $(1/m)*⅀||x(i)||^2$.

When the ratio when is 0.01 or 1% we say that 99% of the variance is retained for this value of k, similarly when it is 0.05 or 5% we say 95% of variance had been retained. 

Algorithm for choosing k:

  • Try PCA with k=1
  • compute Ureduce, z, z2, ........, xapprox
  • check for the ratio as shown above if less than 0.01 done, else increment k, loop.
One extra trick:

Supervised Learning speedup:

You can speed up the supervised learning by reducing the number of features in the training set, simply extract the inputs Xs and apply PCA on it to get Zs.

Now the training set becomes Z and y. The mapping can also be done on the cross-validation set and the training set.

Dimension Reduction: Data Compression

Data Compression:

What you can really do with data compression? Why learn it?
  • Reduce the data from 2D to 1D
  • Or you can even reduce it from 3D to 2D.
  • Visualizing a 2D data is easier than 3D or higher dimensional data, such data can be reduced to dimensions that can be easily visualized.

Principal Component Analysis (PCA):

We want to, for example, reduce the dimension from 2 to 1D. What we have to do is find a direction onto which to project the data so that the projection error or the sum of distances of points from this line are minimized.

Another example: to reduce the data from n dimensions to k dimensions, we find k vectors onto which project the data, so as to minimize the projection error.


PCA algorithm:

Suppose the training set is ${x1, x2, x3, .....}$ where each xi is 'n' element vector.
First, we preprocess the data, we apply feature scaling.
Now, in case we want to reduce the data from n-dimensions to k-dimensions:
Compute the "covariance matrix":
\[sigma = (1/m)⅀(xi)@(x(i).T)\]

Now use the standard svd function as follows, np.linalg.svd(sigma) this gives us three results:
\[U, S, V = np.linalg.svd(sigma)\], U contains the n eigenvectors, choose first K and store them in Ureduce of them to get the k eigenvectors.

The reduced data is z = Ureduce.T@X

Reconstruction from compressed representation:

The process is simple, Xapprox is \[Xapprox = Z@Ureduce\].


K-means algorithm

How to do random initialization?


Recall that the K-means algorithm is:


Repeat{
           for i =1 to m
                     assign each point to cluster centroid closest to x(i)

           for k=1 to K
                     assign each mu to average(mean) of points assigned to the cluster k
}

Random initialization:

One apparent condition is that K the number of clusters you want to have should be less than the number points in the training set.
What experts recommend is, to randomly to choose K training points, and set the centroids to these K examples.


Cost function:

for one set of centroids, and points assigned to it, the cost function is as follows:

\[J = (1/m) * ⅀||x(i) - mu(c, i)||^2\], where the summation runs from i=1 to m.
The logic is simple, for any arrangement, we calculate for each point its distance from the associated centroid, sum this over all the points. Intuitive, no?

Choosing the number of clusters:

What we do is plot cost vs. K graph. We want the value of K, after which the price appears to have stopped falling. This method is also called as the elbow method.

Image result for choosing the number of clusters in K means clustering



You might sometimes want to choose K, based on the situation rather than on this graph. 

Unsupervised Learning: K-means clustering

Clustering:

In unsupervised learning, the training set does not include the labels, instead, contains the features.
Thus, a common training set is as follows:

Training set: \[{X1, X2, X3, ................., Xm}\], where each Xi denotes a vector with n features.

K-means algorithm:

Image result for k means algorithm andrew ng

Suppose that we want to classify the data into two clusters. The approach that we use in K means clustering is:


  • We feed in the data and the number of groups (K), here two.
  • Randomly initialize K cluster points mu1, mu2, ....... muK, here only two
  • Then assign the points in the data to one of these, based on to which point they are closer to, here, to one of the two points based on the proximities.
  • Now for each developed group, calculate their means and assign the centroids to these means.
  • Now again, assign ever point to one of these points (called as centroids) and again calculate means.
  • Repeat this until two consecutive assignment results into the same arrangement.

Support Vector Machine, Decision Boundary

Support Vector Machine: Linearly Separable case

Image result for support vector machine in Linearly andrew ng
This is an example of Large Margin Classifier, the gap between A and B is called the margin. This case is called the Large Margin because the margin is large. In this case, the SVM will learn line C.

Recall that the cost function for SVM was:

\[C⅀ycost1(theta.T@X) + (1-y)cost0(theta.T@X)\] 
Where the summation runs over all the examples.

We are going to focus on the parameter C:
Image result for support vector machine boundary variation with C

A large C will make sure that it classifies perfectly, will make sure that each example has a huge weight when deciding the decision boundary. In case this can be useful, but in most case, you don't want it. At least when the training set is very very large.

Support Vector Machine

Support Vector Machine:

Recall that the cost of example in Logistic regression: \[(-ylog(h) + (1-y)log(1-h)))\]
Where the h was: \[h = 1/(1 + exp(-theta.T@x))\]
And our optimization objective was to minimize the mean summation of each of the examples.

Support vector machine is very very similar to logistic regression:\[min C⅀ycost1(theta.T@x) + (1-y)cost(theta.T@x)) \]

you can add the regularization term if you want to.

The cost1 and cost0 functions are as follows:
Image result for support vector machine cost function

The cost0(z) is zero if z is less than -1, and cost1(z) is zero if z is greater than 1.
Thus if y of a certain example is 1, the cost for that example becomes zero if $theta.T@x$ is greater than one. And in case y is zero for some example then the cost becomes zero if $theta.T@x$ is less than -1.

Neural Networks: Back propagation algorithm and code

The Backpropagation algorithm:

Calculating the errors:

Image result for neural network
The detailed intuition of backpropagation to learn the theta is beyond what this blog covers, but you'll get what you need to know to use the algorithm.

First, we will calculate delta(l)(j) : "error" of node j in layer l. 

For layer 4, this delta matrix, which, in this case, contains only one node.

 \[Delta4 = h - y\] Where h is the output of the neural network.
\[Delta3 = (theta3).T@Delta4 * (a3 * (1-a3))\]

Note here, that we use the operators as in python. Thus, @ is the matrix multiplication operator and * is element-wise multiplication operator. Similarly, for layer 2, we have:
\[Delta2 = (theta2).T@Delta3*(a2*(1-a2))\].

In case you have more layers, the same method can be used to calculate the error for each of the output layer and the hidden layers.


The Backpropagation algorithm:

The algorithm to get the theta matrices for our neural network.

Step 1: Have a training set X and y. Let m be the number of training examples in it.

Step 2: 
   *set $a1 = X$ // feeding i-th training example into the neural network.
   *perform forward propagation, on this data and compute activations for each layer as a2, a3, etc. and h for the last layer.
   * Now compute the errors for each layer, thus, if L is the number of layers, calculate $Delta-l= h-y(i)$. And compute Delta-(l-1),.....Delta2.
   *set $gamma1 = Delta2.T@a1$, similarly $gamma2 = Delta3.T@a2$ and $gamma3 = Delta4.T@a3$

    * $Theta1_gradient = gamma1/m$ $Theta2_gradient = gamma2/m$  $Theta3_gradient = gamma3/m$
    * Now apply the usual gradient descent.


The detailed explanation, why this works is beyond the scope of this blog.

The code:

Here is the code that does exactly the same thing:

def nnCostFunction(self, theta1, theta2, input_layer_size, hidden_layer_size, num_labels, X, y, Lambda): #
m = X.shape[0]
I = np.eye(num_labels)
Y = np.zeros((m, num_labels)) # this is the array for predictions
for i in range(m):
l = y[i, 0]
Y[i, :] = I[l - 1, :]
a1 = np.hstack((np.ones((m, 1)), X))
z2 = a1 @ theta1.T
temp = self.sigmoid(z2)
a2 = np.hstack((np.ones((z2.shape[0], 1)), temp))
h = self.sigmoid(a2 @ theta2.T)
# np.sum(Theta1[:,1:]**2) + np.sum(Theta2[:,1:]**2)
p = sum(sum(theta1[:, 1:] ** 2)) + sum(sum(theta2[:, 1:] ** 2))
J = (sum(sum((-Y) * np.log(h) - (1 - Y) * np.log(1 - h), 2)) / m) + ((Lambda * p) / (2 * m))
#till now forward propagation had been done. Now, backpropagation.
sigma3 = h - Y
sigma2 = (sigma3 @ theta2) * self.sigmoidGradient(np.hstack((np.ones((z2.shape[0], 1)), z2)))
sigma2 = sigma2[:, 1:]
delta2 = sigma3.T @ a2
delta1 = sigma2.T @ a1
p1 = (Lambda / m) * np.hstack((np.zeros((theta1.shape[0], 1)), theta1[:, 1:]))
p2 = (Lambda / m) * np.hstack((np.zeros((theta2.shape[0], 1)), theta2[:, 1:]))
Theta_grad1 = (delta1 / m) + p1
Theta_grad2 = (delta2 / m) + p2
return J, Theta_grad1, Theta_grad2
The parameter Lambda is used for regularization, and you can pass zero to it. Regularization is a general trick to improve our predictions.

These gradients now can be used to perform gradient descent. As follows,

The gradient descent:

def gradientDescent(self, X, y, initial_nn_params, alpha, num_iters, Lambda, input_layer_size, hidden_layer_size,
num_labels):
theta1 = initial_nn_params[:hidden_layer_size * (input_layer_size + 1)].reshape(hidden_layer_size,
input_layer_size + 1)
theta2 = initial_nn_params[hidden_layer_size * (input_layer_size + 1):].reshape(num_labels,
hidden_layer_size + 1)
back propagationm = len(y)
J_history = []
for i in range(num_iters):
nn_params = np.append(theta1.flatten(), theta2.flatten())
cost, grad1, grad2 = self.nnCostFunction(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y,
Lambda)
theta1 = theta1 - (alpha * grad1)
theta2 = theta2 - (alpha * grad2)
J_history.append(cost)
return theta1, theta2, J_history
Again, the Lamda is used for regularization and can be passed as zero. The gradient descent here is the same as that used in Linear Regression and Logistic Regression. 

Now, these are the thetas that can be used to predict the output using the concept of the forward propagation:
def predict(self, theta1, theta2, X):
m = X.shape[0]
a1 = np.hstack((np.ones((m, 1)), X))
z2 = a1 @ theta1.T
temp = self.sigmoid(z2)
a2 = np.hstack((np.ones((z2.shape[0], 1)), temp))
a3 = self.sigmoid(a2 @ theta2.T)
return np.argmax(a3, axis=1) + 1

Note that the sigmoid gradient used in the above code is nothing but a disguised form of something we have already discussed:
def sigmoidGradient(self, z):
return self.sigmoid(z) * (1 - self.sigmoid(z));
You might also want to initialize the thetas randomly, for their use, you implement the corresponding function as follows:
def randInitializeWeights(self, L_in, L_out):
epsilon = (6 ** 1 / 2) / (L_in + L_out) ** 1 / 2
W = np.random.rand(L_out, L_in + 1) * (2 * epsilon) - epsilon
return W
L_in and L_out are the numbers of units in the layer on which theta operates and the layer whose result it produces respectively. All the values in it lie within the range of epsilon to -epsilon.

Installing albert on ubuntu 19.04

Installing Albert on Ubuntu 19.04... Albert is not still released for ubuntu 19.04. But still, you can install it using the following ...