How to do random initialization?
Recall that the K-means algorithm is:
Repeat{
for i =1 to m
assign each point to cluster centroid closest to x(i)
for k=1 to K
assign each mu to average(mean) of points assigned to the cluster k
}
Random initialization:
One apparent condition is that K the number of clusters you want to have should be less than the number points in the training set.
What experts recommend is, to randomly to choose K training points, and set the centroids to these K examples.
Cost function:
for one set of centroids, and points assigned to it, the cost function is as follows:
\[J = (1/m) * ⅀||x(i) - mu(c, i)||^2\], where the summation runs from i=1 to m.
The logic is simple, for any arrangement, we calculate for each point its distance from the associated centroid, sum this over all the points. Intuitive, no?
Choosing the number of clusters:
What we do is plot cost vs. K graph. We want the value of K, after which the price appears to have stopped falling. This method is also called as the elbow method.
You might sometimes want to choose K, based on the situation rather than on this graph.
No comments:
Post a Comment