# KMeans

> KMeans is a method that aims to cluster data in K groups of equal variance. The conventional KMeans algorithm has performance bottleneck; however，when implemented with PS, KMeans achieves the same level of accuracy with better performance.

## 1. Introduction 

The KMeans algorithm assigns each data point to its *nearest* cluster, where the *distance* is measured between the data point and the cluster's *centers*. In general, Kmeans algorithm is implemented in an iterative way as shown below:  

![kmeans](../img/kmeans.png)   

where, ![xi](../img/xi.png) is the ith sample and ![ci](../img/ci.png) is its nearest cluster; ![miu_j](../img/miu_j.png) is the centers of the jth cluster. 


## Mini-batch KMeans
"Web-Scale K-Means Clustering"[1] proposes an improved KMeans algorithm to address the latency, scalability and sparsity requirements in user-facing web applications, using mini-batch optimization for training. As shown below:

![mini_batch_kmeans](../img/mini_batch_kmeans.png)


## 2. Distributed Implementation on Angel

### Model Storage
KMeans on Angel stores the K centers and K-centers counts on ParameterServer，using a K×N matrix represents the K centers and a K×1 vector represents the K-centers counts, where K is the number of clusters and N is the dimension of data，i.e. number of features. 

### Model Updating
KMeans on Angel is trained in an iterative way; during each iteration, the centers are updated by mini-batch. 

### Algorithm
KMeans on Angel algorithm as follows:
 
![KMeans_on_Angel](../img/KMeans_on_Angel.png)  


## 3. Execution & Performance

### Input Format

* Data format is set in "ml.data.type", which supports "libsvm", "dense" and "dummy" formats. For details, see [Angel Data Format](data_format_en.md)

### Parameters

* IO Parameters
  * ml.feature.index.range: number of features
  * ml.data.type: [Angel Data Format](data_format_en.md), can be "dummy" or "libsvm"
  * angel.train.data.path: input path for train
  * angel.predict.data.path：input path for predict
  * angel.save.model.path: save path for trained model
  *	angel.predict.out.path：output path for predict
  * angel.log.path: save path for the log
  
* Algorithm Parameters
  * ml.epoch.num: number of iterations
  * ml.minibatch.size：: samples for mini-batch
  * ml.kmeans.center.num: K, number of clusters
  * ml.kmeans.c：learning rate
  
* Resource Parameters
  * angel.workergroup.number: number of workers  
  * angel.worker.memory.mb: worker's memory requested in G   
  * angel.worker.task.number: number of tasks on each worker, default is 1   
  * angel.ps.number: number of PS 
  * angel.ps.memory.mb: PS's memory requested in G  

###  **Submit Command**    

* **Training Job**
    
	```java
	./bin/angel-submit \
		--action.type=train \
		--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
		--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
		--angel.train.data.path=$traindata \
		--angel.save.model.path=$modelout \
		--angel.output.path.deleteonexist=true \
		--angel.log.path=$logpath \
		--ml.data.type=libsvm \
		--ml.model.type=T_DOUBLE_DENSE \
		--ml.kmeans.center.num=$centerNum  \
		--ml.kmeans.c=0.15 \
		--ml.epoch.num=10 \
		--ml.feature.index.range=$featureNum \
		--ml.feature.num=$featureNum \
		--angel.workergroup.number=4 \
		--angel.worker.memory.mb=5000  \
		--angel.worker.task.number=1 \
		--angel.ps.number=4 \
		--angel.ps.memory.mb=5000 \
		--angel.job.name=kmeans_train
	```

* **IncTraining Job**
	```java
	./bin/angel-submit \
		--action.type=inctrain \
		--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
		--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
		--angel.train.data.path=$traindata \
		--angel.load.model.path=$modelout \
		--angel.save.model.path=$modelout \
		--angel.output.path.deleteonexist=true \
		--angel.log.path=$logpath \
		--ml.data.type=libsvm \
		--ml.model.type=T_DOUBLE_DENSE \
		--ml.kmeans.center.num=$centerNum \
		--ml.kmeans.c=0.15 \
		--ml.epoch.num=10 \
		--ml.feature.index.range=$featureNum \
		--ml.feature.num=$featureNum \
		--angel.workergroup.number=4 \
		--angel.worker.memory.mb=5000  \
		--angel.worker.task.number=1 \
		--angel.ps.number=4 \
		--angel.ps.memory.mb=5000 \
		--angel.job.name=kmeans_inctrain
	```

* **Prediction Job**

	```java
	./bin/angel-submit \
		--action.type=predict \
		--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
		--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
		--angel.predict.data.path=$predictdata \
		--angel.load.model.path=$modelout \
		--angel.predict.out.path=$predictout \
		--angel.output.path.deleteonexist=true \
		--angel.log.path=$logpath \
		--ml.data.type=libsvm \
		--ml.model.type=T_DOUBLE_DENSE \
		--ml.kmeans.center.num=$centerNum \
		--ml.feature.index.range=$featureNum \
		--ml.feature.num=$featureNum \
		--angel.workergroup.number=4 \
		--angel.worker.memory.mb=5000  \
		--angel.worker.task.number=1 \
		--angel.ps.number=4 \
		--angel.ps.memory.mb=5000 \
		--angel.psagent.cache.sync.timeinterval.ms=500 \
		--angel.job.name=kmeans_predict
	```

### Performance
* data：SVHN，3×10^3 features，7×10^4 samples
* resource：
	* Angel：executor：4，5G memory，1 task； ps：4，5G memory
* Time of 100 epochs:
	* Angel：45min
	
## 4. References
[1] Sculley D. Web-scale k-means clustering[C]// International Conference on World Wide Web, WWW 2010, Raleigh, North         	Carolina, Usa, April. DBLP, 2010:1177-1178.