#### Transcript Slide 1

Clustering I The Task • Input: Collection of instances – No special class label attribute! • Output: Clusters (Groups) of instances where members of a cluster are more ‘similar’ to each other than they are to members of another cluster – Similarity between instances need to be defined • Because there are no predefined classes to predict, the task is to find useful groups of instances 2 Example – Iris Data 3 Example Clusters? Petal Width Petal Length 4 Different Types of Clusters • Clustering is not an end in itself – Only a means to an end defined by the goals of the parent application – A garment manufacturer might be looking for grouping people into different sizes and shapes (segmentation) – A biologist might be looking for gene groups with similar functions (natural groupings) • Application dependent notions of clusters possible • This means, you need different clustering techniques to find different types of clusters • Also you need to learn to find a clustering technique to match your objective – Sometimes an ideal match is not possible; therefore you adapt the existing – Sometimes, a wrongly selected technique might show unexpected clusters – which is what data mining is all about 5 Cluster Representations • Several representations are possible e d c j a k h g d b e h j a k f i c i f g Partitioning 2D space showing instances 1 2 3 a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0.4 d 0.1 0.1 0.8 Table of Cluster Probabilities b Venn Diagram a b c d e f g Dendrogram 6 Several Techniques • Partitioning Methods – K-means and k-medoids • Hierarchical Methods – Agglomerative • Density-based Methods – DBSCAN 7 K-means Clustering • Input: the number of clusters K and the collection of n instances • Output: a set of k clusters that minimizes the squared error criterion • Method: – Arbitrarily choose k instances as the initial cluster centers – Repeat • (Re)assign each instance to the cluster to which the instance is the most similar, based on the mean value of the instances in the cluster • Update cluster means (compute mean value of the instances for each cluster) – Until no change in the assignment • Squared Error Criterion – E = ∑i=1 k ∑ pЄCi |p-mi|2 – where mi are the cluster means and p are points in clusters 8 Interactive Demo • http://home.dei.polimi.it/matteucc/Clus tering/tutorial_html/AppletKM.html 9 Choosing the number of Clusters • K-means method expects number of clusters k to be specified as part of the input • How to choose appropriate k? • Options – Start with k = 1 and work up to small number • Select the result with lowest squared error • Warning: Result with each instance in a different cluster on its own would be the best – Create two initial clusters and split each cluster independently • When splitting a cluster create two new ‘seeds’ • One seed at a point one standard deviation away from the cluster center and the second one standard deviation in the opposite direction • Apply k-means to the points in the cluster with the two new seeds 10 Choosing Initial Centroids • Randomly chosen centroids may produce poor quality clusters • How to choose initial centroids? • Options – Take a sample of data, apply hierarchical clustering and extract k cluster centroids • Works well – small sample sizes are desired – Take well separated k points • Take a sample of data and select the first centroid as the centroid of the sample • Each of the subsequent centroid is chosen to be a point farthest from any of the already selected centroids 11 Additional Issues • K-means might produce empty clusters • Need to choose an alternative centroid • Options – Choose the point farthest from the existing centroids – Choose the point from the cluster with the highest squared error • Squared error may need to be reduced further – Because K-means may converge to local minimum – Possible to reduce squared error without increasing K • Solution – Apply alternate phases of cluster splitting and merging – Splitting – split the cluster with the largest squared error – Merging – redistribute the elements of a cluster to its neighbours • Outliers in the data cause problems to k-means • Options – Remove outliers – Use K-Medoids instead 12 K-medoid • Input: the number of clusters K and the collection of n instances • Output: A set of k clusters that minimizes the sum of the dissimilarities of all the instances to their nearest medoids • Method: – Arbitrarily choose k instances as the initial medoids – Repeat • (Re)assign each remaining instance to the cluster with the nearest medoid • Randomly select a non-medoid instance, or • Compute the total cost, S, of swapping Oj with Or • If S<0 then swap Oj with Or to form the new set of k medoids – Until no change 13 K-means Vs K-medoids • • • • Both require K to be specified in the input K-medoids is less influenced by outliers in the data K-medoids is computationally more expensive Both methods assign each instance exactly to one cluster – Fuzzy partitioning techniques relax this • Partitioning methods generally are good at finding spherical shaped clusters • They are suitable for small and medium sized data sets – Extensions are required for working on large data sets 14