Mining Arbitrary Shaped Clusters in Large Dataset

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Cluster analysis or clustering, groups objects into a meaningful manner, is an important task in pattern recognition, data mining, bio-informatics, image processing, information retrieval, intrusion detection system, etc. To meet demands of clustering activities in these domains, many clustering methods have been developed over the years. These clustering methods can be categorized from different perspectives. An important perspective is algorithmic perspective. From algorithmic perspective (view), clustering methods are classified into distance based and density based methods (algorithms). Naturally, clusters are arbitrary shaped and different sizes in a dataset. However, finding clusters of arbitrary shapes and sizes is a difficult (challenging) task, especially in large size dataset. This thesis focuses at finding clusters of arbitrary shapes and sizes using both distance based and density based algorithms in large dataset. Distance based clustering methods usually find convexed shaped clusters. The classical single-link is a distance based method, which can find clusters of arbitrary shapes and sizes. However, it scans dataset many times and has quadratic time and space complexity. This prohibits single-link to apply to large dataset. To address these issues, a distance based hybrid clustering method has been proposed, which is termed as l-SL method. It produces a flat clustering for arbitrary shaped clusters in large dataset. It has two stages. In the first stage, leaders clustering method is applied to a dataset to obtain many convex shaped small clusters. In the second stage, these convexed shaped clusters are merged using single-link method (given minimum distance criteria). Proposed l-SL method scans a dataset once and clustering results of l-SL method is close to the final clustering of that of the single-link method. An improvement of the l-SL method has also been proposed in order to produce same clustering results as that of the single-link clustering method. Proposed l-SL and its improved methods output a flat clustering of a given dataset. A new data summarization scheme termed as data sphere has been proposed, which exploits leaders method to acquire summary of a dataset. Proposed data sphere is utilized to work with hierarchical single-link method in large dataset for generating a hierarchy of clustering. However, clustering results of data sphere based single-link method deteriorates at high compression ratio. To address this problem, tolerance rough set theory based data summarization scheme termed rough bubble has been proposed. Tolerance rough set theory is exploited to resolve uncertainty associated with cluster membership of leaders method, which collects directional distance statistics of each rough bubble. Main disadvantage of proposed distance based l-SL, improved l-SL, data sphere and rough bubble methods is that they cannot find arbitrary shaped clusters in presence of outliers, varying density clusters. To address these issues, a dynamic parameter termed Nearest Neighbor Factor(NNF) is introduced to determine relative position of an object in its neighborhood (local) region. A density based clustering method which uses NNF is introduced for finding clusters of arbitrary shapes, different sizes and variable density in a dataset..
Supervisor: Sukumar Nandi