Density-Based Mining Algorithms for Dynamic Data: An Incremental Approach
No Thumbnail Available
Typically an algorithm designed for carrying data mining tasks is fed with a static set of input. This class of algorithms remain prone to certain disadvantages in scenarios where the input data and extracted results change temporally. The prominent bottlenecks may include redundant computation, high response time along with increased consumption of available resources. Given the importance of handling dynamic data in a real time environment eg: traffic monitoring, medical research, recommendation systems etc., this thesis focuses on developing incremental mining algorithms particularly in the eld of density based clustering and outlier detection. Density-based algorithms display robustness in extracting clusters of varying granularity or ltering outliers from variable density sub-spaces. In this thesis, we propose incremental extensions to two density based clustering algorithms: MBSCAN (Mass-based Clustering of Spatial Data with Application of Noise) and SNNDBSCAN (Shared Nearest Neighbor Density Based Clustering of Large Spatial Data with Application of Noise). While dealing with outlier detection, an incremental density based approach is proposed for the K-Nearest Neighbor Outlier Detection algorithm known as KNNOD. The incremental extensions to MBSCAN and KNNOD are approximate in nature facilitating single point insertions. However for SNN-DBSCAN, we propose exact incremental solutions facilitating both addition and deletion of data in batch mode. Our first contribution known as the iMass (Incremental Mass Based Clustering) clustering algorithm o ers an approximate incremental solution to the static MBSCAN algorithm. The goal of this work is to identify the expensive building blocks of MBSCAN and reconstruct them incrementally post every new insertion. Observations combining six real world and two synthetic datasets showed that the proposed iMass algorithm outperformed the naive MBSCAN method by achieving a maximum efficiency upto an order of 2.28 ( 191 times). Around 60.375% of mean clustering accuracy was observed post final insertion for three unlabeled datasets. The cluster quality evaluation through Normalized Mutual Information (NMI), Rand index (RI) measure and F1-score for five class labeled datasets showed similar or improved results for iMass as compared to MBSCAN. The efforts laid in our first contribution therefore motivated us to expand our research towards proposing exact incremental solutions. The second contribution in form of our proposed clustering algorithm BISDBadd (Batch Incremental Shared Nearest Neighbor Density Based Clustering Algorithm for addition) provides an exact incremental solution to the naive SNN-DBSCAN algorithm while adding points in batch mode. BISDBadd comprises of two proposed sub-variant algorithms viz. Batch - Inc1, Batch - Inc2 and is the most efficient comparatively. BISDBadd targets all the components of SNN-DBSCAN incrementally unlike its sub-variant methods. BISDBadd achieved a maximum efficiency upto an order of 3 ( 1000 times) over five (three real world and two synthetic) datasets. An identical cluster similarity was also observed with that of the SNN-DBSCAN algorithm. Complementing addition of data, the third contribution proposes the algorithm BISDBdel (Batch Incremental Shared Nearest Neighbor Density Based Clustering Algorithm for deletion) thereby providing an exact incremental solution to SNNDBSCAN while deleting points in batch mode. Similar to BISDBadd, BISDBdel comprises of two proposed sub-variant algorithms viz. Batch-Dec1, Batch-Dec2 and is the most efficient comparatively. BISDBdel targets all the components of SNN-DBSCAN incrementally when points are deleted from the dataset unlike its sub-variant methods. On comparing with SNN-DBSCAN, the maximum efficiency achieved by BISDBdel reached upto an order of 4 ( 10000 times) over five (three real world and two synthetic) datasets. The set of clusters obtained were identical to the SNN-DBSCAN algorithm. Moving from the paradigm of clustering, our fourth and nal contribution focuses on dynamic extraction of at most top-N global outliers against single point insertions. Our proposed approximate incremental algorithm KAGO (Adaptive Grid Based Outlier Detection Approach using Kernel Density Estimate (KDE)) uses Gaussian kernel in a grid-partitioned space to determine the local density of a point. The local density obtained through KDE is used to filter the local outliers which are integrated to extract at most top-N global outliers. The KAGO algorithm outperformed KNNOD by achieving a maximum efficiency upto an order of 3.91 ( 8304 times) over two intrusion detection datasets and a bidding data for market advertisement related to a search engine. Outliers' evaluation on these datasets using RI and F1-score showed a mean improved accuracy of around 3.3% in case of KAGO. The thesis therefore strives towards developing approximate and exact incremental algorithms in the eld of density-based clustering and outlier detection thereby facilitating real time data analysis.
Supervisor: Pinaki Mitra
COMPUTER SCIENCE AND ENGINEERING