Outlier analysis in data mining identifies and examines data points that significantly differ from the rest of the dataset. An outlier represents a data point that significantly deviates from the typical pattern or behavior of the dataset.

This article will take you through the concepts, techniques, practical applications, and a code example of outlier analysis in data mining.

An outlier is an observation that appears to deviate markedly from other observations in the sample. Understanding outliers is critical in data mining, as they can provide insights into data that are not immediately apparent.

## Types of Outliers

**Global Outliers:**These are data points that are extreme compared to the whole data distribution.**Contextual Outliers:**These outliers depend on the context of the data and may not necessarily be outliers in a different context.**Collective Outliers:**A collection of data points collectively deviate significantly from the entire data set.

## Methods of Outlier Analysis

**Statistical Methods**

**Z-Score**

The Z-Score represents how many standard deviations an element is from the mean. A Z-Score greater than 2 in absolute value is generally considered an outlier.

**IQR (Interquartile Range)**

IQR is the range between the first and third quartiles. Anything outside this range could be considered an outlier.

**Machine Learning Methods**

**Isolation Forest**

Isolation Forest is an algorithm to detect outliers. It isolates anomalies instead of profiling normal data points.

**One-Class SVM**

One-Class SVM is used for novelty detection, identifying new observations that deviate from the training data.

**Clustering Techniques**

**K-Means Clustering**

Outliers can be recognized if they are grouped into sparse or lone clusters.

**Hierarchical Clustering**

Based on the dendrogram's structure, this method finds outliers.

**Distance-Based Approaches**

**KNN**

Data points with a small number of close neighbors are identified using the k-Nearest Neighbours (k-NN) method.

**DBSCAN**

Data points are grouped according to their density in DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which also identifies noise (outliers).