Anomaly Detection Tutorial (ANO101) - Level Beginner

Date Updated: Feb 25, 2020

1.0 Objective of Tutorial

Welcome to Anomaly Detection Tutorial (#ANO101). This tutorial assumes that you are new to PyCaret and looking to get started with Anomaly Detection using pycaret.anomaly Module.

In this tutorial we will learn:

  • Getting Data: How to import data from PyCaret repository?
  • Setting up Environment: How to setup experiment in PyCaret to get started with building anomaly models?
  • Create Model: How to create a model and assigns anomaly labels to original dataset for analysis?
  • Plot Model: How to analyze model performance using various plots?
  • Predict Model: How to assign anomaly labels to new and unseen dataset based on trained model?
  • Save / Load Model: How to save / load model for future use?

Read Time : Approx. 25 Minutes

1.1 Installing PyCaret

First step to get started with PyCaret is to install pycaret. Installing pycaret is easy and take few minutes only. Follow the instructions below:

Installing PyCaret in Local Jupyter Notebook

pip install pycaret

Installing PyCaret on Google Colab or Azure Notebooks

!pip install pycaret

1.2 Pre-Requisites

  • Python 3.x
  • Latest version of pycaret
  • Internet connection to load data from pycaret's repository
  • Basic Knowledge of Anomaly Detection

1.3 For Google colab users:

If you are running this notebook on Google colab, below code of cells must be run at top of the notebook to display interactive visuals.

from pycaret.utils import enable_colab
enable_colab()

1.4 See also:

2.0 What is Anomaly Detection?

Anomaly Detection is the task of identifying of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. There are three broad categories of anomaly detection techniques exist:

  • Unsupervised anomaly detection: Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.

  • Supervised anomaly detection: This technique requires a data set that has been labeled as "normal" and "abnormal" and involves training a classifier.

  • Semi-supervised anomaly detection: This technique construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the learnt model.

pycaret.anomaly module supports the unsupervised and supervised anomaly detection technique. In this tutorial we will only cover unsupervised anomaly detection technique.

Learn More about Anomaly Detection

3.0 Overview of Anomaly Detection Module in PyCaret

PyCaret's anomaly detection module (pycaret.anomaly) is a an unsupervised machine learning module which performs the task of identifying rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

PyCaret anomaly detection module provides several pre-processing features that can be configured when initializing the setup through setup() function. It has over 12 algorithms and few plots to analyze the results of anomaly detection. PyCaret's anomaly detection module also implements a unique function tune_model() that allows you to tune the hyperparameters of anomaly detection model to optimize the supervised learning objective such as AUC for classification or R2 for regression.

4.0 Dataset for the Tutorial

For this tutorial we will use a dataset from UCI called Mice Protein Expression. The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. Click Here to read more about the dataset.

Dataset Acknowledgement:

Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain. Email: clarahiguera@ucm.es

Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA. Email: katheleen.gardiner@ucdenver.edu

Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland. Email: kcios@vcu.edu

The original dataset and data dictionary can be found here.

5.0 Getting the Data

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret's data respository to load the data using get_data() function (This will require internet connection).

In [1]:
from pycaret.datasets import get_data
dataset = get_data('mice')
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... pCFOS_N SYP_N H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class
0 309_1 0.503644 0.747193 0.430175 2.816329 5.990152 0.218830 0.177565 2.373744 0.232224 ... 0.108336 0.427099 0.114783 0.131790 0.128186 1.675652 Control Memantine C/S c-CS-m
1 309_2 0.514617 0.689064 0.411770 2.789514 5.685038 0.211636 0.172817 2.292150 0.226972 ... 0.104315 0.441581 0.111974 0.135103 0.131119 1.743610 Control Memantine C/S c-CS-m
2 309_3 0.509183 0.730247 0.418309 2.687201 5.622059 0.209011 0.175722 2.283337 0.230247 ... 0.106219 0.435777 0.111883 0.133362 0.127431 1.926427 Control Memantine C/S c-CS-m
3 309_4 0.442107 0.617076 0.358626 2.466947 4.979503 0.222886 0.176463 2.152301 0.207004 ... 0.111262 0.391691 0.130405 0.147444 0.146901 1.700563 Control Memantine C/S c-CS-m
4 309_5 0.434940 0.617430 0.358802 2.365785 4.718679 0.213106 0.173627 2.134014 0.192158 ... 0.110694 0.434154 0.118481 0.140314 0.148380 1.839730 Control Memantine C/S c-CS-m

5 rows × 82 columns

In [2]:
#check the shape of data
dataset.shape
Out[2]:
(1080, 82)

In order to demonstrate the predict_model() function on unseen data, a sample of 5% (54 samples) are taken out from original dataset to be used for predictions at the end of experiment. This should not be confused with train/test split. This particular split is performed to simulate real life scenario. Another way to think about this is that these 54 samples are not available at the time when this experiment was performed.

In [3]:
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1026, 82)
Unseen Data For Predictions: (54, 82)

6.0 Setting up Environment in PyCaret

setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes only one mandatory parameter: pandas dataframe. All other parameters are optional and are used to customize pre-processing pipeline (we will see them in later tutorials).

When setup() is executed PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. Although, most of the times the data type is inferred correctly but it's not always the case. Therefore, after setup() is executed, PyCaret displays a table containing features and their inferred data types. At which stage, you can inspect and press enter to continue if all data types are correctly infered or type quit to end the experiment. Identifying data types correctly is of fundamental importance in PyCaret as it automatically performs few pre-processing tasks which are imperative to perform any machine learning experiment. These pre-processing tasks are performed differently for each data type. As such, it is very important that data types are correctly configured.

In later tutorials we will learn how to overwrite PyCaret's infered data type using numeric_features and categorical_features parameter in setup().

In [4]:
from pycaret.anomaly import *
In [5]:
exp_ano101 = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)
Setup Succesfully Completed!
Description Value
0 session_id 123
1 Original Data (1026, 82)
2 Missing Values True
3 Numeric Features 77
4 Categorical Features 5
5 Ordinal Features False
6 High Cardinality Features False
7 Transformed Data (1026, 91)
8 Numeric Imputer mean
9 Categorical Imputer constant
10 Normalize True
11 Normalize Method zscore
12 Transformation False
13 Transformation Method None
14 PCA False
15 PCA Method None
16 PCA components None
17 Ignore Low Variance False
18 Combine Rare Levels False
19 Rare Level Threshold None
20 Numeric Binning False
21 Remove Multicollinearity False
22 Multicollinearity Threshold None
23 Group Features False

Once the setup is succesfully executed it prints the information grid that contains few important information. Much of the information is related to pre-processing pipeline which is constructed when setup() is executed. Much of these features are out of scope for the purpose of this tutorial. However, few important things to note at this stage are:

  • session_id : A pseduo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment session_id is set as 123 for later reproducibility.

  • Missing Values : When there are missing values in original data it will show as True. Notice that Missing Values in the information grid above is True as the data contains missing values which are automatically imputed using mean for numeric features and constant for categorical features. The method of imputation can be changed using numeric_imputation and categorical_imputation parameter in setup().

  • Original Data : Displays the original shape of dataset. In this experiment (1026, 82) means 1026 samples and 82 features.

  • Transformed Data : Displays the shape of transformed dataset. Notice that shape original dataset (1026, 82) is transformed into (1026, 91). The number of features has increased due to encoding of categorical features in the dataset.

  • Numeric Features : Number of features inferred as numeric. In this dataset, 77 out of 82 features are inferred as numeric.

  • Categorical Features : Number of features inferred as categorical. In this dataset, 5 out of 82 features are inferred as categorical. Also notice, we have ignored one categorical feature i.e. MouseID using ignore_feature parameter.

Notice that how few tasks such as missing value imputation and categorical encoding that are imperative to perform modeling are automatically handled. Most of the other parameters in setup() are optional and used for customizing pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to intermediate and expert level, we will cover them in much detail.

7.0 Create a Model

Creating a anomaly detection model in PyCaret is simple and similar to how you would have created a model in supervised modules of pycaret. A anomaly detection model is created using create_model() function which takes one mandatory parameter i.e. name of model as a string. This function returns a trained model object. See an example below:

In [6]:
iforest = create_model('iforest')
In [7]:
print(iforest)
IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
    random_state=123, verbose=0)

We have created Isolation Forest model using create_model(). Notice the contanimation parameter is set 0.05 which is a default value when you donot pass fraction parameter in create_model(). fraction parameter determines the proportion of outliers in the dataset. In below example, we will create One Class Support Vector Machine model with 0.025 fraction.

In [8]:
svm = create_model('svm', fraction = 0.025)
In [9]:
print(svm)
OCSVM(cache_size=200, coef0=0.0, contamination=0.025, degree=3, gamma='auto',
   kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001,
   verbose=False)

Just be replacing iforest with svm inside create_model() we have now created OCSVM anomaly detection model. There are 12 models available ready-to-use in pycaret.anomaly module. To see the complete list, please see docstring. If you would like to read more about different models, their limitations and usecase, you may click here to read more about it.

8.0 Assign a Model

Now that we have created a model, we would like to assign the anomaly labels to our dataset (1080 samples) to analyze the results. We will achieve this by using assign_model() function. See an example below:

In [10]:
iforest_results = assign_model(iforest)
iforest_results.head()
Out[10]:
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class Label Score
0 3501_12 0.344930 0.626194 0.383583 2.534561 4.097317 0.303547 0.222829 4.592769 0.239427 ... 0.252700 0.218868 0.249187 1.139493 Ts65Dn Memantine S/C t-SC-m 0 -0.003593
1 3520_5 0.630001 0.839187 0.357777 2.651229 4.261675 0.253184 0.185257 3.816673 0.204940 ... 0.155008 0.153219 NaN 1.642886 Control Memantine C/S c-CS-m 0 -0.073378
2 3414_13 0.555122 0.726229 0.278319 2.097249 2.897553 0.222222 0.174356 1.867880 0.203379 ... 0.136109 0.155530 0.185484 1.657670 Ts65Dn Memantine C/S t-CS-m 0 -0.062198
3 3488_8 0.275849 0.430764 0.285166 2.265254 3.250091 0.189258 0.157837 2.917611 0.202594 ... 0.127944 0.207671 0.175357 0.893598 Control Saline S/C c-SC-s 0 -0.078707
4 3501_7 0.304788 0.617299 0.335164 2.638236 4.876609 0.280590 0.199417 4.835421 0.236314 ... 0.245277 0.202171 0.240372 0.795637 Ts65Dn Memantine S/C t-SC-m 0 -0.070426

5 rows × 84 columns

Notice that two columns Label and Score are added towards the end. 0 stands for inliers and 1 for outliers/anomalies. Score is the values computed by the algorithm. Outliers are assigned with larger anomaly scores. Notice that iforest_results also includes MouseID feature that we have dropped during setup(). It wasn't used for the model and is only appended to the dataset when you use assign_model(). In next section we will see how to analyze the results of anomaly detection using plot_model().

9.0 Plot a Model

plot_model() function can be used to analyze the anomaly detection model over different aspects. This function takes a trained model object and returns a plot. See examples below:

9.1 T-distributed Stochastic Neighbor Embedding (t-SNE)

In [11]:
plot_model(iforest)

9.2 Uniform Manifold Approximation and Projection

In [12]:
plot_model(iforest, plot = 'umap')

10.0 Predict on unseen data

predict_model() function is used to assign anomaly labels on the new unseen dataset. We will now use our iforest model to predict the data stored in data_unseen. This was created in the beginning of experiment and it contains 54 new samples that was not exposed to PyCaret before.

In [13]:
unseen_predictions = predict_model(iforest, data=data_unseen)
unseen_predictions.head()
Out[13]:
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class Label Score
0 3517_7 0.361782 0.565987 0.376145 2.774771 4.450250 0.262906 0.238968 6.041007 0.229808 ... 0.208611 NaN 0.270820 0.782176 Ts65Dn Saline S/C t-SC-s 1 0.011522
1 3517_8 0.329361 0.503216 0.326145 2.487893 3.925842 0.231177 0.195611 5.337306 0.210556 ... 0.211478 NaN 0.255582 0.882451 Ts65Dn Saline S/C t-SC-s 0 -0.056883
2 3517_9 0.328314 0.500465 0.363265 2.409370 3.796988 0.226250 0.187024 5.330917 0.203198 ... 0.201974 NaN 0.241815 0.881029 Ts65Dn Saline S/C t-SC-s 0 -0.067360
3 3517_10 0.377490 0.528331 0.379560 2.689263 3.978008 0.266753 0.229495 5.441138 0.227684 ... 0.262620 NaN 0.262620 0.781328 Ts65Dn Saline S/C t-SC-s 1 0.048812
4 3517_11 0.367813 0.478322 0.346371 2.356975 3.561499 0.252121 0.226437 4.868520 0.204288 ... 0.264670 NaN 0.247165 0.783654 Ts65Dn Saline S/C t-SC-s 0 -0.036365

5 rows × 84 columns

Label column indicate the outlier (1 = outlier, 0 = inlier). Score is the values computed by the algorithm. Outliers are assigned with larger anomaly scores. You can also use predict_model() function to label the training data. See example below:

In [14]:
data_predictions = predict_model(iforest, data = data)
data_predictions.head()
Out[14]:
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class Label Score
0 3501_12 0.344930 0.626194 0.383583 2.534561 4.097317 0.303547 0.222829 4.592769 0.239427 ... 0.252700 0.218868 0.249187 1.139493 Ts65Dn Memantine S/C t-SC-m 0 -0.003593
1 3520_5 0.630001 0.839187 0.357777 2.651229 4.261675 0.253184 0.185257 3.816673 0.204940 ... 0.155008 0.153219 NaN 1.642886 Control Memantine C/S c-CS-m 0 -0.073378
2 3414_13 0.555122 0.726229 0.278319 2.097249 2.897553 0.222222 0.174356 1.867880 0.203379 ... 0.136109 0.155530 0.185484 1.657670 Ts65Dn Memantine C/S t-CS-m 0 -0.062198
3 3488_8 0.275849 0.430764 0.285166 2.265254 3.250091 0.189258 0.157837 2.917611 0.202594 ... 0.127944 0.207671 0.175357 0.893598 Control Saline S/C c-SC-s 0 -0.078707
4 3501_7 0.304788 0.617299 0.335164 2.638236 4.876609 0.280590 0.199417 4.835421 0.236314 ... 0.245277 0.202171 0.240372 0.795637 Ts65Dn Memantine S/C t-SC-m 0 -0.070426

5 rows × 84 columns

11.0 Saving the model

We have now finished the experiment by using our iforest model to predict outlier labels on unseen data. This brings us to the end our experiment but one question is still to be asked. What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is No, you don't need to rerun the entire experiment and reconstruct the pipeline to generate predictions on new data. PyCaret inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [15]:
save_model(iforest,'Final IForest Model 08Feb2020')
Transformation Pipeline and Model Succesfully Saved

12.0 Loading the saved model

To load a saved model on a future date in the same or different environment, we would use the PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction

In [16]:
saved_iforest = load_model('Final IForest Model 08Feb2020')
Transformation Pipeline and Model Sucessfully Loaded

Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function . Below we have applied the loaded model to predict the same data_unseen that we have used in section 10 above.

In [17]:
new_prediction = predict_model(saved_iforest, data=data_unseen)
In [18]:
new_prediction.head()
Out[18]:
MouseID DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N pCREB_N ... H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype Treatment Behavior class Label Score
0 3517_7 0.361782 0.565987 0.376145 2.774771 4.450250 0.262906 0.238968 6.041007 0.229808 ... 0.208611 NaN 0.270820 0.782176 Ts65Dn Saline S/C t-SC-s 1 0.011522
1 3517_8 0.329361 0.503216 0.326145 2.487893 3.925842 0.231177 0.195611 5.337306 0.210556 ... 0.211478 NaN 0.255582 0.882451 Ts65Dn Saline S/C t-SC-s 0 -0.056883
2 3517_9 0.328314 0.500465 0.363265 2.409370 3.796988 0.226250 0.187024 5.330917 0.203198 ... 0.201974 NaN 0.241815 0.881029 Ts65Dn Saline S/C t-SC-s 0 -0.067360
3 3517_10 0.377490 0.528331 0.379560 2.689263 3.978008 0.266753 0.229495 5.441138 0.227684 ... 0.262620 NaN 0.262620 0.781328 Ts65Dn Saline S/C t-SC-s 1 0.048812
4 3517_11 0.367813 0.478322 0.346371 2.356975 3.561499 0.252121 0.226437 4.868520 0.204288 ... 0.264670 NaN 0.247165 0.783654 Ts65Dn Saline S/C t-SC-s 0 -0.036365

5 rows × 84 columns

Notice that results of unseen_predictions and new_prediction are identical.

16.0 Wrap-up / Next Steps?

What we have covered in this tutorial is the entire machine learning pipeline from data ingestion, pre-processing, training the anomaly detector, prediction on unseen data and saving the model for later use. We have completed all this in less than 10 commands which are naturally constructed and very intuitive to remember such as create_model(), assign_model(), plot_model(). Re-creating the entire experiment without PyCaret would have taken well over 100 lines of code in most of the libraries.

In this tutorial, we have only covered basics of pycaret.anomaly. In the following tutorials, we will go deeper into advance pre-processing techniques that allows you to fully customize your machine learning pipeline that are must to know for any data scientist.

See you at the next tutorial. Follow the link to Anomaly Detection Tutorial (ANO102) - Level Intermediate