Date Updated: Feb 25, 2020
Welcome to the Clustering Tutorial (CLU101) - Level Beginner. This tutorial assumes that you are new to PyCaret and looking to get started with clustering using
In this tutorial we will learn:
Read Time : Approx. 25 Minutes
The first step to get started with PyCaret is to install pycaret. Installation is easy and will only take a few minutes. Follow the instructions below:
pip install pycaret
!pip install pycaret
If you are running this notebook on Google colab, run the following code at top of your notebook to display interactive visuals.
from pycaret.utils import enable_colab
Clustering is the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups. It is an exploratory data mining activity, and a common technique for statistical data analysis used in many fields including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics. Some common real life use cases of clustering are:
PyCaret's clustering module (
pycaret.clustering) is a an unsupervised machine learning module which performs the task of grouping a set of objects in such a way that those in the same group (called a cluster) are more similar to each other than to those in other groups.
PyCaret's clustering module provides several pre-processing features that can be configured when initializing the setup through the
setup() function. It has over 8 algorithms and several plots to analyze the results. PyCaret's clustering module also implements a unique function called
tune_model() that allows you to tune the hyperparameters of a clustering model to optimize a supervised learning objective such as
AUC for classification or
R2 for regression.
For this tutorial we will use a dataset from UCI called Mice Protein Expression. The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse. Click Here to read more about the dataset.
Clara Higuera Department of Software Engineering and Artificial Intelligence, Faculty of Informatics and the Department of Biochemistry and Molecular Biology, Faculty of Chemistry, University Complutense, Madrid, Spain. Email: email@example.com
Katheleen J. Gardiner, creator and owner of the protein expression data, is currently with the Linda Crnic Institute for Down Syndrome, Department of Pediatrics, Department of Biochemistry and Molecular Genetics, Human Medical Genetics and Genomics, and Neuroscience Programs, University of Colorado, School of Medicine, Aurora, Colorado, USA. Email: firstname.lastname@example.org
Krzysztof J. Cios is currently with the Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA, and IITiS Polish Academy of Sciences, Poland. Email: email@example.com
The original dataset and data dictionary can be found here.
from pycaret.datasets import get_data dataset = get_data('mice')
5 rows × 82 columns
#check the shape of data dataset.shape
In order to demonstrate the
predict_model() function on unseen data, a sample of 5% (54 records) has been withheld from the original dataset to be used for predictions at the end of experiment. This should not be confused with train/test split as this particular split is performed to simulate a real life scenario. Another way to think about this is that these 54 samples were not available at the time when this experiment was performed.
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True) data_unseen = dataset.drop(data.index).reset_index(drop=True) print('Data for Modeling: ' + str(data.shape)) print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)
setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment.
setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. All other parameters are optional and are used to customize the pre-processing pipeline (we will see them in later tutorials).
setup() is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To account for this, PyCaret displays a table containing the features and their inferred data types after
setup() is executed. If all of the data types are correctly identified
enter can be pressed to continue or
quit can be typed to end the expriment. Ensuring that the data types are correct is of fundamental importance in PyCaret as it automatically performs a few pre-processing tasks which are imperative to any machine learning experiment. These tasks are performed differently for each data type which means it is very important for them to be correctly configured.
In later tutorials we will learn how to overwrite PyCaret's infered data type using the
categorical_features parameters in
from pycaret.clustering import *
exp_clu101 = setup(data, normalize = True, ignore_features = ['MouseID'], session_id = 123)
Setup Succesfully Completed!
|1||Original Data||(1026, 82)|
|6||High Cardinality Features||False|
|7||Transformed Data||(1026, 91)|
|17||Ignore Low Variance||False|
|18||Combine Rare Levels||False|
|19||Rare Level Threshold||None|
Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when
setup() is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:
session_idis passed, a random number is automatically generated that is distributed to all functions. In this experiment, the
session_idis set as
123for later reproducibility.
Missing Valuesin the information grid above is
Trueas the data contains missing values which are automatically imputed using
meanfor numeric features and
constantfor categorical features. The method of imputation can be changed using the
Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation, categorical encoding etc. Most of the parameters in
setup() are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but as you progress to the intermediate and expert levels, we will cover them in much greater detail.
Creating a cluster model in PyCaret is simple and similar to how you would create a model in the supervised learning modules. A clustering model is created using the
create_model() function which takes one mandatory parameter: the name of model as a string. This function returns a trained model object. See an example below:
kmeans = create_model('kmeans')
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto', random_state=123, tol=0.0001, verbose=0)
We have created a kmeans model using
create_model(). Notice the
n_clusters parameter is set to
4 which is the default when you do not pass a value to the
num_clusters parameter. In the below example we will create a
kmodes model with 6 clusters.
kmodes = create_model('kmodes', num_clusters = 6)
KModes(cat_dissim=<function matching_dissim at 0x0000014BA29F4AF8>, init='Cao', max_iter=100, n_clusters=6, n_init=1, n_jobs=1, random_state=123, verbose=0)
create_model() has created a
kmodes clustering model. There are 9 models available in the
pycaret.clustering module. To see the complete list, please see the docstring. If you would like to read more about the use cases and limitations of different models, you may click here to read more.
Now that we have created a model, we would like to assign the cluster labels to our dataset (1080 samples) to analyze the results. We will achieve this by using the
assign_model() function. See an example below:
kmean_results = assign_model(kmeans) kmean_results.head()
5 rows × 83 columns
Notice that a new column called
Cluster has been added to the original dataset.
kmean_results also includes the
MouseID feature that we dropped during the
setup() but it was not used for the model and is only appended to the dataset when you use
assign_model(). In the next section we will see how to analyze the results of clustering using
plot_model() function can be used to analyze different aspects of the clustering model. This function takes a trained model object and returns a plot. See examples below:
The cluster labels are automatically colored and shown in a legend. When you hover over the data points you will see additional features which by default use the first column of dataset (in this case MouseID). You can change this by passing the
feature parameter and you may also set
True if you want labels to be printed on the plot.
plot_model(kmeans, plot = 'elbow')
The elbow method is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset. In this example the Elbow plot above suggests that
5 is the optimal number of clusters.
plot_model(kmeans, plot = 'silhouette')
Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. In other words, the silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
plot_model(kmeans, plot = 'distribution') #to see size of clusters