Clustering

June 11, 2019

--> to the BOTwiki - The Chatbot Wiki

In many contexts of data processing it happens that one wants to group units of a type. Clustering refers to an analysis of the data at hand in which a class is assigned to each data point. Data that are similar to each other are grouped together in a cluster. You can use clustering to automatically add new data to existing groups or to analyse existing data for grouping. For this, you do not need sample data that you have previously divided into predefined groups by hand. This is an unsupervised learning procedure.

Application for chatbots

In the context of chatbots, clustering is helpful for initial data analysis if historical data is available. For example, one can determine from the chat history which intents should be created. In the operation of a chatbot, collected sentences that have not met an intent can also be brought into a structure through clustering and new intents can thus be identified.

Functionality:
For clustering, one first needs a data representation. Normally, data are represented as number vectors, with individual dimensions of the vector reflecting properties of the data. In order to group these in a meaningful way, a measure of similarity is needed. In most cases, the Euclidean distance is used for this. [1] Depending on the algorithm, this metric and the way the data are assigned to the groups can vary.

Nearest Neighbour

The simplest way would be to randomly assign a class at the beginning of a certain number of data points, and then for all other points assume the class that your nearest neighbor has.

However, if this method is too inaccurate, it can be extended by considering the classes of several neighbors instead of just one, and choosing the class that occurs most frequently. [2]

Centroid-based processes

The results are even better if centroid-based methods are used. Here, a centre point is determined for each cluster (for example, the mean value of all points in a cluster) and this is used as the reference point for the assignment. At the beginning, these centroids can be set randomly. The distance to all centres is then calculated for the data in each case. The data point is then placed in the cluster to which it has the smallest distance. The centroid is then adjusted accordingly, because adding a point creates a new mean value. The centre moves with time. As soon as a convergence criterion is met (for example, if the changes in the centroids are too small), the algorithm stops. [3][4]

Text representation: Bag of Words

Text data can be represented in vectors in various ways. In simple use cases, for example, one can define a vocabulary that can be thought of as a table. Each word corresponds to an integer (like an ID). With the help of such a table, a sentence can now be converted into a vector. Each sentence corresponds to a vector where each dimension corresponds to a word of the vocabulary. For each word that occurs in the sentence, the dimension is given the value 1, all words that are not present remain at 0. [5]

Text Representation: Deep Learning Models

For a representation that contains more semantic properties of the text, vectors are used that are created by Deep Learning methods. This type of method requires a lot of data. For example, recurrent neural networks (RNNs) can be used to train language models by always predicting the next word for a given word sequence. This works by converting the sequence to a vector in an encoder RN N and using a decoder RNN to translate the vector into the following word. If you now use only the encoder, you get a vector representation of the sentence that has significantly fewer dimensions than a bag-of-words approach and more semantic content.

> Back to BOTwiki - The Chatbot Wiki

Sources

[1] https://sites.google.com/a/erhard-rainer.com/erhard-rainer/mathematik/distanz-zwischen-zwei-punkten
[2] k-nearest neighbours, https://towardsdatascience.com/k-nearest-neighbours-introduction-to-machine-learning-algorithms-18e7ce3d802a
[3] https://www.youtube.com/watch?v=_aWzGGNrcic
[4] https://www-m9.ma.tum.de/material/felix-klein/clustering/Methoden/K-Means.php
[5] https://machinelearningmastery.com/gentle-introduction-bag-words-model/