-> to BOTwiki - The Chatbot Wiki

In many data processing contexts, you may want to group units of the same type. Clustering is an analysis of available data, in which a class is assigned to each data point. Data that is similar is combined in a cluster. Clustering methods can be used to automatically add new data to existing groups or to analyze the grouping of existing data. For this you don't need any example data, which you have previously divided by hand into given groups. It is an unsupervised learning process.

Application for chatbots

In the context of chatbots, clustering is useful for initial data analysis when historical data is present. For example, you can use the chat history to determine which Intents should be created. In addition, in the operation of a chatbot, collected sentences that have not hit an intent can be clustered into a structure and new intents identified.

For clustering you first need a data representation. Normally, data is represented as number vectors, with individual dimensions of the vector reflecting properties of the data. In order to group these in a meaningful way, you need a measure of similarity. In most cases the Euclidean distance is used for this. 1] Depending on the algorithm, this metric and the way the data is assigned to the groups may vary.

Nearest Neighbour

The simplest way would be to randomly assign a class at the beginning of a certain number of data points, and then for all other points assume the class that your nearest neighbor has.

However, if this method is too inaccurate, it can be extended by considering the classes of several neighbors instead of just one, and choosing the class that occurs most frequently. [2]

Centroid-based processes

The results become even better when you Centroid-based processes used. A center point is determined for each cluster (for example, the mean value of all points in a cluster) and this is used as the reference point for the assignment. At the beginning these Centroids can be placed randomly. The distance to all centers is then calculated for the data. The data point then comes into the cluster to which it has the smallest distance. The Centroid is then adjusted accordingly, because adding a point creates a new mean value. The centre moves with time. As soon as a convergence criterion is met (for example, if the centroid changes are too small), the algorithm stops. [3][4]

Text representation: Bag of Words

Text data can be represented in different ways in vectors. In simple use cases, for example, you can define a vocabulary that you can imagine as a table. Each word corresponds to an integer (like an ID). With the help of such a table, a set can now be converted into a vector. Each sentence corresponds to a vector in which each dimension corresponds to a word of the vocabulary. For each word that occurs in the sentence, the dimension is given the value 1, all non-existent words remain at 0. [5]

Text Representation: Deep Learning Models 

For a representation that contains more semantic properties of the text, vectors created by deep learning techniques are used. A lot of data is needed for this type of procedure. For example, recurrent neural networks (RNNs) can be used to train language models by always predicting the next word for a given word sequence. This works by converting the sequence into a vector in an encoder-RNN and translating the vector into the following word with a decoder-RNN. If you now use only the encoder, you get a vector representation of the sentence that has significantly less dimensions than a bag-of-words approach and more semantic content.

> Back to the BOTwiki - The Chatbot Wiki


[2] k-nearest neighbors,