Clustering – Faster! – Using GenAI

Share

Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. The aim of the clustering process is to segregate groups with similar traits and assign them into clusters.

Sentence-Transformers can be used in different ways to perform clustering of small or large set of sentences.

SBERT Fast Clustering algorithm could be used for clustering large datasets (50k sentences in less than 5 seconds). In a large list of sentences it searches for local communities (A local community is a set of highly similar sentences).

The threshold of cosine-similarity could be configured for which we consider two sentences as similar. Also, we can specify the minimal size for a local community. This allows us to get either large coarse-grained clusters or small fine-grained clusters.

Getting Started

First, setup a new environment in Python/ VSCode. Open a PowerShell terminal within VSCode and use the command  ->

PowerShell
python -m venv . venv

to create a virtual environment. Activate this virtual environment via the Terminal to your workspace using ->

PowerShell
.venv\Scripts\Activate.ps1

Now, install the required libraries using pip->

PowerShell
pip install sentence_transformers

If you wish to use the GPU, install the relevant CUDA drivers from Nvidia.

You could check the torch support for CUDA using->

PowerShell
python -c "import torch; print(torch.version.cuda)"
>12.1

it should give you the installed version.

If it returns None, then use nvidia-smi to get the CUDA version installed->

PowerShell
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.61                 Driver Version: 531.61       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti    WDDM | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P0               N/A /  N/A|    619MiB /  4096MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Then install the corresponding torch version using pip ->

PowerShell
pip uninstall torch
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu121/torch_stable.html

Code

We will be using Hugging Face model all-mpnet-base-v2 for Clustering. The all-mpnet-base-v2 model provides the best quality and is an All-round model tuned for many use-cases. It is trained on a large and diverse dataset of over 1 billion training pairs, with base model as microsoft/mpnet-base. It could be used directly via Hugging Face or could be downloaded locally using ->

PowerShell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Now, create a new python file for the clustering.py script->

Python
from sentence_transformers import SentenceTransformer, util
import time
import torch
#Hugging Face Model
model_path = r"..\Models\all-mpnet-base-v2"
#Dataset Path
dataset_path = r"..\Datasets\clustering_dataset.txt" 
# Model for computing sentence embeddings.
model = SentenceTransformer(model_path)
# Get all unique sentences from the file
corpus_sentences = []
with open(dataset_path, encoding="ascii") as fIn:
    corpus_sentences = set(fIn.readlines())
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
#use GPU if available and drivers are installed
device = ("cuda" if torch.cuda.is_available() else "cpu")
corpus_embeddings = model.encode(
    corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True,device=device
)
print("Start clustering")
start_time = time.time()
# Two parameters to tune:
# min_cluster_size: Only consider cluster that have at least 25 elements
# threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(
    corpus_embeddings, min_community_size=25, threshold=0.75
)
print("Clustering done after {:.2f} sec".format(time.time() - start_time))
# Print for all clusters the top 3 and bottom 3 elements
for i, cluster in enumerate(clusters):
    print("\nCluster {}, #{} Elements ".format(i + 1, len(cluster)))
    for sentence_id in cluster[0:3]:
        print("\t", corpus_sentences[sentence_id])
    print("\t", "...")
    for sentence_id in cluster[-3:]:
        print("\t", corpus_sentences[sentence_id])

Sample dataset could be downloaded from here.

If everything is configured correctly, the output should be something like->

PowerShell
(.venv) PS C:\Code\Python\Environment\Clustering\Code> python .\clustering.py
Encode the corpus. This might take a while
cuda
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 782/782 [02:29<00:00,  5.24it/s]
Start clustering
Clustering done after 3.01 sec
Cluster 1, #110 Elements
         How do I enhance my English?
         How could I improve my English?
         How can I specifically improve my English?
         ...
         Can I improve my English in a month?
         How can I speak English fluently and fast?
         How can I speak English in front of people?
Cluster 2, #104 Elements
         What will be the result of banning 500 and 1000 rupees note in India?
         What is the effect of demonetization of 500 and 1000 rupees note?
         How will the 500 & 1000 rupee note ban affect India?
         ...
         What do you think about the Modi's sudden decision to scrap 500 and 1000 rs denomination?
         What are your views on the decision of Narendra Modi to discontinue the use of 500 and 1000 currency notes?
         How much black money will be controlled by banning Rs 500 and Rs 1000 note?
Cluster 3, #84 Elements
         How could I make money online?
         How can one make money online?
         How do I to make money online?
         ...
         How can l earn $100 online daily?
         Is there any way I can earn money online without any kind of investment?
         How can you earn $100 online?
.
.
.
Cluster 56, #25 Elements
         What is your favorite book of al time?
         What is your favorite book of all time and why?
         What 's your favorite book?
         ...
         What kind of books do you enjoy reading the most?
         What is the most significant book that you have read and why?
         What are your absolute favorite 7 books?
(.venv) PS C:\Code\Python\Environment\Clustering\Code> 

GitHub: https://github.com/threadwaiting/GenAIClustering