Skip to content

Quickstart

Install

pip install semaclust

Cluster a list of strings

from semaclust import TextClusterer

texts = [
    "New York", "NYC", "new york city",
    "Los Angeles", "LA",
    "San Francisco", "San Fran", "SF",
]

clusterer = TextClusterer(distance_threshold=1.0)
clusterer.fit(texts)

clusterer.n_clusters_       # 3
clusterer.labels_           # ndarray of cluster ids
clusterer.clusters_         # {0: ['New York', 'NYC', ...], ...}
clusterer.representatives_  # {0: 'NYC', 1: 'LA', 2: 'SF'}

Replace values with their cluster representative

clusterer.transform()
# ['NYC', 'NYC', 'NYC', 'LA', 'LA', 'SF', 'SF', 'SF']

clusterer.fit_transform(texts)
# same as fit(texts).transform()

Tune the threshold

distance_threshold controls cluster granularity. Smaller values produce tighter, more numerous clusters; larger values merge more aggressively. The useful range with the default encoder (all-MiniLM-L6-v2) under ward + euclidean linkage is roughly 0.7 to 1.4 for unit-norm embeddings. See benchmarks.md for per-model sweet spots.

Custom representative

Pick the longest string as the representative instead of the shortest:

clusterer = TextClusterer(
    distance_threshold=1.0,
    representative_selector=lambda texts: max(texts, key=len),
)

Bring your own encoder

import numpy as np
from semaclust import TextClusterer

class HashEncoder:
    def encode(self, texts: list[str]) -> np.ndarray:
        rng = np.random.default_rng(0)
        return rng.standard_normal((len(texts), 64)).astype(np.float32)

TextClusterer(encoder=HashEncoder()).fit_predict(["a", "b"])

CLI

semaclust cluster items.txt --threshold 1.0 --output clusters.json
cat items.txt | semaclust replace --threshold 1.0 > normalized.txt