Skip to content

API reference

TextClusterer

semaclust.TextClusterer

Cluster semantically similar texts via embeddings + agglomerative clustering.

Parameters

encoder: Either an :class:~semaclust.encoders.Encoder instance or a string passed to :class:~semaclust.encoders.SentenceTransformerEncoder. distance_threshold: Linkage distance above which clusters will not be merged. Smaller values produce more, tighter clusters. The default 1.0 is tuned for ward + euclidean linkage on unit-norm sentence-transformer embeddings; see benchmarks.md for per-model sweet spots. metric: Distance metric for :class:~sklearn.cluster.AgglomerativeClustering. linkage: Linkage criterion. "ward" only supports metric="euclidean". batch_size: Encoding batch size. Forwarded to the default encoder only. normalize: Lowercase, strip whitespace, and drop double quotes before encoding. representative_selector: Function picking a representative text from a cluster's member list. Defaults to the shortest text (ties broken alphabetically). random_state: Optional seed applied to random, numpy, and torch before encoding to make embedding-time stochastic ops reproducible. device: Torch device for the auto-created encoder. "auto" (default) picks CUDA if available, then MPS on Apple Silicon, else falls back to CPU. Pass "cuda", "mps", "cpu" to override, or None to let sentence-transformers decide. Ignored when encoder is a custom :class:~semaclust.encoders.Encoder instance.

Examples

from semaclust import TextClusterer clusterer = TextClusterer(distance_threshold=1.0) texts = ["New York", "NYC", "Los Angeles", "LA"] _ = clusterer.fit(texts) # doctest: +SKIP clusterer.transform() # doctest: +SKIP ['New York', 'New York', 'Los Angeles', 'Los Angeles']

fit(texts)

Cluster texts and store fitted attributes on this instance.

fit_predict(texts)

Cluster texts and return the per-text cluster labels.

fit_transform(texts)

Cluster texts and return each text replaced by its representative.

transform(texts=None)

Replace texts with their cluster representative.

With texts=None operates on the training set. With an iterable, replaces texts that were seen at fit time and passes the rest through unchanged.

get_replacement_map()

Return the text -> representative mapping for the training set.

ClusterResult

semaclust.ClusterResult dataclass

Outcome of clustering a list of texts.

Attributes

labels: Cluster id for each input text, in input order. clusters: Mapping from cluster id to the list of texts in that cluster. representatives: Mapping from cluster id to a single representative text. texts: The original input texts, kept for :meth:transform.

replacement_map()

Build a text -> representative mapping for the input texts.

When the same text appears in multiple clusters (possible if the input had duplicates that normalized differently) the first occurrence wins.

transform(texts=None)

Replace each text with its cluster representative.

Texts not seen during clustering pass through unchanged.

Encoders

semaclust.Encoder

Bases: Protocol

Anything that can turn a list of strings into a 2D float array.

semaclust.SentenceTransformerEncoder

Default encoder using sentence-transformers.

The underlying model is lazily loaded on first :meth:encode call so constructing the encoder is cheap and import-time stays side-effect free.

Parameters

model_name: Hugging Face model name passed to SentenceTransformer. batch_size: Batch size used when encoding. device: Torch device. "auto" (default) picks CUDA if available, then MPS on Apple Silicon, and otherwise lets sentence-transformers fall back to CPU. Pass an explicit string ("cuda", "mps", "cpu") to override, or None to delegate the choice entirely to sentence-transformers.

effective_device property

The device string that will actually be passed to the model.

Resolves "auto" lazily on first access; for explicit values returns what the user passed.

Exceptions

semaclust.NotFittedError

Bases: RuntimeError

Raised when accessing fitted attributes on an unfitted clusterer.