API reference¶

TextClusterer¶

`semaclust.TextClusterer` ¶

Cluster semantically similar texts via embeddings + agglomerative clustering.

Parameters¶

encoder: Either an :class:~semaclust.encoders.Encoder instance or a string passed to :class:~semaclust.encoders.SentenceTransformerEncoder. distance_threshold: Linkage distance above which clusters will not be merged. Smaller values produce more, tighter clusters. The default 1.0 is tuned for ward + euclidean linkage on unit-norm sentence-transformer embeddings; see benchmarks.md for per-model sweet spots. metric: Distance metric for :class:~sklearn.cluster.AgglomerativeClustering. linkage: Linkage criterion. "ward" only supports metric="euclidean". batch_size: Encoding batch size. Forwarded to the default encoder only. normalize: Lowercase, strip whitespace, and drop double quotes before encoding. representative_selector: Function picking a representative text from a cluster's member list. Defaults to the shortest text (ties broken alphabetically). random_state: Optional seed applied to random, numpy, and torch before encoding to make embedding-time stochastic ops reproducible. device: Torch device for the auto-created encoder. "auto" (default) picks CUDA if available, then MPS on Apple Silicon, else falls back to CPU. Pass "cuda", "mps", "cpu" to override, or None to let sentence-transformers decide. Ignored when encoder is a custom :class:~semaclust.encoders.Encoder instance.

Examples¶

from semaclust import TextClusterer clusterer = TextClusterer(distance_threshold=1.0) texts = ["New York", "NYC", "Los Angeles", "LA"] _ = clusterer.fit(texts) # doctest: +SKIP clusterer.transform() # doctest: +SKIP ['New York', 'New York', 'Los Angeles', 'Los Angeles']

`fit(texts)` ¶

Cluster texts and store fitted attributes on this instance.

`fit_predict(texts)` ¶

Cluster texts and return the per-text cluster labels.

`fit_transform(texts)` ¶

Cluster texts and return each text replaced by its representative.

`transform(texts=None)` ¶

Replace texts with their cluster representative.

With texts=None operates on the training set. With an iterable, replaces texts that were seen at fit time and passes the rest through unchanged.

`get_replacement_map()` ¶

Return the text -> representative mapping for the training set.

ClusterResult¶

`semaclust.ClusterResult` `dataclass` ¶

Outcome of clustering a list of texts.

Attributes¶

labels: Cluster id for each input text, in input order. clusters: Mapping from cluster id to the list of texts in that cluster. representatives: Mapping from cluster id to a single representative text. texts: The original input texts, kept for :meth:transform.

`replacement_map()` ¶

Build a text -> representative mapping for the input texts.

When the same text appears in multiple clusters (possible if the input had duplicates that normalized differently) the first occurrence wins.

`transform(texts=None)` ¶

Replace each text with its cluster representative.

Texts not seen during clustering pass through unchanged.

Encoders¶

`semaclust.Encoder` ¶

Bases: Protocol

Anything that can turn a list of strings into a 2D float array.

`semaclust.SentenceTransformerEncoder` ¶

Default encoder using sentence-transformers.

The underlying model is lazily loaded on first :meth:encode call so constructing the encoder is cheap and import-time stays side-effect free.

Parameters¶

model_name: Hugging Face model name passed to SentenceTransformer. batch_size: Batch size used when encoding. device: Torch device. "auto" (default) picks CUDA if available, then MPS on Apple Silicon, and otherwise lets sentence-transformers fall back to CPU. Pass an explicit string ("cuda", "mps", "cpu") to override, or None to delegate the choice entirely to sentence-transformers.

`effective_device` `property` ¶

The device string that will actually be passed to the model.

Resolves "auto" lazily on first access; for explicit values returns what the user passed.

Exceptions¶

`semaclust.NotFittedError` ¶

Bases: RuntimeError

Raised when accessing fitted attributes on an unfitted clusterer.

API reference¶

TextClusterer¶

semaclust.TextClusterer ¶

Parameters¶

Examples¶

fit(texts) ¶

fit_predict(texts) ¶

fit_transform(texts) ¶

transform(texts=None) ¶

get_replacement_map() ¶

ClusterResult¶

semaclust.ClusterResult dataclass ¶

Attributes¶

replacement_map() ¶

transform(texts=None) ¶

Encoders¶

semaclust.Encoder ¶

semaclust.SentenceTransformerEncoder ¶

Parameters¶

effective_device property ¶

Exceptions¶

semaclust.NotFittedError ¶

`semaclust.TextClusterer` ¶

`fit(texts)` ¶

`fit_predict(texts)` ¶

`fit_transform(texts)` ¶

`transform(texts=None)` ¶

`get_replacement_map()` ¶

`semaclust.ClusterResult` `dataclass` ¶

`replacement_map()` ¶

`transform(texts=None)` ¶

`semaclust.Encoder` ¶

`semaclust.SentenceTransformerEncoder` ¶

`effective_device` `property` ¶

`semaclust.NotFittedError` ¶