API reference¶
TextClusterer¶
semaclust.TextClusterer
¶
Cluster semantically similar texts via embeddings + agglomerative clustering.
Parameters¶
encoder:
Either an :class:~semaclust.encoders.Encoder instance or a string
passed to :class:~semaclust.encoders.SentenceTransformerEncoder.
distance_threshold:
Linkage distance above which clusters will not be merged. Smaller
values produce more, tighter clusters. The default 1.0 is tuned
for ward + euclidean linkage on unit-norm sentence-transformer
embeddings; see benchmarks.md for per-model sweet spots.
metric:
Distance metric for :class:~sklearn.cluster.AgglomerativeClustering.
linkage:
Linkage criterion. "ward" only supports metric="euclidean".
batch_size:
Encoding batch size. Forwarded to the default encoder only.
normalize:
Lowercase, strip whitespace, and drop double quotes before encoding.
representative_selector:
Function picking a representative text from a cluster's member list.
Defaults to the shortest text (ties broken alphabetically).
random_state:
Optional seed applied to random, numpy, and torch before
encoding to make embedding-time stochastic ops reproducible.
device:
Torch device for the auto-created encoder. "auto" (default) picks
CUDA if available, then MPS on Apple Silicon, else falls back to CPU.
Pass "cuda", "mps", "cpu" to override, or None to let
sentence-transformers decide. Ignored when encoder is a custom
:class:~semaclust.encoders.Encoder instance.
Examples¶
from semaclust import TextClusterer clusterer = TextClusterer(distance_threshold=1.0) texts = ["New York", "NYC", "Los Angeles", "LA"] _ = clusterer.fit(texts) # doctest: +SKIP clusterer.transform() # doctest: +SKIP ['New York', 'New York', 'Los Angeles', 'Los Angeles']
fit(texts)
¶
Cluster texts and store fitted attributes on this instance.
fit_predict(texts)
¶
Cluster texts and return the per-text cluster labels.
fit_transform(texts)
¶
Cluster texts and return each text replaced by its representative.
transform(texts=None)
¶
Replace texts with their cluster representative.
With texts=None operates on the training set. With an iterable,
replaces texts that were seen at fit time and passes the rest through
unchanged.
get_replacement_map()
¶
Return the text -> representative mapping for the training set.
ClusterResult¶
semaclust.ClusterResult
dataclass
¶
Outcome of clustering a list of texts.
Attributes¶
labels:
Cluster id for each input text, in input order.
clusters:
Mapping from cluster id to the list of texts in that cluster.
representatives:
Mapping from cluster id to a single representative text.
texts:
The original input texts, kept for :meth:transform.
replacement_map()
¶
Build a text -> representative mapping for the input texts.
When the same text appears in multiple clusters (possible if the input had duplicates that normalized differently) the first occurrence wins.
transform(texts=None)
¶
Replace each text with its cluster representative.
Texts not seen during clustering pass through unchanged.
Encoders¶
semaclust.Encoder
¶
Bases: Protocol
Anything that can turn a list of strings into a 2D float array.
semaclust.SentenceTransformerEncoder
¶
Default encoder using sentence-transformers.
The underlying model is lazily loaded on first :meth:encode call so
constructing the encoder is cheap and import-time stays side-effect free.
Parameters¶
model_name:
Hugging Face model name passed to SentenceTransformer.
batch_size:
Batch size used when encoding.
device:
Torch device. "auto" (default) picks CUDA if available, then MPS
on Apple Silicon, and otherwise lets sentence-transformers fall back
to CPU. Pass an explicit string ("cuda", "mps", "cpu") to
override, or None to delegate the choice entirely to
sentence-transformers.
effective_device
property
¶
The device string that will actually be passed to the model.
Resolves "auto" lazily on first access; for explicit values returns
what the user passed.
Exceptions¶
semaclust.NotFittedError
¶
Bases: RuntimeError
Raised when accessing fitted attributes on an unfitted clusterer.