semaclust¶
semaclust clusters semantically similar strings using sentence embeddings and agglomerative clustering. It is a small, focused tool for deduplicating free-text fields, normalizing user-entered values, and collapsing spelling or formatting variants into a canonical form.
from semaclust import TextClusterer
clusterer = TextClusterer(distance_threshold=1.0)
clusterer.fit(["New York", "NYC", "Los Angeles", "LA"])
print(clusterer.clusters_)
Why semaclust¶
- Drop-in API. scikit-learn style
fit/fit_predict/fit_transform. - Pluggable encoders. Anything with
encode(list[str]) -> np.ndarrayworks. - CLI included.
semaclust cluster items.txtreturns JSON. - Typed. Ships with
py.typed, mypy strict in CI.
Next steps¶
- Read the quickstart
- Browse the API reference
- Coming from 0.1.x? See the migration guide