Migration from 0.1.x to 0.3.0¶
The 0.3 release is a deliberate break from the old single-method API in favor of a scikit-learn style estimator. (Version 0.2.0 was a stale draft on PyPI; this is the first modernized release.)
At a glance¶
| 0.1.x | 0.3.x |
|---|---|
TextClusterer().cluster(texts) |
TextClusterer().fit(texts).clusters_ |
TextClusterer().get_replacement_map(texts) |
TextClusterer().fit(texts).get_replacement_map() |
TextClusterer().replace_values(texts) |
TextClusterer().fit_transform(texts) |
replace_values(texts, representative_selector=fn) |
TextClusterer(representative_selector=fn).fit_transform(texts) |
Why the break¶
- Predictable state. Each
fitcall stores its outcome on the instance (labels_,clusters_, etc.), letting downstream code inspect the result without re-running the clustering. - Composable. Following the scikit-learn convention means the estimator drops cleanly into pipelines.
- One configuration site. Per-call options like the representative selector are now constructor arguments. Configure once, run many times.
Step by step¶
Before¶
from semaclust import TextClusterer
clusterer = TextClusterer()
clusters = clusterer.cluster(texts)
mapping = clusterer.get_replacement_map(texts, representative_selector=lambda xs: xs[0])
replaced = clusterer.replace_values(texts)
After¶
from semaclust import TextClusterer
clusterer = TextClusterer(representative_selector=lambda xs: xs[0])
clusterer.fit(texts)
clusters = clusterer.clusters_
mapping = clusterer.get_replacement_map()
replaced = clusterer.transform()
Other changes¶
- Skipped 0.2.0, which was an early draft published to PyPI.
- Minimum Python is now 3.10.
- Built-in CLI:
semaclust clusterandsemaclust replace. - The package ships
py.typedand is mypy strict. - Encoders are pluggable via the
Encoderprotocol. - Empty input no longer raises; it returns an empty result.