- Add
byte
scheme
- Add support for
class
scheme -- for multi-class classification field
- Feature: shrink existing vocabulary to given dataset (useful for parent child transfer)
- Fix
nlcodec
CLI bug - Improve help messages with epilog
- Add
nlcodec-learn
interface for vocabulary learn over PySpark
- add
nlcodec-freqs
CLI to setup.py - log time and memory usage for
learn
task - log BPE merge operations once every 2s instead of all operations
- using
__slots__
: ~25% faster, %30 less memory for BPE with 3M word types nlcodec.db.core
withDb
andMultipartDb
nlcodec.db.batch
withBatch
andBathIterable
- CLI
nlcodec.learn
for learning BPE using pyspark - CLI
nlcodec.bitextdb
to build a database from parallel text
- fix issue with
name
as class property (#24, #25)
- Option to supply preconfigured
spark
session object - Add docs
- Option to accept term frequencies as input
- PySpark backend to compute word and char frequencies
--min-co-ev
of BPE is CLI arg
- FIX:
find_packages()
insetup.py
file to include nested packages
- uploaded to pypi :
pip install nlcodec
- public repository with apache license 2.0