--- title: Sanskrit Tokenizer Demo emoji: ๐Ÿ’ป colorFrom: gray colorTo: blue sdk: gradio sdk_version: 5.12.0 app_file: app.py pinned: false short_description: Sanskrit Tokenizer Demo --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Sanskrit BPE Tokenizer This is a Byte-Pair Encoding (BPE) tokenizer specifically designed for Sanskrit text. It provides a web interface for both training the tokenizer and using it to encode/decode text. ## Features - Train BPE tokenizer on custom Sanskrit text - Tokenize Sanskrit text using the trained model - Verify tokenization accuracy through decode/encode cycle - User-friendly web interface ## Usage 1. Go to the "Train" tab and paste your Sanskrit training text 2. Click "Train Tokenizer" to train the model 3. Switch to the "Tokenize" tab to tokenize new text 4. Enter text and click "Tokenize" to see the results ## Example Text ``` เคšเฅ€เคฐเคพเคฃเฅเคฏเคชเคพเคธเฅเคฏเคพเคœเฅเคœเคจเค•เคธเฅเคฏ เค•เคจเฅเคฏเคพ เคจเฅ‡เคฏเค‚ เคชเฅเคฐเคคเคฟเคœเฅเคžเคพ เคฎเคฎ เคฆเคคเฅเคคเคชเฅ‚เคฐเฅเคตเคพเฅค เคฏเคฅเคพเคธเฅเค–เค‚ เค—เคšเฅเค›เคคเฅ เคฐเคพเคœเคชเฅเคคเฅเคฐเฅ€ เคตเคจเค‚ เคธเคฎเค—เฅเคฐเคพ เคธเคน เคธเคฐเฅเคตเคฐเคคเฅเคจเฅˆเคƒเฅฅ ``` # To deploy this to Hugging Face Spaces: 1. Create a new Space on Hugging Face: ```bash huggingface-cli login huggingface-cli repo create sanskrit-tokenizer-demo --type space ``` 2. Initialize git and push your code: ```bash git init git add . git commit -m "Initial commit" git remote add origin https://huggingface.co/spaces/your-username/sanskrit-tokenizer-demo git push -u origin main ``` # Steps to Run Locally 1. Create and activate a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. Install the requirements and the Hugging Face CLI: ```bash pip install -r requirements.txt pip install --upgrade huggingface-hub ``` 4. To run the app: ```bash python src/app.py ``` The interface will be available at `http://localhost:7860` by default. # Logs Orignal (before Hugging Space code): ``` (venv) gitesh.grover@Giteshs-MacBook-Pro ai-era-assignment11 % python src/main.py Reading file... Length of training dataset for token 620958 UTF-8 tokens length (without considering regex) 1719701 Preprocessing tokens... UTF-8 Split tokens length (due to regex) 77799 Training BPE... Learning BPE: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4744/4744 [06:21<00:00, 12.45it/s] Tokenizer training completed in 381.08 seconds Vocab size: 5000 Original tokens length 1719701, while updated tokens length 77799 Compression Ratio 22.10 Saving Tokenizer Vocab in files... Testing the validity of tokenizer... True ``` After Hugging Space implementation: ``` Learning BPE: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4744/4744 [03:25<00:00, 23.07it/s] (Huggingface app output) Training completed! Vocabulary size: 5000 and Compression Ratio: 9.42 ```