--- license: apache-2.0 language: - en pipeline_tag: fill-mask tags: - url - cybersecurity - urls - links - classification - phishing-detection - tiny - phishing - malware - defacement - transformers - urlbert - bert - malicious - base - urlbert --- urlbert-tiny-base-v4 is a lightweight BERT-based model specifically optimized for URL analysis. This version includes several improvements over the previous version: - Trained using a teacher-student architecture - Utilized masked token prediction as the primary pre-training task - Incorporated knowledge distillation from a larger model's logits - Additional training on 3 specialized tasks to enhance URL structure understanding The result is an efficient model that can be rapidly fine-tuned for URL classification tasks with minimal computational resources. ## Model Details - **Parameters:** 3.72M - **Tensor Type:** F32 - **Previous Version:** [urlbert-tiny-base-v3](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v3) ## Usage Example ```python from transformers import BertTokenizerFast, BertForMaskedLM, pipeline import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f"Device: {device}") model_name = "CrabInHoney/urlbert-tiny-base-v4" tokenizer = BertTokenizerFast.from_pretrained(model_name) model = BertForMaskedLM.from_pretrained(model_name) model.to(device) fill_mask = pipeline( "fill-mask", model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1 ) sentences = [ "http://example.[MASK]/" ] for sentence in sentences: print(f"\nInput: {sentence}") results = fill_mask(sentence) for result in results: token_str = result['token_str'] score = result['score'] print(f"Predicted token: {token_str}, probability: {score:.4f}") ``` ### Sample Output ``` Input: http://example.[MASK]/ Predicted token: com, probability: 0.7307 Predicted token: net, probability: 0.1319 Predicted token: org, probability: 0.0881 Predicted token: info, probability: 0.0094 Predicted token: cn, probability: 0.0084 ```