--- license: cc-by-4.0 language: - multilingual tags: - zero-shot-classification - text-classification - pytorch metrics: - recall - precision - f1-score extra_gated_prompt: >- Our models are intended for academic use only. If you are not affiliated with an academic institution, please provide a rationale for using our models. Please allow us a few business days to manually review subscriptions. extra_gated_fields: Name: text Country: country Institution: text Institution Email: text Please specify your academic use case: text --- # xlm-roberta-large-pooled-cap-media-v2 ## Model description An `xlm-roberta-large` model finetuned on multilingual (english, german, hungarian, spanish, slovakian) training data labelled with [major topic codes](https://www.comparativeagendas.net/pages/master-codebook) from the [Comparative Agendas Project](https://www.comparativeagendas.net/). Furthermore we used 7 additional media codes, following [Boydstun (2013)](https://www.amber-boydstun.com/uploads/1/0/6/5/106535199/nyt_front_page_policy_agendas_codebook.pdf): * State and Local Government Administration (24) * Weather and Natural Disaster (26) * Fires(27) * Sports and Recreation (29) * Death Notices (30) * Churches and Religion (31) * Other, Miscellaneous and Human Interest (99) ## How to use the model ```python from transformers import AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large") pipe = pipeline( model="poltextlab/xlm-roberta-large-pooled-cap-media1-v2", task="text-classification", tokenizer=tokenizer, use_fast=False, token="" ) text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities." pipe(text) ``` ### Gated access Due to the gated access, you must pass the `token` parameter when loading the model. In earlier versions of the Transformers package, you may need to use the `use_auth_token` parameter instead. ## Overall Performance: * **Accuracy:** 74% * **Macro Avg:** Precision: 0.76, Recall: 0.74, F1-score: 0.73 * **Weighted Avg:** Precision: 0.76, Recall: 0.74, F1-score: 0.73 ## Per-Class Metrics: | Unnamed: 0 | precision | recall | f1-score | support | |:----------------------------------------------|------------:|---------:|-----------:|------------:| | 1: Macroeconomics | 0.773585 | 0.82 | 0.796117 | 50 | | 2: Civil Rights | 0.714286 | 0.6 | 0.652174 | 50 | | 3: Health | 0.803922 | 0.82 | 0.811881 | 50 | | 4: Agriculture | 0.857143 | 0.84 | 0.848485 | 50 | | 5: Labor | 0.666667 | 0.68 | 0.673267 | 50 | | 6: Education | 0.86 | 0.86 | 0.86 | 50 | | 7: Environment | 0.829787 | 0.78 | 0.804124 | 50 | | 8: Energy | 0.851852 | 0.92 | 0.884615 | 50 | | 9: Immigration | 0.888889 | 0.8 | 0.842105 | 50 | | 10: Transportation | 0.661765 | 0.9 | 0.762712 | 50 | | 12: Law and Crime | 0.679245 | 0.72 | 0.699029 | 50 | | 13: Social Welfare | 0.842105 | 0.64 | 0.727273 | 50 | | 14: Housing | 0.666667 | 0.8 | 0.727273 | 50 | | 15: Banking, Finance, and Domestic Commerce | 0.714286 | 0.6 | 0.652174 | 50 | | 16: Defense | 0.596154 | 0.62 | 0.607843 | 50 | | 17: Technology | 0.709091 | 0.78 | 0.742857 | 50 | | 18: Foreign Trade | 0.88 | 0.88 | 0.88 | 50 | | 19: International Affairs | 0.534483 | 0.62 | 0.574074 | 50 | | 20: Government Operations | 0.790698 | 0.68 | 0.731183 | 50 | | 21: Public Lands | 0.808511 | 0.76 | 0.783505 | 50 | | 23: Culture | 0.678571 | 0.76 | 0.716981 | 50 | | 24: State and Local Government Administration | 0.587302 | 0.74 | 0.654867 | 50 | | 26: Weather and Natural Disasters | 0.913043 | 0.84 | 0.875 | 50 | | 27: Fires | 0.942857 | 0.66 | 0.776471 | 50 | | 29: Sports and Recreation | 0.843137 | 0.86 | 0.851485 | 50 | | 30: Death Notices | 0.956522 | 0.88 | 0.916667 | 50 | | 31: Churches and Religion | 0.782609 | 0.72 | 0.75 | 50 | | 99: Other, Miscellaneous, and Human Interest | 0.378947 | 0.72 | 0.496552 | 50 | | 998: No Policy and No Media Content | 0.75 | 0.06 | 0.111111 | 50 | | accuracy | 0.736552 | 0.736552 | 0.736552 | 0.736552 | | macro avg | 0.757315 | 0.736552 | 0.731373 | 1450 | | weighted avg | 0.757315 | 0.736552 | 0.731373 | 1450 | ## Inference platform This model is used by the [CAP Babel Machine](https://babel.poltextlab.com), an open-source and free natural language processing tool, designed to simplify and speed up projects for comparative research. ## Cooperation Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the [CAP Babel Machine](https://babel.poltextlab.com). ## Debugging and issues This architecture uses the `sentencepiece` tokenizer. In order to run the model before `transformers==4.27` you need to install it manually. If you encounter a `RuntimeError` when loading the model using the `from_pretrained()` method, adding `ignore_mismatched_sizes=True` should solve the issue.