igmochang
/

CR-biodiversity-preprocessed-sentence-similarity-es

Sentence Similarity

sentence-transformers

feature-extraction

Generated from Trainer

dataset_size:2748

loss:MultipleNegativesRankingLoss

text-embeddings-inference

Model card Files Files and versions Community

igmochang commited on Oct 30, 2024

Commit

0bfe83e

·

verified ·

1 Parent(s): 4cd873d

Update README.md

Added the preprocess function used.

Files changed (1) hide show

README.md +29 -0

README.md CHANGED Viewed

@@ -571,6 +571,35 @@ print(similarities.shape)
 # [3, 3]
 ```
 <!--
 ### Direct Usage (Transformers)

 # [3, 3]
 ```
+Preprocess function:
+```python
+import re
+import nltk
+from nltk.corpus import stopwords
+from nltk.stem import SnowballStemmer
+from nltk.tokenize import word_tokenize
+# Initialize Spanish stemmer and stopwords
+nltk.download('punkt')
+nltk.download('stopwords')
+spanish_stopwords = set(stopwords.words('spanish'))
+stemmer = SnowballStemmer('spanish')
+# Function for preprocessing text (lowercase, remove punctuation, stopwords, and apply stemming)
+def preprocess_text(text):
+    # Convert to lowercase
+    text = text.lower()
+    # Remove punctuation and special characters
+    text = re.sub(r'[^\w\s¿?%]', '', text)
+    # Tokenize
+    words = word_tokenize(text)
+    # Remove stopwords and apply stemming
+    words = [stemmer.stem(word) for word in words if word not in spanish_stopwords]
+    # Rejoin the words
+    return ' '.join(words)
+```
 <!--
 ### Direct Usage (Transformers)