Spaces:
Runtime error
Runtime error
Vivian
commited on
Commit
·
6468dfb
1
Parent(s):
95c8add
update
Browse files- markdown/solution.md +5 -2
markdown/solution.md
CHANGED
|
@@ -1,6 +1,9 @@
|
|
| 1 |
### Solution
|
| 2 |
-
To
|
|
|
|
|
|
|
|
|
|
| 3 |
Some special modifications I made to the model are as follows:
|
| 4 |
-
1. Hashtags are usually consists of multiple words without space. When tokenizing these hashtags, they will be splitted into different words and therefore cannot form up an hashtag. Therefore, I added the top 1000 trending hashtags to the token dictioanry and provide special token_ids for each hastag.
|
| 5 |
2. When masking the tweets during training, I intentionally mask the tokens that is a hastag, so that the model will learn to predict the place of hashtags specifically.
|
| 6 |
3. After training the model, we limit the potential candidates of [MASK] with the topics existed on Twitter, to get the relavent hashtags the user can add on.
|
|
|
|
| 1 |
### Solution
|
| 2 |
+
To gain the trianing data, I scraped from the trending topics from Twitter, filtered out the tweets in different languages and left with only tweets in English. There are more than 30k tweets in our training dataset.
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
Then, a BERT model is trained to do the fill-mask task to solve this problem.
|
| 6 |
Some special modifications I made to the model are as follows:
|
| 7 |
+
1. Hashtags are usually consists of multiple words without space. When tokenizing these hashtags, they will be splitted into different words and therefore cannot form up an hashtag when we decode these tokens. For example, "#TheFirstLady" passed in regular tokenizer will be splitted into "#", "The", "First", "#Lady". Therefore, I added the top 1000 trending hashtags to the token dictioanry and provide special token_ids for each hastag.
|
| 8 |
2. When masking the tweets during training, I intentionally mask the tokens that is a hastag, so that the model will learn to predict the place of hashtags specifically.
|
| 9 |
3. After training the model, we limit the potential candidates of [MASK] with the topics existed on Twitter, to get the relavent hashtags the user can add on.
|