Spaces:

vivianhuang88
/

hashtag_rec

Runtime error

Vivian commited on Apr 19, 2022

Commit

be4e820

1 Parent(s): 0322c01

update

Files changed (3) hide show

app.py CHANGED Viewed

@@ -9,6 +9,8 @@ st.set_page_config(layout='wide',
 def main():
     st.title('Twitter Hashtag Recommender')
     # about_stripnet = read_md('markdown/about_stripnet.md')
     # st.markdown(about_stripnet)

 def main():
     st.title('Twitter Hashtag Recommender')
+    st.markdown('markdown/overview.md')
+    st.markdown('markdown/critical.md')
     # about_stripnet = read_md('markdown/about_stripnet.md')
     # st.markdown(about_stripnet)

markdown/critical.md ADDED Viewed

+## Critical Analysis
+1. For efficiency consideration, we only added the top 1000 hashtag topics to our vocab dictionary. However, the actual number of potential hashtags should be millions. If adding all the hashtags to the dictionary, the efficiency of this approach will grow expoenentially. But not including these hashtags might also decrease toe candidate hashtags for users to choose. In addition to this, topics might be unavailable to be predicted if it is not trained in our model.
+2. Training data is small as well. The size of training data is about 30k.
+3. Future modifications on this model might be add weights on different topics. For example, more recent topics will be weighted higher than older topics.

markdown/overview.md ADDED Viewed

+## Overview
+### Backgroud
+On Twitter, adding a '#' to the beging of a word or a phrases creats a hashtag. When people click on the hastags, you can jump into the a topic with a list of tweets related to this topic. Hashtags on Twitter helps people easily follow topics that they are interested.
+### Problem
+When people sending out a tweet and we want to add a hashtag to the tweets, it pumped hashtag list is neither related to the context of tweet very much, nor sorted from trending list.
+### Solution
+To solve this problem, I trained a BERT model and utilize the fill-mask task to solve this problem.
+Some special modifications I made to the model are as follows:
+1. Hashtags are usually consists of multiple words without space. When tokenizing these words, it will be splitted into different words and therefore cannot form up an existed hashtags. Therefore, I added the top 1000 trending hashtags to the token dictioanry and provide special token_ids for each hastags.
+2. When masking the tweets during training, I intentionally mask the tokens that is a hastag, so that the model will learn to predict the place of hashtags specifically.
+3. After training the model, we limit the potential candidates of [MASK] with the topics existed on Twitter, to get the relavent hashtags the user can add on.