Corpus and a baseline neural network system for Named Entity Recognition in Hindi-English Code-Mixed social media text.
We have created a dataset of Hindi-English Code-Mixed Social Media Text (tweets) for the task of Named Entity Recognition. Tweets are pre-processed and annotated as per the 6 NER tags and a 7th Other tag.
eg:
#Word | #Tag |
---|---|
Bharat | B-Loc |
ke | Other |
2016 | Other |
ke | Other |
Demonetization | Other |
mein | Other |
kitna | Other |
kala | Other |
dhan | Other |
real | Other |
mein | Other |
aaya | Other |
??? | Other |
Accha | Other |
hua | Other |
ye | Other |
prashna | Other |
Miss | B-Per |
Word | I-Per |
Chillar | I-Per |
ko | Other |
nahi | Other |
puccha | Other |
gaya | Other |
0 | Other |
#misschillar | B-Per |
#missworld | Other |
#Demonetisation | Other |
#notebandi | Other |
#modi | B-Per |
#bjp | B-Org |
#gujrat | B-Loc |
TwitterData
folder contains Id's of the scrapped tweets inside Scrapped
folder, and processed and annotated data as named inside this.score
calls that gives all the required stats.Decision Tree
model with a f1-score of 0.94.Conditional Random Field (CRF)
model with a f1-score of 0.95.LSTM
model with a f1-score of 0.95.LTRC IIIT-Hyderabad
Named Entity Recognition for Hindi-English Code-Mixed Social Media Text
2018, 27-35, Proceedings of the Seventh Named Entities Workshop here