A siamese LSTM to detect sentence/question pairs.
A siamese LSTM based approach to indentifying similar sentences/question. Built for the quora question pair challenge on Kaggle, but can be trained on any similar-sentence dataset.
generate_stemmed_vectors.py
processes the wordvectors. Stems the word associated with the vector and generates the final wordlist.vectorize_input.py
contains a major portion of the preprocessing steps including cleaning of the data and generation of sentence vectors.model.py
contains the actual model and batching function.python generate_stemmed_vectors.py "[path_to_numberbatch_embeddings]"
train.csv
as an argument like this: python vectorize_input.py "[path to train dataset]"
(this will take quite some time)model.py
.The word embeddings that I used were the ConceptNet Numberbatch embeddings. I used them mainly for their small size, but can be switched out for any other word embeddings. I would personally recommend using Google's Word2Vec pretrained vectors.
I cleaned the dataset to remove sentence contractions. I used the list of contractions given by lystdo on Kaggle. Standard preprocessing tasks like removing outliers in terms of size and stemming was done.
I used stacked LSTM cells with tied weights and dropout. The two sentences/questions were fed into the two parallel networks with equal weights, and the loss was calculated based on their output. The loss function is Yann LeCun's Contrastive Loss Function. The optimizer used is Adam.