Question Pair Save

A siamese LSTM to detect sentence/question pairs.

Project README

Sentence Pair Detector

A siamese LSTM based approach to indentifying similar sentences/question. Built for the quora question pair challenge on Kaggle, but can be trained on any similar-sentence dataset.

File Description:

generate_stemmed_vectors.py processes the wordvectors. Stems the word associated with the vector and generates the final wordlist.
vectorize_input.py contains a major portion of the preprocessing steps including cleaning of the data and generation of sentence vectors.
model.py contains the actual model and batching function.

How to train:

Download the quora question pair dataset from here and extract the train.csv file
Download the ConceptNet Numberbatch pretrained embedding from here
Run the vector processing script with the location of the embeddings of as an argument like: python generate_stemmed_vectors.py "[path_to_numberbatch_embeddings]"
Run the rest of the preprocessing with the location of the quora train.csv as an argument like this: python vectorize_input.py "[path to train dataset]"(this will take quite some time)
Start the training by running model.py.

An overview of what I have done:

Word Embeddings:

The word embeddings that I used were the ConceptNet Numberbatch embeddings. I used them mainly for their small size, but can be switched out for any other word embeddings. I would personally recommend using Google's Word2Vec pretrained vectors.

Preprocessing:

I cleaned the dataset to remove sentence contractions. I used the list of contractions given by lystdo on Kaggle. Standard preprocessing tasks like removing outliers in terms of size and stemming was done.

Model:

I used stacked LSTM cells with tied weights and dropout. The two sentences/questions were fed into the two parallel networks with equal weights, and the loss was calculated based on their output. The loss function is Yann LeCun's Contrastive Loss Function. The optimizer used is Adam.

To-Do:

Expand Readme for better explanation of the model
Remove hardcoded training parameters
Optimize preprocessing
Add option for other optimizers and GRU cells
Add credits for resources used
ADD COMMENTS

Open Source Agenda is not affiliated with "Question Pair" Project. README Source: rpanchal1996/question-pair

Stars

Open Issues

Last Commit

6 years ago

Repository

rpanchal1996/question-pair

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/question-pair"><img src="https://www.opensourceagenda.com/projects/question-pair/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022