The enhanced RCNN model used for sentence similarity classification
(mainly based on Enhanced-RCNN model and other baselines)
To clone this project, make sure git-lfs
is installed.
Please use the following command to clone this project:
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/daviddwlee84/SentenceSimilarity.git
Quick Execute All
# Data preprocessing
./all_data_preprocess.sh
# Train & Evaluate
./train_all_data_at_once.sh [model name]
# Test Ant Submission functionality
bash run.sh raw_data/competition_train.csv ant_test_pred.csv
# pack the Ant Submission files
zip -r AntSubmit.zip . -i \*.py \*.sh -i data/stopwords.txt
Usage
# Data preprocessing
## Ant
python3 ant_preprocess.py [word/char] train
## CCKS
python3 ccks_preprocess.py
## PiPiDai
python3 pipidai_preprocess.py
# Train & Evaluate
## Chinese
python3 run.py --dataset [Ant/CCKS/PiPiDai] --model [model name] --word-segment [word/char]
# train all the model at once use ./train_all_data_at_once.sh
## English
python3 run.py --dataset Quora --model [model name]
# Use Tensorboard
tensorboard --logdir log/same_as_model_log_dir
## remote connection(forward local port to remote port) (execute in local machine)
## then you should be able to access with http://localhost:$LOCAL_PORT
ssh -NfL $LOCAL_PORT:localhost:$REMOTE_PORT $REMOTE_USER@$REMOTE_IP > /dev/null 2>&1
### to close connection (just kill the ssh command which run in background)
ps aux | grep "ssh -NfL" | grep -v grep | awk '{print $2}' | xargs kill
Model
ERCNN
(default)Transformer
SiameseCNN
SiameseRNN
SiameseLSTM
SiameseRCNN
SiameseAttentionRNN
MPCNN
MPLSTM
: skipBiMPM
ESIM
Dataset
Ant
- ChineseCCKS
- ChinesePiPiDai
- Chinese (encoded)Quora
- EnglishMode
train
test
both
(include train and test)predict
Sampling
random
(Original): data is skewed (the ratio is listed below)balance
: positive vs. negative data will be the same
generate-train
generate-test
$ python3 run.py --help
usage: run.py [-h] [--dataset dataset] [--mode mode] [--sampling mode]
[--generate-train] [--generate-test] [--model model]
[--word-segment WS] [--batch-size N] [--test-batch-size N]
[--k-fold N] [--lr N] [--beta1 N] [--beta2 N] [--epsilon N]
[--no-cuda] [--seed N] [--test-split N] [--log-interval N]
[--test-interval N] [--not-save-model]
Enhanced RCNN on Sentence Similarity
optional arguments:
-h, --help show this help message and exit
--dataset dataset Chinese: Ant, CCKS; English: Quora (default: Ant)
--mode mode script mode [train/test/both/predict/submit(Ant)]
(default: both)
--sampling mode sampling mode during training (default: random)
--generate-train use generated negative samples when training (used in
balance sampling)
--generate-test use generated negative samples when testing (used in
balance sampling)
--model model model to use [ERCNN/Transformer/Siamese(CNN/RNN/LSTM/R
CNN/AttentionRNN)] (default: ERCNN)
--word-segment WS chinese word split mode [word/char] (default: char)
--chinese-embed embed chinese embedding (default: cw2vec)
--not-train-embed whether to freeze the embedding parameters
--batch-size N input batch size for training (default: 256)
--test-batch-size N input batch size for testing (default: 1000)
--k-fold N k-fold cross validation i.e. number of epochs to train
(default: 10)
--lr N learning rate (default: 0.001)
--beta1 N beta 1 for Adam optimizer (default: 0.9)
--beta2 N beta 2 for Adam optimizer (default: 0.999)
--epsilon N epsilon for Adam optimizer (default: 1e-08)
--no-cuda disables CUDA training
--seed N random seed (default: 16)
--test-split N test data split (default: 0.3)
--logdir path set log directory (default: ./log)
--log-interval N how many batches to wait before logging training
status
--test-interval N how many batches to test during training
--not-save-model for not saving the current model
--load-model name load the specific model checkpoint file
--submit-path path: submission file path (currently for Ant dataset)
Related Additional Datasets
Original
raw_data/competition_train.csv
- Ant Financial
raw_data/train.csv
- Quora Question Pairs
word2vec/substoke_char.vec.avg
- Ant Financial
word2vec/substoke_word.vec.avg
- Ant Financial
data/stopwords.txt
- Ant Financial
word2vec/glove.word2vec.txt
- Quora Question Pairs
raw_data/task3_train.txt
- CCKS 2018
raw_data/task3_dev.txt
- CCKS 2018
wget http://nlp.stanford.edu/data/glove.840B.300d.zip
unzip glove.840B.300d
from gensim.scripts.glove2word2vec import glove2word2vec
_ = glove2word2vec('glove.840B.300d.txt', 'word2vec/glove.word2vec.txt')
rm glove.840B*
Generated
data/sentence_char_train.csv
- Ant Financialdata/sentence_word_train.csv
- Ant Financialword2vec/Ant_char_tokenizer.pickle
- Ant Financialword2vec/Ant_char_embed_matrix.pickle
- Ant Financialword2vec/Ant_word_tokenizer.pickle
- Ant Financialword2vec/Ant_word_embed_matrix.pickle
- Ant Financialword2vec/Quora_tokenizer.pickle
- Quora Question Pairsword2vec/Quora_embed_matrix.pickle
- Quora Question Pairsmodel/*
log/*
jupyter notebook DataAnalysis.ipynb
Goal: classify whether two question sentences are asking the same thing => predict true or false
Evaluation: f1-score
Data
kaggle competitions download -c quora-question-pairs
unzip test.csv -d raw_data
unzip train.csv -d raw_data
rm *.zip
Goal: classify whether question pairs are duplicates or not => predict the probability that the questions are duplicates (a number between 0 and 1)
Evaluation: log loss between the predicted values and the ground truth
Data
CCKS: China Conference on Knowledge Graph and Semantic Computing
Data
須連繫主辦方才能取得數據
Link失效
In
data_prepare.py
, theclass BalanceDataHelper
Dice loss
other approach
if weight is None:
weight = torch.ones(
y_pred.shape[-1], dtype=torch.float).to(device=y_pred.device) # (C)
if not mode:
return self.simple_cross_entry(y_pred, golden, seq_mask, weight)
probs = nn.functional.softmax(y_pred, dim=2) # (B, T, C)
B, T, C = probs.shape
golden_index = golden.unsqueeze(dim=2) # (B, T, 1)
golden_probs = torch.gather(
probs, dim=2, index=golden_index) # (B, T, 1)
probs_in_package = golden_probs.expand(B, T, T).transpose(1, 2)
packages = np.array([np.eye(T)] * B) # (B, T, T)
probs_in_package = probs_in_package * \
torch.tensor(packages, dtype=torch.float).to(device=probs.device)
max_probs_in_package, _ = torch.max(probs_in_package, dim=2)
golden_probs = golden_probs.squeeze(dim=2)
golden_weight = golden_probs / (max_probs_in_package) # (B, T)
golden_weight = golden_weight.view(-1)
golden_weight = golden_weight.detach()
y_pred = y_pred.view(-1, C)
golden = golden.view(-1)
seq_mask = seq_mask.view(-1)
negative_label = torch.tensor(
[0] * (B * T), dtype=torch.long, device=y_pred.device)
golden_loss = nn.functional.cross_entropy(
y_pred, golden, weight=weight, reduction='none')
negative_loss = nn.functional.cross_entropy(
y_pred, negative_label, weight=weight, reduction='none')
loss = golden_weight * golden_loss + \
(1 - golden_weight) * negative_loss # (B * T)
loss = torch.dot(loss, seq_mask) / (torch.sum(seq_mask) + self.epsilon)
Triplet-Loss
N-pair Loss
# this will create a env_name folder in current directory
virtualenv --python=/path/to/python3.x env_name
# activate the environment
source ./env_name/bin/activate
Add alias in bashrc
alias davidlee="cd /home/username/working_dir; source env_name/bin/activate"
alias pipp="pip install -i https://pypi.tuna.tsinghua.edu.cn/simple"
Install Jupyter notebook use the virtualenv kernel
pip3 install jupyterlab
python3 -m ipykernel install --user --name=python3.6virtualenv
jupyter notebook
python3.6virtualenv
dtype
)
torch.LongTensor(var) == torch.tensor(var, dtype=torch.long)
(Tensor, LongTensor)
__init__()
with super()
and the forward()
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False
Summary
Model Source Code
--
Siamese-CNN, Siamese-RNN, Siamese-LSTM, Siamese-RCNN, Siamese-Attention-RCNN
Contrastive Loss
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
somehow the
nn.Module
in a list can't be auto connectto(device)
Sorry for the limitation of the Git-LFS bandwidth quota, might have some problem to clone this project.
git lfs clone --depth=1 https://github.com/daviddwlee84/SentenceSimilarity.git
git config -f .lfsconfig lfs.url https://gitlab.com/daviddwlee84/SentenceSimilarity.git/info/lfs
# https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
class Attention(nn.Module):
def __init__(self,
enc_hid_dim: int,
dec_hid_dim: int,
attn_dim: int):
super().__init__()
self.enc_hid_dim = enc_hid_dim
self.dec_hid_dim = dec_hid_dim
self.attn_in = (enc_hid_dim * 2) + dec_hid_dim
self.attn = nn.Linear(self.attn_in, attn_dim)
def forward(self,
decoder_hidden: Tensor,
encoder_outputs: Tensor) -> Tensor:
src_len = encoder_outputs.shape[0]
repeated_decoder_hidden = decoder_hidden.unsqueeze(
1).repeat(1, src_len, 1)
encoder_outputs = encoder_outputs.permute(1, 0, 2)
energy = torch.tanh(self.attn(torch.cat((
repeated_decoder_hidden,
encoder_outputs),
dim=2)))
attention = torch.sum(energy, dim=2)
return F.softmax(attention, dim=1)
# https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
class AttnDecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
super(AttnDecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.output_size = output_size
self.dropout_p = dropout_p
self.max_length = max_length
self.embedding = nn.Embedding(self.output_size, self.hidden_size)
self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
self.dropout = nn.Dropout(self.dropout_p)
self.gru = nn.GRU(self.hidden_size, self.hidden_size)
self.out = nn.Linear(self.hidden_size, self.output_size)
def forward(self, input, hidden, encoder_outputs):
embedded = self.embedding(input).view(1, 1, -1)
embedded = self.dropout(embedded)
attn_weights = F.softmax(
self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0),
encoder_outputs.unsqueeze(0))
output = torch.cat((embedded[0], attn_applied[0]), 1)
output = self.attn_combine(output).unsqueeze(0)
output = F.relu(output)
output, hidden = self.gru(output, hidden)
output = F.log_softmax(self.out(output[0]), dim=1)
return output, hidden, attn_weights
def initHidden(self):
return torch.zeros(1, 1, self.hidden_size, device=device)
# https://www.kaggle.com/mlwhiz/attention-pytorch-and-keras
class Attention(nn.Module):
def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
super(Attention, self).__init__(**kwargs)
self.supports_masking = True
self.bias = bias
self.feature_dim = feature_dim
self.step_dim = step_dim
self.features_dim = 0
weight = torch.zeros(feature_dim, 1)
nn.init.kaiming_uniform_(weight)
self.weight = nn.Parameter(weight)
if bias:
self.b = nn.Parameter(torch.zeros(step_dim))
def forward(self, x, mask=None):
feature_dim = self.feature_dim
step_dim = self.step_dim
eij = torch.mm(
x.contiguous().view(-1, feature_dim),
self.weight
).view(-1, step_dim)
if self.bias:
eij = eij + self.b
eij = torch.tanh(eij)
a = torch.exp(eij)
if mask is not None:
a = a * mask
a = a / (torch.sum(a, 1, keepdim=True) + 1e-10)
weighted_input = x * torch.unsqueeze(a, -1)
return torch.sum(weighted_input, 1)