Resource NLP & Bahasa
Resouse kumpulan dataset, thesis, paper, dan artikel tentang NLP (Natural Language Processing) Bahasa Indonesia. Terinpirasi oleh para pendahulu.
TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here
frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.
Frequent Term based Text Summarization for Bahasa Indonesia
M.Fachrurrozi, Novi Yusliani, and Rizky Utami Yoanita. International Conference on Innovations in Engineering and Technology (ICIET'2013) Dec. 25-26, 2013 Bangkok (Thailand).
Shavitri, Shelly. Undergraduate Theses for computer science, University of Indonesia, 1999.
INAGP : Pengurai Kalimat Bahasa Indonesia Sebagai Alat Bantu Untuk Pengembangan Aplikasi PBA
Rosalina Paramita N., Dwi H. Widyantoro, Ayu Purwarianti. Undergraduate Theses from JBPTITBPP, Institute Technology Bandung, 2007.
Penguraian Bahasa Indonesia dengan Menggunakan Pengurai Collins
Sukamto, Rosa Ariani. Tesis untuk Magister, Institut Technology Bandung, 2009.
HMM Based Part-of-Speech Tagger for Bahasa Indonesia
Wicaksono, A. Farizki dan Purwanti, Ayu. Proceeding of 4th International Malindo (Malay and Indonesian Language) Workshop (2010).
Penggunaan Hidden Markov Model untuk Kompresi Kalimat
Yudi Wibisono. Graduate Thesis. Institute of Technology Bandung. 2008.
Probabilistic Part Of Speech Tagging for Bahasa Indonesia
Femphy Pisceldo, Mirna Adriani, Ruli Manurung. Third International MALINDO Workshop, colocated event ACL-IJCNLP 2009, Singapore, August 1, 2009.
Effective Techniques for Indonesian Text Retrieval
Asian J. (2007). PhD thesis School of Computer Science and Information Technology RMIT University Australia.
Arifin, A.Z., I.P.A.K. Mahendra dan H.T. Ciptaningtyas. 2009. Proceeding of International Conference on Information & Communication Technology and Systems (ICTS).
A. D. Tahitoe, D. Purwitasari. Institut Teknologi Sepuluh Nopember (ITS) – Surabaya.
Building an Indonesian WordNet
Desmond Darma Putra, Abdul Arfan and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
English-to-Indonesian Lexical Mapping using Latent Semantic Analysis
Eliza Margaretha, Franky, and Ruli Manurung. In Proceedings of the 2nd International MALINDO Workshop. 2008.
A survey of bahasa Indonesia NLP research conducted at the University of Indonesia
Mirna Adriani and Ruli Manurung. Faculty of Computer Science, University of Indonesia.
Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus
Septina Dian Larasati, Vladislav Kuboˇn, and Daniel Zeman. Charles University in Prague.
Adriani, Mirna. Riza, Hammam. 2008.
Towards a Semantic Analysis of Bahasa Indonesia for Question Answering
Septina Dian Larasati and Ruli Manurung. Faculty of Computer Science. University of Indonesia. 2007.
Jika ingin berkontribusi dalam github ini, sangat disarankan untuk Pull Request
namun dengan resource berbahasa indonesia.
FAQ menjawab pertanyaan pertanyaan umum terkait repository ini mulai dari naming convention, pertanyaan dasar hingga pertanyaan lanjut.
This is a collection/reading-list of awesome Natural Language Processing papers sorted by date.
Unsupervised Machine Translation Using Monolingual Corpora Only, Lample et al.
Paper
On the Dimensionality of Word Embeddings, Yin et al.
Paper
An efficient framework for learning sentence representations, Logeswaran et al.
Paper
Refining Pretrained Word Embeddings Using Layer-wise Relevance Propagation, Akira Utsumi
Paper
Domain Adapted Word Embeddings for Improved Sentiment Classification, Sarma et al.
Paper
In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition, Sheikhshab et al.
Paper
Generalizing Word Embeddings using Bag of Subwords, Zhao et al.
Paper
What's in Your Embedding, And How It Predicts Task Performance, Rogers et al.
Paper
On Learning Better Word Embeddings from Chinese Clinical Records: Study on Combining In-Domain and Out-Domain Data Wang et al.
Paper
Predicting and interpreting embeddings for out of vocabulary words in downstream tasks, Garneau et al.
Paper
Addressing Low-Resource Scenarios with Character-aware Embeddings, Papay et al.
Paper
Domain Adaptation for Disease Phrase Matching with Adversarial Networks, Liu et al.
Paper
Investigating Effective Parameters for Fine-tuning of Word Embeddings Using Only a Small Corpus, Komiya et al.
Paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al.
Paper
Adapting Word Embeddings from Multiple Domains to Symptom Recognition from Psychiatric Notes, Zhang et al.
Paper
Evaluation of sentence embeddings in downstream and linguistic probing tasks, Perone et al.
Paper
Universal Sentence Encoder, Cer et al.
Paper
Deep Contextualized Word Representations, Peters et al.
Paper
Learned in Translation: Contextualized Word Vectors, McCann et al.
Paper
Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations, Rücklé et al.
paper
A Compressed Sensing View of Unsupervised Text Embeddings, Bag-Of-n-Grams, and LSTMs, Arora et al.
Paper
Attention Is All You Need, Vaswani et al.
Paper
Skip-Gram – Zipf + Uniform = Vector Additivity, Gittens et al.
Paper
A Simple but Tough-to-beat Baseline for Sentence Embeddings, Arora et al.
Paper
Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, Strubell et al.
Paper
Advances in Pre-Training Distributed Word Representations, Mikolov et al.
Paper
Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets, Dror et al.
Paper
Towards Universal Paraphrastic Sentence Embeddings, Wieting et al.
Paper
Bag of Tricks for Efficient Text Classification, Joulin et al.
Paper
Enriching Word Vectors with Subword Information, Bojanowski et al.
Paper
Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP, Kirk Roberts
Paper
How to Train Good Word Embeddings for Biomedical NLP, Chiu et al.
Paper
Log-Linear Models, MEMMs, and CRFs, Michael Collins
Paper
Counter-fitting Word Vectors to Linguistic Constraints, Mrkšić et al.
Paper
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al.
Paper
Semi-supervised Sequence Learning, Dai et al.
Paper
Evaluating distributed word representations for capturing semantics of biomedical concepts, Th et al.
Paper
GloVe: Global Vectors for Word Representation, Pennington et al.
Paper
Linguistic Regularities in Sparse and Explicit Word Representations, Levy and Goldberg.
Paper
Neural Word Embedding as Implicit Matrix Factorization, Levy and Goldberg.
Paper
word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method, Goldberg and Levy.
Paper
What’s in a p-value in NLP?, Søgaard et al.
Paper
How transferable are features in deep neural networks?, Yosinski et al.
Paper
Improving lexical embeddings with semantic knowledge, Yu et al.
Paper
Retrofitting word vectors to semantic lexicons, Faruqui et al.
Paper
Efficient Estimation of Word Representations in Vector Space, Mikolov et al.
Paper
Linguistic Regularities in Continuous Space Word Representations, Mikolov et al.
Paper
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al.
Paper
Paper
Paper
Paper
Paper
Paper
Paper