word2vec 2013
project page: https://code.google.com/p/word2vec/
core research:
Tomas Mikolov
Efficient Estimation of Word Representations in Vector Space
(http://goo.gl/ZvBp8F)
http://arxiv.org/pdf/1301.3781.pdf
follow up -
Distributed Representations of Words and Phrases
and their Compositionality
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Distributed Representations of Sentences and Documents
Quoc Le, Tomas Mikolov
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
http://www.i-programmer.info/news/105-artificial-intelligence/6264-machine-learning-applied-to-natural-language.html
Representing words as high dimensional vectors
https://plus.google.com/+ResearchatGoogle/posts/VwBUvQ7PvnZ
http://radimrehurek.com/2014/02/word2vec-tutorial/
Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1] [2].
http://radimrehurek.com/gensim/models/word2vec.html
http://mfcabrera.com/research/2013/11/14/word2vec-german.blog.org/
DEEPLEARNING4J
core research:
GloVe: Global Vectors for Word Representation
http://stanford.edu/~jpennin/papers/glove.pdf
We provide the source code for the model as
well as trained word vectors at
http://nlp.stanford.edu/projects/glove/
Best word vectors so far? http://stanford.edu/~jpennin/papers/glove.pdf … 11% more accurate than word2vec, fast to train, statistically efficient, good task accuracy
Semantic Compositionality Through Recursive Matrix-Vector Spaces
Download Paper
SocherHuvalManningNg_EMNLP2012.pdf
Download Code
Relation Classification
relationClassification.zip (525MB) - All training code and testing code with trained models for new data. External packages (parser, tagger) included and the whole pipeline should run with one script. This is the package if you just want to use the best model to classify your relations.
relationClassification-No-MVRNN-models.zip (103MB) - All training code and testing code but WITHOUT trained models. External packages (parser, tagger) included. Here you need to first run the full training script, which will take a few hours to run.
relationClassification-Only-code.zip (170kB) - All training code and testing code but WITHOUT trained models, external packages, word vectors or anything else. This package includes only the code so you can study the algorithm.
core research:
Tomas Mikolov
Efficient Estimation of Word Representations in Vector Space
(http://goo.gl/ZvBp8F)
http://arxiv.org/pdf/1301.3781.pdf
follow up -
Distributed Representations of Words and Phrases
and their Compositionality
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method
Yoav Goldberg and Omer Levy
{yoav.goldberg,omerlevy}@gmail.com
February 14, 2014
Distributed Representations of Sentences and Documents
Quoc Le, Tomas Mikolov
http://cs.stanford.edu/~quocle/paragraph_vector.pdf
blogs and tutorials:
Representing words as high dimensional vectors
https://plus.google.com/+ResearchatGoogle/posts/VwBUvQ7PvnZ
http://radimrehurek.com/2014/02/word2vec-tutorial/
Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1] [2].
http://radimrehurek.com/gensim/models/word2vec.html
http://mfcabrera.com/research/2013/11/14/word2vec-german.blog.org/
DEEPLEARNING4J
GloVe 2014
project page: http://nlp.stanford.edu/projects/glove/core research:
GloVe: Global Vectors for Word Representation
http://stanford.edu/~jpennin/papers/glove.pdf
We provide the source code for the model as
well as trained word vectors at
http://nlp.stanford.edu/projects/glove/
Best word vectors so far? http://stanford.edu/~jpennin/papers/glove.pdf … 11% more accurate than word2vec, fast to train, statistically efficient, good task accuracy
RNN trained word vectors 2012
http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpacesSemantic Compositionality Through Recursive Matrix-Vector Spaces
Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.
Download Paper
SocherHuvalManningNg_EMNLP2012.pdf
Download Code
Relation Classification
relationClassification.zip (525MB) - All training code and testing code with trained models for new data. External packages (parser, tagger) included and the whole pipeline should run with one script. This is the package if you just want to use the best model to classify your relations.
relationClassification-No-MVRNN-models.zip (103MB) - All training code and testing code but WITHOUT trained models. External packages (parser, tagger) included. Here you need to first run the full training script, which will take a few hours to run.
relationClassification-Only-code.zip (170kB) - All training code and testing code but WITHOUT trained models, external packages, word vectors or anything else. This package includes only the code so you can study the algorithm.
How to measure quality of the word vectors
Several factors influence the quality of the word vectors:
- amount and quality of the training data
- size of the vectors
- training algorithm
The quality of the vectors is crucial for any application. However, exploration of different hyper-parameter settings for complex tasks might be too time demanding. Thus, we designed simple test sets that can be used to quickly evaluate the word vector quality.
For the word relation test set described in [1], see ./demo-word-accuracy.sh, for the phrase relation test set described in [2], see ./demo-phrase-accuracy.sh. Note that the accuracy depends heavily on the amount of the training data; our best results for both test sets are above 70% accuracy with coverage close to 100%.
test metric, tests GloVe vs Word2Vec
On the importance of comparing apples to apples: a case study using the GloVe model
Yoav Goldberg, 10 August 2014
links from word2vec paper http://arxiv.org/pdf/1301.3781.pdf:
The test set is available at http://www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt
http://ronan.collobert.com/senna/
http://metaoptimize.com/projects/wordreprs/
http://www.fit.vutbr.cz/˜imikolov/rnnlm/
http://ai.stanford.edu/˜ehhuang/
Microsoft Research Sentence Completion Challenge
http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2011-129.pdf
G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft
Research Technical Report MSR-TR-2011-129, 2011.
Appendix: Full List of Training Data
The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices.
C.J.C. Burges papers:
http://research.microsoft.com/en-us/um/people/cburges/pubs.htm
The test set is available at http://www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt
http://ronan.collobert.com/senna/
http://metaoptimize.com/projects/wordreprs/
http://www.fit.vutbr.cz/˜imikolov/rnnlm/
http://ai.stanford.edu/˜ehhuang/
Microsoft Research Sentence Completion Challenge
http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2011-129.pdf
G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft
Research Technical Report MSR-TR-2011-129, 2011.
Appendix: Full List of Training Data
The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices.
C.J.C. Burges papers:
http://research.microsoft.com/en-us/um/people/cburges/pubs.htm