Wednesday, December 10, 2014

NLP USE CASES AND TESTING

word2vec 2013

project page: https://code.google.com/p/word2vec/

core research:
Tomas Mikolov
Efficient Estimation of Word Representations in Vector Space 
(http://goo.gl/ZvBp8F)
http://arxiv.org/pdf/1301.3781.pdf

follow up - 
Distributed Representations of Words and Phrases

and their Compositionality
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf


word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method
Yoav Goldberg and Omer Levy
{yoav.goldberg,omerlevy}@gmail.com
February 14, 2014



Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov
http://cs.stanford.edu/~quocle/paragraph_vector.pdf


blogs and tutorials:
http://www.i-programmer.info/news/105-artificial-intelligence/6264-machine-learning-applied-to-natural-language.html
Representing words as high dimensional vectors
https://plus.google.com/+ResearchatGoogle/posts/VwBUvQ7PvnZ
http://radimrehurek.com/2014/02/word2vec-tutorial/
Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1] [2].
http://radimrehurek.com/gensim/models/word2vec.html
http://mfcabrera.com/research/2013/11/14/word2vec-german.blog.org/




DEEPLEARNING4J 

GloVe 2014

project page: http://nlp.stanford.edu/projects/glove/
core research:
GloVe: Global Vectors for Word Representation
http://stanford.edu/~jpennin/papers/glove.pdf
We provide the source code for the model as
well as trained word vectors at
http://nlp.stanford.edu/projects/glove/

Best word vectors so far? http://stanford.edu/~jpennin/papers/glove.pdf … 11% more accurate than word2vec, fast to train, statistically efficient, good task accuracy

RNN trained word vectors 2012

http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces
Semantic Compositionality Through Recursive Matrix-Vector Spaces
Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.

Download Paper
SocherHuvalManningNg_EMNLP2012.pdf

Download Code
Relation Classification
relationClassification.zip (525MB) - All training code and testing code with trained models for new data. External packages (parser, tagger) included and the whole pipeline should run with one script. This is the package if you just want to use the best model to classify your relations.
relationClassification-No-MVRNN-models.zip (103MB) - All training code and testing code but WITHOUT trained models. External packages (parser, tagger) included. Here you need to first run the full training script, which will take a few hours to run.
relationClassification-Only-code.zip (170kB) - All training code and testing code but WITHOUT trained models, external packages, word vectors or anything else. This package includes only the code so you can study the algorithm.

How to measure quality of the word vectors

Several factors influence the quality of the word vectors:
  • amount and quality of the training data
  • size of the vectors
  • training algorithm
The quality of the vectors is crucial for any application. However, exploration of different hyper-parameter settings for complex tasks might be too time demanding. Thus, we designed simple test sets that can be used to quickly evaluate the word vector quality.
For the word relation test set described in [1], see ./demo-word-accuracy.sh, for the phrase relation test set described in [2], see ./demo-phrase-accuracy.sh. Note that the accuracy depends heavily on the amount of the training data; our best results for both test sets are above 70% accuracy with coverage close to 100%.

test metric, tests GloVe vs Word2Vec

On the importance of comparing apples to apples: a case study using the GloVe model
Yoav Goldberg, 10 August 2014


links from word2vec paper http://arxiv.org/pdf/1301.3781.pdf:

The test set is available at http://www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt
http://ronan.collobert.com/senna/
http://metaoptimize.com/projects/wordreprs/
http://www.fit.vutbr.cz/˜imikolov/rnnlm/
http://ai.stanford.edu/˜ehhuang/
Microsoft Research Sentence Completion Challenge
http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2011-129.pdf

G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft
Research Technical Report MSR-TR-2011-129, 2011.
Appendix: Full List of Training Data
The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices.

C.J.C. Burges papers:
http://research.microsoft.com/en-us/um/people/cburges/pubs.htm

CORTICAL

No comments:

Post a Comment