Wednesday, December 10, 2014

NLP USE CASES AND TESTING

word2vec 2013

project page: https://code.google.com/p/word2vec/

core research:
Tomas Mikolov
Efficient Estimation of Word Representations in Vector Space 
(http://goo.gl/ZvBp8F)
http://arxiv.org/pdf/1301.3781.pdf

follow up - 
Distributed Representations of Words and Phrases

and their Compositionality
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf


word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method
Yoav Goldberg and Omer Levy
{yoav.goldberg,omerlevy}@gmail.com
February 14, 2014



Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov
http://cs.stanford.edu/~quocle/paragraph_vector.pdf


blogs and tutorials:
http://www.i-programmer.info/news/105-artificial-intelligence/6264-machine-learning-applied-to-natural-language.html
Representing words as high dimensional vectors
https://plus.google.com/+ResearchatGoogle/posts/VwBUvQ7PvnZ
http://radimrehurek.com/2014/02/word2vec-tutorial/
Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1] [2].
http://radimrehurek.com/gensim/models/word2vec.html
http://mfcabrera.com/research/2013/11/14/word2vec-german.blog.org/




DEEPLEARNING4J 

GloVe 2014

project page: http://nlp.stanford.edu/projects/glove/
core research:
GloVe: Global Vectors for Word Representation
http://stanford.edu/~jpennin/papers/glove.pdf
We provide the source code for the model as
well as trained word vectors at
http://nlp.stanford.edu/projects/glove/

Best word vectors so far? http://stanford.edu/~jpennin/papers/glove.pdf … 11% more accurate than word2vec, fast to train, statistically efficient, good task accuracy

RNN trained word vectors 2012

http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces
Semantic Compositionality Through Recursive Matrix-Vector Spaces
Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.

Download Paper
SocherHuvalManningNg_EMNLP2012.pdf

Download Code
Relation Classification
relationClassification.zip (525MB) - All training code and testing code with trained models for new data. External packages (parser, tagger) included and the whole pipeline should run with one script. This is the package if you just want to use the best model to classify your relations.
relationClassification-No-MVRNN-models.zip (103MB) - All training code and testing code but WITHOUT trained models. External packages (parser, tagger) included. Here you need to first run the full training script, which will take a few hours to run.
relationClassification-Only-code.zip (170kB) - All training code and testing code but WITHOUT trained models, external packages, word vectors or anything else. This package includes only the code so you can study the algorithm.

How to measure quality of the word vectors

Several factors influence the quality of the word vectors:
  • amount and quality of the training data
  • size of the vectors
  • training algorithm
The quality of the vectors is crucial for any application. However, exploration of different hyper-parameter settings for complex tasks might be too time demanding. Thus, we designed simple test sets that can be used to quickly evaluate the word vector quality.
For the word relation test set described in [1], see ./demo-word-accuracy.sh, for the phrase relation test set described in [2], see ./demo-phrase-accuracy.sh. Note that the accuracy depends heavily on the amount of the training data; our best results for both test sets are above 70% accuracy with coverage close to 100%.

test metric, tests GloVe vs Word2Vec

On the importance of comparing apples to apples: a case study using the GloVe model
Yoav Goldberg, 10 August 2014


links from word2vec paper http://arxiv.org/pdf/1301.3781.pdf:

The test set is available at http://www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt
http://ronan.collobert.com/senna/
http://metaoptimize.com/projects/wordreprs/
http://www.fit.vutbr.cz/˜imikolov/rnnlm/
http://ai.stanford.edu/˜ehhuang/
Microsoft Research Sentence Completion Challenge
http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2011-129.pdf

G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft
Research Technical Report MSR-TR-2011-129, 2011.
Appendix: Full List of Training Data
The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices.

C.J.C. Burges papers:
http://research.microsoft.com/en-us/um/people/cburges/pubs.htm

CORTICAL

Potential Projects in Fashion AI

Fashion Marketing and Digital Media Group on LinkedIn. 

https://alicemitchellx.wordpress.com/2014/12/09/the-perspective-media-interview/ 
http://box-of-style.blogspot.com/2014/12/recommended-read-imagine-world-where.html 

If you would like to submit predictions for project related to the future of fashion please let me know. Welcome your feedback and questions. 

Paul


https://www.stylewe.com/

StyleWe is an online fashion shopping platform featuring independent fashion designers. We are committed to providing shoppers with original, high quality, and exclusive fashion products from independent designers.

By working with cutting edge independent fashion designers from around the world, and combining them with our high quality production and digital marketing capabilities, we will turn the fashion designers’ dreams into reality by providing high fashion to customers worldwide. 

Rather than just an online shopping store, we would like to create a community which will be shared by both designers and customers. The community will enable all parties to communicate, share ideas, and recognize each other. It would not only provide instant feedback for fashion designers when launching new concepts or products, but would also allow customers to share their shopping experiences and fashion dreams.

We bring together designers and fashion covering many different styles. We hope that every one of our customers will find their own unique and exclusive designer fashions at StyleWe.

We believe the fashion trend should not be controlled by the few, but rather be guided by the collective actions of every designer and fashion consumer. At StyleWe, our goal is to empower designers so that they no longer feel hidden behind the brand, but are able to proactively communicate directly with their customers throughout the entire fashion life cycle.

We believe fashion should be personal and diversified. Fashion designers should not cater exclusively to the rich and famous. We have dedicated ourselves to enabling talented fashion designers to build their own brands and achieve their dream. Together with our designers, we will deliver high quality designer fashions to everyone

Monday, December 8, 2014

Cortical.io



http://www.crunchbase.com/organization/cept-systems

list of suggestions:

1. use cases and implementations for each cortical API call, similar to word2vec use cases and more...
2. webinars:
 end-to-end examples of cortical word representation combined with deep learning
end-to-end examples of cortical word representation combined with deep learning deployed on a spark cluster
3. schools, similar to http://www.next.ml/
4. meetup presentations - new tech, hackers and founders, etc.
5, participation in summits - spark, solr, etc...ML, NLP, NLU
6. participation in semeval
7. bloomberg
8. cnbc
9. tests:

How to measure quality of the word vectors

simple test sets that can be used to quickly evaluate the word vector quality - 
word relation test set 
the phrase relation test 

best result, average

example:
comparing apples to apples: a case study using the GloVe model

https://docs.google.com/document/d/1ydIujJ7ETSZ688RGfU5IMJJsbxAi-kRl8czSwpti15s/mobilebasic?pli=1






Language processing in the brain


http://en.wikipedia.org/wiki/Language_processing_in_the_brain

Computational Linguistics


http://plato.stanford.edu/entries/computational-linguistics/

SEMEVAL NIST COMPETITIONS


http://alt.qcri.org/semeval2014/
http://alt.qcri.org/semeval2014/task1/index.php?id=results
http://alt.qcri.org/semeval2014/task3/index.php?id=results
http://alt.qcri.org/semeval2014/task5/?id=data-and-tools

word2vec


https://groups.google.com/forum/#!msg/word2vec-toolkit/ZcOst7kEjaI/rv_A6LaE9vkJ
word2vec-toolkit

Re: Abridged summary of word2vec...@googlegroups.com - 1 update in 1 topic
1 post by 1 author
Tim Finin

The SemEval workshop (http://en.wikipedia.org/wiki/SemEval) ran tasks in
2012, 2013 and 2014 where the goal was to compute the semantic
similarity of two sentences on a scale from 0 to 5. Each year they
provided training and test datasets with human judgments. These could
easily be used to evaluate and compare the performance of this and other
ideas using word2vec data. Papers on the participating systems can be
found in the ACL repository
(http://aclanthology.info/events/semeval-201+X for X in range(12:15)).
For an overview of the most recent task, see
http://aclanthology.info/papers/semeval-2014-task-10-multilingual-semantic-textual-similarity.

What are some standard ways of computing the distance between documents?

There's a number of different ways of going about this depending on exactly how much semantic information you want to retain and how easy your documents are to tokenize (html documents would probably be pretty difficult to tokenize, but you could conceivably do something with tags and context.)
Some of them have been mentioned by ffriend, and the paragraph vectors by user1133029 is a really solid one, but I just figured I would go into some more depth about plusses and minuses of different approaches.
  • Cosine Distance - Tried a true, cosine distance is probably the most common distance metric used generically across multiple domains. With that said, there's very little information in cosine distance that can actually be mapped back to anything semantic, which seems to be non-ideal for this situation.
  • Levenshtein Distance - Also known as edit distance, this is usually just used on the individual token level (words, bigrams, etc...). In general I wouldn't recommend this metric as it not only discards any semantic information, but also tends to treat very different word alterations very similarly, but it is an extremely common metric for this kind of thing
  • LSA - Is a part of a large arsenal of techniques when it comes to evaluating document similarity called topic modeling. LSA has gone out of fashion pretty recently, and in my experience, it's not quite the strongest topic modeling approach, but it is relatively straightforward to implement and has a few open source implementations
  • LDA - Is also a technique used for topic modeling, but it's different from LSA in that it actually learns internal representations that tend to be more smooth and intuitive. In general, the results you get from LDA are better for modeling document similarity than LSA, but not quite as good for learning how to discriminate strongly between topics.
  • Pachinko Allocation - Is a really neat extension on top of LDA. In general, this is just a significantly improved version of LDA, with the only downside being that it takes a bit longer to train and open-source implementations are a little harder to come by
  • word2vec - Google has been working on a series of techniques for intelligently reducing words and documents to more reasonable vectors than the sparse vectors yielded by techniques such as Count Vectorizers and TF-IDF. Word2vec is great because it has a number of open source implementations. Once you have the vector, any other similarity metric (like cosine distance) can be used on top of it with significantly more efficacy.
  • doc2vec - Also known as paragraph vectors, this is the latest and greatest in a series of papers by Google, looking into dense vector representations of documents. The gensim library in python has an implementation of word2vec that is straightforward enough that it can pretty reasonably be leveraged to build doc2vec, but make sure to keep the license in mind if you want to go down this route


http://www.fi.muni.cz/usr/sojka/papers/pakray-sojka-raslan2014.pdf
An Architecture for Scientific Document Retrieval
Using Textual and Math Entailment Modules
Partha Pakray and Petr Sojka
Faculty of Informatics, Masaryk University
Botanická 68a, 602 00 Brno, Czech Rep

plain Word2vec with pretrained Google news data by LSA gave better result ...Technology (NIST), Evaluation Exercises on Semantic Evaluation (SemEval)5

word2vec entity relationship resolution - as a search key

competing frameworks: stanford nlp vs coreference

 http://www.ark.cs.cmu.edu/ARKref/ 
submissions to the CoNLL 2011 / 2012 shared task on coreference modeling:

http://conll.cemantix.org/2011/http://conll.cemantix.org/2012/

1. was English only, 2012 involved English, Chinese and Arabic.

The Stanford system (Lee et al.'s submission) was the top performing system in 2011, but a few other submissions reported slightly better performance on English in 2012. I'm not sure if any other substantial work has been done on coreference resolution since then.

In my experience, Stanford's system is the winner in usability. Getting a hold of the code for the other submissions can be difficult - your best bet might be to try contacting the authors directly.


 Poesio's BART  
http://www.bart-coref.org/