Wednesday, December 10, 2014


word2vec 2013

project page:

core research:
Tomas Mikolov
Efficient Estimation of Word Representations in Vector Space 

follow up - 
Distributed Representations of Words and Phrases

and their Compositionality

word2vec Explained: Deriving Mikolov et al.’s
Negative-Sampling Word-Embedding Method
Yoav Goldberg and Omer Levy
February 14, 2014

Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov

blogs and tutorials:
Representing words as high dimensional vectors
Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling [1] [2].


GloVe 2014

project page:
core research:
GloVe: Global Vectors for Word Representation
We provide the source code for the model as
well as trained word vectors at

Best word vectors so far? … 11% more accurate than word2vec, fast to train, statistically efficient, good task accuracy

RNN trained word vectors 2012
Semantic Compositionality Through Recursive Matrix-Vector Spaces
Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.

Download Paper

Download Code
Relation Classification (525MB) - All training code and testing code with trained models for new data. External packages (parser, tagger) included and the whole pipeline should run with one script.

How to measure quality of the word vectors

Several factors influence the quality of the word vectors:
  • amount and quality of the training data
  • size of the vectors
  • training algorithm
The quality of the vectors is crucial for any application. However, exploration of different hyper-parameter settings for complex tasks might be too time demanding. Thus, we designed simple test sets that can be used to quickly evaluate the word vector quality.
For the word relation test set described in [1], see ./, for the phrase relation test set described in [2], see ./ Note that the accuracy depends heavily on the amount of the training data; our best results for both test sets are above 70% accuracy with coverage close to 100%.

test metric, tests GloVe vs Word2Vec

On the importance of comparing apples to apples: a case study using the GloVe model
Yoav Goldberg, 10 August 2014

links from word2vec paper

The test set is available at˜imikolov/rnnlm/word-test.v1.txt
Microsoft Research Sentence Completion Challenge

G. Zweig, C.J.C. Burges. The Microsoft Research Sentence Completion Challenge, Microsoft
Research Technical Report MSR-TR-2011-129, 2011.
Appendix: Full List of Training Data
The Microsoft Sentence Completion Challenge has been recently introduced as a task for advancing language modeling and other NLP techniques [32]. This task consists of 1040 sentences, where one word is missing in each sentence and the goal is to select word that is the most coherent with the rest of the sentence, given a list of five reasonable choices.

C.J.C. Burges papers:


Language processing in the brain

Computational Linguistics



Re: Abridged summary of - 1 update in 1 topic
1 post by 1 author
Tim Finin

The SemEval workshop ( ran tasks in
2012, 2013 and 2014 where the goal was to compute the semantic
similarity of two sentences on a scale from 0 to 5. Each year they
provided training and test datasets with human judgments. These could
easily be used to evaluate and compare the performance of this and other
ideas using word2vec data. Papers on the participating systems can be
found in the ACL repository
( for X in range(12:15)).
For an overview of the most recent task, see

What are some standard ways of computing the distance between documents?

There's a number of different ways of going about this depending on exactly how much semantic information you want to retain and how easy your documents are to tokenize (html documents would probably be pretty difficult to tokenize, but you could conceivably do something with tags and context.)
Some of them have been mentioned by ffriend, and the paragraph vectors by user1133029 is a really solid one, but I just figured I would go into some more depth about plusses and minuses of different approaches.
  • Cosine Distance - Tried a true, cosine distance is probably the most common distance metric used generically across multiple domains. With that said, there's very little information in cosine distance that can actually be mapped back to anything semantic, which seems to be non-ideal for this situation.
  • Levenshtein Distance - Also known as edit distance, this is usually just used on the individual token level (words, bigrams, etc...). In general I wouldn't recommend this metric as it not only discards any semantic information, but also tends to treat very different word alterations very similarly, but it is an extremely common metric for this kind of thing
  • LSA - Is a part of a large arsenal of techniques when it comes to evaluating document similarity called topic modeling. LSA has gone out of fashion pretty recently, and in my experience, it's not quite the strongest topic modeling approach, but it is relatively straightforward to implement and has a few open source implementations
  • LDA - Is also a technique used for topic modeling, but it's different from LSA in that it actually learns internal representations that tend to be more smooth and intuitive. In general, the results you get from LDA are better for modeling document similarity than LSA, but not quite as good for learning how to discriminate strongly between topics.
  • Pachinko Allocation - Is a really neat extension on top of LDA. In general, this is just a significantly improved version of LDA, with the only downside being that it takes a bit longer to train and open-source implementations are a little harder to come by
  • word2vec - Google has been working on a series of techniques for intelligently reducing words and documents to more reasonable vectors than the sparse vectors yielded by techniques such as Count Vectorizers and TF-IDF. Word2vec is great because it has a number of open source implementations. Once you have the vector, any other similarity metric (like cosine distance) can be used on top of it with significantly more efficacy.
  • doc2vec - Also known as paragraph vectors, this is the latest and greatest in a series of papers by Google, looking into dense vector representations of documents. The gensim library in python has an implementation of word2vec that is straightforward enough that it can pretty reasonably be leveraged to build doc2vec, but make sure to keep the license in mind if you want to go down this route
An Architecture for Scientific Document Retrieval
Using Textual and Math Entailment Modules
Partha Pakray and Petr Sojka
Faculty of Informatics, Masaryk University
Botanická 68a, 602 00 Brno, Czech Rep

plain Word2vec with pretrained Google news data by LSA gave better result ...Technology (NIST), Evaluation Exercises on Semantic Evaluation (SemEval)5

word2vec entity relationship resolution - as a search key

competing frameworks: stanford nlp vs coreference 
submissions to the CoNLL 2011 / 2012 shared task on coreference modeling:

1. was English only, 2012 involved English, Chinese and Arabic.

The Stanford system (Lee et al.'s submission) was the top performing system in 2011, but a few other submissions reported slightly better performance on English in 2012. I'm not sure if any other substantial work has been done on coreference resolution since then.

In my experience, Stanford's system is the winner in usability. Getting a hold of the code for the other submissions can be difficult - your best bet might be to try contacting the authors directly.

 Poesio's BART

Tuesday, November 4, 2014


Generalized Additive Model residuals
Extract Model Fitted Values

Fitting Generalized Linear Models
devianceup to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.

Wednesday, October 8, 2014



PCA papers

Attack Resistant Collaborative Filtering

Tomas Mikolov
Research scientist, Facebook

A curated list of resources dedicated to recurrent neural networks

Maintainers - Myungsub Choi, Jiwon Kim
pages for other topics: awesome-deep-vision, awesome-random-forest

Adversarial Attacks on AI APIs DNN online

Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples
Feb 19, 2016

Nicolas Papernot - The Pennsylvania State University Patrick McDaniel - The Pennsylvania State University Ian Goodfellow - Google Inc. Somesh Jha - University of Wisconsin-Madison Z. Berkay Celik - The Pennsylvania State University Ananthram Swami - US Army Research Laboratory
Abstract - Advances in deep learning have led to the broad adoption of Deep Neural Networks (DNNs) to a range of important machine learning problems, e.g., guiding autonomous vehicles, speech recognition, malware detection. Yet, machine learning models, including DNNs, were shown to be vulnerable to adversarial samples—subtly (and often humanly indistinguishably) modified malicious inputs crafted to compromise the integrity of their outputs. Adversarial examples thus enable adversaries to manipulate system behaviors. Potential attacks include attempts to control the behavior of vehicles, have spam content identified as legitimate content, or have malware identified as legitimate software. Adversarial examples are known to transfer from one model to another, even if the second model has a different architecture or was trained on a different set. We introduce the first practical demonstration that this cross-model transfer phenomenon enables attackers to control a remotely hosted DNN with no access to the model, its parameters, or its training data. In our demonstration, we only assume that the adversary can observe outputs from the target DNN given inputs chosen by the adversary. We introduce the attack strategy of fitting a substitute model to the input-output pairs in this manner, then crafting adversarial examples based on this auxiliary model. We evaluate the approach on existing DNN datasets and real-world settings. In one experiment, we force a DNN supported by MetaMind (one of the online APIs for DNN classifiers) to mis-classify inputs at a rate of 84.24%. We conclude with experiments exploring why adversarial samples transfer between DNNs, and a discussion on the applicability of our attack when targeting machine learning algorithms distinct from DNNs.

Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion

Xin Luna Dong ∗ , Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy † , Thomas Strohmann, Shaohua Sun, Wei Zhang Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043 {lunadong|gabr|geremy|wilko|nlao|kpmurphy|tstrohmann|sunsh|weizh} 

 Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.

Knowledge Vault Slides
Published on Aug 25, 2014
Antoine Bordes (Facebook)
Evgeniy Gabrilovich (Google)
A Review of “Knowledge Vault: A Web-Scale Approach to a Probabilistic Knowledge Fusion”

Deep Learning with TensorFlow
This Deep Learning with TensorFlow course focuses on TensorFlow. If you are new to the subject of deep learning, consider taking our Deep Learning 101 course first.
Traditional neural networks rely on shallow nets, composed of one input, one hidden layer and one output layer. Deep-learning networks are distinguished from these ordinary neural networks having more hidden layers, or so-called more depth. These kind of nets are capable of discovering hidden structures within unlabeled and unstructured data (i.e. images, sound, and text), which consitutes the vast majority of data in the world.
TensorFlow is one of the best libraries to implement deep learning. TensorFlow is a software library for numerical computation of mathematical expressional, using data flow graphs. Nodes in the graph represent mathematical operations, while the edges represent the multidimensional data arrays (tensors) that flow between them. It was created by Google and tailored for Machine Learning. In fact, it is being widely used to develop solutions with Deep Learning.
In this TensorFlow course, you will be able to learn the basic concepts of TensorFlow, the main functions, operations and the execution pipeline. Starting with a simple “Hello Word” example, throughout the course you will be able to see how TensorFlow can be used in curve fitting, regression, classification and minimization of error functions. This concept is then explored in the Deep Learning world. You will learn how to apply TensorFlow for backpropagation to tune the weights and biases while the Neural Networks are being trained. Finally, the course covers different types of Deep Architectures, such as Convolutional Networks, Recurrent Networks and Autoencoders.

Course Syllabus
Module 1 – Introduction to TensorFlow
  • HelloWorld with TensorFlow
  • Linear Regression
  • Nonlinear Regression
  • Logistic Regression
  • Activation Functions
Module 2 – Convolutional Neural Networks (CNN)
  • CNN History
  • Understanding CNNs
  • CNN Application
Module 3 – Recurrent Neural Networks (RNN)
  • Intro to RNN Model
  • Long Short-Term memory (LSTM)
  • Recursive Neural Tensor Network Theory
  • Recurrent Neural Network Model
Module 4 - Unsupervised Learning
  • Applications of Unsupervised Learning
  • Restricted Boltzmann Machine
  • Collaborative Filtering with RBM
Module 5 - Autoencoders
  • Introduction to Autoencoders and Applications
  • Autoencoders
  • Deep Belief Network


