SEMEVAL NIST COMPETITIONS

http://alt.qcri.org/semeval2014/
http://alt.qcri.org/semeval2014/task1/index.php?id=results
http://alt.qcri.org/semeval2014/task3/index.php?id=results
http://alt.qcri.org/semeval2014/task5/?id=data-and-tools

word2vec

https://groups.google.com/forum/#!msg/word2vec-toolkit/ZcOst7kEjaI/rv_A6LaE9vkJ
word2vec-toolkit ›

Re: Abridged summary of word2vec...@googlegroups.com - 1 update in 1 topic
1 post by 1 author

Tim Finin

The SemEval workshop (http://en.wikipedia.org/wiki/SemEval) ran tasks in
2012, 2013 and 2014 where the goal was to compute the semantic
similarity of two sentences on a scale from 0 to 5. Each year they
provided training and test datasets with human judgments. These could
easily be used to evaluate and compare the performance of this and other
ideas using word2vec data. Papers on the participating systems can be
found in the ACL repository
(http://aclanthology.info/events/semeval-201+X for X in range(12:15)).
For an overview of the most recent task, see
http://aclanthology.info/papers/semeval-2014-task-10-multilingual-semantic-textual-similarity.

What are some standard ways of computing the distance between documents?

http://datascience.stackexchange.com/questions/678/what-are-some-standard-ways-of-computing-the-distance-between-documents

There's a number of different ways of going about this depending on exactly how much semantic information you want to retain and how easy your documents are to tokenize (html documents would probably be pretty difficult to tokenize, but you could conceivably do something with tags and context.)

Some of them have been mentioned by ffriend, and the paragraph vectors by user1133029 is a really solid one, but I just figured I would go into some more depth about plusses and minuses of different approaches.

Cosine Distance - Tried a true, cosine distance is probably the most common distance metric used generically across multiple domains. With that said, there's very little information in cosine distance that can actually be mapped back to anything semantic, which seems to be non-ideal for this situation.
Levenshtein Distance - Also known as edit distance, this is usually just used on the individual token level (words, bigrams, etc...). In general I wouldn't recommend this metric as it not only discards any semantic information, but also tends to treat very different word alterations very similarly, but it is an extremely common metric for this kind of thing
LSA - Is a part of a large arsenal of techniques when it comes to evaluating document similarity called topic modeling. LSA has gone out of fashion pretty recently, and in my experience, it's not quite the strongest topic modeling approach, but it is relatively straightforward to implement and has a few open source implementations
LDA - Is also a technique used for topic modeling, but it's different from LSA in that it actually learns internal representations that tend to be more smooth and intuitive. In general, the results you get from LDA are better for modeling document similarity than LSA, but not quite as good for learning how to discriminate strongly between topics.
Pachinko Allocation - Is a really neat extension on top of LDA. In general, this is just a significantly improved version of LDA, with the only downside being that it takes a bit longer to train and open-source implementations are a little harder to come by
word2vec - Google has been working on a series of techniques for intelligently reducing words and documents to more reasonable vectors than the sparse vectors yielded by techniques such as Count Vectorizers and TF-IDF. Word2vec is great because it has a number of open source implementations. Once you have the vector, any other similarity metric (like cosine distance) can be used on top of it with significantly more efficacy.
doc2vec - Also known as paragraph vectors, this is the latest and greatest in a series of papers by Google, looking into dense vector representations of documents. The gensim library in python has an implementation of word2vec that is straightforward enough that it can pretty reasonably be leveraged to build doc2vec, but make sure to keep the license in mind if you want to go down this route

http://www.fi.muni.cz/usr/sojka/papers/pakray-sojka-raslan2014.pdf
An Architecture for Scientific Document Retrieval
Using Textual and Math Entailment Modules
Partha Pakray and Petr Sojka
Faculty of Informatics, Masaryk University
Botanická 68a, 602 00 Brno, Czech Rep

plain Word2vec with pretrained Google news data by LSA gave better result ...Technology (NIST), Evaluation Exercises on Semantic Evaluation (SemEval)5

word2vec entity relationship resolution - as a search key

competing frameworks: stanford nlp vs coreference

http://www.ark.cs.cmu.edu/ARKref/
submissions to the CoNLL 2011 / 2012 shared task on coreference modeling:

http://conll.cemantix.org/2011/http://conll.cemantix.org/2012/

1. was English only, 2012 involved English, Chinese and Arabic.

The Stanford system (Lee et al.'s submission) was the top performing system in 2011, but a few other submissions reported slightly better performance on English in 2012. I'm not sure if any other substantial work has been done on coreference resolution since then.

In my experience, Stanford's system is the winner in usability. Getting a hold of the code for the other submissions can be difficult - your best bet might be to try contacting the authors directly.

Poesio's BART
http://www.bart-coref.org/

Open Salon

Monday, December 8, 2014

SEMEVAL NIST COMPETITIONS

word2vec

What are some standard ways of computing the distance between documents?

No comments:

Post a Comment

Followers

Blog Archive