Monday, November 7, 2016

ML Text Generation Problems and Solutions



lstm generating text


TensorFlow using LSTMs for generating text
http://stackoverflow.com/questions/36609920/tensorflow-using-lstms-for-generating-text

text generation with RNN

A: Transforming text with neural network
Implementing seq2seq with sampled decoder outputs
http://stackoverflow.com/questions/36228723/implementing-seq2seq-with-sampled-decoder-outputs/36246038#36246038


Q: RNN for End-End Speech Recognition using TensorFlow
http://stackoverflow.com/questions/38385292/rnn-for-end-end-speech-recognition-using-tensorflow


Q: Tensorflow Android demo: load a custom graph in?
http://stackoverflow.com/questions/39318586/tensorflow-android-demo-load-a-custom-graph-in


building up a stacked LSTM model for text classification in TensorFlow
http://stackoverflow.com/questions/34790159/stacked-rnn-model-setup-in-tensorflow

A Practical Guide for Debugging Tensorflow Codes
Jongwook Choi
June 18th, 2016
Latest Update: Dec 9th, 2016
https://github.com/wookayin/TensorflowKR-2016-talk-debugging

Generative Adversarial Networks


NIPS 2016 Tutorial: Generative Adversarial Networks

https://arxiv.org/abs/1701.00160
Ian Goodfellow
(Submitted on 31 Dec 2016 (v1), last revised 5 Jan 2017 (this version, v2))
This report summarizes the tutorial presented by the author at NIPS 2016 on generative adversarial networks (GANs). The tutorial describes: (1) Why generative modeling is a topic worth studying, (2) how generative models work, and how GANs compare to other generative models, (3) the details of how GANs work, (4) research frontiers in GANs, and (5) state-of-the-art image models that combine GANs with other methods. Finally, the tutorial contains three exercises for readers to complete, and the solutions to these exercises.

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
https://arxiv.org/abs/1612.03242
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, Dimitris Metaxas
(Submitted on 10 Dec 2016)
Synthesizing photo-realistic images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose stacked Generative Adversarial Networks (StackGAN) to generate photo-realistic images conditioned on text descriptions. The Stage-I GAN sketches the primitive shape and basic colors of the object based on the given text description, yielding Stage-I low resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high resolution images with photo-realistic details. The Stage-II GAN is able to rectify defects and add compelling details with the refinement process. Samples generated by StackGAN are more plausible than those generated by existing approaches. Importantly, our StackGAN for the first time generates realistic 256 x 256 images conditioned on only text descriptions, while state-of-the-art methods can generate at most 128 x 128 images. To demonstrate the effectiveness of the proposed StackGAN, extensive experiments are conducted on CUB and Oxford-102 datasets, which contain enough object appearance variations and are widely-used for text-to-image generation analysis.

Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning.
(arXiv:1702.07464v1 [cs.CR]) 
In recent years, a branch of machine learning called Deep Learning has become incredibly popular thanks to the ability of a new class of algorithms to model and interpret a large quantity of data in a similar way to humans. Properly training deep learning models involves collecting a vast amount of users' private data, including habits, geographical positions, interests, and much more. Another major issue is that it is possible to extract from trained models useful information about the training set and this hinders collaboration among distrustful participants or parties that deal with sensitive information.

To tackle this problem, collaborative deep learning models have recently been proposed where parties share only a subset of the parameters in the attempt to keep their respective training sets private. Parameters can also be obfuscated via differential privacy to make information extraction even more challenging, as shown by Shokri and Shmatikov at CCS'15. Unfortunately, we show that any privacy-preserving collaborative deep learning is susceptible to a powerful attack that we devise in this paper. In particular, we show that a distributed or decentralized deep learning approach is fundamentally broken and does not protect the training sets of honest participants. The attack we developed exploits the real-time nature of the learning process that allows the adversary to train a Generative Adversarial Network (GAN) that generates valid samples of the targeted training set that was meant to be private. Interestingly, we show that differential privacy applied to shared parameters of the model as suggested at CCS'15 and CCS'16 is utterly futile. In our generative model attack, all techniques adopted to scramble or obfuscate shared parameters in collaborative deep learning are rendered ineffective with no possibility of a remedy under the threat model considered.


Sequence Modeling via Segmentations
https://arxiv.org/abs/1702.07464
Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, Li Deng
(Submitted on 24 Feb 2017)
Segmental structure is a common pattern in many types of sequences such as phrases in human languages. In this paper, we present a probabilistic model for sequences via their segmentations. The probability of a segmented sequence is calculated as the product of the probabilities of all its segments, where each segment is modeled using existing tools such as recurrent neural networks. Since the segmentation of a sequence is usually unknown in advance, we sum over all valid segmentations to obtain the final probability for the sequence. An efficient dynamic programming algorithm is developed for forward and backward computations without resorting to any approximation. We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts.


Hidden Community Detection in Social Networks

We introduce a new paradigm that is important for community detection in the realm of network analysis. Networks contain a set of strong, dominant communities, which interfere with the detection of weak, natural community structure. When most of the members of the weak communities also belong to stronger communities, they are extremely hard to be uncovered. We call the weak communities the hidden community structure.
We present a novel approach called HICODE (HIdden COmmunity DEtection) that identifies the hidden community structure as well as the dominant community structure. By weakening the strength of the dominant structure, one can uncover the hidden structure beneath. Likewise, by reducing the strength of the hidden structure, one can more accurately identify the dominant structure. In this way, HICODE tackles both tasks simultaneously.
Extensive experiments on real-world networks demonstrate that HICODE outperforms several state-of-the-art community detection methods in uncovering both the dominant and the hidden structure. In the Facebook university social networks, we find multiple non-redundant sets of communities that are strongly associated with residential hall, year of registration or career position of the faculties or students, while the state-of-the-art algorithms mainly locate the dominant ground truth category. In the Due to the difficulty of labeling all ground truth communities in real-world datasets, HICODE provides a promising approach to pinpoint the existing latent communities and uncover communities for which there is no ground truth. Finding this unknown structure is an extremely important community detection problem.


important for NLP - larger role of rare words, smaller role for frequent words
implemented in ADAGRAD
ADAGRAD - adaptive learning rates for each parameter
Related paper:
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Duchi et al 2010
Learning rate is adapting differently for each parameter and rare parameters get larger updates than frequently occurring parameters. Word vectors!

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

http://www.jmlr.org/papers/v12/duchi11a.html
John Duchi, Elad Hazan, Yoram Singer; 12(Jul):2121−2159, 2011.
Abstract
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.
Keywords: subgradient methods, adaptivity, online learning, stochastic convex optimization

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
John Duchi JDUCHI@CS.BERKELEY.EDU Computer Science Division University of California, Berkeley Berkeley, CA 94720 USA
Elad Hazan EHAZAN@IE.TECHNION.AC.IL Technion - Israel Institute of Technology Technion City Haifa, 32000, Israel
Yoram Singer SINGER@GOOGLE.COM Google 1600 Amphitheatre Parkway Mountain View, CA 94043 USA

Use rectified linear function ReLu instead of Tanh and sigmoid. ReLu is null when x is negative and linear at x in (0, 1)


Deep Learning Tricks of the Trade

Prevent Feature Co-adaptation by Dropout (Jeff Hinton et al. 2012) - 
randomly set 50% of the inputs to each neuron to 0
paper -
Improving neural networks by preventing co-adaptation of feature detectors
https://arxiv.org/abs/1207.0580
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov
(Submitted on 3 Jul 2012)
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.



Random hyperparameter search!

in a paper - Y. Bengio (2012) Practical Recommendations for Gradient Based Training of Deep Architectures

1. Unsupervised pre-training
2. Stochastic gradient descent and setting learning rates
3. Main hyper-parameters
learning rate schedule & early stopping,
 mini-batches,
parameter initialization,
number of hidden units, regularization (= weight decay)
4. How to efficiently search for hyper-parameter configurations
short answer: Random hyperparameter search!

Practical Recommendations for Gradient-Based Training of Deep Architectures
https://arxiv.org/pdf/1206.5533.pdf

Yoshua Bengio Version 2, Sept. 16th, 2012
 Abstract
Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyperparameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with

 Some more advanced and recent tricks in later lectures.

Language Models:

A language model computes a probability for a sequence of words
Probability is usually conditioned on window of n previous words
Very useful for a lot of tasks:
Can be used to determine whether a sequence is a good grammatical translation or speech utterance.
Example: going home vs going house

Recurrent Neural Networks

Solution: Condition the neural network on all previous words and tie the weight at each time step