DataBricks

http://databricks.com/

MLBase

ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib. The ML Optimizer is currently under active development.

MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. A prototype of MLI has been implemented against Spark, and serves as a testbed for MLlib.

MLlib: Apache Spark's distributed ML library. MLlib was initially developed as part of the MLbase project, and the library is currently supported by the Spark community. Many features in MLlib have been borrowed from ML Optimizer and MLI, e.g., the model and algorithm APIs, multimodel training, sparse data support, design of local / distributed matrices, etc.

MLI: An API for Distributed Machine Learning

https://amplab.cs.berkeley.edu/publication/mli-an-api-for-distributed-machine-learning/

MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

From the database community, projects like MADLib [12] and Hazy [13]
have tried to expose ML algorithms in the context of well
established systems. Alternatively, projects like Weka [14],
scikit-learn [15] and Google Predict [16] have sought to expose
a library of ML tools in an intuitive interface. However, none
of these systems focus on the challenges of scaling ML to the

emerging distributed data setting.

GitHub

https://github.com/amplab/MLI

status of mlbase/mli

Evan R. Sparks

Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so!
- Evan Apr 01, 2014

http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-td3610.html

email list thread on MLI between Matei Zacharia and others...
http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3C1374785796360.b9575b2a@Nodemailer%3E

Evan Sparks github https://github.com/etrain

http://etrain.github.io/about.html
SPARKS at cs dot berkeley dot edu.

Patrick Wendell https://github.com/pwendell

Machine Learning Library (MLlib)

https://spark.apache.org/docs/latest/mllib-guide.html

http://www.meetup.com/SF-Bayarea-Machine-Learning/events/208076942/

Joseph Bradley (http://www.cs.cmu.edu/~jkbradle/)

MLLib Decision trees and random forest

BIDMach - an interactive, general machine learning toolkit for Big Data

http://bid2.berkeley.edu/bid-data-project/

http://www.meetup.com/Silicon-Valley-Machine-Learning/events/197169132/
Its a data-centered world now, and machine learning is the key to getting value from data. But we believe much of the value from Big Data is untapped, and requires better tools that are much faster, more agile and more tunable (allowing tailoring of models). The current wave of tools rely primarily on cluster computing for scale-up. The BID Data project focuses on *single-node performance first* and fully taps the latest hardware developments in graphics processors. It turns out this approach is faster in absolute terms for most problems (i.e. our tool on a graphics processor outperforms all cluster implementations on up to several hundred nodes), is fully interactive and supports direct prototype-to-production migration (no recoding). Some problems (e.g. training large deep learning networks), still benefit from scale-up on a cluster. We have developed a new family of communication primitives for large-scale ML which are provably close-to-optimal for a broad range of problems, and e.g. they hold the current record for distributed pagerank. Our most recent work is on live tuning and tailoring of models during optimization, and we have developed a new approach to optImization: parameter-cooled Gibbs sampling to support this.
John Canny
http://en.wikipedia.org/wiki/John_Canny

deeplearning4j
http://deeplearning4j.org/

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php.
Other awesome lists can be found in the [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness) list.

https://raw.githubusercontent.com/josephmisiti/awesome-machine-learning/master/README.md

http://deeplearning4j.org/word2vec.html
http://deeplearning4j.org/deepautoencoder.html
http://deeplearning4j.org/recursiveneuraltensornetwork.html

Caffe

Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.

http://caffe.berkeleyvision.org/

Scala Akka
http://akka.io/

ADDITIONAL TOOLS
FLUME

http://flume.apache.org/

Machine Learning stacks

FACTORIE

http://factorie.cs.umass.edu/
https://github.com/factorie/factorie

ScalaNLP

http://www.scalanlp.org/
https://github.com/scalanlp

Numerical Libraries

ScalaNLP Breeze

https://github.com/scalanlp/breeze

https://code.google.com/p/scalalab/wiki/BreezeAsScalaLabToolbox

Spire
https://github.com/non/spire
http://typelevel.org/

Saddle

http://saddle.github.io/

https://github.com/saddle/saddle

Data mining with WEKA, Part 1: Introduction and regression

http://www.ibm.com/developerworks/library/os-weka1/

Data mining with WEKA, Part 2: Classification and clustering

http://www.ibm.com/developerworks/library/os-weka2/

JAVA NUMERIC COMPUTING
JBLAS
http://mikiobraun.github.io/jblas/javadoc/org/jblas/package-summary.html

jBLAS: An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org.
- Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html.
MTJ: Another such project: http://code.google.com/p/matrix-toolkits-java/

By popular demand NVIDIA built a new powerful programming library, NVIDIA® cuDNN.

NVIDIA® cuDNN is a GPU-accelerated library of primitives for deep neural networks. It emphasizes performance, ease-of-use, and low memory overhead. NVIDIA cuDNN is designed to be integrated into higher-level machine learning frameworks, such as UC Berkeley's popularCaffe software. The simple, drop-in design allows developers to focus on designing and implementing neural net models rather than tuning for performance, while still achieving the high performance modern parallel computing hardware affords.

cuDNN is free for anyone to use for any purpose: academic, research or commercial. Just sign up for a registered CUDA developer account. Once your account is activated, log in and visit the cuDNN page at developer.nvidia.com/cuDNN. The included User Guide will help you use the library.

For any additional questions or to provide feedback, please contact us at cuDNN@nvidia.com.

word2vec

Tool for computing continuous distributed representations of words.

http://code.google.com/p/word2vec/

Where to obtain the training data

The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:

First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney's page)
Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).

H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data.
http://0xdata.com/h2o/
http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html

BMRM

(Bundle Methods for Regularized Risk Minimization)

version 2.1

19 February 2009

http://users.cecs.anu.edu.au/~chteo/BMRM.html

PRESTO

ML Inside Presto Distributed SQL Query Engine
http://www.meetup.com/sfmachinelearning/events/218160592/

Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.

Shiny

by RStudio
A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required

http://shiny.rstudio.com/

Applied Deep Learning for Vision and Natural Language with Torch7

http://www.gputechconf.com/resources/gtc-webinars

TO UPLOAD SLIDES

OCTOBER 8

THURSDAY
9:00am PDT /12:00pm EDT

TORCH7: APPLIED DEEP LEARNING FOR VISION AND NATURAL LANGUAGE

Presenter: Nicholas Léonard
Element Inc., Research Engineer

This webinar is targeted at machine learning enthusiasts and researchers and covers applying deep learning techniques on classifying images and building language models, including convolutional and recurrent neural networks. The session is driven in Torch: a scientific computing platform that has great toolboxes for deep learning and optimization among others, and fast CUDA backends with multi-GPU support.

Presenter:
Nicholas Léonard, Research Engineer, Element Inc.
Presenter Bio:
Nicholas graduated from the Royal Military College of Canada in 2008 with a bachelor's degree in Computer Science. Nicholas Retired from the Canadian Army Officer Corp in 2012 to complete a Master's degree in Deep Learning at the University of Montreal. He currently applies deep learning to biometric authentication using smart phones.

cuDNN

https://developer.nvidia.com/cudnn

Key Features

cuDNN provides high performance building blocks for deep neural network applications, including:

Forward and backward convolution routines, including cross-correlation, designed for convolutional neural nets
Arbitrary dimension ordering, striding, and sub-regions for 4d tensors means easy integration into any neural net implementation
Forward and backward paths for many common layer types such as pooling, ReLU, Sigmoid, softmax and Tanh
Tensor transformation functions
Context-based API allows for easy multithreading
Optimized for the latest NVIDIA GPU architectures
Supported on Windows, Linux and MacOS systems with Kepler, Maxwell or Tegra K1 GPUs.

Watch the GPU-Accelerated Deep Learning with cuDNN webinar to learn more about cuDNN.

The convolution routines in cuDNN provide best-in-class performance while using almost no extra memory. cuDNN features customizable data layouts, flexible dimension ordering, striding, and sub-regions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility avoids transposition steps to or from other internal representations. cuDNN also offers a context-based API that allows for easy multithreading and optional interoperability with CUDA streams.

References

Learn more about GPU-accelerated machine learning and deep learning technologies in these blog posts:
Review the CUDA 7 Performance Report and webinar recording for more performance data on cuDNN and other GPU-accelerated libraries.
Additional GPU-Accelerated libraries
For questions or to provide feedback, please contact cuDNN@nvidia.com
Find other cuDNN developers on NVIDIA Developer Forums

15 Deep Learning Libraries

http://www.datasciencecentral.com/profiles/blogs/here-are-15-libraries-in-various-languages-to-help-implement-your

Open Salon

Monday, August 11, 2014

ML PROJECTS and FRAMEWORKS