Monday, August 11, 2014

ML PROJECTS and FRAMEWORKS


DataBricks

http://databricks.com/


MLBase


ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib. The ML Optimizer is currently under active development.

MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. A prototype of MLI has been implemented against Spark, and serves as a testbed for MLlib.

MLlib: Apache Spark's distributed ML library. MLlib was initially developed as part of the MLbase project, and the library is currently supported by the Spark community. Many features in MLlib have been borrowed from ML Optimizer and MLI, e.g., the model and algorithm APIs, multimodel training, sparse data support, design of local / distributed matrices, etc.


MLI: An API for Distributed Machine Learning


MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

From the database community, projects like MADLib [12] and Hazy [13]
have tried to expose ML algorithms in the context of well
established systems. Alternatively, projects like Weka [14],
scikit-learn [15] and Google Predict [16] have sought to expose
a library of ML tools in an intuitive interface. However, none
of these systems focus on the challenges of scaling ML to the

emerging distributed data setting.



GitHub

status of mlbase/mli

Evan R. Sparks

Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so!
- Evan 
Apr 01, 2014
Evan Sparks github https://github.com/etrain
http://etrain.github.io/about.html
SPARKS at cs dot berkeley dot edu.

Patrick Wendell https://github.com/pwendell

Machine Learning Library (MLlib)


BIDMach - an interactive, general machine learning toolkit for Big Data

http://bid2.berkeley.edu/bid-data-project/

http://www.meetup.com/Silicon-Valley-Machine-Learning/events/197169132/
Its a data-centered world now, and machine learning is the key to getting value from data.  But we believe much of the value from Big Data is untapped, and requires better tools that are much faster, more agile and more tunable (allowing tailoring of models). The current wave of tools rely primarily on cluster computing for scale-up. The BID Data project focuses on *single-node performance first* and fully taps the latest hardware developments in graphics processors. It turns out this approach is faster in absolute terms for most problems (i.e. our tool on a graphics processor outperforms all cluster implementations on up to several hundred nodes), is fully interactive and supports direct prototype-to-production migration (no recoding). Some problems (e.g. training large deep learning networks), still benefit from scale-up on a cluster. We have developed a new family of communication primitives for large-scale ML which are provably close-to-optimal for a broad range of problems, and e.g. they hold the current record for distributed pagerank. Our most recent work is on live tuning and tailoring of models during optimization, and we have developed a new approach to optImization: parameter-cooled Gibbs sampling to support this.

John Canny
http://en.wikipedia.org/wiki/John_Canny


deeplearning4j
http://deeplearning4j.org/

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php.
Other awesome lists can be found in the [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness) list.

https://raw.githubusercontent.com/josephmisiti/awesome-machine-learning/master/README.md

http://deeplearning4j.org/word2vec.html
http://deeplearning4j.org/deepautoencoder.html
http://deeplearning4j.org/recursiveneuraltensornetwork.html

Caffe
Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.
http://caffe.berkeleyvision.org/

Scala Akka
http://akka.io/

ADDITIONAL TOOLS
FLUME

http://flume.apache.org/




Machine Learning stacks


FACTORIE

http://factorie.cs.umass.edu/
https://github.com/factorie/factorie

 ScalaNLP

http://www.scalanlp.org/
https://github.com/scalanlp

Numerical Libraries

 ScalaNLP Breeze

https://github.com/scalanlp/breeze

https://code.google.com/p/scalalab/wiki/BreezeAsScalaLabToolbox

 Spire
https://github.com/non/spire
http://typelevel.org/


 Saddle

https://github.com/saddle/saddle

Data mining with WEKA, Part 1: Introduction and regression

http://www.ibm.com/developerworks/library/os-weka1/

Data mining with WEKA, Part 2: Classification and clustering

http://www.ibm.com/developerworks/library/os-weka2/

JAVA NUMERIC COMPUTING
JBLAS
http://mikiobraun.github.io/jblas/javadoc/org/jblas/package-summary.html


By popular demand NVIDIA built a new powerful programming library, NVIDIA® cuDNN.
NVIDIA® cuDNN is a GPU-accelerated library of primitives for deep neural networks. It emphasizes performance, ease-of-use, and low memory overhead. NVIDIA cuDNN is designed to be integrated into higher-level machine learning frameworks, such as UC Berkeley's popularCaffe software. The simple, drop-in design allows developers to focus on designing and implementing neural net models rather than tuning for performance, while still achieving the high performance modern parallel computing hardware affords.
cuDNN is free for anyone to use for any purpose: academic, research or commercial. Just sign up for a registered CUDA developer account. Once your account is activated, log in and visit the cuDNN page at developer.nvidia.com/cuDNN. The included User Guide will help you use the library.
For any additional questions or to provide feedback, please contact us at cuDNN@nvidia.com
http://code.google.com/p/word2vec/

Where to obtain the training data

The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:

H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data.
http://0xdata.com/h2o/
http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html


BMRM

(Bundle Methods for Regularized Risk Minimization)

version 2.1


19 February 2009

http://users.cecs.anu.edu.au/~chteo/BMRM.html

PRESTO


ML Inside Presto Distributed SQL Query Engine
http://www.meetup.com/sfmachinelearning/events/218160592/

Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.

Shiny

by RStudio
A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required

http://shiny.rstudio.com/

Applied Deep Learning for Vision and Natural Language with Torch7

TO UPLOAD SLIDES

OCTOBER 8

THURSDAY
9:00am PDT /12:00pm EDT

TORCH7: APPLIED DEEP LEARNING FOR VISION AND NATURAL LANGUAGE

Presenter: Nicholas Léonard
Element Inc., Research Engineer

This webinar is targeted at machine learning enthusiasts and researchers and covers applying deep learning techniques on classifying images and building language models, including convolutional and recurrent neural networks. The session is driven in Torch: a scientific computing platform that has great toolboxes for deep learning and optimization among others, and fast CUDA backends with multi-GPU support.

Presenter:
Nicholas Léonard, Research Engineer, Element Inc.
Presenter Bio:
Nicholas graduated from the Royal Military College of Canada in 2008 with a bachelor's degree in Computer Science. Nicholas Retired from the Canadian Army Officer Corp in 2012 to complete a Master's degree in Deep Learning at the University of Montreal. He currently applies deep learning to biometric authentication using smart phones.

cuDNN

https://developer.nvidia.com/cudnn

Key Features

cuDNN provides high performance building blocks for deep neural network applications, including:
  • Forward and backward convolution routines, including cross-correlation, designed for convolutional neural nets
  • Arbitrary dimension ordering, striding, and sub-regions for 4d tensors means easy integration into any neural net implementation
  • Forward and backward paths for many common layer types such as pooling, ReLU, Sigmoid, softmax and Tanh
  • Tensor transformation functions
  • Context-based API allows for easy multithreading
  • Optimized for the latest NVIDIA GPU architectures
  • Supported on Windows, Linux and MacOS systems with Kepler, Maxwell or Tegra K1 GPUs.
Watch the GPU-Accelerated Deep Learning with cuDNN webinar to learn more about cuDNN.
The convolution routines in cuDNN provide best-in-class performance while using almost no extra memory. cuDNN features customizable data layouts, flexible dimension ordering, striding, and sub-regions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility avoids transposition steps to or from other internal representations. cuDNN also offers a context-based API that allows for easy multithreading and optional interoperability with CUDA streams.

References


15 Deep Learning Libraries





















No comments:

Post a Comment