DataBricks
http://databricks.com/
MLBase
ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib. The ML Optimizer is currently under active development.
MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions. A prototype of MLI has been implemented against Spark, and serves as a testbed for MLlib.
MLlib: Apache Spark's distributed ML library. MLlib was initially developed as part of the MLbase project, and the library is currently supported by the Spark community. Many features in MLlib have been borrowed from ML Optimizer and MLI, e.g., the model and algorithm APIs, multimodel training, sparse data support, design of local / distributed matrices, etc.
MLI: An API for Distributed Machine Learning
MLI is an Application Programming Interface designed to address the challenges of building Machine Learning algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.
From the database community, projects like MADLib [12] and Hazy [13]
have tried to expose ML algorithms in the context of well
established systems. Alternatively, projects like Weka [14],
scikit-learn [15] and Google Predict [16] have sought to expose
a library of ML tools in an intuitive interface. However, none
of these systems focus on the challenges of scaling ML to the
emerging distributed data setting.
From the database community, projects like MADLib [12] and Hazy [13]
have tried to expose ML algorithms in the context of well
established systems. Alternatively, projects like Weka [14],
scikit-learn [15] and Google Predict [16] have sought to expose
a library of ML tools in an intuitive interface. However, none
of these systems focus on the challenges of scaling ML to the
emerging distributed data setting.
GitHub
status of mlbase/mli
Evan R. Sparks
Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so!
- Evan Apr 01, 2014
Hi there,
MLlib is the first component of MLbase - MLI and the higher levels of the stack are still being developed. Look for updates in terms of our progress on the hyperparameter tuning/model selection problem in the next month or so!
- Evan Apr 01, 2014
email list thread on MLI between Matei Zacharia and others...
http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3C1374785796360.b9575b2a@Nodemailer%3E
http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3C1374785796360.b9575b2a@Nodemailer%3E
Evan Sparks github https://github.com/etrain
http://etrain.github.io/about.htmlSPARKS at cs dot berkeley dot edu.
Machine Learning Library (MLlib)
deeplearning4j
http://deeplearning4j.org/
A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness) list.https://raw.githubusercontent.com/josephmisiti/awesome-machine-learning/master/README.md
http://deeplearning4j.org/word2vec.html
http://deeplearning4j.org/deepautoencoder.html
http://deeplearning4j.org/recursiveneuraltensornetwork.html
Caffe
Caffe is a deep learning framework developed with cleanliness, readability, and speed in mind. It was created by Yangqing Jia during his PhD at UC Berkeley, and is in active development by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2-Clause license.
http://caffe.berkeleyvision.org/Scala Akka
http://akka.io/
ADDITIONAL TOOLS
FLUME
http://flume.apache.org/
Machine Learning stacks
FACTORIE
http://factorie.cs.umass.edu/
https://github.com/factorie/factorie
ScalaNLP
http://www.scalanlp.org/
https://github.com/scalanlp
Numerical Libraries
ScalaNLP Breeze
https://github.com/scalanlp/breeze
https://code.google.com/p/scalalab/wiki/BreezeAsScalaLabToolboxSpire
https://github.com/non/spire
http://typelevel.org/
Saddle
https://github.com/saddle/saddle
JAVA NUMERIC COMPUTING
JBLAS
http://mikiobraun.github.io/jblas/javadoc/org/jblas/package-summary.html
Data mining with WEKA, Part 1: Introduction and regression
http://www.ibm.com/developerworks/library/os-weka1/Data mining with WEKA, Part 2: Classification and clustering
http://www.ibm.com/developerworks/library/os-weka2/JAVA NUMERIC COMPUTING
JBLAS
http://mikiobraun.github.io/jblas/javadoc/org/jblas/package-summary.html
- jBLAS: An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org.
- Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html.
- MTJ: Another such project: http://code.google.com/p/matrix-toolkits-java/
By popular demand NVIDIA built a new powerful programming library, NVIDIA® cuDNN.
NVIDIA® cuDNN is a GPU-accelerated library of primitives for deep neural networks. It emphasizes performance, ease-of-use, and low memory overhead. NVIDIA cuDNN is designed to be integrated into higher-level machine learning frameworks, such as UC Berkeley's popularCaffe software. The simple, drop-in design allows developers to focus on designing and implementing neural net models rather than tuning for performance, while still achieving the high performance modern parallel computing hardware affords.
cuDNN is free for anyone to use for any purpose: academic, research or commercial. Just sign up for a registered CUDA developer account. Once your account is activated, log in and visit the cuDNN page at developer.nvidia.com/cuDNN. The included User Guide will help you use the library.
For any additional questions or to provide feedback, please contact us at cuDNN@nvidia.com.
http://code.google.com/p/word2vec/
Where to obtain the training data
The quality of the word vectors increases significantly with amount of the training data. For research purposes, you can consider using data sets that are available on-line:
- First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney's page)
- Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
- WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
- Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
- UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).
H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data.
http://0xdata.com/h2o/
http://databricks.com/blog/2014/06/30/sparkling-water-h20-spark.html
BMRM
(Bundle Methods for Regularized Risk Minimization)
version 2.1
19 February 2009
http://users.cecs.anu.edu.au/~chteo/BMRM.html
ML Inside Presto Distributed SQL Query Engine
http://www.meetup.com/sfmachinelearning/events/218160592/
Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.
A web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required
http://shiny.rstudio.com/
TO UPLOAD SLIDESPRESTO
ML Inside Presto Distributed SQL Query Engine
http://www.meetup.com/sfmachinelearning/events/218160592/
Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.
Shiny
by RStudioA web application framework for R
Turn your analyses into interactive web applications
No HTML, CSS, or JavaScript knowledge required
http://shiny.rstudio.com/
Applied Deep Learning for Vision and Natural Language with Torch7
OCTOBER 8
THURSDAY
9:00am PDT /12:00pm EDT
TORCH7: APPLIED DEEP LEARNING FOR VISION AND NATURAL LANGUAGE
Presenter: Nicholas Léonard
Element Inc., Research Engineer
This webinar is targeted at machine learning enthusiasts and researchers and covers applying deep learning techniques on classifying images and building language models, including convolutional and recurrent neural networks. The session is driven in Torch: a scientific computing platform that has great toolboxes for deep learning and optimization among others, and fast CUDA backends with multi-GPU support.
Presenter:
Nicholas Léonard, Research Engineer, Element Inc.
Presenter Bio:
Nicholas graduated from the Royal Military College of Canada in 2008 with a bachelor's degree in Computer Science. Nicholas Retired from the Canadian Army Officer Corp in 2012 to complete a Master's degree in Deep Learning at the University of Montreal. He currently applies deep learning to biometric authentication using smart phones.
cuDNN
https://developer.nvidia.com/cudnnKey Features
cuDNN provides high performance building blocks for deep neural network applications, including:
- Forward and backward convolution routines, including cross-correlation, designed for convolutional neural nets
- Arbitrary dimension ordering, striding, and sub-regions for 4d tensors means easy integration into any neural net implementation
- Forward and backward paths for many common layer types such as pooling, ReLU, Sigmoid, softmax and Tanh
- Tensor transformation functions
- Context-based API allows for easy multithreading
- Optimized for the latest NVIDIA GPU architectures
- Supported on Windows, Linux and MacOS systems with Kepler, Maxwell or Tegra K1 GPUs.
Watch the GPU-Accelerated Deep Learning with cuDNN webinar to learn more about cuDNN.
The convolution routines in cuDNN provide best-in-class performance while using almost no extra memory. cuDNN features customizable data layouts, flexible dimension ordering, striding, and sub-regions for the 4D tensors used as inputs and outputs to all of its routines. This flexibility avoids transposition steps to or from other internal representations. cuDNN also offers a context-based API that allows for easy multithreading and optional interoperability with CUDA streams.
References
- Learn more about GPU-accelerated machine learning and deep learning technologies in these blog posts:
- cuDNN v2: Higher Performance for Deep Learning on GPUs
- Accelerate Machine Learning with the cuDNN Deep Neural Network Library
- Deep Learning for Computer Vision with Caffe and cuDNN
- Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1
- Deep Learning for Image Understanding in Planetary Science
- Review the CUDA 7 Performance Report and webinar recording for more performance data on cuDNN and other GPU-accelerated libraries.
- Additional GPU-Accelerated libraries
- For questions or to provide feedback, please contact cuDNN@nvidia.com
- Find other cuDNN developers on NVIDIA Developer Forums
No comments:
Post a Comment