Intro to Spark's Standard Libraries at Stanford Spark Class


Working through basics of Scala followed by Spark framework. Goal is to be writing machine learning algorithms in Spark

An Introduction to Tachyon - The Next Evolution in Fast Big Data Processing

Yann LeCun
Director of AI Research, Facebook
Founding Director of the NYU Center for Data Science
Silver Professor of Computer Science, Neural Science, and Electrical and Computer Engineering,
The Courant Institute of Mathematical Sciences,
Center for Neural Science, and
Electrical and Computer Engineering Department, NYU School of Engineering
New York University.

Richard Socher
CS224d: Deep Learning for Natural Language Processing

April 23, 2015 Mountain View

• M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, J. Daume. A Neural Network for Factoid Question Answering over Paragraphs
• R. Socher, J. Bauer, C.D. Manning, A.Y. Ng. Parsing with Compositional Vector Grammars
April 15 2015 San Francisco

The papers for the meeting is:

March 3, 2015

• R. Socher, B. Huval, C.D. Manning, A.Y. Ng. Semantic Compositionality through Recursive Matrix-Vector Spaces 

ML Samples

Innovation and Commercialization course from EDX 

course from EDX

Machine Learning

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion. 
Creating and explain what a template that contains:
• Elastic load balancer 
• Auto scaling group 
• Launch config 
• EC2 instance with user data to turn on a simple webserver to show functionality 
At the end of this template you should be able to goto the Elastic load balancer's IP address and view a webpage.  This will all be created with one simple CloudFormation template.

Normal Equations, Gradient Descent and Linear Regression

Toronto University CSC 411: Machine Learning and Data Mining (Sept-Dec 2006)

one of the greatest programming languages ever
Bernd Ulmann
Vintage Computer Festival Europe 2007
A geek with a hat

by Swizec Teller

First steps with Octave and machine learning

Stanford Machine Learning

The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the website during the fall 2011 semester. The topics covered are shown below, although for a more detailed summary see lecture 19. The only content not covered here is the Octave/MATLAB programming.
All diagrams are my own or are directly taken from the lectures, full credit to Professor Ng for a truly exceptional lecture course.

CS 229
Machine Learning
Course Materials


Robert Sedgewick
Kevin Wayne
Princeton University

Machine Learning: Linear Regression With Multiple Variables


Spark Machine Learning Library (MLlib)

Spark Summit 2014

Analyzing endurance-sports activity data with Spark

William Benton (Red Hat, Inc.)

some of the academic and open-source background to what Alpine does:
Alpine Labs Blog

full of references to Spark...

Alpine plus Spark on KDnuggets By Joel Horwitz, Alpine Data Labs, Apr 16, 2014.

Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. Leveraging Hadoop Yarn, Alpine has made it very simple to get started with Spark.
Two years ago I was having coffee with a friend of mine and now colleague Dr. Will Ford in a cafe in San Mateo.  We were talking about data science and analytics when he leaned in real close to say, “Have you heard of Spark? This is going to change everything, again.” I had not heard of Spark and started researching the technology the moment I got back to my desk.  I quickly realized what all of the fuss was about when landed on the Berkeley AMPLab. 

Apache SparkSpark is new technology that sits on top of Hadoop Distributed File System (HDFS) that is characterized as “a fast and general engine for large-scale data processing.”  Spark has three key features that make it the most interesting up and coming technology to rock the big data world since Apache Hadoop in 2005.
  1. For iterative analysis like logistic regression, Random Forests, or other advanced algorithms, Spark has demonstrated 100X increase in speed that scales to hundreds of millions of rows.
  2. Spark has native support for the latest and greatest programming languages Java, Scala, and of course Python.
  3. Spark has generality or platform compatibility in both directions meaning it integrates nicely with SQL engines (Shark), Machine Learning (MLlib), and streaming (Spark Streaming) without requiring new software installed on the cluster using Hadoop’s new YARN cluster manager.

At Alpine, we have made it dead simple to get started with Spark by including the technology in our latest build out of the box.  We require no additional software or hardware to leverage our extensive list of operators for data transformation, exploration, and building advanced analytic models.  We leverage Hadoop Yarn (Hadoop NextGen) to launch Spark job without any pre-installation of Spark or modification of cluster configuration. This empowers our customers to have seamless integration of our Spark implementation and their Hadoop stack.  For example, we have analyzed 50 Million rows of account data in 50 seconds on a 20 node cluster recently at last month GigaOM conference. 

The screenshot below shows how Spark does a quick in-memory iteration.  It uses a standard way to do the gradient aggregation, as implemented by Databricks, a company which commercializes  the Apache Spark framework.Spark In-memory IterationAlso, see a demo at 

Interested in learning more about Alpine Chorus and Spark? Head over to to get started. 
"How to Become a Data Scientist" Slides and Video

International Conference on Quantum Simulation 2014 
SETI Institute, Mountain View, California
 July 9 and 10, 2014