Monday, August 4, 2014

SPARK

SPARK


http://spark.apache.org/

Ion Stoica's Home Page
http://www.cs.berkeley.edu/~istoica/

Scott Shenker
http://www.eecs.berkeley.edu/Faculty/Homepages/shenker.html

SLIDESHARE TUTORIALS

1. by Sameer Farooqui, Customer solutions architect, @blueplastic

1.1. http://www.slideshare.net/blueplastic/spark-cassandra-at-datastax-meetup-on-jan-29-2015

1.2. C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui
http://www.slideshare.net/planetcassandra/cassandra-vsh-base

1.3. Sameer Farooqui (Databricks) led a superb tutorial/hands-on lab on Hadoop Fundamentals. He covered a number of technologies in the Hadoop ecosystem. The lab is available under the Creative Commons license at
http://tinyurl.com/bigdatatechcon
Lab created on: Dec, 2014


2. Oct 23, 2014
Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC Strata last week where they created a prototype Spark Streaming + Kafka application for time series data. You can see the code here: https://github.com/killrweather/killrweather

3. from http://tinyurl.com/bigdatatechcon

Labs: Intro to HDFS & Apache Spark on CDH 5.2

https://docs.google.com/document/d/1X-VxSa99bPfwk_pQcvcb3q7RDKBzi7hzqvfLOPApV7A/edit

IntroToMLUsingSparkatSVCC.pdfhttps://www.scribd.com/doc/243546790/IntroToMLUsingSparkatSVCC-pdf


Learning Spark
Lightning-Fast Big Data Analysis
By Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zahariahttp://shop.oreilly.com/product/0636920028512.do
++ Spark Training Videos ++
From Spark Summit 2013:  http://spark-summit.org/2013
From Spark Summit 2014:  http://spark-summit.org/2014


++ Databricks Resources for Spark ++
Databricks will be releasing free videos, docs and labs to learn Spark here:


++ Spark Certification ++
Note, when you’re ready to get certified as a Spark Developer, check out the joint Spark certification program between Databricks & O’Reilly:



http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification

SPARK DOCUMENTATION

API - https://spark.apache.org/docs/1.2.0/api/scala/index.html
https://spark.apache.org/docs/1.2.0


Spark SQL Programming Guide

Spark Cluster Papers

SparkNet: Training Deep Networks in Spark

Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work. However, widely-popular batch-processing computational frameworks like MapReduce and Spark were not designed to support the asynchronous and communication-intensive workloads of existing distributed deep learning systems. We introduce SparkNet, a framework for training deep networks in Spark. Our implementation includes a convenient interface for reading data from Spark RDDs, a Scala interface to the Caffe deep learning framework, and a lightweight multi-dimensional tensor library. Using a simple parallelization scheme for stochastic gradient descent, SparkNet scales well with the cluster size and tolerates very high-latency communication. Furthermore, it is easy to deploy and use with no parameter tuning, and it is compatible with existing Caffe models. We quantify the dependence of the speedup obtained by SparkNet on the number of machines, the communication frequency, and the cluster's communication overhead, and we benchmark our system's performance on the ImageNet dataset.
Comments:11 pages, 6 figures
Subjects:Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
Cite as:arXiv:1511.06051 [stat.ML]
(or arXiv:1511.06051v1 [stat.ML] for this version)

Submission history

From: Robert Nishihara [view email]
[v1] Thu, 19 Nov 2015 03:29:56 GMT (573kb,D)

SPARK MEETUP PRESENTATIONS


http://www.meetup.com/spark-users/files/




Distributed Computing with Spark: Reza Zadeh, ICME program at Stanford

As computer clusters scale up, data flow models such as MapReduce have emerged as a way to run fault-tolerant computations on commodity hardware. Unfortunately, MapReduce is limited in efficiency for many numerical algorithms. We show how new data flow engines, such as Apache Spark, enable much faster iterative and numerical computations, while keeping the scalability and fault-tolerance properties of MapReduce. In this tutorial, we will begin with an overview of data flow computing models and the commodity cluster environment in comparison with traditional HPC and message-passing environments. We will then introduce Spark and show how common numerical and machine learning algorithms have been implemented on it. We will cover both algorithmic ideas and a practical introduction to programming with Spark. 

Reza Zadeh is a Consulting Professor of Computational Mathematics at Stanford, and technical Advisor at Databricks. He focuses on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing. More information available on his website: stanford.edu/~rezab/ 

https://www.youtube.com/watch?v=LfHJPVpZNao&list=PL87GtQd0bfJx73ibrce-Hl_kUhX2MBXGn

http://www.meetup.com/Spark-NYC/events/209271842/

As it is mentioned, Apache Spark follows a DAG (Directed Acyclic Graph) execution engine for execution. What is the whole concept about it and the overall architecture of the Spark?

http://www.quora.com/As-it-is-mentioned-Apache-Spark-follows-a-DAG-Directed-Acyclic-Graph-execution-engine-for-execution-What-is-the-whole-concept-about-it-and-the-overall-architecture-of-the-Spark

Advanced Spark meetup presentation
http://www.meetup.com/Advanced-Apache-Spark-Meetup/events/225715756/
[SF]Deep Dive: Spark SQL+DataFrames+Data­ Sources API+Parquet+Cassand­ra Connector
Overview 
Come join us for a deep dive into the details of the spark-cassandra-connector.
This implementation of the Spark SQL Data Sources API is one of the most advanced and performance-tunable connectors available.
Highlights of the spark-cassandra-connector 
1) Token-ring aware data locality for co-location with Spark Worker nodes
2) Pushdown filter support for optimal performance and participation in the advanced Spark SQL Catalyst Query Optimizations
3) Spark 1.4, Spark 1.5 DataFrame support
4) Enables single Cassandar data store to serve both your transactional and analytics needs (pros and cons to this)
Rough Agenda
7-7:15pm:  Introductions and Announcements
7:15-7:30pm:  Highlights from Strata NYC 
7:30-8:00pm:  Spark SQL Data Sources API Overview
8:00-8:30pm:  Details of the spark-cassandra-connector Data Sources API implementation
Related Links
0) Overview of the Data Sources API
1) The spark-cassandra-connector is an implementation of the Spark SQL DataSources API similar to the following: 
2) Examples of the spark-cassandra-connector in action:
3) Spark SQL Data Sources API

Advanced Spark meetup
Details
Code-level Deep Dive into the optimizations that allowed Spark to win the Daytona GraySort Challenge. 
We'll discuss the following at a code level: 
1) Sort-based Shuffle (less OS resources)
2) Netty-based Network module (epoll, async, ByteBuffer reuse)
3) External Shuffle Service (also allows for auto-scaling of Worker nodes)
4) AlphaSort style cache locality optimizations
http://www.slideshare.net/SparkSummit/deep-dive-into-project-tungsten-josh-rosen (slide 22)
https://issues.apache.org/jira/browse/SPARK-7082
5) https://issues.apache.org/jira/browse/SPARK-9850
Relevant Links

Spark SQL: Relational Data Processing in Spark
https://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
 Michael Armbrust†, Reynold S. Xin†, Cheng Lian†, Yin Huai†, Davies Liu†, Joseph K. Bradley†, Xiangrui Meng†, Tomer Kaftan‡, Michael J. Franklin†‡, Ali Ghodsi†, Matei Zaharia†⇤ †Databricks Inc. ⇤MIT CSAIL ‡AMPLab, UC Berkeley

 ABSTRACT Spark SQL is a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.












No comments:

Post a Comment