http://mlbase.org/
From
"Nick Pentreath" <nick.pentre...@gmail.com>
Subject
Re: Machine Learning on Spark [long rambling discussion email]
Date
Thu, 25 Jul 2013 20:56:36 GMT
Cool I totally understand the constraints you're under and it's not really a criticism at all
- the amplab projects are all awesome!
If I can find ways to help then all the better
—
Sent from Mailbox for iPhone
On Thu, Jul 25, 2013 at 10:04 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:
> I fully agree that we need to be clearer with the timelines in AMP Lab. One thing is
that many of these are still research projects, so it's hard to predict when they will be
ready for prime-time. Usually with all the things we officially announce (e.g. MLlib, GraphX),
and especially the things we put in the Spark codebase, the team behind them really wants
to make them widely available and has committed to spend the engineering to make them usable
in real applications (as opposed to prototyping and moving on). But even then it can take
some time to get the first release out. Hopefully we'll improve our communication about this
through more careful tracking in JIRA.
> Matei
> On Jul 25, 2013, at 11:41 AM, Ameet Talwalkar <ameet@eecs.berkeley.edu> wrote:
>> Hi Nick,
>>
>> I can understand your 'frustration' -- my hope is that having discussions
>> (like the one we're having now) via this mailing list will help mitigate
>> duplicate work moving forward.
>>
>> Regarding your detailed comments, we are aiming to include various
>> components that you mentioned in our release (basic evaluation for
>> collaborative filtering, linear model additions, and basic support for
>> sparse vectors/features). One particularly interesting avenue that is not
>> on our immediate roadmap is adding implicit feedback for matrix
>> factorization. Algorithms like SVD++ are often used in practice, and it
>> would be great to add them to the MLI library (and perhaps also MLlib).
>>
>> -Ameet
>>
>>
>> On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath <nick.pentreath@gmail.com>wrote:
>>
>>> Hi
>>>
>>> Ok, that all makes sense. I can see the benefit of good standard libraries
>>> definitely, and I guess the pieces that felt "missing" to me were what you
>>> are describing as MLI and MLOptimizer.
>>>
>>> It seems like the aims of MLI are very much in line with what I have/had in
>>> mind for a ML library/framework. It seems the goals overlap quite a lot.
>>>
>>> I guess one "frustration" I have had is that there are all these great BDAS
>>> projects, but we never really know when they will be released and what they
>>> will look like until they are. In this particular case I couldn't wait for
>>> MLlib so ended up doing some work myself to port Mahout's ALS and of course
>>> have ended up duplicating effort (which is not a problem as it was
>>> necessary at the time and has been a great learning experience).
>>>
>>> Similarly for GraphX, I would like to develop a project for a Spark-based
>>> version of Faunus (https://github.com/thinkaurelius/faunus) for batch
>>> processing of data in our Titan graph DB. For now I am working with
>>> Bagel-based primitives and Spark RDDs directly, but would love to use
>>> GraphX, but have no idea when it will be released and have little
>>> involvement until it is.
>>>
>>> (I use "frustration" in the nicest way here - I love the BDAS concepts and
>>> all the projects coming out, I just want them all to be released NOW!! :)
>>>
>>> So yes I would love to be involved in MLlib and MLI work to the extent I
>>> can assist and the work is aligned with what I need currently in my
>>> projects (this is just from a time allocation viewpoint - I'm sure much of
>>> it will be complementary).
>>>
>>> Anyway, it seems to me the best course of action is as follows:
>>>
>>> - I'll get involved in MLlib and see how I can contribute there. Some
>>> things that jump out:
>>>
>>>
>>> - implicit preference capability for ALS model since as far as I can see
>>> currently it handles explicit prefs only? (Implicit prefs here:
>>> http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
>>> typically better if we don't have actual rating data but instead
>>> "view",
>>> "click", "play" or whatever)
>>
>> - RMSE and other evaluation metrics for ALS as well as test/train
>>> split / cross-val stuff?
>>
>> - linear model additions, like new loss functions for hinge loss,
>>> least squares etc for SGD, as well as learning rate stuff (
>>> http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net)
>>> -
>>> i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if
>>> that's
>>> desirable, my view is yes)
>>
>> - what about sparse weight and feature vectors for linear models/SGD?
>>> Together with hashing allows very large models while still being
>>> efficient,
>>> and with L1 reg is particularly useful.
>>
>> - finally what about online models? ie SGD models currently are
>>> "static" ie once trained can only predict, whereas SGD can of course
>>> keep
>>> learning. Or does one simply re-train with the previous initial
>>> weight
>>> vector (I guess that can work just as well)... Also on this
>>> topic training
>>> / predicting on Streams as well as RDDs
>>> - I can put up what I have done to a BitBucket account and grant access
>>> to whichever devs would like to take a look. The only reason I don't
>>> just
>>> throw it up on GitHub is that frankly it is not really ready and is not
>>> a
>>> fully-fledged project yet (I think anyway). Possibly some of this can be
>>> useful (not that there's all that much there apart from the ALS (but it
>>> does solve for both explicit and implicit preference data as per
>>> Mahout's
>>> implementation), KMeans (simpler than the one in MLlib as I didn't yet
>>> get
>>> around to doing KMeans++ init) and the arg-parsing / jobrunner (which
>>> may
>>> or may not be interesting both for ML and for Spark jobs in general)).
>>>
>>> Let me know your thoughts
>>> Nick
>>>
>>>
>>> On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar
>>> <ameet@eecs.berkeley.edu>wrote:
>>>
>>>> Hi Nick,
>>>>
>>>> Thanks for your email, and it's great to see such excitement around this
>>>> work! Matei and Reynold already addressed the motivation behind MLlib as
>>>> well as our reasons for not using Breeze, and I'd like to give you some
>>>> background about MLbase, and discuss how it may fit with your interests.
>>>>
>>>> There are three components of MLbase:
>>>>
>>>> 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
>>>> kernels and solid implementations of common algorithms that can be used
>>>> easily by Java/Python and also called into by higher-level systems (e.g.
>>>> MLI, Shark, PySpark).
>>>>
>>>> 2) MLI: this is an ML API that provides a common interface for ML
>>>> algorithms (the same interface used in MLlib), and introduces high-level
>>>> abstractions to simplify feature extraction / exploration and ML
>>> algorithm
>>>> development. These abstractions leverage the kernels in MLlib when
>>>> possible, and also introduce additional kernels. This work also
>>> includes a
>>>> library written against the MLI. The MLI is currently written against
>>>> Spark, but is designed to be platform independent, so that code written
>>>> against MLI could be run on different engines (e.g., Hadoop, GraphX,
>>> etc.).
>>>>
>>>>
>>>> 3) ML Optimizer: This piece automates the task of model selection. The
>>>> optimizer can be viewed as a search problem over feature extraction /
>>>> algorithms included in the MLI library, and is in part based on efficient
>>>> cross validation. This work is under active development but is in an
>>>> earlier stage of development than MLlib and MLI.
>>>>
>>>> (note: MLlib will be included with the Spark codebase, while the MLI and
>>> ML
>>>> Optimizer will live in separate repositories.)
>>>>
>>>> As far as I can tell (though please correct me if I've misunderstood)
>>> your
>>>> main goals include:
>>>>
>>>> i) "consistency in the API"
>>>> ii) "some level of abstraction but to keep things as simple as possible"
>>>> iii) "execute models on Spark ... while providing workflows for
>>> pipelining
>>>> transformations, feature extraction, testing and cross-validation, and
>>> data
>>>> viz."
>>>>
>>>> The MLI (and to some extent the ML Optimizer) is very much in line with
>>>> these goals, and it would be great if you were interested in contributing
>>>> to it. MLI is a private repository right now, but we'll make it public
>>>> soon though, and Evan Sparks or I will let you know when we do so.
>>>>
>>>> Thanks again for getting in touch with us!
>>>>
>>>> -Ameet
>>>>
>>>>
>>>> On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <rxin@cs.berkeley.edu>
>>>> wrote:
>>>>
>>>>> On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
>>>> nick.pentreath@gmail.com
>>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I also found Breeze to be very nice to work with and like the DSL
-
>>>> hence
>>>>>> my question about why not use that? (Especially now that Breeze is
>>>>> actually
>>>>>> just breeze-math and breeze-viz).
>>>>>>
>>>>>
>>>>>
>>>>> Matei addressed this from a higher level. I want to provide a little
>>> bit
>>>>> more context. A common properties of a lot of high level Scala DSL
>>>>> libraries is that simple operators tend to have high virtual function
>>>>> overheads and also create a lot of temporary objects. And because the
>>>> level
>>>>> of abstraction is so high, it is fairly hard to debug / optimize
>>>>> performance.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Reynold Xin, AMPLab, UC Berkeley
>>>>> http://rxin.org
http://mail-archives.apache.org/mod_mbox/spark-dev/201307.mbox/%3C1374785796360.b9575b2a@Nodemailer%3E
========
Hi Lochana,
This post is also referring to the MLbase project I mentioned in my
previous email. We have not open-sourced this work, but plan to do so.
Moreover, you might want to check out the following JIRA ticket
<https://issues.apache.org/jira/browse/SPARK-3530>that includes the design
doc for ML pipelines and parameters in MLlib. This design will include
many of the ideas from our MLbase work.
-Ameet
==================
Status of MLI?