Efficient Performance Modeling:
Configuring and deploying large scale analytics in the cloud is challenging
as it is often unclear what the appropriate configuration is for a given workload.
Ernest is a performance modeling framework that can be used to predict the optimal
Ernest minimizes the resources used to build a performance model by training on small
samples of data and then predicts
performance on larger datasets and cluster sizes. We also studied how this can be used
to model algorithm convergence rates in Hemingway.
Ernest: NSDI 2016 - Source Code | Hemingway: Learning Systems Workshop, NIPS 2016
Low Latency Scheduling
Schedulers used in analytics frameworks aim to minimize the amount of time spent in accessing data
while ensuring coordination overheads are not high. While centralized batch systems provide
optimal scheduling decisions and fault tolerance, they impose a high overhead for low latency workloads.
On the other hand streaming systems provide low latency during normal execution but incur high
latency while recovering from faults. To address this we built Drizzle, a scheduling framework that combines the benefits of
batch processing and streaming systems by using coarse-grained scheduling with fine-grained
Further, to improve data locality for ML algorithms my work has also studied scheduling techniques (KMN) that
can leverage the fact that algorithms operate on a sample of the input data.
Drizzle: SOSP 2017 - Source Code | KMN: OSDI 2014
A number of real-world machine learning applications
require the combination of multiple algorithms. For example a text
classification program might featurize data using TF-IDF scores, then perform
dimension reduction using PCA and finally learn a model using logistic
regression. We proposed machine learning
pipelines as an abstraction that allows users to compose simple operators and form end-to-end pipelines.
In the KeystoneML project we further studied a number of optimizations enabled by our high
KeystoneML: ICDE 2017 - Source Code | SparkML: Blog Post
Scaling R Programs
R is a widely statistical programming language, but data analysis using R is limited by the
memory available on a single machine. In DistributedR, we proposed a distributed array based
abstraction and developed techniques to efficiently share data across multiple-cores and
mitigate load imbalance for sparse matrix based algorithms. Further, to enable large scale structured data
processing, we developed SparkR, an R package for Apache Spark. SparkR uses distributed data
frames as a unifying abstraction to provide support for SQL queries and machine learning
algorithms from R.
DistributedR: Eurosys 2013 - HotCloud 2012 - Source Code | SparkR: SIGMOD 2016 - Source Code