I am a PhD candidate in Computer Science at UC Berkeley. I am a part of the AMPLab and I am advised by Ion Stoica and Mike Franklin. My research interests are in designing systems and algorithms for large scale data analysis. Before coming to Berkeley, I completed my Masters at University of Illinois at Urbana-Champaign and worked in the Systems Research Group, with Prof. Roy Campbell.

I am graduating in May 2017 and am looking for academic jobs.

CV -
Research Statement -
Teaching Statement

### Selected Projects

**Efficient Performance Modeling**:
Configuring and deploying large scale analytics in the cloud is challenging
as it is often unclear what the appropriate configuration is for a given workload.
Ernest is a performance modeling framework that can be used to predict the optimal
cluster configuration.
Ernest minimizes the resources used to build a performance model by training on small
samples of data and then predicts
performance on larger datasets and cluster sizes. We also studied how this can be used
to model algorithm convergence rates in Hemingway.

Ernest: NSDI 2016 -
Source Code |
Hemingway: Learning Systems Workshop, NIPS 2016

**Low Latency Scheduling**
Schedulers used in analytics frameworks aim to minimize the amount of time spent in accessing data
while ensuring coordination overheads are not high. While centralized batch systems provide
optimal scheduling decisions and fault tolerance, they impose a high overhead for low latency workloads.
On the other hand streaming systems provide low latency during normal execution but incur high
latency while recovering from faults. To address this we built Drizzle, a scheduling framework that combines the benefits of
batch processing and streaming systems by using coarse-grained scheduling with fine-grained
execution.
Further, to improve data locality for ML algorithms my work has also studied scheduling techniques (KMN) that
can leverage the fact that algorithms operate on a sample of the input data.

Drizzle: Technical Report -
Source Code |
KMN: OSDI 2014

**ML Pipelines**:
A number of real-world machine learning applications
require the combination of multiple algorithms. For example a text
classification program might featurize data using TF-IDF scores, then perform
dimension reduction using PCA and finally learn a model using logistic
regression. We proposed machine learning
pipelines as an abstraction that allows users to compose simple operators and form end-to-end pipelines.
In the KeystoneML project we further studied a number of optimizations enabled by our high
level API.

KeystoneML: arxiv -
Source Code |
SparkML: Blog Post

**Scaling R Programs**
R is a widely statistical programming language, but data analysis using R is limited by the
memory available on a single machine. In DistributedR, we proposed a distributed array based
abstraction and developed techniques to efficiently share data across multiple-cores and
mitigate load imbalance for sparse matrix based algorithms. Further, to enable large scale structured data
processing, we developed SparkR, an R package for Apache Spark. SparkR uses distributed data
frames as a unifying abstraction to provide support for SQL queries and machine learning
algorithms from R.

DistributedR: Eurosys 2013 -
HotCloud 2012 -
Source Code |
SparkR: SIGMOD 2016 -
Source Code

### Publications

#### 2017

Omid Alipourfard, Jianshu Chen, Hongqiang Liu, Shivaram Venkataraman, Minlan Yu, Ming Zhang Cherry Pick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics - To Appear NSDI 2017

#### 2016

Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoica Drizzle: Fast and Adaptable Stream Processing at Scale - preprint.

Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, Benjamin Recht KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics - arxiv preprint

Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, Joseph Gonzalez Hemingway: Modeling Distributed Optimization Algorithms - Learning Systems Workshop, NIPS 2016

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica Apache Spark: A Unified Engine for Big Data Processing - CACM Contributed Article, Nov 2016

Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Ben Recht, Ion Stoica Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics - NSDI 2016

Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia SparkR: Scaling R Programs with Spark - SIGMOD 2016

Reza Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, Matei Zaharia Matrix Computations and Optimization in Apache Spark - KDD 2016. Best Paper runner-up, Applied Data Science Track.

Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, Ben Recht Large Scale Kernel Learning using Block Coordinate Descent - arxiv preprint

#### 2015

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar MLlib: Machine Learning in Apache Spark - JMLR 17(34):1–7, 2016

#### 2014

Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael Franklin, Ion Stoica The Power of Choice in Data-Aware Cluster Scheduling - OSDI 2014

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying eventual consistency with PBS - CACM Research Highlight August 2014

#### 2013

Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica The Case for Tiny Tasks in Compute Clusters - HotOS 2013

Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices - Eurosys 2013

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica PBS at Work: Advancing Data Management with Consistency Metrics. - Demo at SIGMOD 2013

#### 2012

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Cake: Enabling High-level SLOs on Shared Storage Systems - SoCC 2012

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Sweet Storage SLOs with Frosting - HotCloud 2012

Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Using R for Iterative and Incremental Processing - HotCloud 2012

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying Eventual Consistency with PBS - VLDB Journal Special Edition - Best of VLDB 2012

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Probabilistically Bounded Staleness for Practical Partial Quorums - VLDB 2012

#### Earlier Work

Storage system design for non-volatile byte-addressable memory using consistent and durable data structures - Masters Thesis, University of Illinois, Urbana-Champaign 2011

Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory - FAST 2011

Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Redesigning Data Structures for Non-Volatile Byte-Addressable Memory - Non-Volatile Memories Workshop 2011

Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, Roy Campbell An Algorithm for Fast Edit Distance Computation on GPUs - IEEE Innovative Parallel Computing (InPar) 2012

Abhishek Verma, Shivaram Venkataraman, Matthew Caesar, and Roy H. Campell Scalable Storage for Data-intensive Computing - Handbook of Data-Intensive Computing, Springer Science, 2011.

Ellick Chan, Shivaram Venkataraman, Nadia Tkach, Kevin Larson, Alejandro Gutierrez and Roy H. Campbell Characterizing Data Structures for Volatile Forensics - Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE), 2011

Elllick Chan, Shivaram Venkataraman, Francis David, Amey Chaugule, Roy Campbell Forenscope: A Framework for Live Forensics - ACSAC 2010

Abhishek Verma, Xavier Llora, Shivaram Venkataraman, David Goldberg and Roy Campbell Scaling eCGA Model Building via Data Intensive Computing - IEEE Congress on Evolutionary Computation, CEC 2010

### Selected Talks

*Low Latency Execution for Apache Spark* at
Spark Summit 2016

*Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics* at
NSDI 2016

*SparkR: Scaling R Programs with Spark* at
SIGMOD 2016, Spark Summit 2015

*The Power of Choice in Data-Aware Cluster Scheduling* at
OSDI 2014

*Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices* at
Eurosys 2013

*Probabilistically Bounded Staleness for Practical Partial Quorums*
joint talk with Peter Bailis, at VLDB 2012

*Using R for Iterative and Incremental Processing* at
HotCloud 2012

*Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory* at
FAST 2011

### Contact

Email: shivaram dot venkataraman at gmail.com or shivaram at cs.berkeley.edu

GitHub: @shivaram