I am a PhD candidate in Computer Science at UC Berkeley. I am a part of the AMPLab and I am advised by Ion Stoica and Mike Franklin. My research interests are in designing systems and algorithms for large scale data analysis. Before coming to Berkeley, I completed my Masters at University of Illinois at Urbana-Champaign and worked in the Systems Research Group, with Prof. Roy Campbell.
Efficient Performance Modeling:
Configuring and deploying large scale analytics in the cloud is challenging
as it is often unclear what the appropriate configuration is for a given workload.
Ernest is a performance modeling framework that can be used to predict the optimal
Ernest minimizes the resources used to build a performance model by training on small
samples of data and then predicts
performance on larger datasets and cluster sizes. We also studied how this can be used
to model algorithm convergence rates in Hemingway.
Ernest: NSDI 2016 - Source Code | Hemingway: Learning Systems Workshop, NIPS 2016
Low Latency Scheduling
Schedulers used in analytics frameworks aim to minimize the amount of time spent in accessing data
while ensuring coordination overheads are not high. While centralized batch systems provide
optimal scheduling decisions and fault tolerance, they impose a high overhead for low latency workloads.
On the other hand streaming systems provide low latency during normal execution but incur high
latency while recovering from faults. To address this we built Drizzle, a scheduling framework that combines the benefits of
batch processing and streaming systems by using coarse-grained scheduling with fine-grained
Further, to improve data locality for ML algorithms my work has also studied scheduling techniques (KMN) that
can leverage the fact that algorithms operate on a sample of the input data.
Drizzle: Technical Report - Source Code | KMN: OSDI 2014
A number of real-world machine learning applications
require the combination of multiple algorithms. For example a text
classification program might featurize data using TF-IDF scores, then perform
dimension reduction using PCA and finally learn a model using logistic
regression. We proposed machine learning
pipelines as an abstraction that allows users to compose simple operators and form end-to-end pipelines.
In the KeystoneML project we further studied a number of optimizations enabled by our high
KeystoneML: ICDE 2017 - Source Code | SparkML: Blog Post
Scaling R Programs
R is a widely statistical programming language, but data analysis using R is limited by the
memory available on a single machine. In DistributedR, we proposed a distributed array based
abstraction and developed techniques to efficiently share data across multiple-cores and
mitigate load imbalance for sparse matrix based algorithms. Further, to enable large scale structured data
processing, we developed SparkR, an R package for Apache Spark. SparkR uses distributed data
frames as a unifying abstraction to provide support for SQL queries and machine learning
algorithms from R.
DistributedR: Eurosys 2013 - HotCloud 2012 - Source Code | SparkR: SIGMOD 2016 - Source Code
Eric Jonas, Shivaram Venkataraman, Ion Stoica, Benjamin Recht Occupy the Cloud: Distributed Computing for the 99% arxiv preprint
Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson, Alex Gittens, Michael I. Jordan, Benjamin Recht Breaking Locality Accelerates Block Gauss-Seidel arxiv preprint
Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, Benjamin Recht KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics - ICDE 2017 arxiv version
Omid Alipourfard, Jianshu Chen, Hongqiang Liu, Shivaram Venkataraman, Minlan Yu, Ming Zhang Cherry Pick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics - NSDI 2017
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoica Drizzle: Fast and Adaptable Stream Processing at Scale - preprint.
Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, Joseph Gonzalez Hemingway: Modeling Distributed Optimization Algorithms - Learning Systems Workshop, NIPS 2016
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica Apache Spark: A Unified Engine for Big Data Processing - CACM Contributed Article, Nov 2016
Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Ben Recht, Ion Stoica Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics - NSDI 2016
Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia SparkR: Scaling R Programs with Spark - SIGMOD 2016
Reza Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, Matei Zaharia Matrix Computations and Optimization in Apache Spark - KDD 2016. Best Paper runner-up, Applied Data Science Track.
Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, Ben Recht Large Scale Kernel Learning using Block Coordinate Descent - arxiv preprint
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar MLlib: Machine Learning in Apache Spark - JMLR 17(34):1–7, 2016
Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael Franklin, Ion Stoica The Power of Choice in Data-Aware Cluster Scheduling - OSDI 2014
Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying eventual consistency with PBS - CACM Research Highlight August 2014
Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica The Case for Tiny Tasks in Compute Clusters - HotOS 2013
Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices - Eurosys 2013
Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica PBS at Work: Advancing Data Management with Consistency Metrics. - Demo at SIGMOD 2013
Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Cake: Enabling High-level SLOs on Shared Storage Systems - SoCC 2012
Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Sweet Storage SLOs with Frosting - HotCloud 2012
Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Using R for Iterative and Incremental Processing - HotCloud 2012
Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying Eventual Consistency with PBS - VLDB Journal Special Edition - Best of VLDB 2012
Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Probabilistically Bounded Staleness for Practical Partial Quorums - VLDB 2012
Storage system design for non-volatile byte-addressable memory using consistent and durable data structures - Masters Thesis, University of Illinois, Urbana-Champaign 2011
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory - FAST 2011
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Redesigning Data Structures for Non-Volatile Byte-Addressable Memory - Non-Volatile Memories Workshop 2011
Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, Roy Campbell An Algorithm for Fast Edit Distance Computation on GPUs - IEEE Innovative Parallel Computing (InPar) 2012
Abhishek Verma, Shivaram Venkataraman, Matthew Caesar, and Roy H. Campell Scalable Storage for Data-intensive Computing - Handbook of Data-Intensive Computing, Springer Science, 2011.
Ellick Chan, Shivaram Venkataraman, Nadia Tkach, Kevin Larson, Alejandro Gutierrez and Roy H. Campbell Characterizing Data Structures for Volatile Forensics - Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE), 2011
Elllick Chan, Shivaram Venkataraman, Francis David, Amey Chaugule, Roy Campbell Forenscope: A Framework for Live Forensics - ACSAC 2010
Abhishek Verma, Xavier Llora, Shivaram Venkataraman, David Goldberg and Roy Campbell Scaling eCGA Model Building via Data Intensive Computing - IEEE Congress on Evolutionary Computation, CEC 2010
Low Latency Execution for Apache Spark at Spark Summit 2016
Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics at NSDI 2016
The Power of Choice in Data-Aware Cluster Scheduling at OSDI 2014
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices at Eurosys 2013
Probabilistically Bounded Staleness for Practical Partial Quorums joint talk with Peter Bailis, at VLDB 2012
Using R for Iterative and Incremental Processing at HotCloud 2012
Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory at FAST 2011
Email: shivaram dot venkataraman at gmail.com or shivaram at cs.berkeley.edu