Shivaram Venkataraman


I am a PhD candidate in Computer Science at UC Berkeley. I am a part of the AMPLab and I am advised by Ion Stoica and Mike Franklin. My research interests are in designing systems and algorithms for large scale data analysis. Before coming to Berkeley, I completed my Masters at University of Illinois at Urbana-Champaign and worked in the Systems Research Group, with Prof. Roy Campbell.

I am graduating in May 2017 and am looking for academic jobs.
CV - Research Statement - Teaching Statement


Selected Projects

Efficient Performance Modeling: Configuring and deploying large scale analytics in the cloud is challenging as it is often unclear what the appropriate configuration is for a given workload. Ernest is a performance modeling framework that can be used to predict the optimal cluster configuration. Ernest minimizes the resources used to build a performance model by training on small samples of data and then predicts performance on larger datasets and cluster sizes. We also studied how this can be used to model algorithm convergence rates in Hemingway.
Ernest: NSDI 2016 - Source Code | Hemingway: Learning Systems Workshop, NIPS 2016

Low Latency Scheduling Schedulers used in analytics frameworks aim to minimize the amount of time spent in accessing data while ensuring coordination overheads are not high. While centralized batch systems provide optimal scheduling decisions and fault tolerance, they impose a high overhead for low latency workloads. On the other hand streaming systems provide low latency during normal execution but incur high latency while recovering from faults. To address this we built Drizzle, a scheduling framework that combines the benefits of batch processing and streaming systems by using coarse-grained scheduling with fine-grained execution. Further, to improve data locality for ML algorithms my work has also studied scheduling techniques (KMN) that can leverage the fact that algorithms operate on a sample of the input data.
Drizzle: Technical Report - Source Code | KMN: OSDI 2014

ML Pipelines: A number of real-world machine learning applications require the combination of multiple algorithms. For example a text classification program might featurize data using TF-IDF scores, then perform dimension reduction using PCA and finally learn a model using logistic regression. We proposed machine learning pipelines as an abstraction that allows users to compose simple operators and form end-to-end pipelines. In the KeystoneML project we further studied a number of optimizations enabled by our high level API.
KeystoneML: arxiv - Source Code | SparkML: Blog Post

Scaling R Programs R is a widely statistical programming language, but data analysis using R is limited by the memory available on a single machine. In DistributedR, we proposed a distributed array based abstraction and developed techniques to efficiently share data across multiple-cores and mitigate load imbalance for sparse matrix based algorithms. Further, to enable large scale structured data processing, we developed SparkR, an R package for Apache Spark. SparkR uses distributed data frames as a unifying abstraction to provide support for SQL queries and machine learning algorithms from R.
DistributedR: Eurosys 2013 - HotCloud 2012 - Source Code | SparkR: SIGMOD 2016 - Source Code

Publications

2017

Omid Alipourfard, Jianshu Chen, Hongqiang Liu, Shivaram Venkataraman, Minlan Yu, Ming Zhang Cherry Pick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics - To Appear NSDI 2017

2016

Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, Ion Stoica Drizzle: Fast and Adaptable Stream Processing at Scale - preprint.

Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, Benjamin Recht KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics - arxiv preprint

Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, Joseph Gonzalez Hemingway: Modeling Distributed Optimization Algorithms - Learning Systems Workshop, NIPS 2016

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica Apache Spark: A Unified Engine for Big Data Processing - CACM Contributed Article, Nov 2016

Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Ben Recht, Ion Stoica Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics - NSDI 2016

Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, Matei Zaharia SparkR: Scaling R Programs with Spark - SIGMOD 2016

Reza Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, Matei Zaharia Matrix Computations and Optimization in Apache Spark - KDD 2016. Best Paper runner-up, Applied Data Science Track.

Stephen Tu, Rebecca Roelofs, Shivaram Venkataraman, Ben Recht Large Scale Kernel Learning using Block Coordinate Descent - arxiv preprint

2015

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J Franklin, Reza Zadeh, Matei Zaharia, Ameet Talwalkar MLlib: Machine Learning in Apache Spark - JMLR 17(34):1–7, 2016

2014

Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael Franklin, Ion Stoica The Power of Choice in Data-Aware Cluster Scheduling - OSDI 2014

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying eventual consistency with PBS - CACM Research Highlight August 2014

2013

Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, Ion Stoica The Case for Tiny Tasks in Compute Clusters - HotOS 2013

Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices - Eurosys 2013

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica PBS at Work: Advancing Data Management with Consistency Metrics. - Demo at SIGMOD 2013

2012

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Cake: Enabling High-level SLOs on Shared Storage Systems - SoCC 2012

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, and Randy Katz Sweet Storage SLOs with Frosting - HotCloud 2012

Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber Using R for Iterative and Incremental Processing - HotCloud 2012

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Quantifying Eventual Consistency with PBS - VLDB Journal Special Edition - Best of VLDB 2012

Peter Bailis, Shivaram Venkataraman, Michael Franklin, Joseph M. Hellerstein, and Ion Stoica Probabilistically Bounded Staleness for Practical Partial Quorums - VLDB 2012

Earlier Work

Storage system design for non-volatile byte-addressable memory using consistent and durable data structures - Masters Thesis, University of Illinois, Urbana-Champaign 2011

Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory - FAST 2011

Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, Roy Campbell Redesigning Data Structures for Non-Volatile Byte-Addressable Memory - Non-Volatile Memories Workshop 2011

Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, Roy Campbell An Algorithm for Fast Edit Distance Computation on GPUs - IEEE Innovative Parallel Computing (InPar) 2012

Abhishek Verma, Shivaram Venkataraman, Matthew Caesar, and Roy H. Campell Scalable Storage for Data-intensive Computing - Handbook of Data-Intensive Computing, Springer Science, 2011.

Ellick Chan, Shivaram Venkataraman, Nadia Tkach, Kevin Larson, Alejandro Gutierrez and Roy H. Campbell Characterizing Data Structures for Volatile Forensics - Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE), 2011

Elllick Chan, Shivaram Venkataraman, Francis David, Amey Chaugule, Roy Campbell Forenscope: A Framework for Live Forensics - ACSAC 2010

Abhishek Verma, Xavier Llora, Shivaram Venkataraman, David Goldberg and Roy Campbell Scaling eCGA Model Building via Data Intensive Computing - IEEE Congress on Evolutionary Computation, CEC 2010


Selected Talks

Low Latency Execution for Apache Spark at Spark Summit 2016

Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics at NSDI 2016

SparkR: Scaling R Programs with Spark at SIGMOD 2016, Spark Summit 2015

The Power of Choice in Data-Aware Cluster Scheduling at OSDI 2014

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices at Eurosys 2013

Probabilistically Bounded Staleness for Practical Partial Quorums joint talk with Peter Bailis, at VLDB 2012

Using R for Iterative and Incremental Processing at HotCloud 2012

Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory at FAST 2011


Contact

Email: shivaram dot venkataraman at gmail.com or shivaram at cs.berkeley.edu

GitHub: @shivaram