Shivaram Venkataraman

Assistant Professor, Computer Science, University of Wisconsin-Madison

Office: 7367 CS. Email: shivaram at

Past Projects

Efficient Performance Modeling: Configuring and deploying large scale analytics in the cloud is challenging as it is often unclear what the appropriate configuration is for a given workload. Ernest is a performance modeling framework that can be used to predict the optimal cluster configuration. Ernest minimizes the resources used to build a performance model by training on small samples of data and then predicts performance on larger datasets and cluster sizes. We also studied how this can be used to model algorithm convergence rates in Hemingway.
Ernest: NSDI 2016 - Source Code | Hemingway: Learning Systems Workshop, NIPS 2016

Low Latency Scheduling Schedulers used in analytics frameworks aim to minimize the amount of time spent in accessing data while ensuring coordination overheads are not high. While centralized batch systems provide optimal scheduling decisions and fault tolerance, they impose a high overhead for low latency workloads. On the other hand streaming systems provide low latency during normal execution but incur high latency while recovering from faults. To address this we built Drizzle, a scheduling framework that combines the benefits of batch processing and streaming systems by using coarse-grained scheduling with fine-grained execution. Further, to improve data locality for ML algorithms my work has also studied scheduling techniques (KMN) that can leverage the fact that algorithms operate on a sample of the input data.
Drizzle: SOSP 2017 - Source Code | KMN: OSDI 2014

ML Pipelines: A number of real-world machine learning applications require the combination of multiple algorithms. For example a text classification program might featurize data using TF-IDF scores, then perform dimension reduction using PCA and finally learn a model using logistic regression. We proposed machine learning pipelines as an abstraction that allows users to compose simple operators and form end-to-end pipelines. In the KeystoneML project we further studied a number of optimizations enabled by our high level API.
KeystoneML: ICDE 2017 - Source Code | SparkML: Blog Post

Scaling R Programs R is a widely statistical programming language, but data analysis using R is limited by the memory available on a single machine. In DistributedR, we proposed a distributed array based abstraction and developed techniques to efficiently share data across multiple-cores and mitigate load imbalance for sparse matrix based algorithms. Further, to enable large scale structured data processing, we developed SparkR, an R package for Apache Spark. SparkR uses distributed data frames as a unifying abstraction to provide support for SQL queries and machine learning algorithms from R.
DistributedR: Eurosys 2013 - HotCloud 2012 - Source Code | SparkR: SIGMOD 2016 - Source Code