Channel | Publish Date | Thumbnail & View Count | Download Video |
---|---|---|---|
Publish Date not found | 0 Views |
Learned indexing and sampling for improving query performance in Big Data Analytics
Speaker: Kexin Rong
Abstract:
Traditional data analytics systems improve query efficiency through fine-grained indexing and row-level sampling techniques. To keep up with data volumes, more and more systems are storing and processing datasets on large partitions with hundreds of thousands of rows. Therefore, these analytics systems need to adapt traditional techniques to work with coarse-grained data partitions as the basic unit to process queries efficiently. In this talk, I will discuss two related ideas that combine learning techniques with partition designs to improve query efficiency in the analytics systems. First, I'll describe PS3, the first approximate query processing system that supports non-uniform sampling at the partition level. PS3 reduces the number of partitions that 3 can access by up to 70x to achieve the same error compared to a uniform sample of the partitions. Next, I'll present OLO, an online learning framework that dynamically adjusts data organization based on changes in query workloads to minimize overall data access and movement. We show that dynamic reorganization in end-to-end runtime outperforms a single, optimized partition scheme by up to 30%. I conclude by discussing outstanding issues in this area.
Bio:
Kexin Rong is a postdoctoral researcher at Vmware Research Group. Her research focuses on improving the efficiency and usability of large-scale data analysis. She received her Ph.D. in computer science from Stanford, advised by Peter Bailis and Philp Levis. She will join Georgia Tech in the fall as an assistant professor in the School of Computer Science.
—
0:00 Presentation
32:20 Discussion
Stanford MLSys Seminar hosts: Dan Fu, Karan Goel, Fiodar Kazhamiaka and Piero Molino
Executive producers: Matei Zaharia, Chris Ré
Twitter:
https://twitter.com/realDanFu
https://twitter.com/krandiash
https://twitter.com/w4nderlus7
—
Check our website for the class schedule: http://mlsys.stanford.edu
Join our mailing list to receive weekly updates: https://groups.google.com/forum/#!forum/stanford-mlsys-seminars/join
#machinelearning #ai #artificialintelligence #systems #mlsys #computerscience #stanford #vmware #georgiatech #bigdata
Please take the opportunity to connect and share this video with your friends and family if you find it helpful.