Over the past year, Snorkel AI and Ponder have been working together to bring lightning-fast data labeling to Snorkel Flow users. In this post, we’ll walk you through a few of the highlights from our collaboration.
Problem
Snorkel AI is the pioneer in data-centric AI—a novel approach to AI development wherein data scientists and developers are equipped with the tools and workflows to scalably improve training data quality. Snorkel AI’s enterprise AI development platform, Snorkel Flow, has workflows for programmatically labeling structured and unstructured data including raw text, PDF documents, web pages, conversational data, etc.
As users found more and more success with Snorkel Flow, they started leveraging the platform to label larger and increasingly complex data, such as 100+ page PDF documents. For these use cases, the interactive development experience became slower and more resource-intensive.
Before Ponder, the original architecture of the platform was dependent on a popular parallel processing framework, Dask, and a custom in-memory cache mechanism. This posed several scalability-related challenges for the Snorkel AI team, including:
User Experience
Increasingly complex and compute-intensive data labeling workflows had slower performance, creating a less interactive experience.
System Reliability
Order-of-magnitude increases in data volumes began to cause out-of-memory errors, negatively impacting the reliability of the system.
Maintainability
As a short-term workaround, Snorkel AI had to tune and optimize configurations, making it difficult to maintain the platform’s codebase.
We built Snorkel Flow to help data science teams accelerate AI development through programmatic labeling. Working with Ponder, not only have we been able to speed up key labeling operations by more than 2x, but we’ve been able to do it with less code. We are now able to deliver a better development experience while making the codebase easier to maintain.”
Solution
As Snorkel AI looked for a solution to reliably and scalably offer their users an interactive experience, they began to consider Modin, Ponder’s core technology, as the in-memory dataframe service to power Snorkel Flow. Modin is an open-source project that serves as a fast, scalable drop-in replacement for Pandas. By changing just a single line of code, Modin seamlessly speeds up Pandas workflows, whether you’re on a laptop or in a cluster.
Snorkel AI engineers chose Modin over alternative solutions due to its support for:
Interactive Performance
Grounded in cutting-edge research from UC Berkeley, Ponder’s technology was designed to speed up and scale interactive dataframe workloads while improving memory management.
Efficient Data Processing
Ponder’s technology helps increase resource utilization of whatever hardware it’s running on while its out-of-core capabilities allow it to spill dataframe computations to disk without facing out-of-memory errors.
Usability
Ponder’s easy-to-use Pandas API, combined with its ability to automatically handle partitioning and scheduling under the hood, make it easy to maintain the codebase over time.
Snorkel AI developers worked with Ponder to integrate Modin into the Snorkel Flow platform to power data labeling workflows at scale.
Results
The Snorkel AI team benchmarked Modin against their existing architecture and several alternatives they were considering. Performance and maintainability were the key evaluation criteria. Here are a few highlights from the comparison:
10x Efficiency
Reduced peak memory of benchmark workloads.
2x Faster Processing
Out-of-the-box improvements in data labeling and processing speed.
25% Less Code
No more manual tuning of data pipelines.
Production-ready
Directly work with native Pandas API from prototype to production.
With Ponder and Modin, Snorkel Flow is now able to power labeling pipelines at much greater volumes, while maintaining interactive performance on compute-intensive workloads.
Want to learn more about how Ponder can help you scale your Pandas workflows? Check out what we’re building here!
About Snorkel AI
Snorkel AI is the pioneer in data-centric AI—a novel approach to AI development wherein data scientists and developers are equipped with the tools and workflows to scalably improve training data quality—the greatest determinant to model quality and AI success. Central to this is programmatic labeling, wherein diverse sources of knowledge are encoded, intelligently combined, and applied with a software-like approach. Some of the world’s largest enterprises across financial services, insurance, healthcare, and more use Snorkel Flow, Snorkel AI’s enterprise platform, to overcome the data labeling bottleneck in their development workflows. Since its founding in 2019, the company has raised over $100 million from top investors, including Black Rock, Addition, Lightspeed, Greylock, GV, and more.
About Ponder
Ponder provides enterprise-ready tools for rapid, flexible experimentation with data at scale in Python. Ponder makes data teams more productive by enabling them to get insights faster with tools they know and love. Founded by UC Berkeley researchers, Ponder commercializes the open-source scalable Pandas tool, Modin, which has been downloaded millions of times. Leading data science teams—including Fortune 100 companies—leverage Ponder’s technology to seamlessly accelerate their data science workloads.