Snorkel AI Delivers Supercharged Data Labeling with Ponder

Alejandro Herrera

Dec 7, 2022 4 min read

Articles

Snorkel AI Delivers Supercharged Data Labeling with Ponder image

Over the past year, Snorkel AI and Ponder have been working together to bring lightning-fast data labeling to Snorkel Flow users. In this post, we’ll walk you through a few of the highlights from our collaboration.

Problem

Snorkel AI is the pioneer in data-centric AI—a novel approach to AI development wherein data scientists and developers are equipped with the tools and workflows to scalably improve training data quality. Snorkel AI’s enterprise AI development platform, Snorkel Flow, has workflows for programmatically labeling structured and unstructured data including raw text, PDF documents, web pages, conversational data, etc.

As users found more and more success with Snorkel Flow, they started leveraging the platform to label larger and increasingly complex data, such as 100+ page PDF documents. For these use cases, the interactive development experience became slower and more resource-intensive.

Before Ponder, the original architecture of the platform was dependent on a popular parallel processing framework, Dask, and a custom in-memory cache mechanism. This posed several scalability-related challenges for the Snorkel AI team, including:

User Experience
Increasingly complex and compute-intensive data labeling workflows had slower performance, creating a less interactive experience.

System Reliability
Order-of-magnitude increases in data volumes began to cause out-of-memory errors, negatively impacting the reliability of the system.

Maintainability
As a short-term workaround, Snorkel AI had to tune and optimize configurations, making it difficult to maintain the platform’s codebase.

We built Snorkel Flow to help data science teams accelerate AI development through programmatic labeling. Working with Ponder, not only have we been able to speed up key labeling operations by more than 2x, but we’ve been able to do it with less code. We are now able to deliver a better development experience while making the codebase easier to maintain.”

Henry Ehrenberg, Co-founder of Snorkel AI

Solution

As Snorkel AI looked for a solution to reliably and scalably offer their users an interactive experience, they began to consider Modin, Ponder’s core technology, as the in-memory dataframe service to power Snorkel Flow. Modin is an open-source project that serves as a fast, scalable drop-in replacement for Pandas. By changing just a single line of code, Modin seamlessly speeds up Pandas workflows, whether you’re on a laptop or in a cluster.

Snorkel AI engineers chose Modin over alternative solutions due to its support for:

Interactive Performance
Grounded in cutting-edge research from UC Berkeley, Ponder’s technology was designed to speed up and scale interactive dataframe workloads while improving memory management.

Efficient Data Processing
Ponder’s technology helps increase resource utilization of whatever hardware it’s running on while its out-of-core capabilities allow it to spill dataframe computations to disk without facing out-of-memory errors.

Usability
Ponder’s easy-to-use Pandas API, combined with its ability to automatically handle partitioning and scheduling under the hood, make it easy to maintain the codebase over time.

Snorkel AI developers worked with Ponder to integrate Modin into the Snorkel Flow platform to power data labeling workflows at scale.

Results

The Snorkel AI team benchmarked Modin against their existing architecture and several alternatives they were considering. Performance and maintainability were the key evaluation criteria. Here are a few highlights from the comparison:

10x Efficiency
Reduced peak memory of benchmark workloads.

2x Faster Processing
Out-of-the-box improvements in data labeling and processing speed.

25% Less Code
No more manual tuning of data pipelines.

Production-ready
Directly work with native Pandas API from prototype to production.

With Ponder and Modin, Snorkel Flow is now able to power labeling pipelines at much greater volumes, while maintaining interactive performance on compute-intensive workloads.

Want to learn more about how Ponder can help you scale your Pandas workflows? Check out what we’re building here!

About Snorkel AI
Snorkel AI is the pioneer in data-centric AI—a novel approach to AI development wherein data scientists and developers are equipped with the tools and workflows to scalably improve training data quality—the greatest determinant to model quality and AI success. Central to this is programmatic labeling, wherein diverse sources of knowledge are encoded, intelligently combined, and applied with a software-like approach. Some of the world’s largest enterprises across financial services, insurance, healthcare, and more use Snorkel Flow, Snorkel AI’s enterprise platform, to overcome the data labeling bottleneck in their development workflows. Since its founding in 2019, the company has raised over $100 million from top investors, including Black Rock, Addition, Lightspeed, Greylock, GV, and more.

About Ponder
Ponder provides enterprise-ready tools for rapid, flexible experimentation with data at scale in Python. Ponder makes data teams more productive by enabling them to get insights faster with tools they know and love. Founded by UC Berkeley researchers, Ponder commercializes the open-source scalable Pandas tool, Modin, which has been downloaded millions of times. Leading data science teams—including Fortune 100 companies—leverage Ponder’s technology to seamlessly accelerate their data science workloads.

News

Oct 23, 2023

🐼 ❤️ ❄️

We are excited to announce Snowflake’s intent to acquire Ponder to bring Ponder’s Python data science innovations to its customers and to accelerate the growth of the Modin community.

Articles

Oct 3, 2023

Professional Pandas: Handling Missing Data With Pandas Dropna

This is the fifth in a series of blog posts that teach how to write professional-quality pandas code. We start by discussing pandas dropna generally and going over a simple example. Then we talk about identifying missing values, when to drop data, and how to drop entire rows that are missing.

Articles

Sep 19, 2023

How To Use pandas resample on a Database

In this article, we describe pandas resample + provide some examples, and then show how you can use it at scale in your database.

Ready to level up your Pandas game?

Try Ponder now

Snorkel AI Delivers Supercharged Data Labeling with Ponder

Problem

Solution

Results

Read more:

🐼 ❤️ ❄️

Professional Pandas: Handling Missing Data With Pandas Dropna

How To Use pandas resample on a Database

Ready to level up your Pandas game?

Ready to level up your Pandas game?