We’re excited to announce the launch of Ponder and our $7M seed financing round!
We founded Ponder in mid 2021 to address the usability challenges with data science tools at scale. In particular, pandas is one of the most popular data science tools, used daily by millions to clean, transform, explore, and featurize data, but pandas breaks down and becomes unusable on even moderately large datasets. Our Enterprise-Ready Pandas lets users continue to employ pandas on datasets at scale, without having to change a single line of code—getting seamless improvements in speed, robustness, and efficiency at scale.
Enterprise-Ready Pandas builds on our many years of research and development in the UC Berkeley RISE lab, where two of us did our PhDs and one of us is a professor. Our efforts led to two open-source tools, Modin and Lux, both targeting the usability challenges with pandas at scale. These tools have had extensive open-source adoption, with over 2.6M+ downloads, 10k+ GitHub stars, and 100+ contributors. Our users include 10 of the Fortune-100 companies, and span sectors—ranging from technology companies like Intel, VMware, and Microsoft, to pharmaceutical companies like Bristol-Myers Squibb and GSK, to automotive companies like Ford and Tesla.
So how did we get here?
Our journey begins with pandas: the flexible, expressive, and ubiquitous library found in every data scientist’s toolbox. Every data-driven organization we’ve interacted with uses pandas extensively—for data cleaning and transformation, to data exploration and summarization, to feature engineering and machine learning. Pandas’ success can be attributed to its rich, expressive API of over 600+ functions (that go well beyond SQL!), and the ease with which users can flexibly work with messy data. With a whopping 5-10M users, pandas is, in fact, largely responsible for Python’s growth as a programming language. Many believe pandas is now the most important tool for data science.
Pandas Breaks Down at Scale ⇒ Modin
But every data scientist, analyst, or engineer we’ve ever spoken to also acknowledges that pandas simply doesn’t work at scale, leading to sluggish performance and out-of-memory errors on even moderately-sized datasets. We realized this problem back in 2016 when one of us was talking to a friend who works with large genomic datasets. This friend was using Apache Spark to sample their 100TB dataset so that they could use pandas to do exploratory data analysis. The inability to quickly experiment and iterate on large-scale data often led to long development cycles and delayed insights. This felt fundamentally broken to us, so we pondered the question: what if there was a scalable version of pandas that preserves the ease-of-use and flexibility of its API?
As we began talking with other data practitioners, we realized this was actually one of the most pervasive problems in data science today. Data practitioners often have to work on small fragments of their data to work around pandas problems, leading to incorrect or missing insights. Or they end up having to rewrite their pandas code to a different, but less familiar “big data” framework to run it at scale, such as Spark or SQL.
To address these issues, we developed Modin — a scalable, drop-in replacement for pandas. Modin empowers practitioners to use pandas on data at scale, without requiring them to change a single line of code. Modin leverages our cutting-edge academic research on dataframes—the abstraction underlying pandas. As part of our research, we adapted techniques from databases and distributed systems to dataframes, including a new dataframe algebra drawing from relational, linear, and spreadsheet algebra. You can read more about how we did it here.
Modin’s success has been unprecedented—over 2.5M downloads to date, with more than half of those downloads happening since we founded Ponder last year. In fact, beyond our traditional open source user base, large technology platform providers, including Intel and Microsoft, power their services with Modin. Intel has a team of engineers contributing to Modin and provides Modin as part of their oneAPI AI analytics platform for their customers. Likewise, Microsoft provides Modin as part of their Azure Machine Learning service.
Pandas is Unusable at Scale ⇒ Lux
Beyond the performance issues addressed by Modin, another challenge with pandas at scale is that data scientists often have trouble knowing what parts of the large and complex dataset to look at. pandas provides limited feedback to users along the way, so users are often unable to determine what to do next. Studies have shown that data scientists usually opt to not visually inspect their data in many cases, waiting until the end of their analysis to do so—at which point it is too late to be useful, with several insights missed along the way.
To address these issues, we developed Lux, a zero-effort visualization tool for pandas. Like Modin, Lux requires no changes to your pandas code. Lux automatically generates and recommends rich visualizations tailored to help users identify anomalies in their data, determine typical patterns and new directions of analysis, and verify correctness of preceding steps. To do this efficiently, Lux leverages state-of-the-art scalable data processing techniques to quickly discover potentially interesting patterns. Lux provides these visualization recommendations for free to users at every point in their analysis, without them having to write a single line of code. Lux builds on nearly a decade of academic research on visualization recommendation tools, with over 20 published papers, including these [1,2,3,4,5]. You can learn more about Lux here.
Like Modin, Lux has been a runaway success among open-source adopters. As examples, a telecommunications company used Lux to detect anomalies in mobile networks; a pharmaceutical company used Lux to make sense of drug discovery experimental data; and an insurance company used Lux to identify key predictors for machine learning model building. Lux’s downloads have quadrupled since we started Ponder, to 80K downloads. Many Lux “fans” in the open-source community have written blogs [1,2,3,4,5] and recorded YouTube videos [1,2,3,4,5] all about Lux.
Modin + Lux ⇒ Enterprise-Ready Pandas
At Ponder, we’re building on Modin and Lux in an effort to develop Enterprise-Ready Pandas: a version of pandas that is more scalable, intelligent, usable, and efficient, while at the same time, preserving the rich pandas API and behavior that millions of users depend on already. Our vision with Enterprise-Ready Pandas is to empower anyone in the organization to flexibly experiment with data at scale.
The modern data science ecosystem is in dire need of Enterprise-Ready Pandas because of the following reasons:
- The rapid influx of users into Python and pandas. We’re seeing an increasing number of data science and coding bootcamps teaching people how to program in Python and operate on data using pandas. These bootcamps are impacting entire industries: for example, financial analysts are increasingly migrating from Excel to Python and pandas, due to its expressiveness and flexibility. It’s no surprise that pandas users number in the millions, with the number rising rapidly.
- Supporting flexible, agile data experimentation. Pandas, unlike many other big data frameworks, is tolerant of messy data and doesn’t require a predefined structure up-front. This tolerant behavior increases agility, empowering organizations to experiment with their data faster. For certain industries, this agility and lack of friction can help provide a significant competitive advantage.
- A key ingredient of ML/AI. Pandas, with its origins in open-source data science tooling, has a large fraction of its 600+ API functions dedicated to data preparation and preprocessing, making it a perfect precursor to ML/AI. Modern data-centric organizations use pandas in conjunction with other ML libraries, such as scikit-learn, XGBoost, and PyTorch.
- Eliminate costly, time-consuming retranslation. Even in organizations with mature data infrastructure, we often find most data scientists operating on a sample of data in pandas, followed by an expensive and time-consuming retranslation into a different data framework (if at all possible) to run it at scale. With enterprise-ready pandas, we aim to eliminate this retranslation entirely, allowing users to operate on small and large datasets alike using pandas.
Ponder: Looking Ahead
In our journey towards enterprise-ready pandas, we’re excited to announce our $7M seed financing round, led by Lightspeed Venture Partners (read about their investment in Ponder here), with participation from Intel Capital (read about their investment in Ponder here), 8VC, and the House Fund. We’re also delighted to rely on the support and sage counsel of a number of pioneering advisors and angels, such as Amy Chang, Dev Ittycheria, Spencer Kimball, Jana Messerschmidt, Mike Olson, Christopher Re, John Thompson. We’d like to thank all the folks who helped us at various stages along the way — a special shout out to Farooq Adam, Joey Gonzalez, Marti Hearst, Joe Hellerstein, Anthony Joseph, Stephen Macke, Areg Melik-Adamyan, Arnab Nandi, Fernando Perez, and Vidya Setlur.
We’re thrilled to be working with an outstanding set of early customers that span financial, AI, and healthcare sectors. We’re also continuing to expand our open-source development and adoption. In all likelihood, you, your teams, or your customers are experiencing the same problems with pandas that we described above, and we’d love to work with you! Book a time to chat with us here!
Finally, if you are excited about our mission to supercharge data science productivity, we’re hiring and would love to hear from you! Please check out our job postings here!
If you’d like to learn more about Ponder, our mission, and us, read an article from VentureBeat here!