Scalable Pandas use cases with Ponder
Ponder’s open-source tools are deployed by data science and machine learning teams across various industries to reduce friction in working with Pandas at scale and accelerating time to insight.
Pandas is increasingly becoming the tool of choice among financial analysts and modelers, who prefer its power and flexibility over traditional tools like Excel and SQL. Often, the complex, fast-moving nature of the market demands experimentation and tuning of financial models at scale. However, scalability remains the #1 bottleneck in translating the use of Pandas to large-scale datasets. For example, when working with time-series, modelers are often limited to operating on data over a short time horizon, instead of the entire historical data necessary to be sure of their findings.
Ponder technology is used by top US and global financial institutions to scale up analysis pipelines and eliminate hand-optimization of Pandas performance.
Internet & Software
Pandas is heavily used by applied ML and data science teams across a variety of tech sectors. Business units often need to keep up with rapidly-changing market demands, yet long engineering cycles are spent in re-translating data science workloads to production scale.
Ponder keeps data teams agile by helping data scientists quickly debug and uncover trends that affect their core metrics and KPIs. Our technology is used across 10+ companies in both consumer and enterprise tech to bridge the gap between data science and data engineering teams when working with Pandas.
Healthcare, Biotech, and Pharma
Pandas is used heavily by computational scientists in biotech, healthcare, and pharmaceutical companies to explore and discover insights. These analyses must often be performed at scale to ensure that the findings are accurate and robust. However, interactivity in Pandas becomes a huge bottleneck when it comes to the scale and complexity of real-world datasets, discouraging exploration.
Ponder revolutionizes interactive data science by reducing the friction and roadblocks that come with working with Pandas at a larger scale. Ponder technology has been used by scientists across healthcare and pharmaceutical companies, as well as for research in universities and national labs. Scientists have reported how Ponder’s technology leads to more hypotheses and insights, and encourages a productive and enjoyable experience.
Genomic data analysis
AI/ML Platforms and Services
Pandas is a key component in machine learning (ML) workflows, often serving as a precursor to training, by supporting common preprocessing tasks such as data preparation, data cleaning, and feature generation, as well as downstream analysis and inspection of ML results. Pandas makes it easy for ML practitioners to experiment on both data and modeling parameters in the highly-iterative and ad-hoc process of ML pipeline development. However, scalability is a key bottleneck when using Pandas in production-ready ML solutions.
Ponder natively supports integrations with popular ML libraries in Python and accelerates Pandas workflows critical to ML in production. Ponder’s technology serves as a scalable backend for AI/ML platforms to support agile and rapid development of ML pipelines.
Data Prep & cleaning
Feature selection & engineering
Analysis & visualization
Did you know?
Ponder's open-source technology is used by 10% of Fortune 100 companies
Building Scalable Data Science Platforms with Modin
Modin is actively being developed and used by several major AI Platform Companies to deliver the convenience and agility enabled by Pandas, with enterprise-grade scalability and performance.
Case Study Highlight
ML engineers at an e-commerce company develop recommendation models to deliver personalized shopping experiences. Daily site traffic leads to hundreds of millions of new user events being captured each day. To develop their models, the team must be able to analyze and process large volumes of event data.
A key step in the recommendation data pipeline is to identify a user session, which involves grouping a set of user events, over a contiguous period of time. To perform this task, the data science team uses Pandas window functions, due to the flexibility and ease with which they can modify aggregations over a sliding time range. While Pandas makes it easy for the team to quickly experiment and tweak preprocessing steps, it runs out of memory and crashes on datasets with as few as tens of thousands of rows. As a result, prototyping can only be performed on data across a limited time range.
By using Modin out of the box, the e-commerce data team was able to easily operate on 1000X more data up to 50M rows, without having to rewrite their Pandas-based workloads into another framework or library. Modin reduced the runtime of the compute-intensive operations in Pandas, yielding substantial performance improvements even when tested on a laptop.”