From Prototype to Production with Modin: Six Highlights from a Data Exchange Interview

Sep 14, 2022 6 min read

Events
From Prototype to Production with Modin: Six Highlights from a Data Exchange Interview image

Ponder Cofounder and Modin creator Devin Petersohn discussed Modin + Ponder with data expert Ben Lorica on the Data Exchange Podcast. Below are six highlights from the interview, along with links to the YouTube video. They tackled questions surrounding the purpose + current state of Modin, how frequently companies run into pandas production problems, how Ponder aims to build out the vision of “pandas on everything,” and more.

Table of Contents (with links):

Why does Modin exist?

Devin:

I started my PhD working with genomics and working with domain scientists, so I got a good feeling for how scientists think, and I built large-scale software on Spark — Big data stores for these genomic-type workflows. What I noticed is we had a hard time getting scientists to adopt our big-data tools because they were so different and so cumbersome compared to what they were used to.

So the Modin work around creating a drop-in replacement for pandas, and making it easy for data scientists to scale up their work from what they’re currently doing — That came from the learnings that I had in the scientific side of things…. Data scientists… prioritize their own productivity, and rightfully so because they’re under a lot of pressure in a lot of cases. By prioritizing their own productivity, a lot of times they don’t have time to really learn these new tools and learn the distributed computing in order to scale things up, because all these tools that allow you to scale, they had all these extra requirements around scalability, learning about how to parallelize your code, working with some weird parallelization issues. And so the whole motivation of Modin came from this idea of me talking to one of my data scientist friends who is like: “Pandas is an issue for me.” And I was like: “Hmm, okay. Let’s see if we can scale this up.”

Video clip available here.

So if you’re a fairly reasonable data scientist doing only reasonable things, Modin will parallelize almost everything you do at this point?

Devin:

It’s about 90%. And I wouldn’t say only a reasonable person doing only reasonable things — There is a lot of really messy pandas code out there, and I’ve seen a lot of that. I don’t want people to be punished for writing that kind of code because pandas is used in such a prototype way, right? It’s like: I want to go and build something really quickly, and I’m not thinking about how I’m structuring my code. That’s the way pandas is often used. And in the same way that SQL doesn’t punish you for writing things in the wrong order, Modin — what we’re working on right now with query optimization underneath the hood is that Modin will not punish you for writing this type of quick iteration, prototype type of code. It’s really intended to allow you to get the best performance just out of the box.

Video clip available here.

Pandas in production:

Devin:

It may surprise you, but every company that we talk to, every company that we have met, has the same problems with pandas and still runs things in production pandas, or they prototype things in pandas and move to production. So this is a common theme and Modin is here to help, if you will.

Ben:

Why is that? Is it because a lot of that initial work is in the data science team?

Devin:

Exactly. So there’s a difference between what the data scientist knows and what the data engineer knows in a typical company, and the engineer is more knowledgeable about how to productionize things, and the kind of production environment…. The data scientist is more familiar with the domain. They’re more familiar with how the data should look, what to look out for in the data. And of course statistics, and they know basically generally how to run and build machine learning models.

And so you have this disconnect where the data scientists don’t know how to write these distributed computing pipelines and these big distributed computing programs, and the data engineers don’t know the data well enough, and don’t know how to build the models well enough to actually help on that front. All of this is kind of stemmed from the fact that pandas can’t scale and a lot of the tools that data scientists use don’t scale, and so if they were able to scale, then a lot of the things that data engineers spend their time doing, a lot of boilerplate stuff that isn’t fun, like translating pandas to Spark, or translating pandas to Hive, or whatever you’re using. Some kind of SQL system. A lot of this translation work is very boring, and I think it stems from this need to have a system that can do both small and large scale, and bridge that gap. Even the big companies have these same problems with using pandas and running into bottlenecks with pandas.

Video clip available here.

What’s on the near-term roadmap for Modin?

Devin:

In terms of the pandas API, we’re at about 91 or 92% of the full pandas API. We’re planning on wrapping up the rest of that tail end of the 8% later this year. Maybe early next year. And then we’re really focused on the query optimization and optimizing for that new workload which is the interactive use case. And continuing to build out a lot of the stuff that we’ve done on the research side and that we’ve kind of prototyped, but… research and production — there’s orders of magnitude difference in how long each one takes. Building that out and actually making it robust, and testing it with our very friendly and wonderful community. That’s kind of what’s on the near-term roadmap.

Video clip available here.

What is the mission of Ponder?

Devin:

Our goal is basically to bring a product that focuses on data scientist productivity and helps them move their workflows to production. So that whole motion I mentioned about going from prototype to production — that’s a common theme and a common pain point among every institution that does data science. It’s just universal. It’s kind of scary how problematic it is, because all of these companies that we’re talking to have this issue of moving from single-workflow notebook prototype to production. It just takes a lot of human time. Months. Many, many months in some cases….

Ben:

So then that means that Modin is going to be central.

Devin:

Modin is central, yes, to what we’re building. And the open source is a very heavy focus of the company right now. Continuing to build the open source and grow the open source and make people successful using the open source.

Video clip available here.

What is the difference between what Ponder is trying to do and what other data science platforms are trying to do?

Devin:

Modin is the real differentiator here in terms of allowing the same API to operate at different scales. And so a lot of times, we have current customers who are basically operating on a single node and moving to production, and there are kind of two different data stores that Modin is directly pushing computation to. This kind of “Pandas on everything” — translating Modin into a “Pandas on everything,” being able to execute Modin on top of our database, on top of our data warehouse, on top of whatever you have set up.

Video clip available here.

For more Pandas / Python / data science content, follow Ponder on Twitter or Linkedin. And give Modin a try!

Be the first to get notified when new articles are posted!