How Fast can Python Parse 1 Billion Rows of Data?

110,735

3,708 0

Published 2024-04-13

To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer .
You’ll also get 20% off an annual premium subscription.

———————————————————————————————
Sign up for 1-on-1 coaching at dougmercer.dev/
———————————————————————————————

The 1 billion row challenge is a fun challenge exploring how quickly we can parse a large text file and compute some summary statistics. The coding community created some amazingly clever solutions.

In this video, I walk through some of the top strategies for writing highly performant code in Python. I start with the simplest possible approach, and work my way through JIT compilation, multiprocessing, and memory mapping. By the end, I have a pure Python implementation that is only one order of magnitude slower than the highly optimized Java challenge winner.

On top of that, I show two much simpler, but just as performant solutions that use the polars dataframe library and duckdb (in memory SQL database). In practice, you should use these, cause they are incredibly fast and easy to use.

If you want to take a stab at speeding things up further, you can find the code here github.com/dougmercer-yt/1brc.

References
------------------
Main challenge - github.com/gunnarmorling/1brc
Ifnesi - github.com/ifnesi/1brc/tree/main
Booty - github.com/booty/ruby-1-billion/
Danny van Kooten C solution blog post - www.dannyvankooten.com/blog/2024/1brc/
Awesome duckdb blog post - rmoff.net/2024/01/03/1%EF%B8%8F%E2%83%A3%EF%B8%8F-…
pypy vs Cpython duel blog post - jszafran.dev/posts/how-pypy-impacts-the-performanc…

Chapters
----------------
0:00 Intro
1:09 Let's start simple
2:55 Let's make it fast
10:48 Third party libraries
13:17 But what about Java or C?
14:17 Sponsor
16:04 Outro

Music
----------
"4" by HOME, released under CC BY 3.0 DEED, home96.bandcamp.com/album/resting-state

Go buy their music!

Disclosure
-----------------
This video was sponsored by Brilliant.

#python #datascience #pypy #polars #duckdb #1brc

All Comments (21)

@dougmercer 14 days ago

To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer . You’ll also get 20% off an annual premium subscription.
@eddie_dane 14 days ago

Are mustaches the new hoodies for programmers now?
@danieljakob1307 13 days ago

The Summoning Salt homage at 8:26 is brilliant. Fantastic video!
@guinea_horn 13 days ago

C can't be slower than Java, can it? The slowest C implementation would be to implement the entire JVM and then write bad Java code
@skanderghamgui5039 12 days ago

I had a project last year where I had to automate a manual process using Python to extract data from an Excel file and auto-fill an XML file. After I finished the project, I reduced the process from 3 months of human work to a 20-minute code run, which made me and my boss very happy. I wish I had seen this video last year; we could have been even happier. Nevertheless, it's great to know that I can achieve such high levels of Python performance. I will ensure better time management for my future projects. Thanks.
@joker345172 11 days ago

8:24 Amazing trick! It reminds me of computer graphics class where we had to find a way to improve the DDA Line algorithm... No one could do it. Then, the professor showed us the Bresenham algorithm. It's such a simple concept - instead of working with floats, work with integers! - but it saves soooo much time. It goes to show that sometimes the data type you're working with can have a huge effect on how fast your code is. Drawing a parallel to Machine Learning, this is also why new GPUs have FP8 and FP16 as big selling points. Training with FP32, which is still the standard for a lot of applications, is just dog slow compared to using FP16 or even FP8.
@smol.bird42 13 days ago

your editing has so much taste, great video bro
@shadamethyst1258 12 days ago

I'm impressed you did not do any profiling, nor any statistical test to rule out measurement fluctuations
@FirroLP 10 days ago

Dude, your production quality is so good it's criminal. Had to tell you
@otty4000 14 days ago

wow this was a really great video. Its impressive to explain code/libraries differences that quickly and clearly.
@BosonCollider 14 days ago

The actual lessons from this is: 1: use duckdb 2: otherwise, use polars 3: use pypy more, and push back against libraries that are incompatible with it
@50shmekles 2 days ago

This is one of the most well-done, detailed and thorough yet clear, concise and to the point videos ever. Thank you for introducing me to new concepts and libraries!
@nullzeon 12 days ago

how am I just finding out about this channel, editing, knowledge, this video was fantastic!
@fatcats7727 6 days ago

Just wanted to say, all of your videos are incredibly clean and well edited, and althought the algorithm isn’t picking it up rn, your efforts will not go unnoticed!
@MakeDataUseful 13 days ago

Great video, thanks for taking the time to create 🤙
@jamesborden7105 6 days ago

Practically speaking, I prefer the polars implementation over the duckdb because I'd rather chain function calls instead of manipulating text when doing data analysis in Python. But maybe a library like pypika would solve this?
@rgndn_bhat 2 days ago

Nice one, Doug. My Cpython implementation finished in 64 seconds on M2 MacBook air, almost the same approach - memory mapped, multi processing and chunks
@richardrubin2192 14 days ago

This is great - thanks, Doug!
@thahrimdon 20 hours ago

This is amazing! I was in it with you for the long haul. Had me smiling and frowning the whole way! Great video!
@tzacks_ 13 days ago

in other words, getting performance out of python means rewriting the code in C or using a library written in C :)