How Fast can Python Parse 1 Billion Rows of Data?

110,735
0
Published 2024-04-13
To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer .
You’ll also get 20% off an annual premium subscription.

———————————————————————————————
Sign up for 1-on-1 coaching at dougmercer.dev/
———————————————————————————————

The 1 billion row challenge is a fun challenge exploring how quickly we can parse a large text file and compute some summary statistics. The coding community created some amazingly clever solutions.

In this video, I walk through some of the top strategies for writing highly performant code in Python. I start with the simplest possible approach, and work my way through JIT compilation, multiprocessing, and memory mapping. By the end, I have a pure Python implementation that is only one order of magnitude slower than the highly optimized Java challenge winner.

On top of that, I show two much simpler, but just as performant solutions that use the polars dataframe library and duckdb (in memory SQL database). In practice, you should use these, cause they are incredibly fast and easy to use.

If you want to take a stab at speeding things up further, you can find the code here github.com/dougmercer-yt/1brc.

References
------------------
Main challenge - github.com/gunnarmorling/1brc
Ifnesi - github.com/ifnesi/1brc/tree/main
Booty - github.com/booty/ruby-1-billion/
Danny van Kooten C solution blog post - www.dannyvankooten.com/blog/2024/1brc/
Awesome duckdb blog post - rmoff.net/2024/01/03/1%EF%B8%8F%E2%83%A3%EF%B8%8F-…
pypy vs Cpython duel blog post - jszafran.dev/posts/how-pypy-impacts-the-performanc…

Chapters
----------------
0:00 Intro
1:09 Let's start simple
2:55 Let's make it fast
10:48 Third party libraries
13:17 But what about Java or C?
14:17 Sponsor
16:04 Outro

Music
----------
"4" by HOME, released under CC BY 3.0 DEED, home96.bandcamp.com/album/resting-state

Go buy their music!

Disclosure
-----------------
This video was sponsored by Brilliant.

#python #datascience #pypy #polars #duckdb #1brc

All Comments (21)
  • @dougmercer
    To try everything Brilliant has to offer—free—for a full 30 days, visit brilliant.org/DougMercer . You’ll also get 20% off an annual premium subscription.
  • @eddie_dane
    Are mustaches the new hoodies for programmers now?
  • @guinea_horn
    C can't be slower than Java, can it? The slowest C implementation would be to implement the entire JVM and then write bad Java code
  • I had a project last year where I had to automate a manual process using Python to extract data from an Excel file and auto-fill an XML file. After I finished the project, I reduced the process from 3 months of human work to a 20-minute code run, which made me and my boss very happy. I wish I had seen this video last year; we could have been even happier. Nevertheless, it's great to know that I can achieve such high levels of Python performance. I will ensure better time management for my future projects. Thanks.
  • @joker345172
    8:24 Amazing trick! It reminds me of computer graphics class where we had to find a way to improve the DDA Line algorithm... No one could do it. Then, the professor showed us the Bresenham algorithm. It's such a simple concept - instead of working with floats, work with integers! - but it saves soooo much time. It goes to show that sometimes the data type you're working with can have a huge effect on how fast your code is. Drawing a parallel to Machine Learning, this is also why new GPUs have FP8 and FP16 as big selling points. Training with FP32, which is still the standard for a lot of applications, is just dog slow compared to using FP16 or even FP8.
  • @smol.bird42
    your editing has so much taste, great video bro
  • I'm impressed you did not do any profiling, nor any statistical test to rule out measurement fluctuations
  • @FirroLP
    Dude, your production quality is so good it's criminal. Had to tell you
  • @otty4000
    wow this was a really great video. Its impressive to explain code/libraries differences that quickly and clearly.
  • @BosonCollider
    The actual lessons from this is: 1: use duckdb 2: otherwise, use polars 3: use pypy more, and push back against libraries that are incompatible with it
  • @50shmekles
    This is one of the most well-done, detailed and thorough yet clear, concise and to the point videos ever. Thank you for introducing me to new concepts and libraries!
  • @nullzeon
    how am I just finding out about this channel, editing, knowledge, this video was fantastic!
  • @fatcats7727
    Just wanted to say, all of your videos are incredibly clean and well edited, and althought the algorithm isn’t picking it up rn, your efforts will not go unnoticed!
  • Practically speaking, I prefer the polars implementation over the duckdb because I'd rather chain function calls instead of manipulating text when doing data analysis in Python. But maybe a library like pypika would solve this?
  • @rgndn_bhat
    Nice one, Doug. My Cpython implementation finished in 64 seconds on M2 MacBook air, almost the same approach - memory mapped, multi processing and chunks
  • @thahrimdon
    This is amazing! I was in it with you for the long haul. Had me smiling and frowning the whole way! Great video!
  • @tzacks_
    in other words, getting performance out of python means rewriting the code in C or using a library written in C :)