Build An Airflow Data Pipeline To Download Podcasts [Beginner Data Engineer Tutorial]

29,706
8
Published 2022-06-06
We'll build a data pipeline that can download and store podcast episodes using Apache Airflow, a powerful and widely used data engineering tool. This is a beginner tutorial, so we'll start off by installing Airflow and covering key Airflow concepts.

Along the way, we'll learn how to create our first data pipeline (DAG) in Airflow, how to write tasks using Operators and the TaskFlow API, how to interface with databases using Hooks, and how to run the pipeline efficiently.

By the end of the tutorial, you'll have a good understanding of how to use Airflow, as well as a project that you can extend and build on. Some extensions to this project include automatically transcribing the podcasts and summarizing them.

You can find the full code for the project here, along with an overview - github.com/dataquestio/project-walkthroughs/tree/m… .

Chapters

00:00 Introduction
01:44 - Installing Airflow
07:17 - Creating the first task in our data pipeline with Airflow
17:11 - Using a SQL database with Airflow
25:30 - Storing data in a SQL database with Airflow
34:36 - Downloading podcast episodes with Airflow
38:17 - Looking at our complete data pipeline and next steps

---------------------------------
Join 1M+ Dataquest learners today!
Master data skills and change your life.
Sign up for free: bit.ly/3O8MDef

All Comments (21)
  • @manyes7577
    I think you’re the best data science lecturer so far. Keep going thanks for your hard work
  • @HieuLe-tw7qm
    Thank you very much for this amazing tutorial :D
  • @demohub
    This video was a great resource. Thanks for the tutelage and your take on it.
  • Beautiful explanation and a great project to get me started! Many thanks vik!! One thing to add from my experience: I installed airflow on my Mac M1 and it was working fine but I couldn't run any of the tasks we performed here (not even in the get_episodes task).. to solve that I made an EC2 instance and with some tweaks everything ran :D
  • It was very useful. Thank you. It will be really helpful if you cover Apache Hadoop, Spark, MLFlow, Flink, Flume, Pig, Hive etc. Thank you
  • thanks you for your tutorials, let me know about your airflow version on your tutorial to practice.
  • @kiish8571
    this is very educational thanks a lot, i was wondering if you would be making a video of the automatic transcriptions
  • Hello Vikas Thanks for such a great tutorial everting you made smooth like butter thanks for that ,just one question whenever we made new DAG ( we will have to add docker-compose-CeleryExecutor, docker-compose-LocalExecutor, and Config for that particular DAG )
  • Can you make more advanced Apache Airflow tutorials too?
  • @Funkykeyzman
    Debug tip #1: If you run into error "conn_id isn't defined", then use the Airflow browser interface to instead create the connection. Select Admin --> Connections --> + Debug tip #2: If your Airflow runs fail, try logging out of the Airflow UI and restarting the Airflow server by pressing Ctrl + C and then airflow standalone.
  • if you are facing an issue with creating database, that your dag is running and not completing then put this line after importing the packages os.environ['NO_PROXY'] = '*' , it will work then for sure
  • @vish949
    whenever i run airflow standalone (or even airflow webserver) i get the ModuleNotFound error for pwd. Im running it on a windows, how do i solve this?
  • @OBGynKenobi
    So where is the dependency chain where you set the actual task flow? I would have expected something like task1 >> Task2, etc... at the bottom of the Dag.
  • @user-vy9in2xs6c
    At 33:48, how did we get the 'Done loading. Loaded a total of 0 rows'. We haven't used this text in our code anywhere. Is this the work ok hook.insert_rows
  • @rohitpandey9920
    I am stuck at 14:50 where you try to run the task in airflow. You simply switched the screen from pycharm terminal to git master terminal without any explanation, and I am unable to connect sqlite to pycharm terminal, neither could I establish connection with airflow. Please guide me through this
  • @yousufm.n2515
    When I change the 'dags_folder' path, everything breaks in airflow. What could be the reason
  • My download_episodes task succeeds but I cannot see the mp3 files