Don't Use Apache Airflow

86,469
0
Published 2022-01-18
Apache Airflow is touted as the answer to all your data movement and transformation problems but is it? In this video, I explain what Airflow is, why it is not the answer for most data movement and transformation needs, and provide some better options.

Join my Patreon Community and Watch this Video without Ads!
www.patreon.com/bePatron?u=63260756

Slides
github.com/bcafferky/shared/blob/master/ApacheAirf…

Follow me on Twitter
@BryanCafferky

Follow Me on LinkedIn
www.linkedin.com/in/bryancafferky/

All Comments (21)
  • @wexwexexort
    I wasn't using it but after this video I just changed my mind. I'm gonna schedule some jobs using Airflow next sprint.
  • @Seatek_Ark
    I was recently brought onto a team to convert our ETLs from Apache Nifi over to Airflow and while your assessment is fine, I think there's a few areas where I would have structured this differently. 1. Airflow is not an ETL, you're right in calling it a job scheduler, it's technically referred to as a task scheduler. In your ETL processes you have really 4 things that you're trying to do- a. trigger when an event happens (an email is received, x amount of time has passed, someone put a file in your fileshare or s3 bucket, some notification prompts you to start). b. Extract your data from one location. c. Transform your data. This is where the bulk of your coding comes into play d. put your data into it's appropriate database or storage e. make sure a-d goes off without an issue. The reason why Airflow is a great ETL tool is because it does A and E by itself reall well, and it facilitates B and D. Hooks and sensors are built into airflow, and are fully customizable. If your project is reliant on programs like Glue then you can do all of this in the AWS suite (or Azure or GCP), but if Airflow very cleanly packages up your connection points and your custom etl and runs that sequence of tasks beautifully. Should you default to airflow? If your data engineers are already experts it's fine, if not, then no. Is it the magic tool to ETL? No, watch for AWS and fellow tech giants to come out with something like that in the next 5-10 years. Is it the best task scheduler? Due to support it's miles ahead of its competitors, so yes.
  • @gudata1
    Airflow is a scheduler and it doesn't care about what code you run. The easiest is to pack all your golang/rust/python code in docker containers and scale with that.
  • @sanjaybhatikar
    Beautifully explained! I love how you dive into the code without getting lost in the weeds. Very helpful, thank you :)
  • @tomhas4442
    Been using airflow a little over a year now and totally agree with most of your points. Appreciate it for logging, monitoring of pipelines and the visualizations. Also the good K8s integrations and active community. Would recommend it if most of your code to orchestrate is Python or dockerized. It does come with some downsides like the lack of pipeline version management or the complex setup. There are managed versions though, e.g. Cloud Composer
  • Dear Bryan, thank you for your informative video! For me personally it is actually great news that Airflow IS NOT a full-fledged ETL tool, this is actually exactly what I need. I honestly don't see mentioned limitations (no ETL functionality) as a disadvantage. ETL as a concept is also becoming outdated, in the wake of new approaches such as data mesh and service mesh solutions. What is definitely a no-no is the amount of code overhead and the strong coupling. Will definitely look into suggested tools.
  • @yevgenym9204
    As someone coming from SSIS and literally hate it for being all too much graphical interface, I have to say you did a good job about describing the problems with AirFlow.
  • @mirmir1918
    Very good explanation ! It s good that other participants(products from aws or ms etc ) mentioned
  • @igoryurchenko559
    A main issue of defining a function inside of another function is that it's impossible to unittest. But testing is vital for data processing. it looks like all tasks should be written and tested as standalone functions and adapted to airflow by additional abstraction layer.
  • @DodaGarcia
    I've been using Airflow for a little over a year and your video really confirmed that a lot of the things that have been bugging me about it are not really a me problem. I really love how powerful it is, but having been using it mostly for ETL, I've often found myself overwhelmed with all the coupling and the little "gotchas" in the form of how specifically things have to be set up. It adds a lot of overhead from the get-go, and importantly, means that no matter how well designed the business code is, whenever something breaks or needs to be changed I always need to re-learn all of the Airflow-specific code. I can see why it's a favorite for specialized data teams whose main job is maintaining data pipelines, but not for use cases like mine in which the data flow management is just a small part of the job. So not really anything wrong with Airflow, just that it might be overkill for users like myself. I'm going to look into some of the ETL tools you mentioned, and one thing I'm very interested in using Airflow for soon is managing 3D rendering pipelines. I think it's going to be fantastic for coordinating render jobs and their individual frames, which are often in the thousands.
  • Thank you, Bryan, for your videos. They are really useful. It will be very kind of you to make lessons about Apache NiFi, especially how to choose processors for needed actions.
  • @Theoboeguy
    whether I end up using airflow or not, this is a great video that clearly explains how to use the tool and your perspective. thank you!
  • @evgeny_web
    Hi, thank you very much for this video. The project where I work plans to replace Apache oozie with Airflow, so I think it is pretty useful to watch video like this one. I don't have any prior knowledge of Airflow, it is very easy to understand the main ideas behind this framework.
  • @MichaelCizmar
    Thanks for this. It is easy to understand things sometimes in the context of when you should not use it rather than what it's for.
  • Dear Bryan, thank you very much for this video! Very valuable and straight to the point content. Congrats!
  • @enesteymir
    Thanks clear explanations , I haven't use Airflow yet but it is nearly in the all job posts :) Companies like to use it actually
  • @supernova5839
    That was good introduction on Apace Airflow and use cases, If it has such complex codes and very limited use cases then why most companies look for Airflow skills while hiring a Big data Engineer .
  • @bnmeier
    Although I agree with most of what was said in this video I do have some comments that would likely change someone's mind as it pertains to using Airflow in a real world business scenario. I agree Airflow is not an ETL/ELT tool. I would agree that it is a scheduler. I disagree that code is not reusable. That's one of the reasons why providers and operators exist. If you want to use the same set of tasks multiple times inside the current project or across multiple projects, create a custom operator and use it where you wish. If you are running a medium to large business and the company/IT philosophy is to adopt products that have vendor support, then NiFi and Kettle are not going to be for you. There is no one to call for support when your production instance of either of those goes down. With Airflow a business has the ability to go with Astronomer for a fully vendor supported and highly automated solution which doesn't require the heavy lift of setup. Anyone saying they use AWS Glue and love it, has either not used it or is lying to you. Simply put, it's got a long ways to go to catch up with most orchestrator type tools like Azure Data Factory. If you are in a situation where your company has chosen AWS as their cloud provider and Snowflake as their cloud data warehouse, your options are limited for orchestration of workflow which is a major playing in a complete data pipeline strategy. Products like Matillion are great for drag and drop functionality but are expensive and have a huge deficiency in deployment pipelines and ci/cd implementation. If you are living in the cloud data space and don't know Python at least at a basic level, there is a good chance you are entry level and will need to learn it at some point or not very effective at putting together data pipelines. One of the most powerful module/libraries/etc available to someone in the data space is the Pandas Python module. This becomes a very powerful tool in Airflow or any other orchestration engine dealing with data movement. Just my 2 cents. Again, I don't disagree with what was said. I just think there are way more valid use cases and reasons to use Airflow then insinuated.
  • @-MaCkRage-
    I'm a developer in data analytics team. And now I'm setting up an apach airflow for my team. They will create dags using jupiter lab and it will very comfortable.
  • This is amazing. Rarely anyone is so fair in evaluating popular tool like airflow