8: Design a Web Crawler | Systems Design Interview Questions With Ex-Google SWE

Published 2024-01-13
They call me spiderman the way the ladies (euphemism for my friends) crawl into my web

All Comments (21)
  • @bchandra72
    The videos were highly beneficial for my FANNG system design interview. I purchased several paid system design courses, yet your videos surpassed them all.
  • @panhejia
    Best discussion of crawler that I have seen. You abstract the whole HLD into frontier, content deduper, url deduper, and talks about each piece in details, regarding frontier strategies, content deduper db/memcache selection, url deduper and partitioning. You even touched the fail modes at the end. Nice job!
  • these are great breakdowns. I much prefer your explanations to what grokking has
  • @KENTOSI
    Hey Jordan this was an excellent coverage of this interview question. I was asked this once and didn't even occur to me to think about the robot.txt file. Nice work!
  • @LeoLeo-nx5gi
    Hi Jordan, awesome indepth explanation, thanks!!
  • @anuj9538
    Hi ​@jordanhasnolife5163 Your content is one of the best out there in system design. I've few doubt regarding the architecture Load balancer - Is it just the Kafka cluster where partition key is hash of the host and each flink instance is going to read the data from that particular partition always ?. Will it not cause the issue once this particular flink instance is died.
  • @reedomrahman8799
    Yo been watching your system design 2.0 vids. Have been a great supplement to DDIA. Have you made a video on how todo estimations in system design?
  • @erictsai1592
    Hey great videos as always! Just wondering if you will talk about one popular design question ads aggregation in near real-time and batch processing some time?
  • @0xhhhhff
    Chaddest system designer i know
  • @meenalgoyal8933
    Thank you for the video! :D For document content dedup, I was thinking of another option. May be have redis cluster partitioned by document location and still use single leader. This way cache of the checksum for document content from different geographic location will stay close to that location. So more and faster hits to cache. And in the background these cache partitions can sync up. What are your thoughts?
  • @user-bq1iu1sl7h
    Hey Jordan, great and in-depth content. Just one quick suggestion : Can you please put a framework to every problem? Like, solve problems using the same framework so that we can draw a pattern. Something Like Functional req -> Non-Functional req -> Capacity Esti -> Data Model -> High Level -> Deep Dive -> Tradeoffs/Future etc. Also if you can explain using images and design flows more, it sticks better. I know this is extra work but it will really help us. Your content is really useful but a little verbose. Anyhow, it's really helpful and free, so no complaints :)
  • @IiAzureI
    Is Kafka your load balancer? I don't see how Kafka is doing anything if it's completely local to each node. Doesn't the LB have to do the partitioning of the URLs? (ye I don't understand kafka, reading more now).
  • @Hollowpoint321
    Hi Jordan - just wondering how you would modify this system to ensure stuff is re-crawled with approximate accuracy? E.g. in your intro you said you'd be aiming to complete this within 1 week - if your crawler is running persistently, how would you enqueue sites to be re-crawled a week later, given that there is no message visibility equivalent in kafka, versus a traditional message broker?
  • @ahmedkhaled7960
    If the content of the web pages that we are crawling is text I think we can just use a relational database and have a UNIQUE index over the content column and call it a day
  • @reddy5095
    Shouldn't we use cassandra to store the 1mb files? HDFS is usually for large files ~125mb or more
  • @bogdax
    If we partition data by host, how do we deal with hot partitions then? Some nodes could run forever, while other may be idle.
  • On your final design, Can you explain the path to write to S3. The diagram shows write from flink -> LB -> S3, but shouldn't the flink be able to write to s3 directly, why would it go to LB?
  • @time_experiment
    Why can't we just cache the domain name mappings? Earlier in the problem we use the assumption that we would index about 1B webpages. If we assume all those are uniquely hosted, then we can also map all of the IPs to domain name with about 8GB of memory.