8: Design a Web Crawler | Systems Design Interview Questions With Ex-Google SWE

@bchandra72 2 months ago

The videos were highly beneficial for my FANNG system design interview. I purchased several paid system design courses, yet your videos surpassed them all.

@panhejia 2 months ago

Best discussion of crawler that I have seen. You abstract the whole HLD into frontier, content deduper, url deduper, and talks about each piece in details, regarding frontier strategies, content deduper db/memcache selection, url deduper and partitioning. You even touched the fail modes at the end. Nice job!

@canijustgetanamealre 1 month ago

these are great breakdowns. I much prefer your explanations to what grokking has

@KENTOSI 3 months ago

Hey Jordan this was an excellent coverage of this interview question. I was asked this once and didn't even occur to me to think about the robot.txt file. Nice work!

@LeoLeo-nx5gi 3 months ago

Hi Jordan, awesome indepth explanation, thanks!!

@maxvettel7337 3 months ago

Good support for my preparation

@anuj9538 2 months ago

Hi @jordanhasnolife5163 Your content is one of the best out there in system design. I've few doubt regarding the architecture Load balancer - Is it just the Kafka cluster where partition key is hash of the host and each flink instance is going to read the data from that particular partition always ?. Will it not cause the issue once this particular flink instance is died.

@reedomrahman8799 3 months ago

Yo been watching your system design 2.0 vids. Have been a great supplement to DDIA. Have you made a video on how todo estimations in system design?

@erictsai1592 3 months ago

Hey great videos as always! Just wondering if you will talk about one popular design question ads aggregation in near real-time and batch processing some time?

@0xhhhhff 3 months ago

Chaddest system designer i know

@meenalgoyal8933 1 month ago

Thank you for the video! :D For document content dedup, I was thinking of another option. May be have redis cluster partitioned by document location and still use single leader. This way cache of the checksum for document content from different geographic location will stay close to that location. So more and faster hits to cache. And in the background these cache partitions can sync up. What are your thoughts?

@user-bq1iu1sl7h 1 month ago

Hey Jordan, great and in-depth content. Just one quick suggestion : Can you please put a framework to every problem? Like, solve problems using the same framework so that we can draw a pattern. Something Like Functional req -> Non-Functional req -> Capacity Esti -> Data Model -> High Level -> Deep Dive -> Tradeoffs/Future etc. Also if you can explain using images and design flows more, it sticks better. I know this is extra work but it will really help us. Your content is really useful but a little verbose. Anyhow, it's really helpful and free, so no complaints :)

@IiAzureI 2 months ago

Is Kafka your load balancer? I don't see how Kafka is doing anything if it's completely local to each node. Doesn't the LB have to do the partitioning of the URLs? (ye I don't understand kafka, reading more now).

@Hollowpoint321 14 days ago

Hi Jordan - just wondering how you would modify this system to ensure stuff is re-crawled with approximate accuracy? E.g. in your intro you said you'd be aiming to complete this within 1 week - if your crawler is running persistently, how would you enqueue sites to be re-crawled a week later, given that there is no message visibility equivalent in kafka, versus a traditional message broker?

@ahmedkhaled7960 2 months ago

If the content of the web pages that we are crawling is text I think we can just use a relational database and have a UNIQUE index over the content column and call it a day

@reddy5095 1 month ago

Shouldn't we use cassandra to store the 1mb files? HDFS is usually for large files ~125mb or more

@maxmanzhos8411 2 months ago

GOAT

@bogdax 21 days ago

If we partition data by host, how do we deal with hot partitions then? Some nodes could run forever, while other may be idle.

@priteshacharya yesterday

On your final design, Can you explain the path to write to S3. The diagram shows write from flink -> LB -> S3, but shouldn't the flink be able to write to s3 directly, why would it go to LB?

@time_experiment 1 month ago

Why can't we just cache the domain name mappings? Earlier in the problem we use the assumption that we would index about 1B webpages. If we assume all those are uniquely hosted, then we can also map all of the IPs to domain name with about 8GB of memory.