Yelp Reviews Web Scraper using Python

Created a web scraper that collects data on all reviews written by users on Yelp. This includes the rating, the review content, and the review date. The data also includes the restaurant metadata, such as addresses, phone numbers, websites, store hours, etc. It also collects the gallery of popular dishes, along with photos and number of reviews associated with it.

The code is available here: https://github.com/mdane117/yelp/tree/main

I created two scripts; one script collects the restaurant meta data and uses an list of inputs that I generated. The idea is to target specific restaurants or groups of restaurants instead of collecting all restaurants in the site, because that would take too much time/processing power to collect, and would likely lead to a ban. The other script collects the individual reviews left by customers and this uses the output from the first scraper.

Note that there is one output file for each script. This data can be joined by the restuarant_id. The reasoning behind having two output files is because I wanted to reduce the redundancy in data. For example, I do not want to repeat the restaurant metadata for each review available. It also increases data integrity and makes the field naming convention easier to follow.