This map plots the locations of the restaurants used in this analysis. I analyzed 3000 restaurants in the Metro Vancouver area, across 132 categories.
This graph plots the distribution of restaurants on a 0.5 scale. The x-axis is the restaurant rating, with 5 being the highest and 1 being the lowest. The y-axis represents the numbers of restaurants in that rating. This graph suggests that the ratings are normally distributed, with a rating of 4 being the most common.
The violin plot shows the distribution of average income at specfiic ratings. It looks like the median point of average income trends lower (around $40k) for low ratings (2.5, 3.0, and 3.5), while it trends relatively higher (around $60-80k) for high ratings (4.0 and 4.5).
The data on restaurants was divided by socioeconomic status, based on average income. Average income is defined as total income divided by total respondents in the corresponding federal electoral district. This breakdown was conducted relative to Vancouver income standards, with the cut offs listed below.
There are 42 electoral districts in BC. The restaurants in our dataset are concentrated in 11. In the image above, the majority of restaurants are located downtown (Vancouver Centre). Other electoral districts include Vancouver East, Vancouver Granville, and Vancouver Kingsway.
This graph plots the distribution of restaurants by review, grouped by socioeconomic level. The x-axis shows the average price of the restaurant, with 1 indicating a cheap restaurant ($) and 4 indicating an expensive restaurant ($ $ $ $).
This shows that the majority of restaurants with a 1 rating are from a low socioeconomic neighbourhood.
The majority of restaurants are medium priced (2).
Expensive price levels (3 and 4) are dominated by restaurants in high socioeconomic neighbourhoods.
To investigate restaurants in various socioeconomic neighbourhoods, I trained a regression model using several features. To do this, I looked at the correlation between variables to see which ones had a strong correlation with rating.
None of them have a strong correlation with rating individually. This makes sense because it is inherently the combination of variables the dictates a restaurant’s quality (a low price does not necessarily suggest a low rating).
Rating (integer): the restaurant’s average rating, as determined by Yelp users. Possible scores range from from 1 to 5
Review Count (integer): the restaurant’s total reviews given by Yelp users
Price (integer): the restaurant’s average price, as determined by Yelp users. Possible scored range from $ to \(\)
Distance (integer): the restaurant’s distance in meters from the city centre, around Olympic Village. Note that this was not used as a training variable as there is a high variation in distances between low and high socioeconomic areas.
My Support Vector Machine model predicts for two classes, 0 and 1.
This was done arbitrarily, but generally most restaurants are deemed “good” if they have a rating of 4 or higher.
To test for a bias, I trained two regression models:
For prediction, I put the restaurants in the opposing socioeconomic neighbourhood
Next, I built an SVM graph for each model. By looking at the results, I can evaluate how a restaurant would perform in a different socioeconoic environment.
This model is trained on features from restaurants in a low socioeconomic neighbourhood. When testing it on high socioeconomic restaurants, it classifies half of them as “bad” restaurants. Generally, the model defines this as low review count and high price.
n=613 | Predicted: 0 | Predicted: 1 |
---|---|---|
Actual: 0 | 143 | 51 |
Actual: 1 | 274 | 145 |
Approximately 68% were rated “bad” and 32% were rated “good”.
This model is trained on features from restaurants in a high socioeconomic neighbourhood. When testing it on low socioeconomic restaurants, it classifies every single one of the restaurants as “good”.
n=343 | Predicted: 0 | Predicted: 1 |
---|---|---|
Actual: 0 | 0 | 176 |
Actual: 1 | 0 | 167 |
Ratings were “good” 100% of the time.
To further investigate the bias, I ran the same models on all Starbucks in Metro Vancouver to hold the restaurant type constant. There are 271 Starbucks in the dataset. Starbucks was chosen because of the high sample size. For this model, I categorized a “good” Starbucks as one with a rating >= 3.5, as Starbucks restaurants have an average of 3.0 or 3.5 in rating.
Low Socioeconomic model
n=109 | Predicted: 0 | Predicted: 1 |
---|---|---|
Actual: 0 | 0 | 20 |
Actual: 1 | 0 | 89 |
High Socioeconomic model
n=63 | Predicted: 0 | Predicted: 1 |
---|---|---|
Actual: 0 | 0 | 12 |
Actual: 1 | 0 | 51 |
Both models predicted a “good” rating 100% of the time. In the high socioeconomic model, you can visually observe that several ratings are being overestimated (upper left hand corner, those with low reviews and high price).
From the regression model training, the high socioeconomic model tends to underestimate the actual data, while the low socioeconomic model tends to overestimate the actual data. This happens 45% of the time for the former, and 51% of the time for the latter. Although this analysis is inconclusive, it implies that there is a bias in restaurant rating based on socioeconomic status in Vancouver.
This analysis has some limitations.
Electoral districts are large, and many not account for region-specific nuances in socioeconomic status. Although this analysis can be improved using census tracts, there is no available data matching postal code Forward Sortation Area (FSA) boundaries to census areas. Unfortunately, there is limited regional specific data available on StatsCanada.
Yelp’s API has limited useful parameters in its endpoint, so only two features were considered in the regression. In the future, I would want to consider additional metrics, such as Health Inspection records or years of operation. Other external datapoints I can consider are crime rates or average house rental price.
There are many qualitative factors that can also influence a restaurant’s rating, such as food freshness, location, customer service, and ambience. These factors have not been considered in the analysis