Chocolate Bar Ratings
1 Introduction
This project aims to perform exploratory data analysis (EDA) on the Chocolate Rating Dataset and bring insight to answer questions as per below:
- Which bean type has the largest market share?
- What is the distribution of chocolate bar manufacturers across the world?
- What is the relationship between bean type and chocolate ratings?
- What is the relationship between cocoa percentage and chocolate ratings?
Source data can be accessible via clickable link below.
#clickable link
<- data.frame(
df Name = c("Kaggle", "Github"),
Link = c('<a href ="https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings/data" target ="_blank">Click</a>',
'<a href ="https://github.com/jianyuan941/Chocolate-Bar-Ratings" target= "_blank">Click</a>')
) kable(df, escape = F)
2 Exploratory Data Analysis
2.0.1 Data Cleansing
Issue identified in the dataset:
- Column names are inconsistent, containing a mix of capital letter and dots (.).
- The skim() function incorrectly report the dataset as free from missing value (NA).
- Misspelling input in
companylocation
column. - Bias during data collection stage. Companies are evaluated multiple time in a year
- Rating collection for each manufacturer is not consistent across the years.
Data cleaning outcome:
After removing row with NA, the sample size reduced from 1795 to 884 which remains acceptable for further analysis. Dropping sample which contains NA to ensure that all the remaining observations come from a completed dataset.
2.1 Data Manipulation
<-
flavors_of_cacao_without_na %>%
flavors_of_cacao_without_na mutate( companylocation = case_when(
== "U.S.A." ~ "United States of America",
companylocation == "U.K." ~ "United Kingdom",
companylocation == "Scotland" ~ "United Kingdom",
companylocation == "Amsterdam" ~ "Netherlands",
companylocation == "Sao Tome" ~ "São Tomé and Principe",
companylocation == "Czech Republic" ~ "Czechia",
companylocation == "Eucador" ~ "Ecuador",
companylocation ~ companylocation
T
),beantype = case_when(
str_detect(beantype, regex("Criollo, Trinitario|Trinitario, Criollo|Trinitario, Forastero|Criollo, +|Amazon, ICS|Blend-Forastero,Criollo|Criollo, Forastero|Amazon mix|Forastero, Trinitario|Trinitario, TCGA|Trinitario, Nacional" ,ignore_case = T)) ~ "Blend",
str_detect(beantype, regex("Forastero|CCN51|Matina", ignore_case = T)) ~"Forastero",
str_detect(beantype, regex("criollo", ignore_case = T)) ~ "Criollo",
str_detect(beantype, regex("Nacional|EET", ignore_case = T)) ~ "Nacional",
str_detect(beantype, regex("Trinitario", ignore_case = T)) ~ "Trinitario",
TRUE ~ beantype
),cocoapercent = as.double(gsub("\\%","",cocoapercent))/100
)
In the data manipulation, adjustment is conducted to standardized, categoried and removed noises which potentially misleading the final result.
2.2 First Insight: Overall Rating Distribution
break_generator(
breaks_name = "score_break",
labels_name = "score_label",
start = 0,
end = 5,
gap = 1
)
#overall review on score
<-
plot_for_rating_distribution%>%
flavors_of_cacao_without_na select(companymakerifknown, rating) %>%
group_by(companymakerifknown) %>%
#remove duplicated company to avoid flooding specific rating
summarise(avg_rating = mean(rating), .groups = "drop") %>%
mutate(score_group = cut(avg_rating,
breaks = score_break,
labels = score_label,
right = F,
ordered_result = T),
score_group = case_when(
== "5" ~ "4-5",
avg_rating ~ score_group
T %>%
))
#evaluate only avrage rating for every company
group_by(score_group) %>%
summarise(total_count = n(), .groups = "drop")
barchart_plotly(
database = plot_for_rating_distribution,
x_col = score_group,
y_col = total_count,
fill_col = score_group,
%>% layout(
) title = "Average Rating Distribution For Chocolate",
xaxis = list(title = "Score Group"),
yaxis = list(title = "Count")
)
score_group | 2-3 | 3-4 | 4-5 |
total_count | 79 | 202 | 1 |
percentage | 28.01% | 71.63% | 0.35% |
The bar chart above illustrates the average rating of every company throughout the years. Average rating is centralized in score group 3-4 (equal to 202 companies and 71.63% of sample size), followed by 2-3 (79, 28.01%) and lastly 4-5 (1, 0.35%). It gives insights that:
- Chocolate Manufacturer have at least score 3-4 in rating to remain competitive in the market.
- The market tends to award favourable ratings.
<-
changes_of_rating_over_years %>%
flavors_of_cacao_without_na #find average rating per company per year
select(companymakerifknown, reviewdate, rating) %>%
group_by(companymakerifknown, reviewdate) %>%
#to prevent same company evaluate twice within that year
summarise(avg_rating = mean(rating), .groups = "drop") %>%
#to group score into factors by years
mutate(score_group = cut(avg_rating,
breaks = score_break,
labels = score_label,
right = F,
ordered_result = T),
score_group = case_when(
== "5" ~ "4-5",
avg_rating ~ score_group
T %>%
)) select(reviewdate, score_group) %>%
group_by(reviewdate, score_group) %>%
summarise(total_count = n(), .groups = "drop")
#generate group for years
<- unique(changes_of_rating_over_years$reviewdate)
year_group
barchart_plotly(
database = changes_of_rating_over_years,
x_col = score_group,
y_col = total_count,
text_col = total_count,
matched_group_col = reviewdate,
ordered_group = year_group
%>% button_generator(
) ordered_group = year_group,
num_of_plot = 1
%>%
) layout(
title = "Distribution of Rating Over Years",
xaxis = list(title = "score Group"),
yaxis = list(title = "Count")
)
Ratings are reorganized to examine changes from 2006 to 2017. The bar chart above shows a marked rise in the 3–4 score group over time, followed by the 2–3 score group, which displays a fluctuating trend. The 1–2 and 4–5 score groups appear as extremes: companies in the 1–2 score group improve to remain competitive in the market, while the 4–5 score group is reserved for only the very best chocolate manufacturers.
2.3 Second Insight: Distribution by Country
<-
flavors_of_cacao_without_na_group_by_country %>%
flavors_of_cacao_without_na # find average rating per company per year
select(companymakerifknown, rating, companylocation) %>%
# calculate average rating for company
group_by(companymakerifknown, companylocation) %>%
summarise(avg_rating = mean(rating), .groups = "drop") %>%
select(-companymakerifknown) %>%
# Distribution Rating and Count to Country
group_by(companylocation) %>%
summarise(
avg_rating = mean(avg_rating),
total_count = n(), .groups = "drop")
ggplotly(
map_function(
database = flavors_of_cacao_without_na_group_by_country,
assign_name = worldmap_group_by_number_of_company,
joinby = companylocation,
fill = total_count)+
labs(title = "global map")+
theme_function()
)
Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
: Ignoring unknown aesthetics: text
The map above shows the distribution of chocolate manufacturers across countries. Lighter colors represent larger numbers, while darker colors represent smaller numbers.
<- unique(flavors_of_cacao_without_na_group_by_country$companylocation)
country_group <-
flavors_of_cacao_without_na_group_by_year_and_country%>%
flavors_of_cacao_without_na select(companylocation, rating, reviewdate) %>%
group_by(companylocation, reviewdate) %>%
summarise(avg_rating = mean(rating), .groups = "drop")
<-
f1%>%
flavors_of_cacao_without_na_group_by_country arrange(avg_rating) %>%
mutate(companylocation = factor(companylocation, levels = companylocation)) %>%
barchart_plotly(
database = .,
x_col = total_count,
y_col = companylocation,
text_col = avg_rating,
matched_group_col = companylocation,
ordered_group = country_group
)
<-
f2scatter_plotlyv2(
database = flavors_of_cacao_without_na_group_by_year_and_country,
x_col = reviewdate,
y_col = avg_rating,
matched_group_col = companylocation,
ordered_group = country_group,
display = F
#checking = T
)
subplot(f1, f2, shareX = F, shareY = F, nrows = 2, heights = c(0.7,0.3)) %>%
button_generator(
ordered_group = country_group,
sec_graph_display = F
%>% layout(
) height = 500,
margin = list(l = 0),
annotations = list(
list(text = "Distribtion of Chocolate Manufacturers and ranked in Ascending",
x = 0.5,
y =1.05,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 14)),
list(text = "Average Score by Years",
x = 0.5,
y =0.20,
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 14))
) )
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
companylocation | United States of America | United Kingdom | France | Canada | Ecuador |
total_count | 118 | 19 | 16 | 11 | 11 |
market_share | 41.84% | 6.74% | 5.67% | 3.9% | 3.9% |
The Figure above combined two type of plots. The Vertical bar chart illustrates the number of chocolate manufacturers by countries and ordered by average rating in descending order, While line chart shows the changes of average rating across the years for each country.
USA appears as the main chocolate bar manufacturing country, with 118 companies, equivalent to 41.84% of the total manufacturers. The average ratings for USA remains stable in between 3.0 and 3.4 across the years. United Kingdom, the second-largest market player, holds significantly smaller market shares 6.74% (19 manufacturing companies) as compared to USA.
2.4 Third Insight: Relationship of Cacao Bean and Rating
<-
flavors_of_cacao_without_na_group_by_beantype %>%
flavors_of_cacao_without_na # calculate average rating base on beantype
select(companymakerifknown, beantype, rating) %>%
group_by(beantype) %>%
mutate(average_rating = round(mean(rating),2)) %>%
ungroup() %>%
# find only unique company with different beantype
select(companymakerifknown, beantype, average_rating) %>%
distinct() %>%
# summarize total number of company with unique beantype
group_by(beantype, average_rating) %>%
summarise(total_count = n(), .groups = "drop") %>%
# sort base on lowerest to highest rating
arrange(average_rating) %>%
mutate(beantype = factor(beantype, levels = beantype))
<- as.character(unique(flavors_of_cacao_without_na_group_by_beantype$beantype))
bean_group
<-
review_for_bean_per_year %>%
flavors_of_cacao_without_na # group by year and company
group_by(reviewdate, beantype) %>%
summarise(average_rating = mean(rating), .groups = "drop") %>%
arrange(reviewdate, beantype)
<- barchart_plotly(
fdatabase = flavors_of_cacao_without_na_group_by_beantype,
x_col = beantype,
y_col = total_count,
text_col = average_rating,
matched_group_col = beantype,
ordered_group = bean_group
)
<- scatter_plotlyv2(
f2database = review_for_bean_per_year,
x_col = reviewdate,
y_col = average_rating,
#text_col = average_rating,
matched_group_col = beantype,
ordered_group = bean_group,
display =F
#checking = T
)
subplot(f, f2, shareX = F, shareY = F, nrows =2) %>%
button_generator(
ordered_group = bean_group,
sec_graph_display = F
%>%
) layout(
title = "Supply of Bean Type with Average Rating"
)
beantype | Trinitario | Forastero | Criollo | Blend | Nacional | Beniano | Amazon |
average_rating | 3.25 | 3.12 | 3.27 | 3.28 | 3.31 | 3.58 | 3.25 |
total_count | 185 | 108 | 89 | 46 | 8 | 3 | 1 |
percentage | 42.05% | 24.55% | 20.23% | 10.45% | 1.82% | 0.68% | 0.23% |
The figure above combines a bar chart and a line chart. The bar chart displays the distribution of bean types ordered by average rating, while the line chart shows changes in ratings over the years.
Trinitario, a hybrid of Forastero and Criollo that inherits disease resistance from Forastero and superior flavor from Criollo (source ), is the most widely used cocoa in the market, accounting for 42.05% of the total with an average rating of 3.25 across the years.
Forastero, the second most commonly used cocoa (24.55% of the market), is favored for its ease of cultivation and high yield (source ). However, its average rating is lower compared to Trinitario.
Criollo ranks third in usage; although it produces smaller yields, it is valued for its stronger flavor compared to other varieties. Blend (mixed cocoa), Nacional, Amazon, and Beniano varieties serve more niche market segments.
%>%
flavors_of_cacao_without_na select(beantype, cocoapercent, rating) %>%
tsne_function(
database = .,
point_to_consider = 100,
blur_rate = 0.1,
group_col = beantype,
show_iteration_msg = F)
The t-SNE plot above illustrates the relationships among cocoa percentage, rating, and bean type. The scatter points for different bean types overlap heavily, suggesting that there are no clear or direct differences between them.
2.5 Fourth Insight: Relationship of Cocoa Percentage and Rating
break_generator(
breaks_name = "cacao_percentage_break",
labels_name = "cacao_percentage_label",
start = 0,
end = 1,
gap = 0.1
)
<-
summary_cocoa_percentage %>%
flavors_of_cacao_without_na select(companymakerifknown,rating, cocoapercent, beantype) %>%
group_by(companymakerifknown,beantype) %>%
summarise(average_rating = mean(rating),
average_cocoapercent = mean(cocoapercent),
.groups = "drop") %>%
select(-companymakerifknown) %>%
mutate(cocoa_percent_group = cut(average_cocoapercent,
breaks = cacao_percentage_break,
labels = cacao_percentage_label,
ordered_result = T,
right = F)) %>%
select(-average_cocoapercent) %>%
group_by(beantype, cocoa_percent_group) %>%
summarise(total_count = n(),
average_rating = mean(average_rating),.groups = "drop")
barchart_plotly(
database = summary_cocoa_percentage,
x_col = cocoa_percent_group,
y_col = total_count,
matched_group_col = beantype,
ordered_group = bean_group,
text_col = average_rating
%>%
) button_generator(
num_of_plot = 1,
ordered_group = bean_group
%>% layout(
) xaxis = list(title = "Cocoa Percentage Group"),
yaxis = list(title = "Total Count")
)
Figure above shows the distribution of average rating based on cocoa percentage group by bean type. Inight suggests that high cocoa percentage do not result in high ratings. Distribution across beantype illustrates highest average rating normally laid in between 60 to 80 cocoa percentage.
2.6 Preliminary Screening for ANOVA Analysis
%>%
flavors_of_cacao_without_na group_by(companymakerifknown, beantype) %>%
summarise(average_cocoapercent = mean(cocoapercent), .groups = "drop") %>%
group_by(beantype) %>%
summarise(num_of_user = n(), .groups = "drop") %>%
arrange(num_of_user) %>%
save_function(name = "sample_size_assessment_result", type = "NULL", return = T) %>%
kbl(caption = "sample size for beantype") %>%
kable_styling()
beantype | num_of_user |
---|---|
Amazon | 1 |
Beniano | 3 |
Nacional | 8 |
Blend | 46 |
Criollo | 89 |
Forastero | 108 |
Trinitario | 185 |
Sample size assessment indicates that Amazon, Beniano, and Nacional have small sample sizes, making them unsuitable for ANOVA testing.
<-sample_size_assessment_result$beantype[4:7]
bean_type
%>%
flavors_of_cacao_without_na group_by(companymakerifknown, beantype) %>%
summarise(average_rating = mean(rating), .groups = "drop") %>%
save_function(name ="sample", type = "NULL", return = T) %>%
hist_bell_plotly(
database = .,
x_col = average_rating,
bin_size = 0.05,
matched_group_col = beantype,
ordered_group = bean_type
%>%
) bell_curve_selection_generator(
number_of_curve = 8,
hist_bell = T
%>%
) layout(
xaxis = list(title = "Average Rating"),
yaxis = list(title = "Density")
)
shapiro_group_testing(
database = sample,
matched_group_col = beantype,
ordered_group = bean_type,
x_col = average_rating
%>%
) kbl(caption = "Result from shapiro test") %>%
kable_styling()
group | p_value |
---|---|
Blend | 0.0082189 |
Criollo | 0.0100423 |
Forastero | 0.0027237 |
Trinitario | 0.0044680 |
The Shapiro-Wilk test shows that the remaining groups deviate from normality, as their p-values are ≤ 0.05. Visual inspection of the histograms suggests that larger sample sizes could improve the approximation to normality and increase p-values.
3 Conclusion
Trinitario has the largest market share due to its characteristics—high yield rate and disease resistance inherited from Forastero, combined with superior taste from Criollo. Based on visual inspection, there are no significant differences between bean type and rating, and the ANOVA test could not be performed because the sample sizes by bean type do not follow a normal distribution. Additionally, a higher cocoa percentage does not necessarily result in higher ratings; the optimal cocoa range appears to be between 60% and 80%, suggesting that the best flavors arise from a balanced combination of ingredients rather than solely from cocoa content.
In terms of manufacturers, 41.84% of companies (118 manufacturers) are from the USA, making it the largest chocolate bar supplier by a significant margin.