Chocolate Bar Ratings

Author

JY Choo

Published

August 28, 2025

1 Introduction

This project aims to perform exploratory data analysis (EDA) on the Chocolate Rating Dataset and bring insight to answer questions as per below:

  1. Which bean type has the largest market share?
  2. What is the distribution of chocolate bar manufacturers across the world?
  3. What is the relationship between bean type and chocolate ratings?
  4. What is the relationship between cocoa percentage and chocolate ratings?

Source data can be accessible via clickable link below.

#clickable link 
df <- data.frame(
  Name = c("Kaggle", "Github"),
  Link = c('<a href ="https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings/data" target ="_blank">Click</a>',
           '<a href ="https://github.com/jianyuan941/Chocolate-Bar-Ratings" target= "_blank">Click</a>')
)  
  kable(df, escape = F)
Name Link
Kaggle Click
Github Click

2 Exploratory Data Analysis

2.0.1 Data Cleansing

Issue identified in the dataset:

  1. Column names are inconsistent, containing a mix of capital letter and dots (.).
  2. The skim() function incorrectly report the dataset as free from missing value (NA).
  3. Misspelling input in companylocation column.
  4. Bias during data collection stage. Companies are evaluated multiple time in a year
  5. Rating collection for each manufacturer is not consistent across the years.

Data cleaning outcome:

After removing row with NA, the sample size reduced from 1795 to 884 which remains acceptable for further analysis. Dropping sample which contains NA to ensure that all the remaining observations come from a completed dataset.

2.1 Data Manipulation

flavors_of_cacao_without_na <- 
flavors_of_cacao_without_na %>% 
  mutate( companylocation = case_when(
          companylocation == "U.S.A." ~ "United States of America",
          companylocation == "U.K." ~ "United Kingdom",
          companylocation == "Scotland" ~ "United Kingdom",
          companylocation == "Amsterdam" ~ "Netherlands",
          companylocation == "Sao Tome" ~ "São Tomé and Principe",
          companylocation == "Czech Republic" ~ "Czechia",
          companylocation == "Eucador" ~ "Ecuador",
          T ~ companylocation
        ),
        beantype = case_when(
          str_detect(beantype, regex("Criollo, Trinitario|Trinitario, Criollo|Trinitario, Forastero|Criollo, +|Amazon, ICS|Blend-Forastero,Criollo|Criollo, Forastero|Amazon mix|Forastero, Trinitario|Trinitario, TCGA|Trinitario, Nacional" ,ignore_case = T)) ~ "Blend",
          str_detect(beantype, regex("Forastero|CCN51|Matina", ignore_case = T)) ~"Forastero",
          str_detect(beantype, regex("criollo", ignore_case = T)) ~ "Criollo",
          str_detect(beantype, regex("Nacional|EET", ignore_case = T)) ~ "Nacional",
          str_detect(beantype, regex("Trinitario", ignore_case = T)) ~ "Trinitario",
          TRUE ~ beantype 
        ),
        cocoapercent = as.double(gsub("\\%","",cocoapercent))/100
         )

In the data manipulation, adjustment is conducted to standardized, categoried and removed noises which potentially misleading the final result.

2.2 First Insight: Overall Rating Distribution

break_generator(
  breaks_name = "score_break",
  labels_name = "score_label",
  start = 0,
  end = 5,
  gap = 1
)


#overall review on score
plot_for_rating_distribution<-
flavors_of_cacao_without_na %>% 
  select(companymakerifknown, rating) %>% 
  group_by(companymakerifknown) %>% 
  
  #remove duplicated company to avoid flooding specific rating
  summarise(avg_rating = mean(rating), .groups = "drop") %>% 
  mutate(score_group = cut(avg_rating,
                           breaks = score_break,
                           labels = score_label,
                           right = F,
                             ordered_result = T),
  score_group = case_when(
            avg_rating == "5" ~ "4-5",
            T ~ score_group
           )) %>% 
  
  #evaluate only avrage rating for every company
  group_by(score_group) %>% 
  summarise(total_count = n(), .groups = "drop")

barchart_plotly(
  database = plot_for_rating_distribution,
  x_col = score_group,
  y_col = total_count,
  fill_col = score_group,
) %>% layout(
  title = "Average Rating Distribution For Chocolate",
  xaxis = list(title = "Score Group"),
  yaxis = list(title = "Count")
)
Rating Distribution in Percentage
score_group 2-3 3-4 4-5
total_count 79 202 1
percentage 28.01% 71.63% 0.35%

The bar chart above illustrates the average rating of every company throughout the years. Average rating is centralized in score group 3-4 (equal to 202 companies and 71.63% of sample size), followed by 2-3 (79, 28.01%) and lastly 4-5 (1, 0.35%). It gives insights that:

  1. Chocolate Manufacturer have at least score 3-4 in rating to remain competitive in the market.
  2. The market tends to award favourable ratings.
changes_of_rating_over_years <-
flavors_of_cacao_without_na %>% 
  #find average rating per company per year
  select(companymakerifknown, reviewdate, rating) %>% 
  group_by(companymakerifknown, reviewdate) %>% 
  #to prevent same company evaluate twice within that year
  summarise(avg_rating = mean(rating), .groups = "drop") %>% 
  #to group score into factors by years
    mutate(score_group = cut(avg_rating,
                           breaks = score_break,
                           labels = score_label,
                           right = F,
                           ordered_result = T),
score_group = case_when(
          avg_rating == "5" ~ "4-5",
          T ~ score_group
         )) %>% 
  select(reviewdate, score_group) %>% 
  group_by(reviewdate, score_group) %>% 
  summarise(total_count = n(), .groups = "drop") 


#generate group for years
year_group <- unique(changes_of_rating_over_years$reviewdate)

barchart_plotly(
  database = changes_of_rating_over_years,
  x_col = score_group,
  y_col = total_count,
  text_col = total_count,
  matched_group_col = reviewdate,
  ordered_group = year_group
) %>% button_generator(
  ordered_group = year_group,
  num_of_plot = 1
) %>% 
  layout(
  title = "Distribution of Rating Over Years",
  xaxis = list(title = "score Group"),
  yaxis = list(title = "Count")
)

Ratings are reorganized to examine changes from 2006 to 2017. The bar chart above shows a marked rise in the 3–4 score group over time, followed by the 2–3 score group, which displays a fluctuating trend. The 1–2 and 4–5 score groups appear as extremes: companies in the 1–2 score group improve to remain competitive in the market, while the 4–5 score group is reserved for only the very best chocolate manufacturers.

2.3 Second Insight: Distribution by Country

flavors_of_cacao_without_na_group_by_country <- 
flavors_of_cacao_without_na %>% 
  # find average rating per company per year
  select(companymakerifknown, rating, companylocation) %>% 
  # calculate average rating for company 
  group_by(companymakerifknown, companylocation) %>% 
  summarise(avg_rating = mean(rating), .groups = "drop") %>% 
  select(-companymakerifknown) %>% 
  # Distribution Rating and Count to Country
  group_by(companylocation) %>% 
  summarise(
    avg_rating = mean(avg_rating),
    total_count = n(), .groups = "drop")

ggplotly(
  map_function(
    database = flavors_of_cacao_without_na_group_by_country,
    assign_name = worldmap_group_by_number_of_company,
    joinby = companylocation,
    fill = total_count)+
    labs(title = "global map")+
    theme_function()
)
Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat = stat,
: Ignoring unknown aesthetics: text

The map above shows the distribution of chocolate manufacturers across countries. Lighter colors represent larger numbers, while darker colors represent smaller numbers.

country_group <- unique(flavors_of_cacao_without_na_group_by_country$companylocation)
flavors_of_cacao_without_na_group_by_year_and_country<-
flavors_of_cacao_without_na %>% 
  select(companylocation, rating, reviewdate) %>% 
  group_by(companylocation, reviewdate) %>% 
  summarise(avg_rating = mean(rating), .groups = "drop")

f1<- 
flavors_of_cacao_without_na_group_by_country %>% 
  arrange(avg_rating) %>% 
  mutate(companylocation = factor(companylocation, levels = companylocation)) %>% 
barchart_plotly(
  database = .,
  x_col = total_count,
  y_col = companylocation,
  text_col = avg_rating,
  matched_group_col = companylocation,
  ordered_group = country_group
) 

f2<-
scatter_plotlyv2(
  database = flavors_of_cacao_without_na_group_by_year_and_country,
  x_col = reviewdate,
  y_col = avg_rating,
  matched_group_col = companylocation,
  ordered_group = country_group,
  display = F
  #checking = T
)

subplot(f1, f2, shareX = F, shareY = F, nrows = 2, heights = c(0.7,0.3))  %>% 
  button_generator(
    ordered_group = country_group,
    sec_graph_display = F
  ) %>% layout(
    height = 500,
    margin = list(l = 0),
    annotations = list(
      list(text = "Distribtion of Chocolate Manufacturers and ranked in Ascending",
           x = 0.5,
           y =1.05,
          xref = "paper", 
          yref = "paper", 
          showarrow = FALSE,
          font = list(size = 14)),
      list(text = "Average Score by Years",
           x = 0.5,
           y =0.20,
          xref = "paper", 
          yref = "paper", 
          showarrow = FALSE,
          font = list(size = 14))
      )
    )
Warning: Specifying width/height in layout() is now deprecated.
Please specify in ggplotly() or plot_ly()
Top 5 Market Players
companylocation United States of America United Kingdom France Canada Ecuador
total_count 118 19 16 11 11
market_share 41.84% 6.74% 5.67% 3.9% 3.9%

The Figure above combined two type of plots. The Vertical bar chart illustrates the number of chocolate manufacturers by countries and ordered by average rating in descending order, While line chart shows the changes of average rating across the years for each country.

USA appears as the main chocolate bar manufacturing country, with 118 companies, equivalent to 41.84% of the total manufacturers. The average ratings for USA remains stable in between 3.0 and 3.4 across the years. United Kingdom, the second-largest market player, holds significantly smaller market shares 6.74% (19 manufacturing companies) as compared to USA.

2.4 Third Insight: Relationship of Cacao Bean and Rating

flavors_of_cacao_without_na_group_by_beantype <- 
flavors_of_cacao_without_na %>% 
  # calculate average rating base on beantype
  select(companymakerifknown, beantype, rating) %>% 
  group_by(beantype) %>% 
  mutate(average_rating = round(mean(rating),2)) %>% 
  ungroup() %>%
  # find only unique company with different beantype
  select(companymakerifknown, beantype, average_rating) %>% 
  distinct() %>% 
  # summarize total number of company with unique beantype 
  group_by(beantype, average_rating) %>% 
  summarise(total_count = n(), .groups = "drop") %>% 
  # sort base on lowerest to highest rating
  arrange(average_rating) %>% 
  mutate(beantype = factor(beantype, levels = beantype))

bean_group <- as.character(unique(flavors_of_cacao_without_na_group_by_beantype$beantype))

review_for_bean_per_year <-
flavors_of_cacao_without_na %>% 
  # group by year and company
  group_by(reviewdate, beantype) %>% 
  summarise(average_rating = mean(rating), .groups = "drop") %>% 
  arrange(reviewdate, beantype)

f<- barchart_plotly(
  database = flavors_of_cacao_without_na_group_by_beantype,
  x_col = beantype,
  y_col = total_count,
  text_col = average_rating,
  matched_group_col = beantype,
  ordered_group = bean_group
)

f2<- scatter_plotlyv2(
  database = review_for_bean_per_year,
  x_col = reviewdate,
  y_col = average_rating,
  #text_col = average_rating,
  matched_group_col = beantype,
  ordered_group = bean_group,
  display =F
  #checking = T
) 

 
subplot(f, f2, shareX = F, shareY = F, nrows =2) %>% 
button_generator(
    ordered_group = bean_group,
    sec_graph_display = F
  ) %>%   
  layout(
    title = "Supply of Bean Type with Average Rating"
  )
Type of Bean with Associate Average Rating
beantype Trinitario Forastero Criollo Blend Nacional Beniano Amazon
average_rating 3.25 3.12 3.27 3.28 3.31 3.58 3.25
total_count 185 108 89 46 8 3 1
percentage 42.05% 24.55% 20.23% 10.45% 1.82% 0.68% 0.23%

The figure above combines a bar chart and a line chart. The bar chart displays the distribution of bean types ordered by average rating, while the line chart shows changes in ratings over the years.

Trinitario, a hybrid of Forastero and Criollo that inherits disease resistance from Forastero and superior flavor from Criollo (source ), is the most widely used cocoa in the market, accounting for 42.05% of the total with an average rating of 3.25 across the years.

Forastero, the second most commonly used cocoa (24.55% of the market), is favored for its ease of cultivation and high yield (source ). However, its average rating is lower compared to Trinitario.

Criollo ranks third in usage; although it produces smaller yields, it is valued for its stronger flavor compared to other varieties. Blend (mixed cocoa), Nacional, Amazon, and Beniano varieties serve more niche market segments.

flavors_of_cacao_without_na %>% 
  select(beantype, cocoapercent, rating) %>% 
tsne_function(
  database = ., 
  point_to_consider = 100,
  blur_rate = 0.1,
  group_col = beantype,
  show_iteration_msg = F)

The t-SNE plot above illustrates the relationships among cocoa percentage, rating, and bean type. The scatter points for different bean types overlap heavily, suggesting that there are no clear or direct differences between them.

2.5 Fourth Insight: Relationship of Cocoa Percentage and Rating

break_generator(
  breaks_name = "cacao_percentage_break",
  labels_name = "cacao_percentage_label",
  start = 0,
  end = 1,
  gap = 0.1
)

summary_cocoa_percentage <- 
flavors_of_cacao_without_na %>% 
  select(companymakerifknown,rating, cocoapercent, beantype) %>% 
  group_by(companymakerifknown,beantype) %>% 
  summarise(average_rating = mean(rating),
            average_cocoapercent = mean(cocoapercent),
            .groups = "drop") %>% 
  select(-companymakerifknown) %>% 
  mutate(cocoa_percent_group = cut(average_cocoapercent,
                                   breaks = cacao_percentage_break,
                                   labels = cacao_percentage_label,
                                   ordered_result = T,
                                   right = F)) %>% 
  select(-average_cocoapercent) %>% 
  group_by(beantype, cocoa_percent_group) %>% 
  summarise(total_count = n(), 
            average_rating = mean(average_rating),.groups = "drop") 
  
  barchart_plotly(
    database = summary_cocoa_percentage, 
    x_col = cocoa_percent_group,
    y_col = total_count,
    matched_group_col = beantype,
    ordered_group = bean_group,
    text_col = average_rating
    ) %>% 
    button_generator(
      num_of_plot = 1,
      ordered_group = bean_group
    ) %>% layout(
      xaxis = list(title = "Cocoa Percentage Group"),
      yaxis = list(title = "Total Count")
    )

Figure above shows the distribution of average rating based on cocoa percentage group by bean type. Inight suggests that high cocoa percentage do not result in high ratings. Distribution across beantype illustrates highest average rating normally laid in between 60 to 80 cocoa percentage.

2.6 Preliminary Screening for ANOVA Analysis

flavors_of_cacao_without_na %>% 
  group_by(companymakerifknown, beantype) %>% 
  summarise(average_cocoapercent = mean(cocoapercent), .groups = "drop") %>% 
  group_by(beantype) %>% 
  summarise(num_of_user = n(), .groups = "drop") %>%
  arrange(num_of_user) %>%
  save_function(name = "sample_size_assessment_result", type = "NULL", return = T) %>% 
  kbl(caption = "sample size for beantype") %>%
  kable_styling()
sample size for beantype
beantype num_of_user
Amazon 1
Beniano 3
Nacional 8
Blend 46
Criollo 89
Forastero 108
Trinitario 185

Sample size assessment indicates that Amazon, Beniano, and Nacional have small sample sizes, making them unsuitable for ANOVA testing.

bean_type <-sample_size_assessment_result$beantype[4:7]

flavors_of_cacao_without_na %>% 
  group_by(companymakerifknown, beantype) %>% 
  summarise(average_rating = mean(rating), .groups = "drop") %>% 
  save_function(name ="sample", type = "NULL", return = T) %>% 
hist_bell_plotly(
  database = .,
  x_col = average_rating,
  bin_size = 0.05,
  matched_group_col = beantype,
  ordered_group = bean_type
) %>% 
  bell_curve_selection_generator(
    number_of_curve = 8,
    hist_bell = T
  ) %>% 
  layout(
    xaxis = list(title = "Average Rating"),
    yaxis = list(title = "Density")
    
  )
shapiro_group_testing(
  database = sample,
  matched_group_col = beantype,
  ordered_group = bean_type,
  x_col = average_rating
) %>% 
  kbl(caption = "Result from shapiro test") %>% 
  kable_styling()
Result from shapiro test
group p_value
Blend 0.0082189
Criollo 0.0100423
Forastero 0.0027237
Trinitario 0.0044680

The Shapiro-Wilk test shows that the remaining groups deviate from normality, as their p-values are ≤ 0.05. Visual inspection of the histograms suggests that larger sample sizes could improve the approximation to normality and increase p-values.

3 Conclusion

Trinitario has the largest market share due to its characteristics—high yield rate and disease resistance inherited from Forastero, combined with superior taste from Criollo. Based on visual inspection, there are no significant differences between bean type and rating, and the ANOVA test could not be performed because the sample sizes by bean type do not follow a normal distribution. Additionally, a higher cocoa percentage does not necessarily result in higher ratings; the optimal cocoa range appears to be between 60% and 80%, suggesting that the best flavors arise from a balanced combination of ingredients rather than solely from cocoa content.

In terms of manufacturers, 41.84% of companies (118 manufacturers) are from the USA, making it the largest chocolate bar supplier by a significant margin.