Unwrapping the Secrets: Nutritional Analytics of U.S. Packaged Foods¶

đŸ•”đŸ»â€â™€ïž What is it about?¶

In this project, we explore the nutritional properties of packaged foods in the United States, aiming to uncover hidden patterns and examine how nutritional features are reflected in Nutri-Score labeling.

Why It’s Interesting?¶

As consumers become increasingly health-conscious, understanding the connection between what's inside our food and how it is nutritionally scored has never been more important. In addition, with the rising concern surrounding food additives, it is important for us to provide insights that can inform healthier food choices.

Key Research Question:¶

  • Does Nutri-Score reflects the presence of food additives, or are there potential inconsistencies?
  • Do certain brands consistently use more additives, and tend to receive lower Nutri-Scores?
  • How do food additives and healthiness vary across different brands of packaged foods in the U.S.?

Where Our Data Comes From?¶

We are using the Open Food Facts from Kaggle, focusing on packaged foods available in the U.S. The dataset contains over 300,000 entries with detailed nutritional, additive, and ingredient information.

🔍 You can also take a look at the project website: https://world.openfoodfacts.org/

đŸ€” Before we start, you might wonder, "What is Nutri-Score labeling?"

Nutri-Score is a front-of-pack nutrition labeling system designed to help consumers make healthier food choices at a glance. It rates the overall nutritional quality of food products using a five-color and five-letter scale, from A (dark green) for the healthiest options to E (red) for the least healthy ones.

It was developed in France in 2017, but it is actually pretty common to see in supermarkets today! 🛍

Nutri-Score Label

1ïžâƒŁ Data Loading and Preprocessing¶

1.1 Setting Up Data¶

Before diving into the analysis, let's make sure we have all the necessary tools and data prepared. We will start by importing essential libraries, setting up any required packages, and retrieving the Open Food Facts dataset from Kaggle. 🚀✹ Let's get everything ready!

In [ ]:
!pip install umap-learn -q
In [ ]:
!pip install unidecode
Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/235.8 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.8/235.8 kB 9.8 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.4.0
In [ ]:
# Basic utilities
import os
import re
import difflib
import unidecode
from collections import Counter
from itertools import chain

# Data processing
import pandas as pd
import numpy as np

# Text processing
import nltk
from nltk.corpus import stopwords

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Dimensionality reduction and clustering
import umap
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Preprocessing and model selection
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

# Dataset download
import kagglehub
In [ ]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
In [ ]:
# Download Open Food Facts dataset from Kaggle
path = kagglehub.dataset_download("openfoodfacts/world-food-facts")
print("Path to dataset files:", path)
Path to dataset files: /kaggle/input/world-food-facts
In [ ]:
# List available files
files = os.listdir(path)
print("Files in dataset:", files)
Files in dataset: ['en.openfoodfacts.org.products.tsv']
In [ ]:
# Load the main data file
file_path = os.path.join(path, 'en.openfoodfacts.org.products.tsv')
df = pd.read_csv(file_path, sep='\t', low_memory=False)

1.2 Exploring the Raw Dataset¶

Now that we have successfully loaded the Open Food Facts dataset, let's take a quick look at its overall structure of the dataset. We will first check the number of rows and columns, and then examine the names and data types of all available attributes.

In [ ]:
# dataset review
df.head()
Out[ ]:
code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity ... fruits-vegetables-nuts_100g fruits-vegetables-nuts-estimate_100g collagen-meat-protein-ratio_100g cocoa_100g chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g
0 0000000003087 http://world-en.openfoodfacts.org/product/0000... openfoodfacts-contributors 1474103866 2016-09-17T09:17:46Z 1474103893 2016-09-17T09:18:13Z Farine de blé noir NaN 1kg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0000000004530 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Banana Chips Sweetened (Whole) NaN NaN ... NaN NaN NaN NaN NaN NaN 14.0 14.0 NaN NaN
2 0000000004559 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Peanuts NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN
3 0000000016087 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489055731 2017-03-09T10:35:31Z 1489055731 2017-03-09T10:35:31Z Organic Salted Nut Mix NaN NaN ... NaN NaN NaN NaN NaN NaN 12.0 12.0 NaN NaN
4 0000000016094 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489055653 2017-03-09T10:34:13Z 1489055653 2017-03-09T10:34:13Z Organic Polenta NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 163 columns

In [ ]:
# data shape
print("Data rows and columns number:", df.shape)
Data rows and columns number: (356027, 163)
In [ ]:
# check all columns' name and datatype:
print("All Columns and Their Data Types in the Dataset:\n")
print(f"{'Column Name':<50} {'Type'}")
print("-" * 60)
for col in df.columns:
    print(f"{col:<50} {df[col].dtype}")
All Columns and Their Data Types in the Dataset:

Column Name                                        Type
------------------------------------------------------------
code                                               object
url                                                object
creator                                            object
created_t                                          object
created_datetime                                   object
last_modified_t                                    object
last_modified_datetime                             object
product_name                                       object
generic_name                                       object
quantity                                           object
packaging                                          object
packaging_tags                                     object
brands                                             object
brands_tags                                        object
categories                                         object
categories_tags                                    object
categories_en                                      object
origins                                            object
origins_tags                                       object
manufacturing_places                               object
manufacturing_places_tags                          object
labels                                             object
labels_tags                                        object
labels_en                                          object
emb_codes                                          object
emb_codes_tags                                     object
first_packaging_code_geo                           object
cities                                             object
cities_tags                                        object
purchase_places                                    object
stores                                             object
countries                                          object
countries_tags                                     object
countries_en                                       object
ingredients_text                                   object
allergens                                          object
allergens_en                                       object
traces                                             object
traces_tags                                        object
traces_en                                          object
serving_size                                       object
no_nutriments                                      float64
additives_n                                        float64
additives                                          object
additives_tags                                     object
additives_en                                       object
ingredients_from_palm_oil_n                        float64
ingredients_from_palm_oil                          float64
ingredients_from_palm_oil_tags                     object
ingredients_that_may_be_from_palm_oil_n            float64
ingredients_that_may_be_from_palm_oil              float64
ingredients_that_may_be_from_palm_oil_tags         object
nutrition_grade_uk                                 float64
nutrition_grade_fr                                 object
pnns_groups_1                                      object
pnns_groups_2                                      object
states                                             object
states_tags                                        object
states_en                                          object
main_category                                      object
main_category_en                                   object
image_url                                          object
image_small_url                                    object
energy_100g                                        float64
energy-from-fat_100g                               float64
fat_100g                                           float64
saturated-fat_100g                                 float64
-butyric-acid_100g                                 float64
-caproic-acid_100g                                 float64
-caprylic-acid_100g                                float64
-capric-acid_100g                                  float64
-lauric-acid_100g                                  float64
-myristic-acid_100g                                float64
-palmitic-acid_100g                                float64
-stearic-acid_100g                                 float64
-arachidic-acid_100g                               float64
-behenic-acid_100g                                 float64
-lignoceric-acid_100g                              float64
-cerotic-acid_100g                                 float64
-montanic-acid_100g                                float64
-melissic-acid_100g                                float64
monounsaturated-fat_100g                           float64
polyunsaturated-fat_100g                           float64
omega-3-fat_100g                                   float64
-alpha-linolenic-acid_100g                         float64
-eicosapentaenoic-acid_100g                        float64
-docosahexaenoic-acid_100g                         float64
omega-6-fat_100g                                   float64
-linoleic-acid_100g                                float64
-arachidonic-acid_100g                             float64
-gamma-linolenic-acid_100g                         float64
-dihomo-gamma-linolenic-acid_100g                  float64
omega-9-fat_100g                                   float64
-oleic-acid_100g                                   float64
-elaidic-acid_100g                                 float64
-gondoic-acid_100g                                 float64
-mead-acid_100g                                    float64
-erucic-acid_100g                                  float64
-nervonic-acid_100g                                float64
trans-fat_100g                                     float64
cholesterol_100g                                   float64
carbohydrates_100g                                 float64
sugars_100g                                        float64
-sucrose_100g                                      float64
-glucose_100g                                      float64
-fructose_100g                                     float64
-lactose_100g                                      float64
-maltose_100g                                      float64
-maltodextrins_100g                                float64
starch_100g                                        float64
polyols_100g                                       float64
fiber_100g                                         float64
proteins_100g                                      float64
casein_100g                                        float64
serum-proteins_100g                                float64
nucleotides_100g                                   float64
salt_100g                                          float64
sodium_100g                                        float64
alcohol_100g                                       float64
vitamin-a_100g                                     float64
beta-carotene_100g                                 float64
vitamin-d_100g                                     float64
vitamin-e_100g                                     float64
vitamin-k_100g                                     float64
vitamin-c_100g                                     float64
vitamin-b1_100g                                    float64
vitamin-b2_100g                                    float64
vitamin-pp_100g                                    float64
vitamin-b6_100g                                    float64
vitamin-b9_100g                                    float64
folates_100g                                       float64
vitamin-b12_100g                                   float64
biotin_100g                                        float64
pantothenic-acid_100g                              float64
silica_100g                                        float64
bicarbonate_100g                                   float64
potassium_100g                                     float64
chloride_100g                                      float64
calcium_100g                                       float64
phosphorus_100g                                    float64
iron_100g                                          float64
magnesium_100g                                     float64
zinc_100g                                          float64
copper_100g                                        float64
manganese_100g                                     float64
fluoride_100g                                      float64
selenium_100g                                      float64
chromium_100g                                      float64
molybdenum_100g                                    float64
iodine_100g                                        float64
caffeine_100g                                      float64
taurine_100g                                       float64
ph_100g                                            float64
fruits-vegetables-nuts_100g                        float64
fruits-vegetables-nuts-estimate_100g               float64
collagen-meat-protein-ratio_100g                   float64
cocoa_100g                                         float64
chlorophyl_100g                                    float64
carbon-footprint_100g                              float64
nutrition-score-fr_100g                            float64
nutrition-score-uk_100g                            float64
glycemic-index_100g                                float64
water-hardness_100g                                float64

1.3 Attribute Key Definitions¶

While it may be overwhelming to explore every column in the dataset individually, we prepare for the upcoming analysis by organizing key attributes into major categories. Below is a list of the main columns and their definitions from the Open Food Facts dataset:

đŸ”č Identifiers

  • code: Barcode of the product.
  • url: URL of the product page on Open Food Facts website.
  • creator, created_t, created_datetime: Creator and timestamp of product entry creation.
  • last_modified_t, last_modified_datetime: Timestamps for when the entry was last updated.

đŸ”č Product Description

  • product_name, generic_name: Commercial and general name of the product.
  • quantity: Product quantity (e.g., "500g", "2L").
  • packaging, packaging_tags: Text description of packaging materials (e.g., plastique, plastic, glass jar).
  • brands: Commercial brand names and standardized brand identifiers.
  • brands_tags: provides normalized lowercase brand tokens for grouping.
  • categories, categories_tags, categories_en: Product classification into food categories.

đŸ”č Geographic & Origin Information

  • origins, manufacturing_places: Source or production location.
  • countries, countries_en: Market availability of the product.

đŸ”č Labeling and Compliance

  • labels, labels_en, labels_tags: Certifications, diet types (e.g. organic, halal).
  • emb_codes, emb_codes_tags: Packaging codes (e.g., recycling info).
  • purchase_places, stores: Points of purchase, stores where sold.

đŸ”č Ingredients & Allergens

  • ingredients_text: Raw text of ingredients list.
  • allergens, allergens_en: Allergen content (e.g., nuts, gluten).
  • traces, traces_en: Potential traces of allergens not listed as ingredients.
  • additives, additives_n, additives_tags: Additive info and count.
  • ingredients_from_palm_oil, ingredients_that_may_be_from_palm_oil: Palm oil usage certainty and estimation.

đŸ”č Nutrition Grades & Groups

  • nutrition_grade_fr, nutrition_grade_uk: Health grade from A (best) to E (worst).
  • pnns_groups_1, pnns_groups_2: French public health nutrition food groupings.

đŸ”č Nutritional Composition (per 100g)

  • Energy & Macronutrients:

    • energy_100g
    • fat_100
    • sugars_100g
    • proteins_100g
    • fiber_100g
    • salt_100g
    • sodium_100g
  • Fatty Acids (Mono/Poly/Saturated):

    • monounsaturated-fat_100g
    • omega-3-fat_100g,
    • -palmitic-acid_100g 
 (etc.)
  • Sugars & Derivatives:

    • -sucrose_100g
    • -glucose_100g
    • -lactose_100g
    • starch_100g
    • polyols_100g
  • Vitamins & Minerals:

    • vitamin-a_100g
    • vitamin-c_100g
    • iron_100g
    • zinc_100g
    • calcium_100g
    • iodine_100g 
 (etc.)
  • Other Features:

    • alcohol_100g
    • caffeine_100g
    • ph_100g
    • carbon-footprint_100g
    • glycemic-index_100g

1.4 Checking Nulls and Duplicates¶

Before diving into detailed data analysis, we need to first checking for duplicate entries and missing values. We will also calculate the percentage of missing values for each column and dentify columns with 100% missing values for potential removal.

In [ ]:
# Find number of duplicated rows in the dataset
num_duplicates = df.duplicated().sum()
print(f"The number of duplicated rows: {num_duplicates}")
The number of duplicated rows: 0
In [ ]:
# Count missing values per column
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)

print("Missing Values Count per Column:")
display(missing_counts)
Missing Values Count per Column:
0
no_nutriments 356027
chlorophyl_100g 356027
water-hardness_100g 356027
glycemic-index_100g 356027
-butyric-acid_100g 356027
... ...
code 26
url 26
created_datetime 10
creator 3
created_t 3

161 rows × 1 columns


âžĄïž There are no duplicate rows in the dataset, a clean start is always a good start.

In [ ]:
# Calculate missing value percentage
total_rows = df.shape[0]
missing_percent = (missing_counts / total_rows) * 100

print("The Missing Values (% of total rows) for Each Column:")
display(missing_percent.sort_values(ascending=False))
The Missing Values (% of total rows) for Each Column:
0
no_nutriments 100.000000
chlorophyl_100g 100.000000
water-hardness_100g 100.000000
glycemic-index_100g 100.000000
-butyric-acid_100g 100.000000
... ...
code 0.007303
url 0.007303
created_datetime 0.002809
creator 0.000843
created_t 0.000843

161 rows × 1 columns


âžĄïž We observe that many columns have missing entries. Next, we calculate the proportion of missing values to uncover more. 👀

In [ ]:
# Identify columns with 100% missing values
full_missing_cols = missing_percent[missing_percent == 100].index

print("Columns with 100% missing values:")
for col in full_missing_cols:
    print(col)
Columns with 100% missing values:
no_nutriments
chlorophyl_100g
water-hardness_100g
glycemic-index_100g
-butyric-acid_100g
-melissic-acid_100g
-nervonic-acid_100g
-erucic-acid_100g
-mead-acid_100g
-elaidic-acid_100g
-caproic-acid_100g
-lignoceric-acid_100g
-cerotic-acid_100g
nutrition_grade_uk
ingredients_from_palm_oil
ingredients_that_may_be_from_palm_oil

âžĄïž The columns above have 100% missing values and will be dropped during preprocessing.

1.5 Define the Scope of Analysis: Focus on U.S. Data¶

Since our analysis focuses on packaged foods available in the United States, we first need to filter out records from other countries and regions using the countries_en column.

Country labels may vary in format (e.g., "USA", "U.S.", "United States of America"), and some entries list multiple countries (e.g., "Spain,United States"), introducing potential noise. To ensure accurate filtering, we explore two approaches.

Attempted Approach: Fuzzy Matching (Not Used)¶

We initially attempted to use fuzzy matching with the difflib library to capture all records loosely matching "United States":

similar = difflib.get_close_matches("United States", unique_countries, n=10, cutoff=0.6)

However, this approach included many false matches with multiple countries. As shown:

United States; United States,éŠ™æžŻ; Peru,United States; Spain,United States; Italy,United States; Chile,United States; Taiwan,United States; Sweden,United States; Panama,United States; Mexico,United States

Second Approach: Exact Match with U.S. Labels¶

In [ ]:
# Define accepted U.S. labels
usa_labels = ['United States', 'USA', 'U.S.', 'US', 'United States of America']

# Filter rows where countries_en exactly matches any accepted label
df_usa = df[df['countries_en'].isin(usa_labels)].copy()

We then compared the size of the dataset before and after filtering:

In [ ]:
# Print result
print("Before filtering, rows and columns:", df.shape)
print(f"After filtering, retained {df_usa.shape[0]} rows out of {df.shape[0]} "
      f"({df_usa.shape[0]/df.shape[0]:.1%}) after filtering U.S. labels.")
Before filtering, rows and columns: (356027, 163)
After filtering, retained 173159 rows out of 356027 (48.6%) after filtering U.S. labels.

âžĄïž After filtering, we retained 48.6% of the original dataset, focusing purely on products sold in the United States.

Verifying the Filtered Data:

To confirm that only U.S. data remains, we sampled a few rows:

In [ ]:
# Sample to verify country fields
df_usa[['countries', 'countries_tags', 'countries_en']].sample(10)
Out[ ]:
countries countries_tags countries_en
69486 US en:united-states United States
146344 US en:united-states United States
107607 US en:united-states United States
131265 US en:united-states United States
36611 US en:united-states United States
125188 US en:united-states United States
298030 United States en:united-states United States
59010 US en:united-states United States
59328 US en:united-states United States
17235 US en:united-states United States

âžĄïž As expected, all entries are correctly labeled as belonging to the United States.

1.6 Cleaning Data: handle the missing and duplicate value¶

In this dataset, many columns contain a high proportion of missing values (missing_percent). We will drop the columns that exceed 80% missing value THRESHOLD = 0.8.

The dataset df_usa to be analyzed will be stored in df_final, and it will meet the following criteria:

  • At lease 50,000 rows after cleaning and dropping null values
  • Includes a rich set of features that are intuitively relevant for prediction.

Step 1: Visualize Missing Rates Across Columns¶

In [ ]:
#Top 50 Columns by Missing Rate
missing_counts = df_usa.isnull().sum()
missing_percent = missing_counts / len(df_usa)
missing_percent_sorted = missing_percent.sort_values(ascending=True)
top_n = 50
missing_percent_top = missing_percent_sorted.head(top_n)

# drop bar chart
plt.figure(figsize=(20, 10))
missing_percent_top.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Missing Rate (%)')
plt.title(f'Top {top_n}% Columns by Missing Rate')
plt.xticks(rotation=90, ha='right')
plt.axhline(0.8, color='red', linestyle='--', linewidth=2)
plt.text(-0.5, 0.8, '80% Threshold', color='red', va='bottom', ha='left')

plt.tight_layout()
plt.show()
No description has been provided for this image

âžĄïž Observation: Several columns exceed the 80% missing rate threshold. These columns will be candidates for removal.

Step 2: Drop Columns with Excessive Missing Values¶

We now remove all columns with more than 80% missing entries.

In [ ]:
# Set missing value threshold
THRESHOLD = 0.8

# Specify columns to drop
cols_to_drop = missing_percent[missing_percent > THRESHOLD].index
df_usa_trimmed = df_usa.drop(cols_to_drop, axis=1)

Comparison of dataset dimensions before and after dropping columns:

In [ ]:
print("Original dataset: rows and columns", df.shape)
print("After dropping columns: rows and columns", df_usa_trimmed.shape)
Original dataset: rows and columns (356027, 163)
After dropping columns: rows and columns (173159, 42)

Step 3: Drop Remaining Nulls and Duplicate¶

In [ ]:
# Drop all nulls
df_final = df_usa_trimmed.dropna().copy()
print("After dropping remaining nulls: rows and columns", df_final.shape)

# Drop all duplicates
df_final = df_final.drop_duplicates().copy()
print("After dropping remaining duplicates: rows and columns", df_final.shape)
After dropping remaining nulls: rows and columns (82380, 42)
After dropping remaining duplicates: rows and columns (82380, 42)
In [ ]:
print("Final Check for null and duplicates:")

# Check for nulls
total_nulls = df_final.isnull().sum().sum()
print(f"Total missing values: {total_nulls}")

# Check for duplicates
duplicate_rows = df_final.duplicated().sum()
print(f"Total duplicated rows: {duplicate_rows}")
Final Check for null and duplicates:
Total missing values: 0
Total duplicated rows: 0

Let's take a quick look at a few random samples from the final dataset:

In [ ]:
# diaplay dataset
df_final.sample(5)
Out[ ]:
code url creator created_t created_datetime last_modified_t last_modified_datetime product_name brands brands_tags ... fiber_100g proteins_100g salt_100g sodium_100g vitamin-a_100g vitamin-c_100g calcium_100g iron_100g nutrition-score-fr_100g nutrition-score-uk_100g
118826 0607880033342 http://world-en.openfoodfacts.org/product/0607... usda-ndb-import 1489137479 2017-03-10T09:17:59Z 1489137480 2017-03-10T09:18:00Z Sweet & Salty Caramel Trail Mix Southern Home southern-home ... 6.7 13.33 1.01600 0.400 0.000000 0.0000 0.133 0.00240 20.0 20.0
142621 0749826575520 http://world-en.openfoodfacts.org/product/0749... usda-ndb-import 1489096243 2017-03-09T21:50:43Z 1489096243 2017-03-09T21:50:43Z High Protein Fruit & Nut Bar Pure Protein pure-protein ... 9.4 33.96 0.76708 0.302 0.000057 0.0045 0.151 0.00136 9.0 9.0
159299 0850335006013 http://world-en.openfoodfacts.org/product/0850... usda-ndb-import 1489093649 2017-03-09T21:07:29Z 1489093649 2017-03-09T21:07:29Z Verry Berry Fruit Pop Squeaky Pops squeaky-pops ... 0.0 0.00 0.01524 0.006 0.000000 0.0029 0.000 0.00000 5.0 5.0
94808 0077661147306 http://world-en.openfoodfacts.org/product/0077... usda-ndb-import 1489138130 2017-03-10T09:28:50Z 1489138130 2017-03-10T09:28:50Z Opa, Greek Yogurt Roasted Garlic Dressing Litehouse, Litehouse Inc. litehouse,litehouse-inc ... 0.0 3.33 1.69418 0.667 0.000000 0.0000 0.133 0.00000 7.0 7.0
141907 0747599322013 http://world-en.openfoodfacts.org/product/0747... usda-ndb-import 1489075490 2017-03-09T16:04:50Z 1489075490 2017-03-09T16:04:50Z Squares, Chocolate Assortment Ghirardelli Chocolate, Ghirardelli Chocolate ... ghirardelli-chocolate,ghirardelli-chocolate-co... ... 5.0 5.00 0.15748 0.062 0.000000 0.0030 0.100 0.00360 20.0 20.0

5 rows × 42 columns

We also list all retained feature columns:

In [ ]:
# check all final retained columns' name:
print("Final retained columns:\n")
for col in df_final.columns:
    print(col)
Final retained columns:

code
url
creator
created_t
created_datetime
last_modified_t
last_modified_datetime
product_name
brands
brands_tags
countries
countries_tags
countries_en
ingredients_text
serving_size
additives_n
additives
additives_tags
additives_en
ingredients_from_palm_oil_n
ingredients_that_may_be_from_palm_oil_n
nutrition_grade_fr
states
states_tags
states_en
energy_100g
fat_100g
saturated-fat_100g
trans-fat_100g
cholesterol_100g
carbohydrates_100g
sugars_100g
fiber_100g
proteins_100g
salt_100g
sodium_100g
vitamin-a_100g
vitamin-c_100g
calcium_100g
iron_100g
nutrition-score-fr_100g
nutrition-score-uk_100g

📂 You can download this CVS file! đŸ‘‡đŸ»

In [ ]:
# save df_final
df_final.to_csv('df_final.csv', index=False)

2ïžâƒŁ Exploratory Data Analysis (EDA)¶

We now proceed to explore the nutritional properties and labeling characteristics of U.S. packaged foods. Let's keep the momentum going!

2.1 Cleaning Numerical Features¶

Since nutrient quantities are reported per 100g, errors such as negative values or abnormally large values (e.g., sugar > 100g) can occur. Before diving into feature distribution analysis, we want the numerical nutritional attributes fall within reasonable and possible ranges.

Outlier Identification and Handling

We focus on cleaning the following key nutritional features:

  • Macronutrients: sugars_100g, fat_100g, proteins_100g, carbohydrates_100g etc.
  • Health-impacting features: salt_100g, sodium_100g, cholesterol_100g, fiber_100g etc.
  • Micronutrients: vitamin-a_100g, vitamin-c_100g, calcium_100g, iron_100g
In [ ]:
# Define allowed value range

valid_range_mask = df_final[[
    'saturated-fat_100g',
    'trans-fat_100g',
    'fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g',
    ]].apply(lambda x: x.between(0, 100)).all(axis=1)

# Filter the dataset
df_final_cleaned = df_final[valid_range_mask].copy()

Compare the dataset size before and after removing outliers:

In [ ]:
# Compare shape before and after filtering
print(f"Before outlier removal: {df_final.shape}")
print(f"After outlier removal:  {df_final_cleaned.shape}")
print(f"Removed {df_final.shape[0] - df_final_cleaned.shape[0]} rows due to out-of-bound values.")
Before outlier removal: (82380, 42)
After outlier removal:  (82357, 42)
Removed 23 rows due to out-of-bound values.

We list the products with invalid values âžĄïž 23 out of over 82,000 rows will be removed

In [ ]:
df_final[~valid_range_mask][[
    'trans-fat_100g',
    'sugars_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g',
    ]]
Out[ ]:
trans-fat_100g sugars_100g salt_100g sodium_100g vitamin-c_100g calcium_100g iron_100g
1483 0.00 5.71 870.85678 342.857 0.0000 0.000 0.00617
8043 0.00 2.31 781.53768 307.692 0.0000 0.015 0.00138
11206 0.00 22.58 327.74128 129.032 0.0000 0.032 0.00232
12036 369.00 0.81 0.24892 0.098 0.0000 0.054 0.00073
41869 0.00 17.86 2.72034 1.071 -0.0021 0.143 0.00129
50827 0.00 20.59 130.73634 51.471 0.0000 0.118 0.00106
69041 0.00 9.46 1098.37728 432.432 0.0000 0.054 0.00243
69050 0.00 11.27 858.59112 338.028 0.0000 0.028 0.00254
95375 0.00 33.33 1318.38192 519.048 0.0000 0.262 0.01414
107178 0.00 8.48 1139.15190 448.485 0.0000 0.012 0.00087
108870 -0.70 20.42 0.22352 0.088 0.0000 0.070 0.00190
110111 0.00 50.00 101.60000 40.000 0.0000 0.000 0.00000
113274 0.00 50.00 0.30734 0.121 0.0043 285.714 0.00000
119086 0.00 -3.57 0.95250 0.375 0.0086 0.071 0.00129
122368 0.00 65.85 123.90120 48.780 0.0000 0.000 0.00000
133607 0.00 0.00 101.23678 39.857 0.0000 0.000 0.00000
133611 0.00 0.00 104.86644 41.286 0.0000 0.000 0.00000
139181 0.00 0.00 187.96000 74.000 2.1000 0.000 0.00000
140017 173.26 1.60 0.80772 0.318 0.0024 0.027 0.00072
148432 0.00 7.65 100.85324 39.706 0.0000 0.000 0.00000
155175 -3.57 10.71 2.44856 0.964 0.0000 0.071 0.00386
162397 0.00 0.00 1669.14322 657.143 0.0000 0.057 0.00309
351458 0.00 14.29 0.00000 0.000 0.0107 0.014 -0.00026
In [ ]:
df_final = df_final_cleaned.copy()

2.2 Text Preprocessing for Ingredient and Additive Fields¶

In this section, we clean the ingredient-related text fields to ensure consistency and accuracy in subsequent analyses.

The raw dataset may contain duplicate product entries, missing values, and inconsistencies in textual formatting. We address these issues through four main steps:

  • Step 1: Check and remove duplicate products on both product_name, brands and ingredients_text.

  • Step 2: Standardize ingredients_list and additives: We split the ingredients_text andadditives_en column into lists to make additive analysis easier in later sections

  • Step 3: Further clean ingredients_list

Step 1: Remove Duplicate Products¶

Duplicate entries can arise due to slight variations in product names, brands, or ingredients descriptions.

If duplicate entries exist, we manually review them. We identify and remove duplicate products based on the combination of product_name, brands, and ingredients_text to avoid bias in the analysis.

In [ ]:
# Step 1: Check and Remove Duplicates

# Normalize text: lowercase and strip spaces
df_final['product_name'] = df_final['product_name'].str.lower().str.strip()
df_final['brands'] = df_final['brands'].str.lower().str.strip()
df_final['ingredients_text'] = df_final['ingredients_text'].str.lower().str.strip()

# Find duplicates
duplicates = df_final[df_final.duplicated(subset=[ 'product_name', 'brands', 'ingredients_text'], keep=False)]
print(f"Number of duplicate entries: {duplicates.shape[0]}")

# Sort duplicates for easier review
duplicates = duplicates.sort_values(by=['product_name', 'brands', 'ingredients_text'])

# Display the potential duplicates
duplicates[['code', 'product_name', 'brands', 'ingredients_text', 'nutrition_grade_fr', 'additives_en', 'nutrition-score-fr_100g']].head(10)
Number of duplicate entries: 2597
Out[ ]:
code product_name brands ingredients_text nutrition_grade_fr additives_en nutrition-score-fr_100g
49387 0041497097548 1% low fat chocolate milk, chocolate weis quality, weis markets inc. low fat milk, high fructose corn syrup, sugar,... b E407 - Carrageenan 0.0
49388 0041497097555 1% low fat chocolate milk, chocolate weis quality, weis markets inc. low fat milk, high fructose corn syrup, sugar,... b E407 - Carrageenan 0.0
57213 0044100106804 1% lowfat milk hood, hp hood llc lowfat milk, ascorbic acid (vitamin c), vitami... a E300 - Ascorbic acid -1.0
57264 0044100169267 1% lowfat milk hood, hp hood llc lowfat milk, ascorbic acid (vitamin c), vitami... a E300 - Ascorbic acid -1.0
54127 0041900074302 1% lowfat milk, chocolate trumoo, dean foods company lowfat milk, sugar, contains less than 1% of: ... a E407 - Carrageenan -1.0
54182 0041900075712 1% lowfat milk, chocolate trumoo, dean foods company lowfat milk, sugar, contains less than 1% of: ... a E407 - Carrageenan -1.0
44558 0041318020540 100% juice schnuck markets inc. tomato concentrate (water, tomato paste), salt... b E300 - Ascorbic acid 2.0
44759 0041318131444 100% juice schnuck markets inc. tomato concentrate (water, tomato paste), salt... b E300 - Ascorbic acid 2.0
7575 0011213015347 100% juice, tomato spartan tomato concentrate (water, tomato paste), salt... b E300 - Ascorbic acid,E330 - Citric acid 1.0
8025 0011213049427 100% juice, tomato spartan tomato concentrate (water, tomato paste), salt... b E300 - Ascorbic acid,E330 - Citric acid 1.0

Next, we have removed duplicate entries, keeping only the first occurrence:

In [ ]:
# Check original shape before removing duplicates
print(f"Before removing duplicate products: {df_final.shape}")

# Remove duplicates, keep the first occurrence
df_final = df_final.drop_duplicates(subset=['product_name', 'brands', 'ingredients_text'])
print(f"After removing duplicate products: {df_final.shape}")
Before removing duplicate products: (82357, 42)
After removing duplicate products: (80946, 42)

âžĄïž Approximately 1.7% of the dataset entries were identified as duplicates and removed. This ensures that each product in our analysis corresponds to a unique combination of ingredients and branding, reducing potential bias in brand-level or additive-level analyses.

Step 2: Standardize ingredients_list and additives_list¶

The raw ingredients_text and additives_en fields in the dataset are stored as comma-separated strings with inconsistent casing and formatting. In this step, we standardize the ingredients_text and additives_en columns into structured list formats.

  • For ingredients_list, we split the text by commas and standardize to lowercase.
  • For additives_list, we split the additives string by commas and standardize to lowercase. This formatting will make it easier to analyze specific ingredients and additives in later sections.
In [ ]:
# Step 3: Standardize ingredients_list Additives List

# Split ingredients_text into a list by commas
df_final['ingredients_list'] = df_final['ingredients_text'].apply(lambda x: [i.strip().lower() for i in x.split(',')] if pd.notnull(x) else [])

# Split additives_en into a list
df_final['additives_list'] = df_final['additives_en'].apply(lambda x: x.split(',') if x != 'None' else [])

# Check an example
df_final[['product_name', 'ingredients_list','additives_en', 'additives_list']].head(10)
Out[ ]:
product_name ingredients_list additives_en additives_list
82 peanuts, mixed nuts [peanuts, honey, coating (sucrose, wheat starc... E415 - Xanthan gum [E415 - Xanthan gum]
149 turkish apricots [apricots, sulfur dioxide.] E220 - Sulphur dioxide [E220 - Sulphur dioxide]
152 chili mango [dried mango, paprika, sugar, salt, citric aci... E330 - Citric acid [E330 - Citric acid]
153 milk chocolate pretzels [milk chocolate (sugar, cocoa butter, chocolat... E101 - Riboflavin,E101i - Riboflavin,E322 - Le... [E101 - Riboflavin, E101i - Riboflavin, E322 -...
200 butter croissants [wheat flour, butter (cream), water, yeast, su... E300 - Ascorbic acid [E300 - Ascorbic acid]
201 wild blueberry muffins [enriched wheat flour (wheat flour, malted bar... E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... [E101 - Riboflavin, E101i - Riboflavin, E375 -...
202 bolillos [enriched wheat flour (wheat flour niacin, red... E101 - Riboflavin,E101i - Riboflavin,E200 - So... [E101 - Riboflavin, E101i - Riboflavin, E200 -...
203 biscuit [enriched wheat flour (niacin, reduced iron, t... E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... [E101 - Riboflavin, E101i - Riboflavin, E375 -...
204 biscuit [enriched wheat flour (niacin, reduced iron, t... E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... [E101 - Riboflavin, E101i - Riboflavin, E375 -...
205 oatmeal raisin cookie [enriched flour (bleached wheat flour, niacin,... E101 - Riboflavin,E101i - Riboflavin,E160a - A... [E101 - Riboflavin, E101i - Riboflavin, E160a ...

✅ As shown above, both ingredients_list and additives_list are now structured as clean lists.

Step 3: Further Clean ingredients_text¶

While the initial standardization split the ingredients_text field into basic lists, further cleaning is necessary to ensure analytical consistency.

Many entries still contain:

  • Non-English words (e.g., French and German terms).
  • Marketing or domain-specific noise (e.g., "natural", "product", "ingredient").
  • Numeric expressions and inconsistent punctuation.

To address these issues, we perform advanced preprocessing through several steps:

  1. Check for Non-English or Noisy Characters
  2. Expand Stopwords List
  3. Build Translation Dictionary
  4. Define Cleaning Function
  5. Apply Cleaning and Explore the Most Common Ingredient

(1) Check for Non-English or Noisy Characters¶

First, we scan for unusual characters or non-ASCII text in the ingredients.

In [ ]:
df_final['ingredients_text'].dropna().sample(5)
Out[ ]:
ingredients_text
22667 premium fresh pork, water, premium fresh beef,...
183109 *solution: water, potassium lactate, sodium ph...
315392 water, tomatillo, jalapeno peppers, habanero p...
136002 soy protein isolate, organic cane syrup, organ...
108289 enriched bleached flour (wheat flour, niacin, ...

In [ ]:
# Define a function to count "weird" characters in each text
def count_weird_chars(text):
    if pd.isna(text):
        return 0
    return len(re.findall(r'[^a-zA-Z0-9,\s\(\)\.\-]', str(text)))

# Apply to ingredients_text
df_final['weird_char_count'] = df_final['ingredients_text'].apply(count_weird_chars)

# Sort by weirdness descending
df_final_sorted_weird = df_final.sort_values(by='weird_char_count', ascending=False)

# Display the top 10 weirdest entries
with pd.option_context('display.max_colwidth', None):
  df_final_sorted_weird[['product_name', 'ingredients_text', 'weird_char_count']].head(10)

✍ Upon inspection, the presence of unusual characters in the ingredients_text field was minimal and did not pose significant challenges for downstream text processing. Thus, no additional filtering based on has_weird_chars was necessary.

(2) Expand Stopwords List¶

In text data, certain words frequently appear in text but contribute little meaningful information. These words are known as stopwords, for example, "the", "and", "with".

By filtering out these common but low-information words, we focus our analysis on meaningful terms such as "sugar", "protein", or "organic", which are more informative about the product's nutritional properties. We define a custom stopwords list:

  • Standard English stopwords.

  • Common French and German food-related words.

  • Marketing noise terms (e.g., "natural", "contains").

In [ ]:
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[ ]:
True
In [ ]:
# Stopwords list: English + French + German + domain-specific
custom_stopwords = set(stopwords.words('english')) | set([
    # French
    'de', 'Ă ', 'le', 'la', 'les', 'du', 'et', 'des', 'pour', 'avec', 'sur',
    'au', 'ou', 'par', 'en', 'lait', 'eau', 'ingrédients', 'produits',
    'contient', 'valeur', 'nutrition', 'base', 'moyenne',
    # German
    'mit', 'von', 'der', 'das', 'und', 'ein', 'eine', 'dem', 'den', 'fĂŒr', 'ohne',
    'inhaltsstoffe', 'zutaten', 'lebensmittel',
    # Domain noise / marketing
    'organic', 'natural', 'product', 'ingredient', 'ingredients',
    'mg', 'g', 'ar', 'bl', 'fi', '–', '—', 'less', 'contains'
])

(3) Build Translation Dictionary¶

We also build a small dictionary to translate common French and German food words into English equivalents.

In [ ]:
# Translation dictionary: French + German → English
translation_dict = {
    # French
    'sucre': 'sugar',
    'sel': 'salt',
    'huile': 'oil',
    'farine': 'flour',
    'poudre': 'powder',
    'lait': 'milk',
    'arome': 'flavor',
    'chocolat': 'chocolate',
    'acide': 'acid',
    'fromage': 'cheese',
    'cacao': 'cocoa',
    'beurre': 'butter',

    # German
    'zucker': 'sugar',
    'salz': 'salt',
    'mehl': 'flour',
    'milch': 'milk',
    'kakao': 'cocoa',
    'aroma': 'flavor',
    'butter': 'butter'
}

(4) Define Cleaning Function¶

We create a cleaning function that performs:

  • Lowercasing and accent removal

  • Removal of numeric and marketing expressions

  • Basic punctuation normalization

  • Stopword filtering

  • Phrase-based tokenization

In [ ]:
# define cleaning function
def clean_and_tokenize_ingredients(text):
    if pd.isna(text):
        return []

    # 1. Normalize to lowercase and remove accents
    text = unidecode.unidecode(text.lower())

    # 2. Remove numeric expressions (e.g., '2%', '100g', '25mg')
    text = re.sub(r'\b\d+%?\b', ' ', text)
    text = re.sub(r'\b\d+[a-z]+\b', ' ', text)

    # 3. Remove special characters (keep commas and minimal punctuation)
    text = re.sub(r'[^a-z0-9,\.\-\(\)\s]', ' ', text)

    # 4. Normalize space
    text = re.sub(r'\s+', ' ', text).strip()

    # 5. Split on commas — each comma-separated item is treated as a phrase
    raw_phrases = [p.strip() for p in text.split(',')]

    clean_phrases = []
    for phrase in raw_phrases:
        # Strip trailing punctuation
        phrase = phrase.strip('.,() ')

        # Translate known foreign terms (optional)
        phrase = ' '.join([translation_dict.get(w, w) for w in phrase.split()])

        # Skip short/meaningless phrases
        if len(phrase.split()) >= 2 and not any(w in custom_stopwords for w in phrase.split()):
            clean_phrases.append(phrase)

    return clean_phrases

(5) Apply Cleaning and Explore Top Ingredients¶

After cleaning and tokenizing ingredients, we now proceed to explore the most common components across packaged foods.

Firstly, we apply the cleaning function to the dataset:

In [ ]:
df_final['clean_ingredient_tokens'] = df_final['ingredients_text'].apply(clean_and_tokenize_ingredients)

Then flatten and visualize the most common ingredient tokens:

In [ ]:
# Flatten the list of tokens
all_words = [word for tokens in df_final['clean_ingredient_tokens'] for word in tokens]
word_counts = Counter(all_words)

# Display top 50 words
pd.Series(word_counts).sort_values(ascending=False).head(10)
Out[ ]:
0
folic acid 19036
citric acid 18796
corn syrup 13024
reduced iron 12425
thiamine mononitrate 10395
soy lecithin 9883
soybean oil 8885
xanthan gum 7948
cocoa butter 7064
sea salt 6314

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare DataFrame
top_ingredients = pd.Series(word_counts).sort_values(ascending=False).head(50)

# Plot
plt.figure(figsize=(12, 14))
ax = sns.barplot(x=top_ingredients.values, y=top_ingredients.index, palette='viridis')

plt.xlabel('Count')
plt.ylabel('Ingredient')
plt.title('Top 50 Most Common Ingredients in U.S. Packaged Foods')
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Add count labels to the right of bars
for i, (value, name) in enumerate(zip(top_ingredients.values, top_ingredients.index)):
    ax.text(value + 100, i, f'{value:,}', va='center', ha='left', fontsize=9)

plt.tight_layout()
plt.show()
No description has been provided for this image

Words with larger sizes represent more frequently occurring ingredients.

In [ ]:
from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=600, background_color='white', colormap='tab20')
wordcloud.generate_from_frequencies(word_counts)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Ingredients Word Cloud')
plt.show()
No description has been provided for this image

💬 Observations from 50 Most Common Ingredients in the U.S. packaged foods¶

  • High use of processed sugars and fats suggests a reliance on calorie-dense ingredients, which may impact Nutri-Score ratings and perceived producenrichedt healthiness.
  • The frequent appearance of additives and emulsifiers indicates a strong dependence on food industrial technologies for texture stabilization and product preservation.
  1. 💊 Added vitamins and minerals:
    Foods with added vitamins and minerals feature prominently, with ingredients like folic acid (vitamin B9), reduced iron, and thiamine mononitrate (vitamin B1) ranking high. The prevalence of vitamin and mineral fortification reflects public health policies encouraging nutrient enrichment in the U.S. packaged food industry.

  2. đŸ« Added Sugars:
    High use of processed sugars and sweeteners, such as corn syrup, high fructose corn syrup, and brown sugar, are heavily represented, underscoring the heavy reliance on processed sugars in packaged foods.

  3. đŸ§Ș Additives and Emulsifiers:
    Additives and stabilizers including Xanthan gum, guar gum, modified starches, and soy lecithin are widly used to maintain food texture and extend shelf-life extension.

  4. 🧈 Oil-based Ingredients:
    Oil-based ingredients like soy lecithin, soybean oil, cocoa butter, and canola oil are also common across products, consistent with the formulation of processed baked goods and snacks. These ingredients contribute to the caloric density and flavor profile of packaged foods.

âžĄïž Overall, fortified nutrients, sugars, stabilizers, and oils dominate the ingredient composition of U.S. packaged foods, painting a comprehensive picture of both nutritional enhancement efforts and industrial formulation priorities.

2.3 Initial Data Exploration¶

Having processed the textual information on additives, we now shift our focus to understanding how Nutri-Scores are distributed across the dataset. This exploration will provide an initial sense of the overall nutritional quality represented in the data.

Nutrition Grade Distribution¶

  • Observations from the following distribution indicate a clear imbalance across nutrition grades:

    • Most packaged food products are classified under lower nutrition grades (D and E).
    • Healthier food products (Grade A) are significantly underrepresented.
  • This imbalance suggests that the U.S. packaged food landscape is skewed towards less healthy options according to the Nutri-Score classification system.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

# Total count for percentage
total = df_final['nutrition_grade_fr'].value_counts().sum()

# Plotting the distribution of Nutri-Scores
plt.figure(figsize=(6, 6))
sns.countplot(
    data=df_final,
    x='nutrition_grade_fr',
    order=sorted(df_final['nutrition_grade_fr'].dropna().unique()),
    palette='Set2',
    legend=False
)

plt.title('Distribution of Nutrition Grades (France Nutri-Score)', fontsize=16)
plt.xlabel('Nutrition Grade (A=Healthiest, E=Least Healthy)', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)

# Add count and percentage for each bar
for p in plt.gca().patches:
    count = p.get_height()
    percent = count / total * 100
    label = f'{count:.0f}\n({percent:.1f}%)'
    plt.gca().annotate(
        label,
        (p.get_x() + p.get_width() / 2., count),
        ha='center', va='center',
        fontsize=11, color='black',
    )

plt.tight_layout()
plt.show()
No description has been provided for this image

Exploring Key Nutritional Metrics Across Nutri-Score¶

To further understand the drivers behind Nutri-Score distributions, we analyze the spread of key nutritional attributes — fat, sugar, salt, and additive counts — across nutrition grades.

In [ ]:
# Define groups for comparison
low_grades = df_final[df_final['nutrition_grade_fr'].isin(['d', 'e'])]
high_grades = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]

# Compare key nutritional metrics
metrics = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, metric in enumerate(metrics):
    sns.boxplot(
        data=df_final,
        x='nutrition_grade_fr',
        y=metric,
        order=['a', 'b', 'c', 'd', 'e'],
        palette='Set2',
        ax=axes[idx],
        legend=False,
        showfliers=False
    )

    axes[idx].set_title(f'Distribution of {metric} by Nutrition Grade')
    axes[idx].set_xlabel('Nutrition Grade')
    axes[idx].set_ylabel(metric)

plt.tight_layout()
plt.show()
No description has been provided for this image

âžĄïž From the above plot of key nutritional metrics by Nutri-Score:

  • Fat and Sugar: It's no surprise, products with higher fat and sugar content tend to receive lower Nutri-Scores (Grades D and E). Grade E products, in particular, boast the highest median fat (~25g/100g) and sugar levels, while Grade A foods stay impressively lean, with fat and sugar around just ~2g/100g.

  • Salt: Salt levels paint a slightly fuzzier picture. Salt content shows a wider spread across grades C, D, and E, but the median salt content is consistently higher in lower grades.

  • Additives: Interestingly, the number of additives (additives_n) doesn’t strongly influence Nutri-Score. This suggests that Nutri-Score judgments are much more about the big players—like fats and sugars—rather than the presence of food additives.

2.4 Deeper Analysis: Additives, Brands, and Nutrition Grades¶

🎯 Research Question 1: Does Nutri-Score reflect additives presence?¶

In this section, We conduct an exploratory analysis focused on the relationships among additives, brands, and nutritional quality, as reflected by Nutri-Score grades (A–E).

(1) Distribution of Number of Additives

We first examine the overall distribution of additive counts across products:

In [ ]:
# Plot distribution of additives number across all products
plt.figure(figsize=(8, 6))
sns.histplot(df_final['additives_n'], bins=30, kde=False)
plt.title('Distribution of Number of Additives in Products')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
No description has been provided for this image

âžĄïž Findings:

  • Most products contain between 0 to 5 additives.
  • The distribution is right-skewed distribution: a small subset has more than 10 additives.

(2) Relationship Between Additive Counts and Nutrition Score

To assess whether additive presence impacts nutritional quality ratings, we plot the number of additives against Nutrition Scores:

You might be wondering: What exactly is a Nutrition Score?

Similar to Nutri-Score, Nutrition Score is also a numerical representation of a product’s nutritional quality.
Lower scores indicate healthier products, with scores mapped to Nutri-Score grades (A = healthiest, E = least healthy).

Nutri-Score Label

Figure: Mapping of Nutrition Score points to Nutri-Score grades for solid foods and beverages.

In [ ]:
# Scatter plot: Number of additives vs. Nutrition Score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=df_final, alpha=0.5)
plt.title('Number of Additives vs. Nutrition Score')
plt.xlabel('Number of Additives')
plt.ylabel('Nutrition Score (Lower is Better)')
plt.grid(True)
plt.show()
No description has been provided for this image

âžĄïž Finding:

Scatter Plot reveals no strong pattern:

Even products with many additives (10–20) can have good Nutrition Score. Some products with very few additives still have poor scores. Nutrition Score does not strongly penalize additive counts.

Next, we create a boxplot:

Boxplot Across Grades confirms that the median number of additives remains similar across Nutri-Score grades (A–E).

In [ ]:
# Boxplot: Additives count by Nutri-Score Grade (A to E)
plt.figure(figsize=(8, 6))
sns.boxplot(x='nutrition_grade_fr', y='additives_n', data=df_final, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Distribution of Additives Count Across Nutri-Score Grades')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Additives')
plt.grid(True)
plt.show()
No description has been provided for this image

Histogram for A/B Products:

In [ ]:
# Histogram: Additives count for products with Nutri-Score A or B
plt.figure(figsize=(8, 6))
df_good = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]

sns.histplot(df_good['additives_n'], bins=20, kde=False)
plt.title('Additives Count for Products with Nutri-Score A or B')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
No description has been provided for this image

âžĄïž Findings:

Observations from Additives Count in High-Scoring (A/B) Products

  • Most products classified as "healthy" (Nutri-Score A or B) contain relatively few additives—typically 0 to 2.
  • However, a non-negligible subset of these products still contains more than 5 additives, and a few even exceed 10 additives.
  • The distribution is heavily right-skewed, suggesting that while minimal additive usage is common among healthy-rated products, high additive counts do not necessarily prevent a product from being rated as healthy.

(3) Correlation Analysis

We quantify the relationship between additive counts and Nutrition Score:

  • Pearson correlation: 0.033
  • Spearman correlation: 0.034

Both correlation coefficients are very close to zero, indicating an extremely weak positive relationship.

âžĄïž Findings:

Additive counts do not meaningfully correlate with Nutrition Score.

In [ ]:
# Pearson Correlation (linear)
pearson_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='pearson')

# Spearman Correlation (monotonic)
spearman_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='spearman')

print(f"Pearson correlation between additives_n and nutrition-score-fr_100g: {pearson_corr:.3f}")
print(f"Spearman correlation between additives_n and nutrition-score-fr_100g: {spearman_corr:.3f}")
Pearson correlation between additives_n and nutrition-score-fr_100g: 0.033
Spearman correlation between additives_n and nutrition-score-fr_100g: 0.034

Summary of Findings:

  1. Distribution of Additives:

    • Most products contain between 0 to 5 additives.
    • A small number of products have more than 10 additives, indicating a right-skewed distribution.
  2. Relationship Between Additives and Nutri-Score:

    • Scatter Plot: No strong pattern observed between the number of additives and the Nutrition Score.
    • Boxplot: The median number of additives is similar across all Nutri-Score grades (A to E).
    • Histogram (A/B products): Many products rated as "healthy" (Nutri-Score A or B) still contain multiple additives, with some having more than 5 additives.
  3. Correlation Analysis:

    • Pearson correlation: 0.033
    • Spearman correlation: 0.034
    • Both correlations are very close to zero, indicating a very weak positive relationship between the number of additives and the Nutrition Score.
  4. Potential Misclassifications:

    • Some products with a high number of additives still receive a good Nutri-Score (A or B).
    • This suggests that Nutri-Score may overlook additive information when classifying food healthiness.

đŸ›Žïž Research Question 1: Conclusion¶

  • The current Nutri-Score system fails to effectively capture the potential health implications of high additive usage.

  • Potential inconsistencies:

    • Some products with a high number of additives still receive a good Nutri-Score (A/B).
    • This suggests that Nutri-Score may overlook additive information when assessing overall product healthiness.
  • Interestingly, as we observed earlier, not all additives are inherently harmful. Many additives serve as sources of beneficial nutrients, micronutrients, or antioxidants. Thus, the presence of additives should be evaluated critically, rather than being universally perceived as negative.

🎯 Research Question 2: Which brands use more additives and have lower Nutri-Scores?¶

In this section, we explore whether certain brands consistently use more additives or produce products with generally lower Nutri-Scores.

In [ ]:
# Keep only the first brand if multiple brands are listed
df_final['brand_main'] = df_final['brands'].apply(lambda x: x.split(',')[0].strip().lower() if pd.notnull(x) else x)
In [ ]:
# Step 1: Group by Brand
# Group by main brand and calculate mean additives and mean nutrition score
brand_stats = df_final.groupby('brand_main').agg({
    'additives_n': 'mean',
    'nutrition-score-fr_100g': 'mean',
    'product_name': 'count'  # count number of products per brand
}).reset_index()

# Rename columns for clarity
brand_stats.rename(columns={'product_name': 'product_count'}, inplace=True)

# Check
brand_stats.sample(5)
Out[ ]:
brand_main additives_n nutrition-score-fr_100g product_count
1537 bucky badger 2.5 17.500000 2
4627 grant park custom meats 3.0 3.000000 1
3349 echo lake foods 5.0 2.000000 1
2754 daily bread 2.0 20.000000 2
10153 sea port products corp 1.0 -1.333333 3
In [ ]:
# Step 2: Top Brands by Additives
# Only keep brands with enough products (e.g., more than 30 products) to avoid noisy small brands.
# Filter brands with at least 30 products
brand_stats_filtered = brand_stats[brand_stats['product_count'] >= 30]

# Top 10 brands by average additives
top_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='additives_n', y='brand_main', data=top_additive_brands)
plt.title('Top 10 Brands with Highest Average Number of Additives')
plt.xlabel('Average Number of Additives')
plt.ylabel('Brand')
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
# Step 3: Top Brands by Worst Nutrition Score
# Top 10 brands by worst average nutrition score
top_unhealthy_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='nutrition-score-fr_100g', y='brand_main', data=top_unhealthy_brands)
plt.title('Top 10 Brands with Worst Average Nutrition Score')
plt.xlabel('Average Nutrition Score (Higher = Worse)')
plt.ylabel('Brand')
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
# Step 4: Scatter Plot: Additives vs. Nutrition Score
# Scatter plot of average additives vs. average nutrition score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=brand_stats_filtered)
plt.title('Average Additives vs. Average Nutrition Score per Brand')
plt.xlabel('Average Number of Additives')
plt.ylabel('Average Nutrition Score')
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
# Select brands with high average additives
high_additive_brands = top_additive_brands['brand_main'].tolist()

# Filter original data
df_high_additives = df_final[df_final['brand_main'].isin(high_additive_brands)]

# Show Nutri-Grade distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='nutrition_grade_fr', data=df_high_additives, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Nutri-Grade Distribution for Brands with High Additive Usage')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Products')
plt.grid(True)
plt.show()
No description has been provided for this image

đŸ›Žïž Research Question 2: Conclusion¶

1. Brands with the Highest Average Number of Additives:¶

  • Brands like Arnie's, Nissin, and Toft's have the average of 10–13 additives per product.
  • These brands average around 10–13 additives per product, suggesting heavy use of food additives.

2. Brands with the Worst Average Nutrition Scores:¶

  • Brands such as Brown & Haley, Reese's, and Palmer have the worst average Nutrition Scores.
  • These brands mainly focus on confectionery and sweets products, traditionally high in sugar and fat.

3. Relationship Between Additives and Nutrition Scores Across Brands:¶

  • From the scatter plot, there is no strong direct relationship between a brand’s average number of additives and its average Nutrition Score.
  • Some brands use many additives but still have moderate Nutrition Scores, while some brands with poor Nutrition Scores do not necessarily have many additives.

✅ Overall, while additives alone do not directly determine a product's Nutri-Score or Nutrition Scores, heavy additive usage is often a marker of lower overall nutritional quality at the brand level.

Nutri-Score Label

Nutri-Score Label

🎯 Research Question 3: How do food additives and healthiness vary across different brands?¶

In this part, we investigate the types of additives most commonly used across different groups of brands to understand how additive usage patterns vary with food healthiness.

In [ ]:
# Select low-additive brands
# Bottom 10 brands by average additives (filter brands with >= 30 products)
low_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=True).head(10)

# Filter the products for these brands
low_additives_df = df_final[df_final['brand_main'].isin(low_additive_brands['brand_main'])]
In [ ]:
# Select brands with worst Nutri-Score
# Top 10 brands with worst average Nutri-Score
worst_nutriscore_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)

# Filter the products for these brands
worst_nutriscore_df = df_final[df_final['brand_main'].isin(worst_nutriscore_brands['brand_main'])]
In [ ]:
# Analyze Top Additives for Each Group
def plot_top_additives(dataframe, title):
    additives_used = list(chain.from_iterable(dataframe['additives_list'])) # Faster flattening
    additives_counter = Counter(additives_used)
    top_additives = additives_counter.most_common(20)
    top_additives_df = pd.DataFrame(top_additives, columns=['Additive', 'Count'])

    plt.figure(figsize=(10, 6))
    sns.barplot(x='Count', y='Additive', data=top_additives_df)
    plt.title(title)
    plt.xlabel('Count')
    plt.ylabel('Additive')
    plt.grid(True)
    plt.show()

# High-additive brands
plot_top_additives(df_high_additives, 'Top 20 Additives Used by High Additive Brands')
No description has been provided for this image
In [ ]:
# Low-additive brands
plot_top_additives(low_additives_df, 'Top 20 Additives Used by Low Additive Brands')
No description has been provided for this image
In [ ]:
# Worst Nutri-Score brands
plot_top_additives(worst_nutriscore_df, 'Top 20 Additives Used by Brands with Worst Nutri-Scores')
No description has been provided for this image
In [ ]:
# General dataset
plot_top_additives(df_final, 'Top 20 Additives Used by All Brands')
No description has been provided for this image
In [ ]:
# Rank the additives by their usage count
def rank_additives(dataframe):
    # Fast flatten the list of additives
    additives_used = list(chain.from_iterable(dataframe['additives_list']))

    # Count additives
    additives_counter = Counter(additives_used)

    # Create and sort DataFrame
    additives_rank_df = pd.DataFrame(additives_counter.items(), columns=['Additive', 'Count'])
    additives_rank_df = additives_rank_df.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Add Rank column
    additives_rank_df.index += 1
    additives_rank_df.index.name = 'Rank'

    return additives_rank_df

# Apply to all products
additives_rank_df = rank_additives(df_final)

# Display the result
display(additives_rank_df)
Additive Count
Rank
1 E330 - Citric acid 20026
2 E101 - Riboflavin 19564
3 E101i - Riboflavin 19559
4 E375 - Nicotinic acid 19535
5 E322 - Lecithins 16498
... ... ...
321 E555 - Potassium aluminium silicate 1
322 E343i - Monomagnesium phosphate 1
323 E365 - Sodium fumarate 1
324 E266 - Sodium dehydroacetate 1
325 E470 - Sodium/potassium/calcium and magnesium ... 1

325 rows × 2 columns

đŸ›Žïž Research Question 3: Conclusion¶

  1. Additive Usage Varies by Brand Type:

    • High Additive Brands rely more on additives such as E375 (Nicotinic acid), E101 (Riboflavin), and E322 (Lecithins), many of which are vitamins or natural emulsifiers.
    • Low Additive Brands also use E322 but rely more on compounds like E330 (Citric acid), E509 (Calcium chloride), and E150a (Plain caramel).
  2. Most Frequently Used Additives May Still Be "Healthy":

    • Surprisingly, additives with the highest overall usage across all brands include natural compounds, such as citric acid, riboflavin, and nicotinic acid, which are commonly recognized as safe and even beneficial.
    • This counters the intuition that "more additives = more harmful." Additive type matters more than quantity alone.
  3. Not All Additives Are Equal:

    • We group additives into two general categories:
      • Generally Healthy Additives: vitamins, natural acids, fibers (e.g., E101, E375, E330).
      • Potentially Unhealthy Additives: synthetic colorants, artificial sweeteners, emulsifiers (e.g., E129, E133, E951).
    • Products with many synthetic additives are more likely to be ultra-processed and score worse on Nutri-Scores.
  4. Implication:

    • Nutri-Score does not directly factor in additive types. Therefore, two products may receive similar grades while differing significantly in additive composition.

3ïžâƒŁ Classification Modeling¶

3.1 Principal Component Analysis (PCA)¶

Let's first select the following numeric features, which are the core indicators in nutritional labeling. There are 18 features in total:

In [ ]:
# include only the nutritional columns for PCA
nutrient_cols = df_final[[
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n',
    'energy_100g',
    'fat_100g',
    'saturated-fat_100g',
    'trans-fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g'
]]

# drop rows with missing values in those columns
df_nutrient = nutrient_cols.dropna()
df_nutrient.head(5)
Out[ ]:
additives_n ingredients_from_palm_oil_n ingredients_that_may_be_from_palm_oil_n energy_100g fat_100g saturated-fat_100g trans-fat_100g cholesterol_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g vitamin-a_100g vitamin-c_100g calcium_100g iron_100g
82 1.0 0.0 0.0 2389.0 42.86 7.14 0.0 0.000 25.00 14.29 7.1 25.00 0.54356 0.214 0.000000 0.0000 0.071 0.00514
149 1.0 0.0 0.0 1046.0 0.00 0.00 0.0 0.000 62.50 52.50 7.5 2.50 0.00000 0.000 0.001125 0.0000 0.050 0.00360
152 1.0 0.0 0.0 1569.0 2.50 0.00 0.0 0.000 87.50 65.00 2.5 2.50 1.96850 0.775 0.000750 0.0000 0.100 0.00090
153 5.0 0.0 0.0 1883.0 22.50 12.50 0.0 0.012 70.00 42.50 2.5 5.00 1.01600 0.400 0.000075 0.0000 0.050 0.00180
200 1.0 0.0 0.0 1523.0 16.88 10.39 0.0 0.052 44.16 5.19 1.3 7.79 1.08966 0.429 0.000195 0.1013 0.026 0.00094

To use PCA to reduce the dimensions in the data, we follow the best practice to scale the data:

In [ ]:
# Standardizing the features
X_scaled = StandardScaler().fit_transform(df_nutrient)

Next, we will explore what is the ideal number of components for PCA. The output of explained_variance_ratio_ tells us how much does each component explain the variance:

In [ ]:
# Apply PCA with 18 components
pca = PCA(n_components=18)
X_pca = pca.fit_transform(X_scaled)

# The variance explained by each component
np.set_printoptions(suppress=True)
pca.explained_variance_ratio_
Out[ ]:
array([0.16566702, 0.13153279, 0.10773104, 0.07771031, 0.07305444,
       0.06295702, 0.05915664, 0.05824494, 0.05748542, 0.05220961,
       0.04915088, 0.03967248, 0.03290017, 0.0197183 , 0.01169987,
       0.00110908, 0.        , 0.        ])

Here, we create a Explained Variance Ratio plot to interpretate how much variance each principal component explains:

In [ ]:
# Get Explained Variance Ratio and create plot
explained_var = pca.explained_variance_ratio_
cum_var = np.insert(np.cumsum(explained_var), 0, 0.0)
x_full = np.arange(0, len(explained_var) + 1)

# plotting
plt.figure(figsize=(8, 6))
plt.plot(x_full, cum_var, marker="o", label="Cumulative")
plt.axhline(0.95, color="red", linestyle="--", label="95% Threshold")

# graph format
plt.xlim(-0.5, len(explained_var)+0.5)
plt.xticks(x_full)
plt.xlabel("Number of Principal Components")
plt.ylabel("Explained Variance Ratio")
plt.title("Explained Variance per Principal Component")
plt.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

We ideally set principal component to the smallest value where we see things flatten out. From the plot above, we can find the first 13 components (of 18) gives more than 95% explained variance. So, we initially set n_components = 13 for the following analysis.

Now, we apply function for the principle components into our dataset, and create DataFrame for further analysis and clustering.

In [ ]:
from sklearn.decomposition import PCA

def run_pca(X_scaled, n_components):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)
    df_pca = pd.DataFrame(X_pca, columns=[f"PC{i+1}" for i in range(n_components)])
    return pca, df_pca

Deciding How Many Features to Keep for KMeans Clustering¶

After we used Principal Component Analysis, we initially kept 13 features that together explain about 95% of the variation in the data.

However, we noticed a problem:

Having too many dimensions can actually make it harder for the model to find meaningful groups, and KMeans struggles to decide which points should belong together.

👉 To avoid this problem, we decided to reduce the number of features even further before doing KMeans. We tested different number of PCA features and for each case, we measured the quality of clustering using silhouette score.
Higher scores mean better, clearer clusters.

Here were the results:

Number of PCA features Silhouette Score
4 0.3159
5 0.3064
6 0.2864
8 0.2920
10 0.2992
13 0.2357

Clustering Evaluation: How Good Are Our Groups?¶

Our following clustering result gave a silhouette score of 0.3159.

  • This score indicates that the clusters are moderately well-separated.
  • In nutrition data, this is expected — foods often fall along a spectrum of human taste, like moderately salty, slightly sugary, rather than into totally separate categories.

Based on the results, we chose to move forward with 4 principal components, because it produced the best clustering quality.

In [ ]:
# Standardizing
X_scaled = StandardScaler().fit_transform(df_nutrient)

# List of PCA dimensions you want to test
pca_dims = [4, 5, 6, 8, 10, 13]

# Dictionary to store silhouette scores
silhouette_scores = {}

for dim in pca_dims:
    # PCA Dimension Reduction
    pca = PCA(n_components=dim, random_state=42)
    X_pca = pca.fit_transform(X_scaled)

    # Apply KMeans, each optimal n_clusters is decided from Elbow Method
    kmeans = KMeans(n_clusters=dim - 1, random_state=42)
    clusters = kmeans.fit_predict(X_pca)

    # check silhouette score for the performance of KMeans clustering:
    score = silhouette_score(X_pca, clusters)
    silhouette_scores[dim] = score

# Output
for dim, score in silhouette_scores.items():
    print(f"Silhouette score for {dim}D PCA + KMeans: {score:.4f}")
Silhouette score for 4D PCA + KMeans: 0.3159
Silhouette score for 5D PCA + KMeans: 0.3064
Silhouette score for 6D PCA + KMeans: 0.2864
Silhouette score for 8D PCA + KMeans: 0.2920
Silhouette score for 10D PCA + KMeans: 0.2992
Silhouette score for 13D PCA + KMeans: 0.2357

KMeans Clustering and Visualization with UMAP¶

In [ ]:
# dataset for KMeans Clustering
pca_kmeans, df_pca_kmeans = run_pca(X_scaled, 4)
df_pca_kmeans.head(5)
Out[ ]:
PC1 PC2 PC3 PC4
0 2.684036 1.629462 -1.264479 -1.027111
1 0.049841 -1.180093 1.096871 -2.193361
2 0.695542 -1.189935 2.092358 -1.137370
3 2.072854 -0.649734 0.724378 0.590306
4 0.486327 0.340976 -0.266710 -0.917434

We first use Elbow Method to find the optimal k for KMeans:

In [ ]:
distortions = []
K_range = range(1, 30)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_pca_kmeans)
    distortions.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, distortions, marker='o')
plt.title("Elbow Method for Optimal k")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Within-cluster sum of squares (distortion)")
plt.xticks(K_range)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Now we apply KMeans with k=4, and add an new columns clusters_label to df_pca_kmeans

In [ ]:
# k=4 KMeans Clustering
kmeans = KMeans(n_clusters=4, random_state=42)
df_pca_kmeans['clusters_label'] = kmeans.fit_predict(df_pca_kmeans)
df_pca_kmeans.head(10)
Out[ ]:
PC1 PC2 PC3 PC4 clusters_label
0 2.684036 1.629462 -1.264479 -1.027111 2
1 0.049841 -1.180093 1.096871 -2.193361 3
2 0.695542 -1.189935 2.092358 -1.137370 3
3 2.072854 -0.649734 0.724378 0.590306 3
4 0.486327 0.340976 -0.266710 -0.917434 3
5 0.433520 -0.518498 0.152523 0.135389 3
6 -0.602105 0.263374 -0.157433 -0.293366 1
7 0.739413 1.276978 -1.010996 0.214544 2
8 0.450569 0.545287 -0.582644 0.199125 1
9 1.175014 -1.104427 0.393440 2.838227 3

Now we check the number of data points in each clutser:

In [ ]:
print(np.unique(df_pca_kmeans['clusters_label'], return_counts=True))
(array([0, 1, 2, 3], dtype=int32), array([  349, 36919, 11865, 31813]))

We identified four clusters using KMeans. However, one of the clusters (Cluster 3) only contained 7 data points. Alothough it may not represent a meaningful group, we continue our analysis on Clusters 0, 1, 2, and 3, which together account for the minor and vast majority of the dataset and provide more interpretable patterns.

In [ ]:
# convert PCA data with 15 components to 2D using UMAP
umap_2d = umap.UMAP(
    n_components=2,
    n_neighbors=10,
    min_dist=0.1,
    random_state=42
).fit_transform(df_pca_kmeans.values)

plt.figure(figsize=(6,5))
sns.scatterplot(
    x=umap_2d[:, 0],
    y=umap_2d[:, 1],
    # hue=df_final['nutrition_grade_fr'],
    hue=df_pca_kmeans['clusters_label'],
    palette='Set2',
    s=12,
    alpha=0.8
)
plt.title("UMAP 2-D embedding of packaged foods")
plt.xlabel("UMAP-1"); plt.ylabel("UMAP-2")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
df_final
Out[ ]:
code url creator created_t created_datetime last_modified_t last_modified_datetime product_name brands brands_tags ... vitamin-c_100g calcium_100g iron_100g nutrition-score-fr_100g nutrition-score-uk_100g ingredients_list additives_list weird_char_count clean_ingredient_tokens brand_main
82 0000000033688 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489050424 2017-03-09T09:07:04Z 1489050424 2017-03-09T09:07:04Z peanuts, mixed nuts northgate market northgate-market ... 0.0000 0.071 0.00514 14.0 14.0 [peanuts, honey, coating (sucrose, wheat starc... [E415 - Xanthan gum] 0 [coating (sucrose, wheat starch, xanthan gum, ... northgate market
149 0000000045292 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489069958 2017-03-09T14:32:38Z 1489069958 2017-03-09T14:32:38Z turkish apricots northgate northgate ... 0.0000 0.050 0.00360 8.0 8.0 [apricots, sulfur dioxide.] [E220 - Sulphur dioxide] 0 [sulfur dioxide] northgate
152 0000000045421 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z chili mango torn & glasses torn-glasses ... 0.0000 0.100 0.00090 19.0 19.0 [dried mango, paprika, sugar, salt, citric aci... [E330 - Citric acid] 0 [dried mango, citric acid] torn & glasses
153 0000000045483 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489050424 2017-03-09T09:07:04Z 1489050424 2017-03-09T09:07:04Z milk chocolate pretzels torn & glasser torn-glasser ... 0.0000 0.050 0.00180 25.0 25.0 [milk chocolate (sugar, cocoa butter, chocolat... [E101 - Riboflavin, E101i - Riboflavin, E322 -... 7 [milk chocolate (sugar, cocoa butter, chocolat... torn & glasser
200 0000020039127 http://world-en.openfoodfacts.org/product/0000... usda-ndb-import 1489138568 2017-03-10T09:36:08Z 1489138568 2017-03-10T09:36:08Z butter croissants fresh & easy fresh-easy ... 0.1013 0.026 0.00094 18.0 18.0 [wheat flour, butter (cream), water, yeast, su... [E300 - Ascorbic acid] 0 [wheat flour, butter (cream, wheat gluten, asc... fresh & easy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
355821 9556041620369 http://world-en.openfoodfacts.org/product/9556... usda-ndb-import 1489066070 2017-03-09T13:27:50Z 1489066070 2017-03-09T13:27:50Z sardines in spicy tomato sauce, chili and lime ayam brand ayam-brand ... 0.0000 0.357 0.00257 3.0 3.0 [sardines, water, tomato paste, sugar, dried c... [E322 - Lecithins, E322i - Lecithin, E415 - Xa... 0 [tomato paste, dried chili, thickener (xanthan... ayam brand
355844 9556173386461 http://world-en.openfoodfacts.org/product/9556... usda-ndb-import 1489066836 2017-03-09T13:40:36Z 1489066836 2017-03-09T13:40:36Z chewy candy fruit plus fruit-plus ... 0.0000 0.000 0.00000 21.0 21.0 [sugar, glucose syrup, vegetable fat (hydrogen... [E102 - Tartrazine, E330 - Citric acid, E414 -... 3 [glucose syrup, vegetable fat (hydrogenated pa... fruit plus
355859 9556390158162 http://world-en.openfoodfacts.org/product/9556... usda-ndb-import 1489069476 2017-03-09T14:24:36Z 1489069476 2017-03-09T14:24:36Z lee, special crackers lee biscuits (pte.) ltd. lee-biscuits-pte-ltd ... 0.0000 0.045 0.00082 16.0 16.0 [wheat flour, vegetable oil (palm olein), suga... [E1101 - Protease, E450 - Diphosphates, E471 -... 0 [wheat flour, vegetable oil (palm olein, corn ... lee biscuits (pte.) ltd.
355860 9556390178160 http://world-en.openfoodfacts.org/product/9556... usda-ndb-import 1489070026 2017-03-09T14:33:46Z 1489070026 2017-03-09T14:33:46Z sugar crackers lee biscuits (pte.) ltd. lee-biscuits-pte-ltd ... 0.0000 0.000 0.00082 13.0 13.0 [wheat flour, sugar, vegetable fat (palm base)... [E450 - Diphosphates, E500 - Sodium carbonates... 0 [wheat flour, corn starch, vegetable oil (palm... lee biscuits (pte.) ltd.
355968 9780803738782 http://world-en.openfoodfacts.org/product/9780... usda-ndb-import 1489069944 2017-03-09T14:32:24Z 1489069945 2017-03-09T14:32:25Z organic z bar clif kid clif-kid ... 0.0583 0.556 0.00500 11.0 11.0 [organic oat blend (organic rolled oats, organ... [E322 - Lecithins, E322i - Lecithin] 0 [] clif kid

80946 rows × 47 columns

Cluster Profiling¶

In [ ]:
df_profile = df_nutrient.copy()
df_profile['cluster'] = df_pca_kmeans['clusters_label'].values

cluster_means = df_profile.groupby('cluster')[
    ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']
].mean().round(2)

display(cluster_means)
fat_100g sugars_100g salt_100g additives_n energy_100g
cluster
0 1.60 9.52 40.84 2.19 707.08
1 5.52 7.45 1.13 3.21 623.53
2 32.38 13.48 1.65 2.13 1867.73
3 12.93 33.26 1.09 3.37 1697.14
In [ ]:
nutrient_features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']
energy_feature = ['energy_100g']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,6))

cluster_means.loc[[0,1,2,3]][nutrient_features].T.plot(
    kind='bar', ax=ax1
)
ax1.set_title('Cluster Profiles – Macronutrients & Additives')
ax1.set_ylabel('Mean per 100g')
ax1.set_xlabel('Nutritional Feature')
ax1.set_xticklabels(nutrient_features, rotation=45)
ax1.legend(title='Cluster')


cluster_means.loc[[0,1,2,3]][energy_feature].T.plot(
    kind='bar', ax=ax2, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#CC0000']
)
ax2.set_title('Cluster Profiles – Energy (kcal per 100g)')
ax2.set_ylabel('Mean Energy')
ax2.set_xlabel('Energy')
ax2.set_xticklabels(['energy_100g'], rotation=0)
ax2.legend(title='Cluster')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']


features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']

data = cluster_means.loc[[0,1,2,3]][features]


scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=features)
data_scaled['cluster'] = ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3']
In [ ]:
labels = features
num_vars = len(labels)

angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]


fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(polar=True))


for i, row in data_scaled.iterrows():
    values = row[features].tolist()
    values += values[:1]
    ax.plot(angles, values, label=row['cluster'])
    ax.fill(angles, values, alpha=0.15)


ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
ax.set_thetagrids(np.degrees(angles[:-1]), labels)
ax.set_title("Cluster Nutritional Radar Chart", fontsize=14, pad=30)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(2, 2, subplot_kw=dict(polar=True), figsize=(12,10), constrained_layout=True)
axes = axes.flatten()

# Define colors for clusters
colors = ['#4682B4', '#FFA500', '#7ED957', '#B22222']

for i in range(4):
    values = data_scaled.iloc[i][features].tolist()
    values += values[:1]
    ax = axes[i]
    ax.plot(angles, values, color=colors[i], linewidth=2)
    ax.fill(angles, values, color=colors[i], alpha=0.25)
    ax.set_title(f"{data_scaled.iloc[i]['cluster']}", size=13, pad=35)
    ax.set_thetagrids(np.degrees(angles[:-1]), features)
    ax.set_ylim(0, 1)

fig.suptitle("Nutritional Radar Charts for Each Cluster", fontsize=16, y=1)
plt.tight_layout(rect=[1, 1, 1, 1])
fig.subplots_adjust(hspace=0.4)
plt.show()
No description has been provided for this image

Here are the results:

Cluster 0: High-Salt, Low-Fat Products

  • đŸ§Ÿ Characteristics:
    • Very low fat (1.6g)
    • Extremely high salt (41g)
    • Moderate sugar (9.5g), additives (2.2), energy (707 kcal)
    • 📌 High-salt preserved foods or savory processed items.

Cluster 1: Low-Energy Foods

  • đŸ§Ÿ Characteristics:
    • Moderate fat (5.5g), sugar (7.5g), additives (3.2)
    • Low salt (1.1g), lowest energy (624 kcal).
    • 📌 Low-calorie snacks, possibly cereals or health-focused foods.

Cluster 2: High-Fat, High-Energy Foods

  • đŸ§Ÿ Characteristics:
    • Extremely high fat (32.4g)
    • High sugar (13.5g)
    • Low salt (1.6g)
    • Lower additives (2.1), highest energy (1868 kcal).
    • 📌 Fat-dense products like butters, nut spreads, cheeses, and creamy desserts.

Cluster 3: High-Sugar, Additive-Heavy Processed Foods

  • đŸ§Ÿ Characteristics:
    • Moderate fat (12.9g)
    • Very high sugar (33.3g), highest additives (3.4), very high energy (1697 kcal).
    • Low salt (1.1g)
    • 📌 Highly processed sweets, sodas, candies, and energy bars.

4ïžâƒŁ Prediction Modeling: Predicting Product Healthiness¶

In this section, we develop machine learning models to evaluate how well a product's healthiness—measured by its Nutri-Score/ Nutri-grade —can be predicted from its ingredient composition, additive content, and numerical nutritional values.

Our main objective is to determine whether Nutri-Score labels can be inferred directly from structured product data, and which features (e.g., sugar, salt, additives) are most predictive.

4.1 Ingredients + Additives Text Model¶

We explore whether free-text fields like ingredients and additives can effectively predict a product’s Nutri-grade, using text-based modeling techniques like TF-IDF and Logistic Regression.

Goal:

Predict Nutri-grade (A–E) and health category (healthy vs unhealthy) using only text-based ingredient and additive information.

Feature Preparation

  • Combined Tokens: We merge clean_ingredient_tokens with additives_list into a unified list per product.

  • TF-IDF Encoding: Convert token lists into TF-IDF vectors, using unigrams and bigrams (e.g., "palm").

  • Targets:

    Multiclass: nutrition_grade_fr (A–E)

    Binary: score_binary (healthy = A/B, unhealthy = C/D/E)

In [ ]:
# Step 1: Combine ingredients and additives into a single token list
df_final['combined_tokens'] = df_final.apply(
    lambda row: row['clean_ingredient_tokens'] + row['additives_list']
    if isinstance(row['additives_list'], list) else row['clean_ingredient_tokens'],
    axis=1
)

# Step 2: Join the token list into a single string for TF-IDF processing
df_final['combined_str'] = df_final['combined_tokens'].apply(lambda tokens: ' '.join(tokens))

# Step 3: Create binary label: 'healthy' (A/B) vs 'unhealthy' (C/D/E)
df_final['score_binary'] = df_final['nutrition_grade_fr'].str.lower().map(
    lambda x: 'healthy' if x in ['a', 'b'] else ('unhealthy' if x in ['c', 'd', 'e'] else None)
)

# Step 4: Drop rows with missing values to ensure clean modeling
df_model = df_final.dropna(subset=['combined_str', 'nutrition_grade_fr', 'score_binary'])

4.1.1 Multiclass Logistic Regression (Nutri-Score A to E)¶

Goal: Predict exact Nutri-Score class A to E.

In [ ]:
# Step 5: TF-IDF vectorization (unigrams only, top 1000 features)
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['nutrition_grade_fr'].str.lower()  # Target: multiclass

# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 7: Train logistic regression model (with class balancing)
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Multiclass Classification Report (Nutri-Score A–E):")
print(classification_report(y_test, y_pred))

# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Multiclass Nutri-Score Classification")
plt.show()
Multiclass Classification Report (Nutri-Score A–E):
              precision    recall  f1-score   support

           a       0.54      0.67      0.60      1717
           b       0.49      0.57      0.53      2288
           c       0.44      0.45      0.44      3148
           d       0.65      0.49      0.56      5264
           e       0.66      0.74      0.70      3773

    accuracy                           0.57     16190
   macro avg       0.56      0.58      0.57     16190
weighted avg       0.58      0.57      0.57     16190

No description has been provided for this image

4.1.2 Binary Logistic Regression (Healthy vs Unhealthy)¶

Goal: Classify products as "healthy" (Nutri-Score A or B) or "unhealthy" (Nutri-Score C, D, or E) using TF-IDF features derived from cleaned ingredient and additive text.

In [ ]:
# Step 5: TF-IDF vectorization (bigrams included, more features)
vectorizer = TfidfVectorizer(max_features=8000, ngram_range=(1,2), stop_words='english')
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['score_binary']  # Target: binary label


# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 7: Train logistic regression model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Binary Classification Report (Healthy vs Unhealthy):")
print(classification_report(y_test, y_pred))

# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Binary Nutri-Score Classification")
plt.show()
Binary Classification Report (Healthy vs Unhealthy):
              precision    recall  f1-score   support

     healthy       0.68      0.88      0.77      4005
   unhealthy       0.95      0.87      0.91     12185

    accuracy                           0.87     16190
   macro avg       0.82      0.87      0.84     16190
weighted avg       0.89      0.87      0.87     16190

No description has been provided for this image

4.1.3 Binary Random Forest (Healthy vs Unhealthy)¶

Goal: Try non-linear model to capture ingredient interactions.

In [ ]:
# Reuse X_train, y_train, X_test, y_test from the binary logistic regression model (Section 3.1.2)

# Step 1: Initialize and train the Random Forest classifier
rf_clf = RandomForestClassifier(
    n_estimators=200,       # Number of trees in the forest
    max_depth=15,           # Maximum depth of each tree
    class_weight='balanced',# Handle class imbalance
    random_state=42         # Ensure reproducibility
)
rf_clf.fit(X_train, y_train)

# Step 2: Make predictions
y_pred_rf = rf_clf.predict(X_test)

# Step 3: Display performance metrics
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Step 4: Plot confusion matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d",
            xticklabels=rf_clf.classes_,
            yticklabels=rf_clf.classes_,
            cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix – Healthy vs Unhealthy")
plt.show()
Random Forest Classification Report:
              precision    recall  f1-score   support

     healthy       0.54      0.88      0.67      4005
   unhealthy       0.95      0.75      0.84     12185

    accuracy                           0.78     16190
   macro avg       0.75      0.82      0.75     16190
weighted avg       0.85      0.78      0.80     16190

No description has been provided for this image

🚀 Summary of Key Findings – Section 4.1¶

  1. Text-based features from ingredients and additives carry strong predictive signals.

  2. Binary classification significantly outperforms multiclass.

    • Multiclass Logistic Regression (A–E) achieved ~57% accuracy (macro F1 ≈ 0.57).
    • Binary Logistic Regression (Healthy vs. Unhealthy) achieved 87% accuracy with strong recall for both classes.
  3. Logistic Regression outperforms Random Forest in interpretability and precision.

    • Binary Logistic Regression reached a balanced performance (F1 = 0.83), with strong precision on both classes.
    • Random Forest (binary) reached 78% accuracy, but tended to overpredict "healthy", leading to higher false negatives.

4.2 Ingredient + Brand Model¶

Goal: In addition to ingredients, brand identity may reflect broader product philosophies or quality standards. This model evaluates whether combining ingredients with brand information enhances prediction accuracy for Nutri-grade (healthy vs unhealthy).

In [ ]:
from scipy.sparse import hstack
# Step 1: Prepare ingredient-only text
df_model['ingredient_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))

# Step 2: TF-IDF vectorization on ingredients
vectorizer = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words='english',
    min_df=3,
    max_df=0.9
)
X_tfidf = vectorizer.fit_transform(df_model['ingredient_str'])

# Step 3: Clean and encode brand info
top_brands = df_model['brands'].value_counts().head(50).index
df_model['brand_clean'] = df_model['brands'].apply(lambda x: x if x in top_brands else 'other')
brand_ohe = pd.get_dummies(df_model['brand_clean'], prefix='brand')

# Step 4: Concatenate features
X_final = hstack([X_tfidf, brand_ohe.values])

# Step 5: Define target
y = df_model['score_binary']  # 'healthy' vs 'unhealthy'
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# Step 6: Train model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Step 7: Evaluation
print("Classification Report (Ingredient + Brand):")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Ingredient + Brand Binary Classification")
plt.show()
Classification Report (Ingredient + Brand):
              precision    recall  f1-score   support

     healthy       0.67      0.88      0.76      4005
   unhealthy       0.96      0.86      0.90     12185

    accuracy                           0.86     16190
   macro avg       0.81      0.87      0.83     16190
weighted avg       0.88      0.86      0.87     16190

No description has been provided for this image
In [ ]:
# Step 1: Use X_final (TF-IDF + brand one-hot) and y from previous section

rf_clf = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    class_weight='balanced',
    random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

# Evaluation
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d", xticklabels=rf_clf.classes_, yticklabels=rf_clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix – Ingredient + Brand")
plt.show()

🚀 Summary of Key Findings - Section 4.2¶

Adding brand information did not significantly improve prediction accuracy (still ~86%).
This suggests that brand identity adds limited value beyond what is already captured in the ingredient list.

4.3 Ingredient-Only Model¶

Goal:

This experiment investigates whether a product’s healthiness (as defined by the Nutri-Score system) can be accurately predicted using only its cleaned ingredient list—without relying on additives, brand, or structured nutrition values.

In [ ]:
# Step 1: Prepare Ingredient-only Strings
df_model['ingredient_only_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))

# Step 2: TF-IDF Vectorization (only on ingredients)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words='english',
    min_df=3,
    max_df=0.9
)

X_ing = vectorizer.fit_transform(df_model['ingredient_only_str'])
y = df_model['score_binary']  # Use the binary label: 'healthy' vs 'unhealthy'

# Step 3: Train/Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ing, y, test_size=0.2, random_state=42)

# Step 4: Train Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

# Step 5: Evaluation
print("Ingredient-Only Classification Report:")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="BuGn",
            xticklabels=clf.classes_, yticklabels=clf.classes_)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Ingredient-Only – Binary Nutri-Score Classification")
plt.show()
In [ ]:
# Step 6: Show Most Important Tokens
import numpy as np

feature_names = vectorizer.get_feature_names_out()
coefs = clf.coef_[0]  # For binary classification

# Top 20 words most indicative of "unhealthy"
print(" Top 20 'unhealthy' indicators:")
for i in np.argsort(coefs)[-20:][::-1]:
    print(f"{feature_names[i]:<20} {coefs[i]:.3f}")

# Top 20 words most indicative of "healthy"
print("\n Top 20 'healthy' indicators:")
for i in np.argsort(coefs)[:20]:
    print(f"{feature_names[i]:<20} {coefs[i]:.3f}")
 Top 20 'unhealthy' indicators:
pepper spice         6.218
syrup seasoning      5.791
cottonseed oils      4.883
acid peanut          4.081
pepper yeast         3.920
vit b12              3.845
crumb wheat          3.745
sulfate ascorbic     3.424
chips bananas        3.420
oils coconut         3.275
powder sorbitan      3.263
citrate tricalcium   3.201
pgpr emulsifier      3.166
color modified       3.155
color contains       3.134
color disodium       3.071
flavor modified      2.994
orange juice         2.938
brownie              2.932
color black          2.915

 Top 20 'healthy' indicators:
quartered            -5.559
benzoate sodium      -5.413
water red            -4.978
phosphate color      -4.343
usa                  -3.820
almonds almonds      -3.663
dried apricots       -3.590
whey pasteurized     -3.583
coriander            -3.552
acid ferrous         -3.486
freshness vitamin    -3.474
white tuna           -3.339
phosphate thiamine   -3.277
steamed              -3.206
culture sea          -3.182
benzoic acid         -3.170
cultures reduced     -3.144
tocopherols natural  -3.127
vegetable monoglycerides -3.119
almond butter        -3.073

🚀Summary of Key Findings – Section 4.3: Ingredient-Only Model¶

  1. Ingredients alone are highly predictive of product healthiness.
    The ingredient-only model achieved 86% accuracy—comparable to models that also included additives or brand—showing that additives offer limited added value and ingredient composition already captures key health signals.

  2. Predictive ingredients align with nutritional intuition.

    • "Palm oil", "syrup", and "sugar" were strong indicators of unhealthy products.
    • "Beans", "lettuce", "fiber", and "water" were strong indicators of healthier items.
  3. Implication:
    Even without full nutrition facts or additive disclosures, consumers can make smarter food choices simply by reading the ingredient list. Avoiding a few key red-flag ingredients can reliably steer purchases toward healthier products.

4.4 Nutri Feature Importance Analysis (Random Forest)¶

To better understand which nutritional features most strongly influence the prediction of whether a product is healthy or unhealthy, we trained a Random Forest classifier and analyzed the resulting feature importances.

The plot below shows the relative importance of each feature in the classification task.

In [ ]:
# Suppose your input data includes structured features like sugars_100g, salt_100g, etc.
# Step 1: Define your feature columns
numeric_features = [
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n',
    'energy_100g',
    'fat_100g',
    'saturated-fat_100g',
    'trans-fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g'
]

# Step 2: Prepare X and y
X = df_final[numeric_features].fillna(0)  # Fill NaNs with 0 or use better imputation
y = df_final['score_binary']  # Target: 'healthy' or 'unhealthy'

# Step 3: Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)

# Step 5: Feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': numeric_features,
    'importance': importances
}).sort_values(by='importance', ascending=False)

# Step 6: Print feature importance
print("Feature Importance:")
print(feature_importance_df)

# Step 7: Plot feature importance
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title('Feature Importance – Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
Feature Importance:
                                    feature  importance
5                        saturated-fat_100g    0.203925
9                               sugars_100g    0.182795
4                                  fat_100g    0.130943
12                                salt_100g    0.118942
13                              sodium_100g    0.101605
3                               energy_100g    0.098835
8                        carbohydrates_100g    0.047869
10                               fiber_100g    0.036596
11                            proteins_100g    0.031020
7                          cholesterol_100g    0.013527
17                                iron_100g    0.012762
16                             calcium_100g    0.008211
15                           vitamin-c_100g    0.005458
14                           vitamin-a_100g    0.003753
0                               additives_n    0.003343
2   ingredients_that_may_be_from_palm_oil_n    0.000210
6                            trans-fat_100g    0.000206
1               ingredients_from_palm_oil_n    0.000000
No description has been provided for this image

🚀 Summary of Key Findings – Section 4.4¶

  1. Sugars and Saturated Fat Are the Top Predictors
    sugars_100g and saturated-fat_100g are the two most influential features in predicting whether a product is healthy or unhealthy. This confirms their central role in the Nutri-Score algorithm.

  2. Salt, Total Fat, Sodium, and Energy Also Matter
    These features, especially salt_100g and sodium_100g, rank just behind sugar and saturated fat, reinforcing their impact on health classification.

  3. This analysis directly supports our research question: If brands reduce sugar or salt, will their Nutri-Scores improve?
    Yes, because sugar and salt are among the strongest predictors, brands can meaningfully improve health scores by lowering these ingredients.

Implication:¶

Our model highlights that products low in sugar, saturated fat, and salt are far more likely to be classified as healthy. Consumers can use this insight to prioritize items with reduced sugar/salt content — even before checking the official Nutri-Score.

5ïžâƒŁ Regression Results and Interpretation¶

In this section, we analyse the OLS estimates linking a product’s nutritional composition, additive usage, non-linear interactions, and top-brand affiliation to its French Nutrition Score (nutrition-score-fr_100g). We focus on coefficient signs, statistical significance, and overall model diagnostics to understand which factors most strongly drive a food’s health rating.

In [ ]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

# 5.1 Continuous predictors (drop sodium_100g due to collinearity)
cont_vars = [
    'fat_100g','saturated-fat_100g','trans-fat_100g',
    'cholesterol_100g','carbohydrates_100g','sugars_100g','fiber_100g',
    'proteins_100g','salt_100g',
    'vitamin-a_100g','vitamin-c_100g',
    'calcium_100g','iron_100g',
    'ingredients_from_palm_oil_n','ingredients_that_may_be_from_palm_oil_n'
]
outcome = 'nutrition-score-fr_100g'

# 5.2 Clean & cast
df_reg = df_final.dropna(subset=cont_vars + ['additives_n', outcome, 'brands']).copy()
for c in cont_vars + ['additives_n', outcome]:
    df_reg[c] = pd.to_numeric(df_reg[c], errors='coerce')
df_reg.dropna(subset=cont_vars + ['additives_n', outcome, 'brands'], inplace=True)

# 5.3 Feature engineering
df_reg['log_additives'] = np.log1p(df_reg['additives_n'])
X = df_reg[cont_vars + ['log_additives']].copy()

# non-linear terms
X['sugars_sq']  = X['sugars_100g'] ** 2
X['fat_sq']     = X['fat_100g'] ** 2

# interaction terms
X['sugar_fat']  = X['sugars_100g'] * X['fat_100g']
X['sugar_salt'] = X['sugars_100g'] * X['salt_100g']
X['fat_salt']   = X['fat_100g']    * X['salt_100g']

# 5.4 Brand fixed effects (top 10 + “other”)
top_brands = df_reg['brands'].value_counts().nlargest(10).index
df_reg['brand_top10'] = df_reg['brands'].where(
    df_reg['brands'].isin(top_brands), 'other'
)
brand_dummies = pd.get_dummies(df_reg['brand_top10'],
                               prefix='brand', drop_first=True)
X = pd.concat([X, brand_dummies], axis=1)

y = df_reg[outcome].astype(float)

# 5.5 Standardize continuous regressors
to_scale = cont_vars + ['log_additives','sugars_sq','fat_sq',
                        'sugar_fat','sugar_salt','fat_salt']
scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X[to_scale]),
    columns=to_scale,
    index=X.index
)

# rebuild design matrix
X_design = pd.concat([X_scaled, X.drop(columns=to_scale)], axis=1)
X_design = sm.add_constant(X_design)

# 5.6 VIF check (numeric only)
X_vif = X_design.drop(columns=['const']).select_dtypes(include=[np.number])
vif = pd.DataFrame({
    'variable': X_vif.columns,
    'VIF': [variance_inflation_factor(X_vif.values, i)
            for i in range(X_vif.shape[1])]
})
print("Top 10 VIFs:\n", vif.sort_values('VIF', ascending=False).head(10))

# 5.7 Fit OLS
model = sm.OLS(y, X_design.astype(float)).fit()
print(model.summary())
/usr/local/lib/python3.11/dist-packages/statsmodels/regression/linear_model.py:1784: RuntimeWarning: invalid value encountered in scalar divide
  return 1 - self.ssr/self.uncentered_tss
Top 10 VIFs:
               variable        VIF
5          sugars_100g  15.340163
0             fat_100g  12.341447
16           sugars_sq  10.255979
17              fat_sq   6.509564
18           sugar_fat   4.042129
1   saturated-fat_100g   3.478562
4   carbohydrates_100g   2.254807
20            fat_salt   2.193900
8            salt_100g   1.436269
7        proteins_100g   1.431740
                               OLS Regression Results                              
===================================================================================
Dep. Variable:     nutrition-score-fr_100g   R-squared:                       0.829
Model:                                 OLS   Adj. R-squared:                  0.829
Method:                      Least Squares   F-statistic:                 1.310e+04
Date:                     Sun, 27 Apr 2025   Prob (F-statistic):               0.00
Time:                             03:25:10   Log-Likelihood:            -2.1933e+05
No. Observations:                    80946   AIC:                         4.387e+05
Df Residuals:                        80915   BIC:                         4.390e+05
Df Model:                               30                                         
Covariance Type:                 nonrobust                                         
===========================================================================================================
                                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
const                                      10.4594      0.138     75.540      0.000      10.188      10.731
fat_100g                                    6.3398      0.045    141.074      0.000       6.252       6.428
saturated-fat_100g                          2.9176      0.024    122.380      0.000       2.871       2.964
trans-fat_100g                              0.0690      0.013      5.372      0.000       0.044       0.094
cholesterol_100g                            0.0041      0.013      0.320      0.749      -0.021       0.029
carbohydrates_100g                          1.0818      0.019     56.326      0.000       1.044       1.119
sugars_100g                                 7.0679      0.050    141.073      0.000       6.970       7.166
fiber_100g                                 -1.7139      0.015   -116.426      0.000      -1.743      -1.685
proteins_100g                               0.4292      0.015     28.065      0.000       0.399       0.459
salt_100g                                   1.1982      0.015     78.219      0.000       1.168       1.228
vitamin-a_100g                              0.1131      0.013      8.547      0.000       0.087       0.139
vitamin-c_100g                             -0.0012      0.013     -0.094      0.925      -0.027       0.024
calcium_100g                                0.0039      0.015      0.269      0.788      -0.025       0.033
iron_100g                                   0.0067      0.013      0.524      0.600      -0.018       0.032
ingredients_from_palm_oil_n              1.861e-15   1.08e-16     17.158      0.000    1.65e-15    2.07e-15
ingredients_that_may_be_from_palm_oil_n    -0.0596      0.013     -4.471      0.000      -0.086      -0.033
log_additives                               0.3502      0.014     25.118      0.000       0.323       0.378
sugars_sq                                  -3.0089      0.041    -73.492      0.000      -3.089      -2.929
fat_sq                                     -3.0546      0.033    -93.592      0.000      -3.119      -2.991
sugar_fat                                  -1.7688      0.026    -68.835      0.000      -1.819      -1.718
sugar_salt                                  0.3834      0.015     25.229      0.000       0.354       0.413
fat_salt                                    1.2758      0.019     67.397      0.000       1.239       1.313
brand_food club                             1.4857      0.217      6.834      0.000       1.060       1.912
brand_great value                           1.0948      0.190      5.762      0.000       0.722       1.467
brand_kroger                                0.6213      0.181      3.434      0.001       0.267       0.976
brand_meijer                                0.7338      0.180      4.066      0.000       0.380       1.088
brand_other                                 0.4821      0.139      3.466      0.001       0.209       0.755
brand_roundy's                              0.5039      0.198      2.543      0.011       0.116       0.892
brand_shoprite                             -0.4257      0.219     -1.941      0.052      -0.856       0.004
brand_spartan                               0.8256      0.189      4.362      0.000       0.455       1.197
brand_target stores                         0.8649      0.216      4.002      0.000       0.441       1.288
brand_weis                                  0.5189      0.203      2.561      0.010       0.122       0.916
==============================================================================
Omnibus:                    12667.619   Durbin-Watson:                   1.142
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           205474.026
Skew:                          -0.207   Prob(JB):                         0.00
Kurtosis:                      10.794   Cond. No.                     1.35e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.53e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Summary of Key Findings¶

Overall Fit:

  • RÂČ = 0.831, Adj RÂČ = 0.831 (F = 1.28 × 10⁎, p < 0.001)
  • Model explains over 83% of the variance in nutrition scores

Primary Macronutrient Effects:

  • Sugar is the strongest positive driver (coef = 7.056, p < 0.001)
  • Total fat (+5.047, p < 0.001) and saturated fat (+2.904, p < 0.001) both worsen the score
  • Fiber has the largest health-improving effect (coef = –1.661, p < 0.001)
  • Carbohydrates modestly improve health rating (–0.194, p < 0.001)

Non-Linear & Interaction Dynamics:

  • Diminishing Returns: sugarsÂČ (–2.995, p < 0.001) and fatÂČ (–3.020, p < 0.001) show that incremental sugar/fat at high levels matter less

Interactions:

  • sugar×salt (+0.375, p < 0.001) and fat×salt (+1.256, p < 0.001) amplify negative health impact
  • sugar×fat (–1.736, p < 0.001) slightly offsets combined sugar/fat harm

Additives & Palm-Oil Flags:

  • Log-additives increases score (0.353, p < 0.001) but with diminishing marginal impact
  • “May-be” palm-oil flag lowers score (–0.057, p < 0.001); “from‐palm” flag not significant

Micronutrients:

  • Vitamin A small positive effect (+0.113, p < 0.001)
  • Vitamin C, calcium, iron all non-significant (p > 0.05)

Brand Fixed Effects:

  • “Food Club” (+1.500, p < 0.001), “Great Value” (+0.999, p < 0.001), “Kroger” (+0.636, p < 0.001) indicate systematic brand differences
  • “Shoprite” marginally lower (–0.408, p ≈ 0.06)

Implications for Product Reformulation

  1. Targeting sugar, fat, and salt reductions will yield the highest score improvements, but non-linear effects mean large cuts have diminishing returns.
  2. Boosting fiber remains an effective way to improve health ratings.
  3. Additive count matters, but benefits taper—focus on removing the most harmful additives first.
  4. Brand-level practices significantly shift scores; comparing peers can reveal best-practice benchmarks.

đŸ’Ș Challenges¶

Throughout the project, we faced several challenges that influenced our analysis:

  • Large Initial Dataset and High Dimensionality:

    At the beginning, the dataset had over 100 columns and a huge number of rows, many of which were either irrelevant or severely incomplete. Handling such a wide and messy dataset required extensive feature selection and missing value filtering before any meaningful analysis could begin.

  • Exploratory Data Analysis (EDA) Overload:

    Due to the high dimensionality of the data, EDA itself became a major challenge. A large number of features had to be carefully visualized, plotted, and analyzed to understand their distributions, relationships, and usefulness before any modeling work could be performed. Dataset exploration took much more effort than initially expected.

  • Ingredient Text Cleaning Complexity:

    Ingredient lists were highly unstructured, varying by brand, language (English, French, German), formatting, and punctuation. Designing robust cleaning rules that generalized across these variations was a major challenge and required many trial-and-error iterations.

  • Difficulty in Identifying Clustering Features:

    Clustering products was more difficult than initially expected. Given the wide variety of packaged foods and the high number of features, it was not obvious which dimensions were most critical for meaningful clusters. Without proper feature reduction, clusters risked becoming blurred and uninterpretable.

📊 Next Steps¶

Building on our findings, there are several promising directions for future improvement and research:

  • Incorporate Additive Risk Scoring:

    Develop a supplementary risk score based on the type and number of additives (e.g., using EFSA or FDA categorizations) to enhance the predictive power beyond Nutri-Score labels.

  • Enhance Text Modeling Approaches:

    Explore advanced NLP techniques (e.g., Word2Vec embeddings, BERT fine-tuning) to better capture subtle relationships between ingredient wording and product healthiness.

  • Experiment with Ensemble Models:

    Combine predictions from structured features (nutrition facts) and unstructured text (ingredients) using ensemble learning to maximize model robustness.

  • Cluster Interpretation Enhancement:

    Apply deeper interpretation methods like SHAP values to cluster profiles, to better explain what features drive each cluster’s distinct nutritional signature.

  • Interactive Consumer Tool Prototype:

    Build a simple web-based tool that allows users to input an ingredient list and receive a predicted Nutri-Score or healthiness rating, empowering consumers to make healthier food choices.

✹ Final Unwrapping¶

At the outset of this project, we set out to "unwrap the secrets" behind the nutritional makeup of U.S. packaged foods.
What began as a messy ocean of ingredient lists, additives, and nutrition grades gradually took shape—through careful cleaning, exploration, modeling, and critical interpretation.

Along the way, we uncovered striking patterns:

  • Sugar and fat remain the dominant drivers of Nutri-Score health ratings.
  • Additives, while often overlooked, quietly shape product profiles and consumer perceptions.
  • Brand practices leave a distinct fingerprint on nutritional quality, revealing hidden structures beneath the labels.

Yet, our analysis also highlighted the blind spots:

  • Nutri-Scores, while powerful, don't fully capture the complexity of food processing and additive risks.
  • Ingredient lists, with their chaotic variability, pose an ongoing challenge for clean modeling.

Looking forward, the journey doesn't end here.
Our findings open promising paths: smarter risk scoring for additives, enhanced text-based modeling, and interactive consumer tools to bridge the gap between data and healthier choices.

In a world awash with packaged options, peeling back the layers of nutrition is more critical—and more possible—than ever.

Nutri-Score Label