Unwrapping the Secrets: Nutritional Analytics of U.S. Packaged Foods¶
đ”đ»ââïž What is it about?¶
In this project, we explore the nutritional properties of packaged foods in the United States, aiming to uncover hidden patterns and examine how nutritional features are reflected in Nutri-Score labeling.
Why Itâs Interesting?¶
As consumers become increasingly health-conscious, understanding the connection between what's inside our food and how it is nutritionally scored has never been more important. In addition, with the rising concern surrounding food additives, it is important for us to provide insights that can inform healthier food choices.
Key Research Question:¶
- Does Nutri-Score reflects the presence of food additives, or are there potential inconsistencies?
- Do certain brands consistently use more additives, and tend to receive lower Nutri-Scores?
- How do food additives and healthiness vary across different brands of packaged foods in the U.S.?
Where Our Data Comes From?¶
We are using the Open Food Facts from Kaggle, focusing on packaged foods available in the U.S. The dataset contains over 300,000 entries with detailed nutritional, additive, and ingredient information.
đ You can also take a look at the project website: https://world.openfoodfacts.org/
đ€ Before we start, you might wonder, "What is Nutri-Score labeling?"
Nutri-Score is a front-of-pack nutrition labeling system designed to help consumers make healthier food choices at a glance. It rates the overall nutritional quality of food products using a five-color and five-letter scale, from A (dark green) for the healthiest options to E (red) for the least healthy ones.
It was developed in France in 2017, but it is actually pretty common to see in supermarkets today! đ

1ïžâŁ Data Loading and Preprocessing¶
1.1 Setting Up Data¶
Before diving into the analysis, let's make sure we have all the necessary tools and data prepared. We will start by importing essential libraries, setting up any required packages, and retrieving the Open Food Facts dataset from Kaggle. đâš Let's get everything ready!
!pip install umap-learn -q
!pip install unidecode
Collecting unidecode Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB) Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB) ââââââââââââââââââââââââââââââââââââââââ 0.0/235.8 kB ? eta -:--:-- ââââââââââââââââââââââââââââââââââââââââ 235.8/235.8 kB 9.8 MB/s eta 0:00:00 Installing collected packages: unidecode Successfully installed unidecode-1.4.0
# Basic utilities
import os
import re
import difflib
import unidecode
from collections import Counter
from itertools import chain
# Data processing
import pandas as pd
import numpy as np
# Text processing
import nltk
from nltk.corpus import stopwords
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
# Dimensionality reduction and clustering
import umap
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
# Preprocessing and model selection
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
# Dataset download
import kagglehub
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
# Download Open Food Facts dataset from Kaggle
path = kagglehub.dataset_download("openfoodfacts/world-food-facts")
print("Path to dataset files:", path)
Path to dataset files: /kaggle/input/world-food-facts
# List available files
files = os.listdir(path)
print("Files in dataset:", files)
Files in dataset: ['en.openfoodfacts.org.products.tsv']
# Load the main data file
file_path = os.path.join(path, 'en.openfoodfacts.org.products.tsv')
df = pd.read_csv(file_path, sep='\t', low_memory=False)
1.2 Exploring the Raw Dataset¶
Now that we have successfully loaded the Open Food Facts dataset, let's take a quick look at its overall structure of the dataset. We will first check the number of rows and columns, and then examine the names and data types of all available attributes.
# dataset review
df.head()
| code | url | creator | created_t | created_datetime | last_modified_t | last_modified_datetime | product_name | generic_name | quantity | ... | fruits-vegetables-nuts_100g | fruits-vegetables-nuts-estimate_100g | collagen-meat-protein-ratio_100g | cocoa_100g | chlorophyl_100g | carbon-footprint_100g | nutrition-score-fr_100g | nutrition-score-uk_100g | glycemic-index_100g | water-hardness_100g | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0000000003087 | http://world-en.openfoodfacts.org/product/0000... | openfoodfacts-contributors | 1474103866 | 2016-09-17T09:17:46Z | 1474103893 | 2016-09-17T09:18:13Z | Farine de blé noir | NaN | 1kg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 0000000004530 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069957 | 2017-03-09T14:32:37Z | 1489069957 | 2017-03-09T14:32:37Z | Banana Chips Sweetened (Whole) | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 14.0 | 14.0 | NaN | NaN |
| 2 | 0000000004559 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069957 | 2017-03-09T14:32:37Z | 1489069957 | 2017-03-09T14:32:37Z | Peanuts | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN |
| 3 | 0000000016087 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489055731 | 2017-03-09T10:35:31Z | 1489055731 | 2017-03-09T10:35:31Z | Organic Salted Nut Mix | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | 12.0 | NaN | NaN |
| 4 | 0000000016094 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489055653 | 2017-03-09T10:34:13Z | 1489055653 | 2017-03-09T10:34:13Z | Organic Polenta | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows Ă 163 columns
# data shape
print("Data rows and columns number:", df.shape)
Data rows and columns number: (356027, 163)
# check all columns' name and datatype:
print("All Columns and Their Data Types in the Dataset:\n")
print(f"{'Column Name':<50} {'Type'}")
print("-" * 60)
for col in df.columns:
print(f"{col:<50} {df[col].dtype}")
All Columns and Their Data Types in the Dataset: Column Name Type ------------------------------------------------------------ code object url object creator object created_t object created_datetime object last_modified_t object last_modified_datetime object product_name object generic_name object quantity object packaging object packaging_tags object brands object brands_tags object categories object categories_tags object categories_en object origins object origins_tags object manufacturing_places object manufacturing_places_tags object labels object labels_tags object labels_en object emb_codes object emb_codes_tags object first_packaging_code_geo object cities object cities_tags object purchase_places object stores object countries object countries_tags object countries_en object ingredients_text object allergens object allergens_en object traces object traces_tags object traces_en object serving_size object no_nutriments float64 additives_n float64 additives object additives_tags object additives_en object ingredients_from_palm_oil_n float64 ingredients_from_palm_oil float64 ingredients_from_palm_oil_tags object ingredients_that_may_be_from_palm_oil_n float64 ingredients_that_may_be_from_palm_oil float64 ingredients_that_may_be_from_palm_oil_tags object nutrition_grade_uk float64 nutrition_grade_fr object pnns_groups_1 object pnns_groups_2 object states object states_tags object states_en object main_category object main_category_en object image_url object image_small_url object energy_100g float64 energy-from-fat_100g float64 fat_100g float64 saturated-fat_100g float64 -butyric-acid_100g float64 -caproic-acid_100g float64 -caprylic-acid_100g float64 -capric-acid_100g float64 -lauric-acid_100g float64 -myristic-acid_100g float64 -palmitic-acid_100g float64 -stearic-acid_100g float64 -arachidic-acid_100g float64 -behenic-acid_100g float64 -lignoceric-acid_100g float64 -cerotic-acid_100g float64 -montanic-acid_100g float64 -melissic-acid_100g float64 monounsaturated-fat_100g float64 polyunsaturated-fat_100g float64 omega-3-fat_100g float64 -alpha-linolenic-acid_100g float64 -eicosapentaenoic-acid_100g float64 -docosahexaenoic-acid_100g float64 omega-6-fat_100g float64 -linoleic-acid_100g float64 -arachidonic-acid_100g float64 -gamma-linolenic-acid_100g float64 -dihomo-gamma-linolenic-acid_100g float64 omega-9-fat_100g float64 -oleic-acid_100g float64 -elaidic-acid_100g float64 -gondoic-acid_100g float64 -mead-acid_100g float64 -erucic-acid_100g float64 -nervonic-acid_100g float64 trans-fat_100g float64 cholesterol_100g float64 carbohydrates_100g float64 sugars_100g float64 -sucrose_100g float64 -glucose_100g float64 -fructose_100g float64 -lactose_100g float64 -maltose_100g float64 -maltodextrins_100g float64 starch_100g float64 polyols_100g float64 fiber_100g float64 proteins_100g float64 casein_100g float64 serum-proteins_100g float64 nucleotides_100g float64 salt_100g float64 sodium_100g float64 alcohol_100g float64 vitamin-a_100g float64 beta-carotene_100g float64 vitamin-d_100g float64 vitamin-e_100g float64 vitamin-k_100g float64 vitamin-c_100g float64 vitamin-b1_100g float64 vitamin-b2_100g float64 vitamin-pp_100g float64 vitamin-b6_100g float64 vitamin-b9_100g float64 folates_100g float64 vitamin-b12_100g float64 biotin_100g float64 pantothenic-acid_100g float64 silica_100g float64 bicarbonate_100g float64 potassium_100g float64 chloride_100g float64 calcium_100g float64 phosphorus_100g float64 iron_100g float64 magnesium_100g float64 zinc_100g float64 copper_100g float64 manganese_100g float64 fluoride_100g float64 selenium_100g float64 chromium_100g float64 molybdenum_100g float64 iodine_100g float64 caffeine_100g float64 taurine_100g float64 ph_100g float64 fruits-vegetables-nuts_100g float64 fruits-vegetables-nuts-estimate_100g float64 collagen-meat-protein-ratio_100g float64 cocoa_100g float64 chlorophyl_100g float64 carbon-footprint_100g float64 nutrition-score-fr_100g float64 nutrition-score-uk_100g float64 glycemic-index_100g float64 water-hardness_100g float64
1.3 Attribute Key Definitions¶
While it may be overwhelming to explore every column in the dataset individually, we prepare for the upcoming analysis by organizing key attributes into major categories. Below is a list of the main columns and their definitions from the Open Food Facts dataset:
đč Identifiers
code: Barcode of the product.url: URL of the product page on Open Food Facts website.creator,created_t,created_datetime: Creator and timestamp of product entry creation.last_modified_t,last_modified_datetime: Timestamps for when the entry was last updated.
đč Product Description
product_name,generic_name: Commercial and general name of the product.quantity: Product quantity (e.g., "500g", "2L").packaging,packaging_tags: Text description of packaging materials (e.g., plastique, plastic, glass jar).brands: Commercial brand names and standardized brand identifiers.brands_tags: provides normalized lowercase brand tokens for grouping.categories,categories_tags,categories_en: Product classification into food categories.
đč Geographic & Origin Information
origins,manufacturing_places: Source or production location.countries,countries_en: Market availability of the product.
đč Labeling and Compliance
labels,labels_en,labels_tags: Certifications, diet types (e.g. organic, halal).emb_codes,emb_codes_tags: Packaging codes (e.g., recycling info).purchase_places,stores: Points of purchase, stores where sold.
đč Ingredients & Allergens
ingredients_text: Raw text of ingredients list.allergens,allergens_en: Allergen content (e.g., nuts, gluten).traces,traces_en: Potential traces of allergens not listed as ingredients.additives,additives_n,additives_tags: Additive info and count.ingredients_from_palm_oil,ingredients_that_may_be_from_palm_oil: Palm oil usage certainty and estimation.
đč Nutrition Grades & Groups
nutrition_grade_fr,nutrition_grade_uk: Health grade from A (best) to E (worst).pnns_groups_1,pnns_groups_2: French public health nutrition food groupings.
đč Nutritional Composition (per 100g)
Energy & Macronutrients:
energy_100gfat_100sugars_100gproteins_100gfiber_100gsalt_100gsodium_100g
Fatty Acids (Mono/Poly/Saturated):
monounsaturated-fat_100gomega-3-fat_100g,-palmitic-acid_100g⊠(etc.)
Sugars & Derivatives:
-sucrose_100g-glucose_100g-lactose_100gstarch_100gpolyols_100g
Vitamins & Minerals:
vitamin-a_100gvitamin-c_100giron_100gzinc_100gcalcium_100giodine_100g⊠(etc.)
Other Features:
alcohol_100gcaffeine_100gph_100gcarbon-footprint_100gglycemic-index_100g
1.4 Checking Nulls and Duplicates¶
Before diving into detailed data analysis, we need to first checking for duplicate entries and missing values. We will also calculate the percentage of missing values for each column and dentify columns with 100% missing values for potential removal.
# Find number of duplicated rows in the dataset
num_duplicates = df.duplicated().sum()
print(f"The number of duplicated rows: {num_duplicates}")
The number of duplicated rows: 0
# Count missing values per column
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)
print("Missing Values Count per Column:")
display(missing_counts)
Missing Values Count per Column:
| 0 | |
|---|---|
| no_nutriments | 356027 |
| chlorophyl_100g | 356027 |
| water-hardness_100g | 356027 |
| glycemic-index_100g | 356027 |
| -butyric-acid_100g | 356027 |
| ... | ... |
| code | 26 |
| url | 26 |
| created_datetime | 10 |
| creator | 3 |
| created_t | 3 |
161 rows Ă 1 columns
âĄïž There are no duplicate rows in the dataset, a clean start is always a good start.
# Calculate missing value percentage
total_rows = df.shape[0]
missing_percent = (missing_counts / total_rows) * 100
print("The Missing Values (% of total rows) for Each Column:")
display(missing_percent.sort_values(ascending=False))
The Missing Values (% of total rows) for Each Column:
| 0 | |
|---|---|
| no_nutriments | 100.000000 |
| chlorophyl_100g | 100.000000 |
| water-hardness_100g | 100.000000 |
| glycemic-index_100g | 100.000000 |
| -butyric-acid_100g | 100.000000 |
| ... | ... |
| code | 0.007303 |
| url | 0.007303 |
| created_datetime | 0.002809 |
| creator | 0.000843 |
| created_t | 0.000843 |
161 rows Ă 1 columns
âĄïž We observe that many columns have missing entries. Next, we calculate the proportion of missing values to uncover more. đ
# Identify columns with 100% missing values
full_missing_cols = missing_percent[missing_percent == 100].index
print("Columns with 100% missing values:")
for col in full_missing_cols:
print(col)
Columns with 100% missing values: no_nutriments chlorophyl_100g water-hardness_100g glycemic-index_100g -butyric-acid_100g -melissic-acid_100g -nervonic-acid_100g -erucic-acid_100g -mead-acid_100g -elaidic-acid_100g -caproic-acid_100g -lignoceric-acid_100g -cerotic-acid_100g nutrition_grade_uk ingredients_from_palm_oil ingredients_that_may_be_from_palm_oil
âĄïž The columns above have 100% missing values and will be dropped during preprocessing.
1.5 Define the Scope of Analysis: Focus on U.S. Data¶
Since our analysis focuses on packaged foods available in the United States, we first need to filter out records from other countries and regions using the countries_en column.
Country labels may vary in format (e.g., "USA", "U.S.", "United States of America"), and some entries list multiple countries (e.g., "Spain,United States"), introducing potential noise. To ensure accurate filtering, we explore two approaches.
Attempted Approach: Fuzzy Matching (Not Used)¶
We initially attempted to use fuzzy matching with the difflib library to capture all records loosely matching "United States":
similar = difflib.get_close_matches("United States", unique_countries, n=10, cutoff=0.6)
However, this approach included many false matches with multiple countries. As shown:
United States; United States,éŠæžŻ; Peru,United States; Spain,United States; Italy,United States; Chile,United States; Taiwan,United States; Sweden,United States; Panama,United States; Mexico,United States
Second Approach: Exact Match with U.S. Labels¶
# Define accepted U.S. labels
usa_labels = ['United States', 'USA', 'U.S.', 'US', 'United States of America']
# Filter rows where countries_en exactly matches any accepted label
df_usa = df[df['countries_en'].isin(usa_labels)].copy()
We then compared the size of the dataset before and after filtering:
# Print result
print("Before filtering, rows and columns:", df.shape)
print(f"After filtering, retained {df_usa.shape[0]} rows out of {df.shape[0]} "
f"({df_usa.shape[0]/df.shape[0]:.1%}) after filtering U.S. labels.")
Before filtering, rows and columns: (356027, 163) After filtering, retained 173159 rows out of 356027 (48.6%) after filtering U.S. labels.
âĄïž After filtering, we retained 48.6% of the original dataset, focusing purely on products sold in the United States.
Verifying the Filtered Data:
To confirm that only U.S. data remains, we sampled a few rows:
# Sample to verify country fields
df_usa[['countries', 'countries_tags', 'countries_en']].sample(10)
| countries | countries_tags | countries_en | |
|---|---|---|---|
| 69486 | US | en:united-states | United States |
| 146344 | US | en:united-states | United States |
| 107607 | US | en:united-states | United States |
| 131265 | US | en:united-states | United States |
| 36611 | US | en:united-states | United States |
| 125188 | US | en:united-states | United States |
| 298030 | United States | en:united-states | United States |
| 59010 | US | en:united-states | United States |
| 59328 | US | en:united-states | United States |
| 17235 | US | en:united-states | United States |
âĄïž As expected, all entries are correctly labeled as belonging to the United States.
1.6 Cleaning Data: handle the missing and duplicate value¶
In this dataset, many columns contain a high proportion of missing values (missing_percent). We will drop the columns that exceed 80% missing value THRESHOLD = 0.8.
The dataset df_usa to be analyzed will be stored in df_final, and it will meet the following criteria:
- At lease 50,000 rows after cleaning and dropping null values
- Includes a rich set of features that are intuitively relevant for prediction.
Step 1: Visualize Missing Rates Across Columns¶
#Top 50 Columns by Missing Rate
missing_counts = df_usa.isnull().sum()
missing_percent = missing_counts / len(df_usa)
missing_percent_sorted = missing_percent.sort_values(ascending=True)
top_n = 50
missing_percent_top = missing_percent_sorted.head(top_n)
# drop bar chart
plt.figure(figsize=(20, 10))
missing_percent_top.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Missing Rate (%)')
plt.title(f'Top {top_n}% Columns by Missing Rate')
plt.xticks(rotation=90, ha='right')
plt.axhline(0.8, color='red', linestyle='--', linewidth=2)
plt.text(-0.5, 0.8, '80% Threshold', color='red', va='bottom', ha='left')
plt.tight_layout()
plt.show()
âĄïž Observation: Several columns exceed the 80% missing rate threshold. These columns will be candidates for removal.
Step 2: Drop Columns with Excessive Missing Values¶
We now remove all columns with more than 80% missing entries.
# Set missing value threshold
THRESHOLD = 0.8
# Specify columns to drop
cols_to_drop = missing_percent[missing_percent > THRESHOLD].index
df_usa_trimmed = df_usa.drop(cols_to_drop, axis=1)
Comparison of dataset dimensions before and after dropping columns:
print("Original dataset: rows and columns", df.shape)
print("After dropping columns: rows and columns", df_usa_trimmed.shape)
Original dataset: rows and columns (356027, 163) After dropping columns: rows and columns (173159, 42)
Step 3: Drop Remaining Nulls and Duplicate¶
# Drop all nulls
df_final = df_usa_trimmed.dropna().copy()
print("After dropping remaining nulls: rows and columns", df_final.shape)
# Drop all duplicates
df_final = df_final.drop_duplicates().copy()
print("After dropping remaining duplicates: rows and columns", df_final.shape)
After dropping remaining nulls: rows and columns (82380, 42) After dropping remaining duplicates: rows and columns (82380, 42)
print("Final Check for null and duplicates:")
# Check for nulls
total_nulls = df_final.isnull().sum().sum()
print(f"Total missing values: {total_nulls}")
# Check for duplicates
duplicate_rows = df_final.duplicated().sum()
print(f"Total duplicated rows: {duplicate_rows}")
Final Check for null and duplicates: Total missing values: 0 Total duplicated rows: 0
Let's take a quick look at a few random samples from the final dataset:
# diaplay dataset
df_final.sample(5)
| code | url | creator | created_t | created_datetime | last_modified_t | last_modified_datetime | product_name | brands | brands_tags | ... | fiber_100g | proteins_100g | salt_100g | sodium_100g | vitamin-a_100g | vitamin-c_100g | calcium_100g | iron_100g | nutrition-score-fr_100g | nutrition-score-uk_100g | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 118826 | 0607880033342 | http://world-en.openfoodfacts.org/product/0607... | usda-ndb-import | 1489137479 | 2017-03-10T09:17:59Z | 1489137480 | 2017-03-10T09:18:00Z | Sweet & Salty Caramel Trail Mix | Southern Home | southern-home | ... | 6.7 | 13.33 | 1.01600 | 0.400 | 0.000000 | 0.0000 | 0.133 | 0.00240 | 20.0 | 20.0 |
| 142621 | 0749826575520 | http://world-en.openfoodfacts.org/product/0749... | usda-ndb-import | 1489096243 | 2017-03-09T21:50:43Z | 1489096243 | 2017-03-09T21:50:43Z | High Protein Fruit & Nut Bar | Pure Protein | pure-protein | ... | 9.4 | 33.96 | 0.76708 | 0.302 | 0.000057 | 0.0045 | 0.151 | 0.00136 | 9.0 | 9.0 |
| 159299 | 0850335006013 | http://world-en.openfoodfacts.org/product/0850... | usda-ndb-import | 1489093649 | 2017-03-09T21:07:29Z | 1489093649 | 2017-03-09T21:07:29Z | Verry Berry Fruit Pop | Squeaky Pops | squeaky-pops | ... | 0.0 | 0.00 | 0.01524 | 0.006 | 0.000000 | 0.0029 | 0.000 | 0.00000 | 5.0 | 5.0 |
| 94808 | 0077661147306 | http://world-en.openfoodfacts.org/product/0077... | usda-ndb-import | 1489138130 | 2017-03-10T09:28:50Z | 1489138130 | 2017-03-10T09:28:50Z | Opa, Greek Yogurt Roasted Garlic Dressing | Litehouse, Litehouse Inc. | litehouse,litehouse-inc | ... | 0.0 | 3.33 | 1.69418 | 0.667 | 0.000000 | 0.0000 | 0.133 | 0.00000 | 7.0 | 7.0 |
| 141907 | 0747599322013 | http://world-en.openfoodfacts.org/product/0747... | usda-ndb-import | 1489075490 | 2017-03-09T16:04:50Z | 1489075490 | 2017-03-09T16:04:50Z | Squares, Chocolate Assortment | Ghirardelli Chocolate, Ghirardelli Chocolate ... | ghirardelli-chocolate,ghirardelli-chocolate-co... | ... | 5.0 | 5.00 | 0.15748 | 0.062 | 0.000000 | 0.0030 | 0.100 | 0.00360 | 20.0 | 20.0 |
5 rows Ă 42 columns
We also list all retained feature columns:
# check all final retained columns' name:
print("Final retained columns:\n")
for col in df_final.columns:
print(col)
Final retained columns: code url creator created_t created_datetime last_modified_t last_modified_datetime product_name brands brands_tags countries countries_tags countries_en ingredients_text serving_size additives_n additives additives_tags additives_en ingredients_from_palm_oil_n ingredients_that_may_be_from_palm_oil_n nutrition_grade_fr states states_tags states_en energy_100g fat_100g saturated-fat_100g trans-fat_100g cholesterol_100g carbohydrates_100g sugars_100g fiber_100g proteins_100g salt_100g sodium_100g vitamin-a_100g vitamin-c_100g calcium_100g iron_100g nutrition-score-fr_100g nutrition-score-uk_100g
đ You can download this CVS file! đđ»
# save df_final
df_final.to_csv('df_final.csv', index=False)
2ïžâŁ Exploratory Data Analysis (EDA)¶
We now proceed to explore the nutritional properties and labeling characteristics of U.S. packaged foods. Let's keep the momentum going!
2.1 Cleaning Numerical Features¶
Since nutrient quantities are reported per 100g, errors such as negative values or abnormally large values (e.g., sugar > 100g) can occur. Before diving into feature distribution analysis, we want the numerical nutritional attributes fall within reasonable and possible ranges.
Outlier Identification and Handling
We focus on cleaning the following key nutritional features:
- Macronutrients:
sugars_100g,fat_100g,proteins_100g,carbohydrates_100getc. - Health-impacting features:
salt_100g,sodium_100g,cholesterol_100g,fiber_100getc. - Micronutrients:
vitamin-a_100g,vitamin-c_100g,calcium_100g,iron_100g
# Define allowed value range
valid_range_mask = df_final[[
'saturated-fat_100g',
'trans-fat_100g',
'fat_100g',
'cholesterol_100g',
'carbohydrates_100g',
'sugars_100g',
'fiber_100g',
'proteins_100g',
'salt_100g',
'sodium_100g',
'vitamin-a_100g',
'vitamin-c_100g',
'calcium_100g',
'iron_100g',
]].apply(lambda x: x.between(0, 100)).all(axis=1)
# Filter the dataset
df_final_cleaned = df_final[valid_range_mask].copy()
Compare the dataset size before and after removing outliers:
# Compare shape before and after filtering
print(f"Before outlier removal: {df_final.shape}")
print(f"After outlier removal: {df_final_cleaned.shape}")
print(f"Removed {df_final.shape[0] - df_final_cleaned.shape[0]} rows due to out-of-bound values.")
Before outlier removal: (82380, 42) After outlier removal: (82357, 42) Removed 23 rows due to out-of-bound values.
We list the products with invalid values âĄïž 23 out of over 82,000 rows will be removed
df_final[~valid_range_mask][[
'trans-fat_100g',
'sugars_100g',
'salt_100g',
'sodium_100g',
'vitamin-c_100g',
'calcium_100g',
'iron_100g',
]]
| trans-fat_100g | sugars_100g | salt_100g | sodium_100g | vitamin-c_100g | calcium_100g | iron_100g | |
|---|---|---|---|---|---|---|---|
| 1483 | 0.00 | 5.71 | 870.85678 | 342.857 | 0.0000 | 0.000 | 0.00617 |
| 8043 | 0.00 | 2.31 | 781.53768 | 307.692 | 0.0000 | 0.015 | 0.00138 |
| 11206 | 0.00 | 22.58 | 327.74128 | 129.032 | 0.0000 | 0.032 | 0.00232 |
| 12036 | 369.00 | 0.81 | 0.24892 | 0.098 | 0.0000 | 0.054 | 0.00073 |
| 41869 | 0.00 | 17.86 | 2.72034 | 1.071 | -0.0021 | 0.143 | 0.00129 |
| 50827 | 0.00 | 20.59 | 130.73634 | 51.471 | 0.0000 | 0.118 | 0.00106 |
| 69041 | 0.00 | 9.46 | 1098.37728 | 432.432 | 0.0000 | 0.054 | 0.00243 |
| 69050 | 0.00 | 11.27 | 858.59112 | 338.028 | 0.0000 | 0.028 | 0.00254 |
| 95375 | 0.00 | 33.33 | 1318.38192 | 519.048 | 0.0000 | 0.262 | 0.01414 |
| 107178 | 0.00 | 8.48 | 1139.15190 | 448.485 | 0.0000 | 0.012 | 0.00087 |
| 108870 | -0.70 | 20.42 | 0.22352 | 0.088 | 0.0000 | 0.070 | 0.00190 |
| 110111 | 0.00 | 50.00 | 101.60000 | 40.000 | 0.0000 | 0.000 | 0.00000 |
| 113274 | 0.00 | 50.00 | 0.30734 | 0.121 | 0.0043 | 285.714 | 0.00000 |
| 119086 | 0.00 | -3.57 | 0.95250 | 0.375 | 0.0086 | 0.071 | 0.00129 |
| 122368 | 0.00 | 65.85 | 123.90120 | 48.780 | 0.0000 | 0.000 | 0.00000 |
| 133607 | 0.00 | 0.00 | 101.23678 | 39.857 | 0.0000 | 0.000 | 0.00000 |
| 133611 | 0.00 | 0.00 | 104.86644 | 41.286 | 0.0000 | 0.000 | 0.00000 |
| 139181 | 0.00 | 0.00 | 187.96000 | 74.000 | 2.1000 | 0.000 | 0.00000 |
| 140017 | 173.26 | 1.60 | 0.80772 | 0.318 | 0.0024 | 0.027 | 0.00072 |
| 148432 | 0.00 | 7.65 | 100.85324 | 39.706 | 0.0000 | 0.000 | 0.00000 |
| 155175 | -3.57 | 10.71 | 2.44856 | 0.964 | 0.0000 | 0.071 | 0.00386 |
| 162397 | 0.00 | 0.00 | 1669.14322 | 657.143 | 0.0000 | 0.057 | 0.00309 |
| 351458 | 0.00 | 14.29 | 0.00000 | 0.000 | 0.0107 | 0.014 | -0.00026 |
df_final = df_final_cleaned.copy()
2.2 Text Preprocessing for Ingredient and Additive Fields¶
In this section, we clean the ingredient-related text fields to ensure consistency and accuracy in subsequent analyses.
The raw dataset may contain duplicate product entries, missing values, and inconsistencies in textual formatting. We address these issues through four main steps:
Step 1: Check and remove duplicate products on both
product_name, brands andingredients_text.Step 2: Standardize
ingredients_listandadditives: We split the ingredients_text andadditives_en column into lists to make additive analysis easier in later sectionsStep 3: Further clean
ingredients_list
Step 1: Remove Duplicate Products¶
Duplicate entries can arise due to slight variations in product names, brands, or ingredients descriptions.
If duplicate entries exist, we manually review them. We identify and remove duplicate products based on the combination of product_name, brands, and ingredients_text to avoid bias in the analysis.
# Step 1: Check and Remove Duplicates
# Normalize text: lowercase and strip spaces
df_final['product_name'] = df_final['product_name'].str.lower().str.strip()
df_final['brands'] = df_final['brands'].str.lower().str.strip()
df_final['ingredients_text'] = df_final['ingredients_text'].str.lower().str.strip()
# Find duplicates
duplicates = df_final[df_final.duplicated(subset=[ 'product_name', 'brands', 'ingredients_text'], keep=False)]
print(f"Number of duplicate entries: {duplicates.shape[0]}")
# Sort duplicates for easier review
duplicates = duplicates.sort_values(by=['product_name', 'brands', 'ingredients_text'])
# Display the potential duplicates
duplicates[['code', 'product_name', 'brands', 'ingredients_text', 'nutrition_grade_fr', 'additives_en', 'nutrition-score-fr_100g']].head(10)
Number of duplicate entries: 2597
| code | product_name | brands | ingredients_text | nutrition_grade_fr | additives_en | nutrition-score-fr_100g | |
|---|---|---|---|---|---|---|---|
| 49387 | 0041497097548 | 1% low fat chocolate milk, chocolate | weis quality, weis markets inc. | low fat milk, high fructose corn syrup, sugar,... | b | E407 - Carrageenan | 0.0 |
| 49388 | 0041497097555 | 1% low fat chocolate milk, chocolate | weis quality, weis markets inc. | low fat milk, high fructose corn syrup, sugar,... | b | E407 - Carrageenan | 0.0 |
| 57213 | 0044100106804 | 1% lowfat milk | hood, hp hood llc | lowfat milk, ascorbic acid (vitamin c), vitami... | a | E300 - Ascorbic acid | -1.0 |
| 57264 | 0044100169267 | 1% lowfat milk | hood, hp hood llc | lowfat milk, ascorbic acid (vitamin c), vitami... | a | E300 - Ascorbic acid | -1.0 |
| 54127 | 0041900074302 | 1% lowfat milk, chocolate | trumoo, dean foods company | lowfat milk, sugar, contains less than 1% of: ... | a | E407 - Carrageenan | -1.0 |
| 54182 | 0041900075712 | 1% lowfat milk, chocolate | trumoo, dean foods company | lowfat milk, sugar, contains less than 1% of: ... | a | E407 - Carrageenan | -1.0 |
| 44558 | 0041318020540 | 100% juice | schnuck markets inc. | tomato concentrate (water, tomato paste), salt... | b | E300 - Ascorbic acid | 2.0 |
| 44759 | 0041318131444 | 100% juice | schnuck markets inc. | tomato concentrate (water, tomato paste), salt... | b | E300 - Ascorbic acid | 2.0 |
| 7575 | 0011213015347 | 100% juice, tomato | spartan | tomato concentrate (water, tomato paste), salt... | b | E300 - Ascorbic acid,E330 - Citric acid | 1.0 |
| 8025 | 0011213049427 | 100% juice, tomato | spartan | tomato concentrate (water, tomato paste), salt... | b | E300 - Ascorbic acid,E330 - Citric acid | 1.0 |
Next, we have removed duplicate entries, keeping only the first occurrence:
# Check original shape before removing duplicates
print(f"Before removing duplicate products: {df_final.shape}")
# Remove duplicates, keep the first occurrence
df_final = df_final.drop_duplicates(subset=['product_name', 'brands', 'ingredients_text'])
print(f"After removing duplicate products: {df_final.shape}")
Before removing duplicate products: (82357, 42) After removing duplicate products: (80946, 42)
âĄïž Approximately 1.7% of the dataset entries were identified as duplicates and removed. This ensures that each product in our analysis corresponds to a unique combination of ingredients and branding, reducing potential bias in brand-level or additive-level analyses.
Step 2: Standardize ingredients_list and additives_list¶
The raw ingredients_text and additives_en fields in the dataset are stored as comma-separated strings with inconsistent casing and formatting. In this step, we standardize the ingredients_text and additives_en columns into structured list formats.
- For
ingredients_list, we split the text by commas and standardize to lowercase. - For
additives_list, we split the additives string by commas and standardize to lowercase. This formatting will make it easier to analyze specific ingredients and additives in later sections.
# Step 3: Standardize ingredients_list Additives List
# Split ingredients_text into a list by commas
df_final['ingredients_list'] = df_final['ingredients_text'].apply(lambda x: [i.strip().lower() for i in x.split(',')] if pd.notnull(x) else [])
# Split additives_en into a list
df_final['additives_list'] = df_final['additives_en'].apply(lambda x: x.split(',') if x != 'None' else [])
# Check an example
df_final[['product_name', 'ingredients_list','additives_en', 'additives_list']].head(10)
| product_name | ingredients_list | additives_en | additives_list | |
|---|---|---|---|---|
| 82 | peanuts, mixed nuts | [peanuts, honey, coating (sucrose, wheat starc... | E415 - Xanthan gum | [E415 - Xanthan gum] |
| 149 | turkish apricots | [apricots, sulfur dioxide.] | E220 - Sulphur dioxide | [E220 - Sulphur dioxide] |
| 152 | chili mango | [dried mango, paprika, sugar, salt, citric aci... | E330 - Citric acid | [E330 - Citric acid] |
| 153 | milk chocolate pretzels | [milk chocolate (sugar, cocoa butter, chocolat... | E101 - Riboflavin,E101i - Riboflavin,E322 - Le... | [E101 - Riboflavin, E101i - Riboflavin, E322 -... |
| 200 | butter croissants | [wheat flour, butter (cream), water, yeast, su... | E300 - Ascorbic acid | [E300 - Ascorbic acid] |
| 201 | wild blueberry muffins | [enriched wheat flour (wheat flour, malted bar... | E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... | [E101 - Riboflavin, E101i - Riboflavin, E375 -... |
| 202 | bolillos | [enriched wheat flour (wheat flour niacin, red... | E101 - Riboflavin,E101i - Riboflavin,E200 - So... | [E101 - Riboflavin, E101i - Riboflavin, E200 -... |
| 203 | biscuit | [enriched wheat flour (niacin, reduced iron, t... | E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... | [E101 - Riboflavin, E101i - Riboflavin, E375 -... |
| 204 | biscuit | [enriched wheat flour (niacin, reduced iron, t... | E101 - Riboflavin,E101i - Riboflavin,E375 - Ni... | [E101 - Riboflavin, E101i - Riboflavin, E375 -... |
| 205 | oatmeal raisin cookie | [enriched flour (bleached wheat flour, niacin,... | E101 - Riboflavin,E101i - Riboflavin,E160a - A... | [E101 - Riboflavin, E101i - Riboflavin, E160a ... |
â As shown above, both ingredients_list and additives_list are now structured as clean lists.
Step 3: Further Clean ingredients_text¶
While the initial standardization split the ingredients_text field into basic lists, further cleaning is necessary to ensure analytical consistency.
Many entries still contain:
- Non-English words (e.g., French and German terms).
- Marketing or domain-specific noise (e.g., "natural", "product", "ingredient").
- Numeric expressions and inconsistent punctuation.
To address these issues, we perform advanced preprocessing through several steps:
- Check for Non-English or Noisy Characters
- Expand Stopwords List
- Build Translation Dictionary
- Define Cleaning Function
- Apply Cleaning and Explore the Most Common Ingredient
(1) Check for Non-English or Noisy Characters¶
First, we scan for unusual characters or non-ASCII text in the ingredients.
df_final['ingredients_text'].dropna().sample(5)
| ingredients_text | |
|---|---|
| 22667 | premium fresh pork, water, premium fresh beef,... |
| 183109 | *solution: water, potassium lactate, sodium ph... |
| 315392 | water, tomatillo, jalapeno peppers, habanero p... |
| 136002 | soy protein isolate, organic cane syrup, organ... |
| 108289 | enriched bleached flour (wheat flour, niacin, ... |
# Define a function to count "weird" characters in each text
def count_weird_chars(text):
if pd.isna(text):
return 0
return len(re.findall(r'[^a-zA-Z0-9,\s\(\)\.\-]', str(text)))
# Apply to ingredients_text
df_final['weird_char_count'] = df_final['ingredients_text'].apply(count_weird_chars)
# Sort by weirdness descending
df_final_sorted_weird = df_final.sort_values(by='weird_char_count', ascending=False)
# Display the top 10 weirdest entries
with pd.option_context('display.max_colwidth', None):
df_final_sorted_weird[['product_name', 'ingredients_text', 'weird_char_count']].head(10)
âïž Upon inspection, the presence of unusual characters in the ingredients_text field was minimal and did not pose significant challenges for downstream text processing. Thus, no additional filtering based on has_weird_chars was necessary.
(2) Expand Stopwords List¶
In text data, certain words frequently appear in text but contribute little meaningful information. These words are known as stopwords, for example, "the", "and", "with".
By filtering out these common but low-information words, we focus our analysis on meaningful terms such as "sugar", "protein", or "organic", which are more informative about the product's nutritional properties. We define a custom stopwords list:
Standard English stopwords.
Common French and German food-related words.
Marketing noise terms (e.g., "natural", "contains").
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
True
# Stopwords list: English + French + German + domain-specific
custom_stopwords = set(stopwords.words('english')) | set([
# French
'de', 'Ă ', 'le', 'la', 'les', 'du', 'et', 'des', 'pour', 'avec', 'sur',
'au', 'ou', 'par', 'en', 'lait', 'eau', 'ingrédients', 'produits',
'contient', 'valeur', 'nutrition', 'base', 'moyenne',
# German
'mit', 'von', 'der', 'das', 'und', 'ein', 'eine', 'dem', 'den', 'fĂŒr', 'ohne',
'inhaltsstoffe', 'zutaten', 'lebensmittel',
# Domain noise / marketing
'organic', 'natural', 'product', 'ingredient', 'ingredients',
'mg', 'g', 'ar', 'bl', 'fi', 'â', 'â', 'less', 'contains'
])
(3) Build Translation Dictionary¶
We also build a small dictionary to translate common French and German food words into English equivalents.
# Translation dictionary: French + German â English
translation_dict = {
# French
'sucre': 'sugar',
'sel': 'salt',
'huile': 'oil',
'farine': 'flour',
'poudre': 'powder',
'lait': 'milk',
'arome': 'flavor',
'chocolat': 'chocolate',
'acide': 'acid',
'fromage': 'cheese',
'cacao': 'cocoa',
'beurre': 'butter',
# German
'zucker': 'sugar',
'salz': 'salt',
'mehl': 'flour',
'milch': 'milk',
'kakao': 'cocoa',
'aroma': 'flavor',
'butter': 'butter'
}
(4) Define Cleaning Function¶
We create a cleaning function that performs:
Lowercasing and accent removal
Removal of numeric and marketing expressions
Basic punctuation normalization
Stopword filtering
Phrase-based tokenization
# define cleaning function
def clean_and_tokenize_ingredients(text):
if pd.isna(text):
return []
# 1. Normalize to lowercase and remove accents
text = unidecode.unidecode(text.lower())
# 2. Remove numeric expressions (e.g., '2%', '100g', '25mg')
text = re.sub(r'\b\d+%?\b', ' ', text)
text = re.sub(r'\b\d+[a-z]+\b', ' ', text)
# 3. Remove special characters (keep commas and minimal punctuation)
text = re.sub(r'[^a-z0-9,\.\-\(\)\s]', ' ', text)
# 4. Normalize space
text = re.sub(r'\s+', ' ', text).strip()
# 5. Split on commas â each comma-separated item is treated as a phrase
raw_phrases = [p.strip() for p in text.split(',')]
clean_phrases = []
for phrase in raw_phrases:
# Strip trailing punctuation
phrase = phrase.strip('.,() ')
# Translate known foreign terms (optional)
phrase = ' '.join([translation_dict.get(w, w) for w in phrase.split()])
# Skip short/meaningless phrases
if len(phrase.split()) >= 2 and not any(w in custom_stopwords for w in phrase.split()):
clean_phrases.append(phrase)
return clean_phrases
(5) Apply Cleaning and Explore Top Ingredients¶
After cleaning and tokenizing ingredients, we now proceed to explore the most common components across packaged foods.
Firstly, we apply the cleaning function to the dataset:
df_final['clean_ingredient_tokens'] = df_final['ingredients_text'].apply(clean_and_tokenize_ingredients)
Then flatten and visualize the most common ingredient tokens:
# Flatten the list of tokens
all_words = [word for tokens in df_final['clean_ingredient_tokens'] for word in tokens]
word_counts = Counter(all_words)
# Display top 50 words
pd.Series(word_counts).sort_values(ascending=False).head(10)
| 0 | |
|---|---|
| folic acid | 19036 |
| citric acid | 18796 |
| corn syrup | 13024 |
| reduced iron | 12425 |
| thiamine mononitrate | 10395 |
| soy lecithin | 9883 |
| soybean oil | 8885 |
| xanthan gum | 7948 |
| cocoa butter | 7064 |
| sea salt | 6314 |
import matplotlib.pyplot as plt
import seaborn as sns
# Prepare DataFrame
top_ingredients = pd.Series(word_counts).sort_values(ascending=False).head(50)
# Plot
plt.figure(figsize=(12, 14))
ax = sns.barplot(x=top_ingredients.values, y=top_ingredients.index, palette='viridis')
plt.xlabel('Count')
plt.ylabel('Ingredient')
plt.title('Top 50 Most Common Ingredients in U.S. Packaged Foods')
plt.grid(axis='x', linestyle='--', alpha=0.7)
# Add count labels to the right of bars
for i, (value, name) in enumerate(zip(top_ingredients.values, top_ingredients.index)):
ax.text(value + 100, i, f'{value:,}', va='center', ha='left', fontsize=9)
plt.tight_layout()
plt.show()
Words with larger sizes represent more frequently occurring ingredients.
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=600, background_color='white', colormap='tab20')
wordcloud.generate_from_frequencies(word_counts)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Ingredients Word Cloud')
plt.show()
đŹ Observations from 50 Most Common Ingredients in the U.S. packaged foods¶
- High use of processed sugars and fats suggests a reliance on calorie-dense ingredients, which may impact Nutri-Score ratings and perceived producenrichedt healthiness.
- The frequent appearance of additives and emulsifiers indicates a strong dependence on food industrial technologies for texture stabilization and product preservation.
đ Added vitamins and minerals:
Foods with added vitamins and minerals feature prominently, with ingredients like folic acid (vitamin B9), reduced iron, and thiamine mononitrate (vitamin B1) ranking high. The prevalence of vitamin and mineral fortification reflects public health policies encouraging nutrient enrichment in the U.S. packaged food industry.đ« Added Sugars:
High use of processed sugars and sweeteners, such as corn syrup, high fructose corn syrup, and brown sugar, are heavily represented, underscoring the heavy reliance on processed sugars in packaged foods.đ§Ș Additives and Emulsifiers:
Additives and stabilizers including Xanthan gum, guar gum, modified starches, and soy lecithin are widly used to maintain food texture and extend shelf-life extension.đ§ Oil-based Ingredients:
Oil-based ingredients like soy lecithin, soybean oil, cocoa butter, and canola oil are also common across products, consistent with the formulation of processed baked goods and snacks. These ingredients contribute to the caloric density and flavor profile of packaged foods.
âĄïž Overall, fortified nutrients, sugars, stabilizers, and oils dominate the ingredient composition of U.S. packaged foods, painting a comprehensive picture of both nutritional enhancement efforts and industrial formulation priorities.
2.3 Initial Data Exploration¶
Having processed the textual information on additives, we now shift our focus to understanding how Nutri-Scores are distributed across the dataset. This exploration will provide an initial sense of the overall nutritional quality represented in the data.
Nutrition Grade Distribution¶
Observations from the following distribution indicate a clear imbalance across nutrition grades:
- Most packaged food products are classified under lower nutrition grades (D and E).
- Healthier food products (Grade A) are significantly underrepresented.
This imbalance suggests that the U.S. packaged food landscape is skewed towards less healthy options according to the Nutri-Score classification system.
import matplotlib.pyplot as plt
import seaborn as sns
# Total count for percentage
total = df_final['nutrition_grade_fr'].value_counts().sum()
# Plotting the distribution of Nutri-Scores
plt.figure(figsize=(6, 6))
sns.countplot(
data=df_final,
x='nutrition_grade_fr',
order=sorted(df_final['nutrition_grade_fr'].dropna().unique()),
palette='Set2',
legend=False
)
plt.title('Distribution of Nutrition Grades (France Nutri-Score)', fontsize=16)
plt.xlabel('Nutrition Grade (A=Healthiest, E=Least Healthy)', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)
# Add count and percentage for each bar
for p in plt.gca().patches:
count = p.get_height()
percent = count / total * 100
label = f'{count:.0f}\n({percent:.1f}%)'
plt.gca().annotate(
label,
(p.get_x() + p.get_width() / 2., count),
ha='center', va='center',
fontsize=11, color='black',
)
plt.tight_layout()
plt.show()
Exploring Key Nutritional Metrics Across Nutri-Score¶
To further understand the drivers behind Nutri-Score distributions, we analyze the spread of key nutritional attributes â fat, sugar, salt, and additive counts â across nutrition grades.
# Define groups for comparison
low_grades = df_final[df_final['nutrition_grade_fr'].isin(['d', 'e'])]
high_grades = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]
# Compare key nutritional metrics
metrics = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()
for idx, metric in enumerate(metrics):
sns.boxplot(
data=df_final,
x='nutrition_grade_fr',
y=metric,
order=['a', 'b', 'c', 'd', 'e'],
palette='Set2',
ax=axes[idx],
legend=False,
showfliers=False
)
axes[idx].set_title(f'Distribution of {metric} by Nutrition Grade')
axes[idx].set_xlabel('Nutrition Grade')
axes[idx].set_ylabel(metric)
plt.tight_layout()
plt.show()
âĄïž From the above plot of key nutritional metrics by Nutri-Score:
Fat and Sugar: It's no surprise, products with higher fat and sugar content tend to receive lower Nutri-Scores (Grades D and E). Grade E products, in particular, boast the highest median fat (~25g/100g) and sugar levels, while Grade A foods stay impressively lean, with fat and sugar around just ~2g/100g.
Salt: Salt levels paint a slightly fuzzier picture. Salt content shows a wider spread across grades C, D, and E, but the median salt content is consistently higher in lower grades.
Additives: Interestingly, the number of additives (
additives_n) doesnât strongly influence Nutri-Score. This suggests that Nutri-Score judgments are much more about the big playersâlike fats and sugarsârather than the presence of food additives.
2.4 Deeper Analysis: Additives, Brands, and Nutrition Grades¶
đŻ Research Question 1: Does Nutri-Score reflect additives presence?¶
In this section, We conduct an exploratory analysis focused on the relationships among additives, brands, and nutritional quality, as reflected by Nutri-Score grades (AâE).
(1) Distribution of Number of Additives
We first examine the overall distribution of additive counts across products:
# Plot distribution of additives number across all products
plt.figure(figsize=(8, 6))
sns.histplot(df_final['additives_n'], bins=30, kde=False)
plt.title('Distribution of Number of Additives in Products')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
âĄïž Findings:
- Most products contain between 0 to 5 additives.
- The distribution is right-skewed distribution: a small subset has more than 10 additives.
(2) Relationship Between Additive Counts and Nutrition Score
To assess whether additive presence impacts nutritional quality ratings, we plot the number of additives against Nutrition Scores:
You might be wondering: What exactly is a Nutrition Score?
Similar to Nutri-Score, Nutrition Score is also a numerical representation of a productâs nutritional quality.
Lower scores indicate healthier products, with scores mapped to Nutri-Score grades (A = healthiest, E = least healthy).

Figure: Mapping of Nutrition Score points to Nutri-Score grades for solid foods and beverages.
# Scatter plot: Number of additives vs. Nutrition Score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=df_final, alpha=0.5)
plt.title('Number of Additives vs. Nutrition Score')
plt.xlabel('Number of Additives')
plt.ylabel('Nutrition Score (Lower is Better)')
plt.grid(True)
plt.show()
âĄïž Finding:
Scatter Plot reveals no strong pattern:
Even products with many additives (10â20) can have good Nutrition Score. Some products with very few additives still have poor scores. Nutrition Score does not strongly penalize additive counts.
Next, we create a boxplot:
Boxplot Across Grades confirms that the median number of additives remains similar across Nutri-Score grades (AâE).
# Boxplot: Additives count by Nutri-Score Grade (A to E)
plt.figure(figsize=(8, 6))
sns.boxplot(x='nutrition_grade_fr', y='additives_n', data=df_final, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Distribution of Additives Count Across Nutri-Score Grades')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Additives')
plt.grid(True)
plt.show()
Histogram for A/B Products:
# Histogram: Additives count for products with Nutri-Score A or B
plt.figure(figsize=(8, 6))
df_good = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]
sns.histplot(df_good['additives_n'], bins=20, kde=False)
plt.title('Additives Count for Products with Nutri-Score A or B')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
âĄïž Findings:
Observations from Additives Count in High-Scoring (A/B) Products
- Most products classified as "healthy" (Nutri-Score A or B) contain relatively few additivesâtypically 0 to 2.
- However, a non-negligible subset of these products still contains more than 5 additives, and a few even exceed 10 additives.
- The distribution is heavily right-skewed, suggesting that while minimal additive usage is common among healthy-rated products, high additive counts do not necessarily prevent a product from being rated as healthy.
(3) Correlation Analysis
We quantify the relationship between additive counts and Nutrition Score:
- Pearson correlation: 0.033
- Spearman correlation: 0.034
Both correlation coefficients are very close to zero, indicating an extremely weak positive relationship.
âĄïž Findings:
Additive counts do not meaningfully correlate with Nutrition Score.
# Pearson Correlation (linear)
pearson_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='pearson')
# Spearman Correlation (monotonic)
spearman_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='spearman')
print(f"Pearson correlation between additives_n and nutrition-score-fr_100g: {pearson_corr:.3f}")
print(f"Spearman correlation between additives_n and nutrition-score-fr_100g: {spearman_corr:.3f}")
Pearson correlation between additives_n and nutrition-score-fr_100g: 0.033 Spearman correlation between additives_n and nutrition-score-fr_100g: 0.034
Summary of Findings:
Distribution of Additives:
- Most products contain between 0 to 5 additives.
- A small number of products have more than 10 additives, indicating a right-skewed distribution.
Relationship Between Additives and Nutri-Score:
- Scatter Plot: No strong pattern observed between the number of additives and the Nutrition Score.
- Boxplot: The median number of additives is similar across all Nutri-Score grades (A to E).
- Histogram (A/B products): Many products rated as "healthy" (Nutri-Score A or B) still contain multiple additives, with some having more than 5 additives.
Correlation Analysis:
- Pearson correlation: 0.033
- Spearman correlation: 0.034
- Both correlations are very close to zero, indicating a very weak positive relationship between the number of additives and the Nutrition Score.
Potential Misclassifications:
- Some products with a high number of additives still receive a good Nutri-Score (A or B).
- This suggests that Nutri-Score may overlook additive information when classifying food healthiness.
đïž Research Question 1: Conclusion¶
The current Nutri-Score system fails to effectively capture the potential health implications of high additive usage.
Potential inconsistencies:
- Some products with a high number of additives still receive a good Nutri-Score (A/B).
- This suggests that Nutri-Score may overlook additive information when assessing overall product healthiness.
Interestingly, as we observed earlier, not all additives are inherently harmful. Many additives serve as sources of beneficial nutrients, micronutrients, or antioxidants. Thus, the presence of additives should be evaluated critically, rather than being universally perceived as negative.
đŻ Research Question 2: Which brands use more additives and have lower Nutri-Scores?¶
In this section, we explore whether certain brands consistently use more additives or produce products with generally lower Nutri-Scores.
# Keep only the first brand if multiple brands are listed
df_final['brand_main'] = df_final['brands'].apply(lambda x: x.split(',')[0].strip().lower() if pd.notnull(x) else x)
# Step 1: Group by Brand
# Group by main brand and calculate mean additives and mean nutrition score
brand_stats = df_final.groupby('brand_main').agg({
'additives_n': 'mean',
'nutrition-score-fr_100g': 'mean',
'product_name': 'count' # count number of products per brand
}).reset_index()
# Rename columns for clarity
brand_stats.rename(columns={'product_name': 'product_count'}, inplace=True)
# Check
brand_stats.sample(5)
| brand_main | additives_n | nutrition-score-fr_100g | product_count | |
|---|---|---|---|---|
| 1537 | bucky badger | 2.5 | 17.500000 | 2 |
| 4627 | grant park custom meats | 3.0 | 3.000000 | 1 |
| 3349 | echo lake foods | 5.0 | 2.000000 | 1 |
| 2754 | daily bread | 2.0 | 20.000000 | 2 |
| 10153 | sea port products corp | 1.0 | -1.333333 | 3 |
# Step 2: Top Brands by Additives
# Only keep brands with enough products (e.g., more than 30 products) to avoid noisy small brands.
# Filter brands with at least 30 products
brand_stats_filtered = brand_stats[brand_stats['product_count'] >= 30]
# Top 10 brands by average additives
top_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=False).head(10)
# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='additives_n', y='brand_main', data=top_additive_brands)
plt.title('Top 10 Brands with Highest Average Number of Additives')
plt.xlabel('Average Number of Additives')
plt.ylabel('Brand')
plt.grid(True)
plt.show()
# Step 3: Top Brands by Worst Nutrition Score
# Top 10 brands by worst average nutrition score
top_unhealthy_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)
# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='nutrition-score-fr_100g', y='brand_main', data=top_unhealthy_brands)
plt.title('Top 10 Brands with Worst Average Nutrition Score')
plt.xlabel('Average Nutrition Score (Higher = Worse)')
plt.ylabel('Brand')
plt.grid(True)
plt.show()
# Step 4: Scatter Plot: Additives vs. Nutrition Score
# Scatter plot of average additives vs. average nutrition score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=brand_stats_filtered)
plt.title('Average Additives vs. Average Nutrition Score per Brand')
plt.xlabel('Average Number of Additives')
plt.ylabel('Average Nutrition Score')
plt.grid(True)
plt.show()
# Select brands with high average additives
high_additive_brands = top_additive_brands['brand_main'].tolist()
# Filter original data
df_high_additives = df_final[df_final['brand_main'].isin(high_additive_brands)]
# Show Nutri-Grade distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='nutrition_grade_fr', data=df_high_additives, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Nutri-Grade Distribution for Brands with High Additive Usage')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Products')
plt.grid(True)
plt.show()
đïž Research Question 2: Conclusion¶
1. Brands with the Highest Average Number of Additives:¶
- Brands like Arnie's, Nissin, and Toft's have the average of 10â13 additives per product.
- These brands average around 10â13 additives per product, suggesting heavy use of food additives.
2. Brands with the Worst Average Nutrition Scores:¶
- Brands such as Brown & Haley, Reese's, and Palmer have the worst average Nutrition Scores.
- These brands mainly focus on confectionery and sweets products, traditionally high in sugar and fat.
3. Relationship Between Additives and Nutrition Scores Across Brands:¶
- From the scatter plot, there is no strong direct relationship between a brandâs average number of additives and its average Nutrition Score.
- Some brands use many additives but still have moderate Nutrition Scores, while some brands with poor Nutrition Scores do not necessarily have many additives.
â Overall, while additives alone do not directly determine a product's Nutri-Score or Nutrition Scores, heavy additive usage is often a marker of lower overall nutritional quality at the brand level.


đŻ Research Question 3: How do food additives and healthiness vary across different brands?¶
In this part, we investigate the types of additives most commonly used across different groups of brands to understand how additive usage patterns vary with food healthiness.
# Select low-additive brands
# Bottom 10 brands by average additives (filter brands with >= 30 products)
low_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=True).head(10)
# Filter the products for these brands
low_additives_df = df_final[df_final['brand_main'].isin(low_additive_brands['brand_main'])]
# Select brands with worst Nutri-Score
# Top 10 brands with worst average Nutri-Score
worst_nutriscore_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)
# Filter the products for these brands
worst_nutriscore_df = df_final[df_final['brand_main'].isin(worst_nutriscore_brands['brand_main'])]
# Analyze Top Additives for Each Group
def plot_top_additives(dataframe, title):
additives_used = list(chain.from_iterable(dataframe['additives_list'])) # Faster flattening
additives_counter = Counter(additives_used)
top_additives = additives_counter.most_common(20)
top_additives_df = pd.DataFrame(top_additives, columns=['Additive', 'Count'])
plt.figure(figsize=(10, 6))
sns.barplot(x='Count', y='Additive', data=top_additives_df)
plt.title(title)
plt.xlabel('Count')
plt.ylabel('Additive')
plt.grid(True)
plt.show()
# High-additive brands
plot_top_additives(df_high_additives, 'Top 20 Additives Used by High Additive Brands')
# Low-additive brands
plot_top_additives(low_additives_df, 'Top 20 Additives Used by Low Additive Brands')
# Worst Nutri-Score brands
plot_top_additives(worst_nutriscore_df, 'Top 20 Additives Used by Brands with Worst Nutri-Scores')
# General dataset
plot_top_additives(df_final, 'Top 20 Additives Used by All Brands')
# Rank the additives by their usage count
def rank_additives(dataframe):
# Fast flatten the list of additives
additives_used = list(chain.from_iterable(dataframe['additives_list']))
# Count additives
additives_counter = Counter(additives_used)
# Create and sort DataFrame
additives_rank_df = pd.DataFrame(additives_counter.items(), columns=['Additive', 'Count'])
additives_rank_df = additives_rank_df.sort_values(by='Count', ascending=False).reset_index(drop=True)
# Add Rank column
additives_rank_df.index += 1
additives_rank_df.index.name = 'Rank'
return additives_rank_df
# Apply to all products
additives_rank_df = rank_additives(df_final)
# Display the result
display(additives_rank_df)
| Additive | Count | |
|---|---|---|
| Rank | ||
| 1 | E330 - Citric acid | 20026 |
| 2 | E101 - Riboflavin | 19564 |
| 3 | E101i - Riboflavin | 19559 |
| 4 | E375 - Nicotinic acid | 19535 |
| 5 | E322 - Lecithins | 16498 |
| ... | ... | ... |
| 321 | E555 - Potassium aluminium silicate | 1 |
| 322 | E343i - Monomagnesium phosphate | 1 |
| 323 | E365 - Sodium fumarate | 1 |
| 324 | E266 - Sodium dehydroacetate | 1 |
| 325 | E470 - Sodium/potassium/calcium and magnesium ... | 1 |
325 rows Ă 2 columns
đïž Research Question 3: Conclusion¶
Additive Usage Varies by Brand Type:
- High Additive Brands rely more on additives such as E375 (Nicotinic acid), E101 (Riboflavin), and E322 (Lecithins), many of which are vitamins or natural emulsifiers.
- Low Additive Brands also use E322 but rely more on compounds like E330 (Citric acid), E509 (Calcium chloride), and E150a (Plain caramel).
Most Frequently Used Additives May Still Be "Healthy":
- Surprisingly, additives with the highest overall usage across all brands include natural compounds, such as citric acid, riboflavin, and nicotinic acid, which are commonly recognized as safe and even beneficial.
- This counters the intuition that "more additives = more harmful." Additive type matters more than quantity alone.
Not All Additives Are Equal:
- We group additives into two general categories:
- Generally Healthy Additives: vitamins, natural acids, fibers (e.g., E101, E375, E330).
- Potentially Unhealthy Additives: synthetic colorants, artificial sweeteners, emulsifiers (e.g., E129, E133, E951).
- Products with many synthetic additives are more likely to be ultra-processed and score worse on Nutri-Scores.
- We group additives into two general categories:
Implication:
- Nutri-Score does not directly factor in additive types. Therefore, two products may receive similar grades while differing significantly in additive composition.
3ïžâŁ Classification Modeling¶
3.1 Principal Component Analysis (PCA)¶
Let's first select the following numeric features, which are the core indicators in nutritional labeling. There are 18 features in total:
# include only the nutritional columns for PCA
nutrient_cols = df_final[[
'additives_n',
'ingredients_from_palm_oil_n',
'ingredients_that_may_be_from_palm_oil_n',
'energy_100g',
'fat_100g',
'saturated-fat_100g',
'trans-fat_100g',
'cholesterol_100g',
'carbohydrates_100g',
'sugars_100g',
'fiber_100g',
'proteins_100g',
'salt_100g',
'sodium_100g',
'vitamin-a_100g',
'vitamin-c_100g',
'calcium_100g',
'iron_100g'
]]
# drop rows with missing values in those columns
df_nutrient = nutrient_cols.dropna()
df_nutrient.head(5)
| additives_n | ingredients_from_palm_oil_n | ingredients_that_may_be_from_palm_oil_n | energy_100g | fat_100g | saturated-fat_100g | trans-fat_100g | cholesterol_100g | carbohydrates_100g | sugars_100g | fiber_100g | proteins_100g | salt_100g | sodium_100g | vitamin-a_100g | vitamin-c_100g | calcium_100g | iron_100g | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82 | 1.0 | 0.0 | 0.0 | 2389.0 | 42.86 | 7.14 | 0.0 | 0.000 | 25.00 | 14.29 | 7.1 | 25.00 | 0.54356 | 0.214 | 0.000000 | 0.0000 | 0.071 | 0.00514 |
| 149 | 1.0 | 0.0 | 0.0 | 1046.0 | 0.00 | 0.00 | 0.0 | 0.000 | 62.50 | 52.50 | 7.5 | 2.50 | 0.00000 | 0.000 | 0.001125 | 0.0000 | 0.050 | 0.00360 |
| 152 | 1.0 | 0.0 | 0.0 | 1569.0 | 2.50 | 0.00 | 0.0 | 0.000 | 87.50 | 65.00 | 2.5 | 2.50 | 1.96850 | 0.775 | 0.000750 | 0.0000 | 0.100 | 0.00090 |
| 153 | 5.0 | 0.0 | 0.0 | 1883.0 | 22.50 | 12.50 | 0.0 | 0.012 | 70.00 | 42.50 | 2.5 | 5.00 | 1.01600 | 0.400 | 0.000075 | 0.0000 | 0.050 | 0.00180 |
| 200 | 1.0 | 0.0 | 0.0 | 1523.0 | 16.88 | 10.39 | 0.0 | 0.052 | 44.16 | 5.19 | 1.3 | 7.79 | 1.08966 | 0.429 | 0.000195 | 0.1013 | 0.026 | 0.00094 |
To use PCA to reduce the dimensions in the data, we follow the best practice to scale the data:
# Standardizing the features
X_scaled = StandardScaler().fit_transform(df_nutrient)
Next, we will explore what is the ideal number of components for PCA. The output of explained_variance_ratio_ tells us how much does each component explain the variance:
# Apply PCA with 18 components
pca = PCA(n_components=18)
X_pca = pca.fit_transform(X_scaled)
# The variance explained by each component
np.set_printoptions(suppress=True)
pca.explained_variance_ratio_
array([0.16566702, 0.13153279, 0.10773104, 0.07771031, 0.07305444,
0.06295702, 0.05915664, 0.05824494, 0.05748542, 0.05220961,
0.04915088, 0.03967248, 0.03290017, 0.0197183 , 0.01169987,
0.00110908, 0. , 0. ])
Here, we create a Explained Variance Ratio plot to interpretate how much variance each principal component explains:
# Get Explained Variance Ratio and create plot
explained_var = pca.explained_variance_ratio_
cum_var = np.insert(np.cumsum(explained_var), 0, 0.0)
x_full = np.arange(0, len(explained_var) + 1)
# plotting
plt.figure(figsize=(8, 6))
plt.plot(x_full, cum_var, marker="o", label="Cumulative")
plt.axhline(0.95, color="red", linestyle="--", label="95% Threshold")
# graph format
plt.xlim(-0.5, len(explained_var)+0.5)
plt.xticks(x_full)
plt.xlabel("Number of Principal Components")
plt.ylabel("Explained Variance Ratio")
plt.title("Explained Variance per Principal Component")
plt.legend()
plt.tight_layout()
plt.show()
We ideally set principal component to the smallest value where we see things flatten out. From the plot above, we can find the first 13 components (of 18) gives more than 95% explained variance. So, we initially set n_components = 13 for the following analysis.
Now, we apply function for the principle components into our dataset, and create DataFrame for further analysis and clustering.
from sklearn.decomposition import PCA
def run_pca(X_scaled, n_components):
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)
df_pca = pd.DataFrame(X_pca, columns=[f"PC{i+1}" for i in range(n_components)])
return pca, df_pca
Deciding How Many Features to Keep for KMeans Clustering¶
After we used Principal Component Analysis, we initially kept 13 features that together explain about 95% of the variation in the data.
However, we noticed a problem:
Having too many dimensions can actually make it harder for the model to find meaningful groups, and KMeans struggles to decide which points should belong together.
đ To avoid this problem, we decided to reduce the number of features even further before doing KMeans. We tested different number of PCA features and for each case, we measured the quality of clustering using silhouette score.
Higher scores mean better, clearer clusters.
Here were the results:
| Number of PCA features | Silhouette Score |
|---|---|
| 4 | 0.3159 |
| 5 | 0.3064 |
| 6 | 0.2864 |
| 8 | 0.2920 |
| 10 | 0.2992 |
| 13 | 0.2357 |
Clustering Evaluation: How Good Are Our Groups?¶
Our following clustering result gave a silhouette score of 0.3159.
- This score indicates that the clusters are moderately well-separated.
- In nutrition data, this is expected â foods often fall along a spectrum of human taste, like moderately salty, slightly sugary, rather than into totally separate categories.
Based on the results, we chose to move forward with 4 principal components, because it produced the best clustering quality.
# Standardizing
X_scaled = StandardScaler().fit_transform(df_nutrient)
# List of PCA dimensions you want to test
pca_dims = [4, 5, 6, 8, 10, 13]
# Dictionary to store silhouette scores
silhouette_scores = {}
for dim in pca_dims:
# PCA Dimension Reduction
pca = PCA(n_components=dim, random_state=42)
X_pca = pca.fit_transform(X_scaled)
# Apply KMeans, each optimal n_clusters is decided from Elbow Method
kmeans = KMeans(n_clusters=dim - 1, random_state=42)
clusters = kmeans.fit_predict(X_pca)
# check silhouette score for the performance of KMeans clustering:
score = silhouette_score(X_pca, clusters)
silhouette_scores[dim] = score
# Output
for dim, score in silhouette_scores.items():
print(f"Silhouette score for {dim}D PCA + KMeans: {score:.4f}")
Silhouette score for 4D PCA + KMeans: 0.3159 Silhouette score for 5D PCA + KMeans: 0.3064 Silhouette score for 6D PCA + KMeans: 0.2864 Silhouette score for 8D PCA + KMeans: 0.2920 Silhouette score for 10D PCA + KMeans: 0.2992 Silhouette score for 13D PCA + KMeans: 0.2357
KMeans Clustering and Visualization with UMAP¶
# dataset for KMeans Clustering
pca_kmeans, df_pca_kmeans = run_pca(X_scaled, 4)
df_pca_kmeans.head(5)
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| 0 | 2.684036 | 1.629462 | -1.264479 | -1.027111 |
| 1 | 0.049841 | -1.180093 | 1.096871 | -2.193361 |
| 2 | 0.695542 | -1.189935 | 2.092358 | -1.137370 |
| 3 | 2.072854 | -0.649734 | 0.724378 | 0.590306 |
| 4 | 0.486327 | 0.340976 | -0.266710 | -0.917434 |
We first use Elbow Method to find the optimal k for KMeans:
distortions = []
K_range = range(1, 30)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(df_pca_kmeans)
distortions.append(kmeans.inertia_)
# Plot elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, distortions, marker='o')
plt.title("Elbow Method for Optimal k")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Within-cluster sum of squares (distortion)")
plt.xticks(K_range)
plt.grid(True)
plt.tight_layout()
plt.show()
Now we apply KMeans with k=4, and add an new columns clusters_label to df_pca_kmeans
# k=4 KMeans Clustering
kmeans = KMeans(n_clusters=4, random_state=42)
df_pca_kmeans['clusters_label'] = kmeans.fit_predict(df_pca_kmeans)
df_pca_kmeans.head(10)
| PC1 | PC2 | PC3 | PC4 | clusters_label | |
|---|---|---|---|---|---|
| 0 | 2.684036 | 1.629462 | -1.264479 | -1.027111 | 2 |
| 1 | 0.049841 | -1.180093 | 1.096871 | -2.193361 | 3 |
| 2 | 0.695542 | -1.189935 | 2.092358 | -1.137370 | 3 |
| 3 | 2.072854 | -0.649734 | 0.724378 | 0.590306 | 3 |
| 4 | 0.486327 | 0.340976 | -0.266710 | -0.917434 | 3 |
| 5 | 0.433520 | -0.518498 | 0.152523 | 0.135389 | 3 |
| 6 | -0.602105 | 0.263374 | -0.157433 | -0.293366 | 1 |
| 7 | 0.739413 | 1.276978 | -1.010996 | 0.214544 | 2 |
| 8 | 0.450569 | 0.545287 | -0.582644 | 0.199125 | 1 |
| 9 | 1.175014 | -1.104427 | 0.393440 | 2.838227 | 3 |
Now we check the number of data points in each clutser:
print(np.unique(df_pca_kmeans['clusters_label'], return_counts=True))
(array([0, 1, 2, 3], dtype=int32), array([ 349, 36919, 11865, 31813]))
We identified four clusters using KMeans. However, one of the clusters (Cluster 3) only contained 7 data points. Alothough it may not represent a meaningful group, we continue our analysis on Clusters 0, 1, 2, and 3, which together account for the minor and vast majority of the dataset and provide more interpretable patterns.
# convert PCA data with 15 components to 2D using UMAP
umap_2d = umap.UMAP(
n_components=2,
n_neighbors=10,
min_dist=0.1,
random_state=42
).fit_transform(df_pca_kmeans.values)
plt.figure(figsize=(6,5))
sns.scatterplot(
x=umap_2d[:, 0],
y=umap_2d[:, 1],
# hue=df_final['nutrition_grade_fr'],
hue=df_pca_kmeans['clusters_label'],
palette='Set2',
s=12,
alpha=0.8
)
plt.title("UMAP 2-D embedding of packaged foods")
plt.xlabel("UMAP-1"); plt.ylabel("UMAP-2")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
df_final
| code | url | creator | created_t | created_datetime | last_modified_t | last_modified_datetime | product_name | brands | brands_tags | ... | vitamin-c_100g | calcium_100g | iron_100g | nutrition-score-fr_100g | nutrition-score-uk_100g | ingredients_list | additives_list | weird_char_count | clean_ingredient_tokens | brand_main | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82 | 0000000033688 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489050424 | 2017-03-09T09:07:04Z | 1489050424 | 2017-03-09T09:07:04Z | peanuts, mixed nuts | northgate market | northgate-market | ... | 0.0000 | 0.071 | 0.00514 | 14.0 | 14.0 | [peanuts, honey, coating (sucrose, wheat starc... | [E415 - Xanthan gum] | 0 | [coating (sucrose, wheat starch, xanthan gum, ... | northgate market |
| 149 | 0000000045292 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069958 | 2017-03-09T14:32:38Z | 1489069958 | 2017-03-09T14:32:38Z | turkish apricots | northgate | northgate | ... | 0.0000 | 0.050 | 0.00360 | 8.0 | 8.0 | [apricots, sulfur dioxide.] | [E220 - Sulphur dioxide] | 0 | [sulfur dioxide] | northgate |
| 152 | 0000000045421 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489069957 | 2017-03-09T14:32:37Z | 1489069957 | 2017-03-09T14:32:37Z | chili mango | torn & glasses | torn-glasses | ... | 0.0000 | 0.100 | 0.00090 | 19.0 | 19.0 | [dried mango, paprika, sugar, salt, citric aci... | [E330 - Citric acid] | 0 | [dried mango, citric acid] | torn & glasses |
| 153 | 0000000045483 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489050424 | 2017-03-09T09:07:04Z | 1489050424 | 2017-03-09T09:07:04Z | milk chocolate pretzels | torn & glasser | torn-glasser | ... | 0.0000 | 0.050 | 0.00180 | 25.0 | 25.0 | [milk chocolate (sugar, cocoa butter, chocolat... | [E101 - Riboflavin, E101i - Riboflavin, E322 -... | 7 | [milk chocolate (sugar, cocoa butter, chocolat... | torn & glasser |
| 200 | 0000020039127 | http://world-en.openfoodfacts.org/product/0000... | usda-ndb-import | 1489138568 | 2017-03-10T09:36:08Z | 1489138568 | 2017-03-10T09:36:08Z | butter croissants | fresh & easy | fresh-easy | ... | 0.1013 | 0.026 | 0.00094 | 18.0 | 18.0 | [wheat flour, butter (cream), water, yeast, su... | [E300 - Ascorbic acid] | 0 | [wheat flour, butter (cream, wheat gluten, asc... | fresh & easy |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 355821 | 9556041620369 | http://world-en.openfoodfacts.org/product/9556... | usda-ndb-import | 1489066070 | 2017-03-09T13:27:50Z | 1489066070 | 2017-03-09T13:27:50Z | sardines in spicy tomato sauce, chili and lime | ayam brand | ayam-brand | ... | 0.0000 | 0.357 | 0.00257 | 3.0 | 3.0 | [sardines, water, tomato paste, sugar, dried c... | [E322 - Lecithins, E322i - Lecithin, E415 - Xa... | 0 | [tomato paste, dried chili, thickener (xanthan... | ayam brand |
| 355844 | 9556173386461 | http://world-en.openfoodfacts.org/product/9556... | usda-ndb-import | 1489066836 | 2017-03-09T13:40:36Z | 1489066836 | 2017-03-09T13:40:36Z | chewy candy | fruit plus | fruit-plus | ... | 0.0000 | 0.000 | 0.00000 | 21.0 | 21.0 | [sugar, glucose syrup, vegetable fat (hydrogen... | [E102 - Tartrazine, E330 - Citric acid, E414 -... | 3 | [glucose syrup, vegetable fat (hydrogenated pa... | fruit plus |
| 355859 | 9556390158162 | http://world-en.openfoodfacts.org/product/9556... | usda-ndb-import | 1489069476 | 2017-03-09T14:24:36Z | 1489069476 | 2017-03-09T14:24:36Z | lee, special crackers | lee biscuits (pte.) ltd. | lee-biscuits-pte-ltd | ... | 0.0000 | 0.045 | 0.00082 | 16.0 | 16.0 | [wheat flour, vegetable oil (palm olein), suga... | [E1101 - Protease, E450 - Diphosphates, E471 -... | 0 | [wheat flour, vegetable oil (palm olein, corn ... | lee biscuits (pte.) ltd. |
| 355860 | 9556390178160 | http://world-en.openfoodfacts.org/product/9556... | usda-ndb-import | 1489070026 | 2017-03-09T14:33:46Z | 1489070026 | 2017-03-09T14:33:46Z | sugar crackers | lee biscuits (pte.) ltd. | lee-biscuits-pte-ltd | ... | 0.0000 | 0.000 | 0.00082 | 13.0 | 13.0 | [wheat flour, sugar, vegetable fat (palm base)... | [E450 - Diphosphates, E500 - Sodium carbonates... | 0 | [wheat flour, corn starch, vegetable oil (palm... | lee biscuits (pte.) ltd. |
| 355968 | 9780803738782 | http://world-en.openfoodfacts.org/product/9780... | usda-ndb-import | 1489069944 | 2017-03-09T14:32:24Z | 1489069945 | 2017-03-09T14:32:25Z | organic z bar | clif kid | clif-kid | ... | 0.0583 | 0.556 | 0.00500 | 11.0 | 11.0 | [organic oat blend (organic rolled oats, organ... | [E322 - Lecithins, E322i - Lecithin] | 0 | [] | clif kid |
80946 rows Ă 47 columns
Cluster Profiling¶
df_profile = df_nutrient.copy()
df_profile['cluster'] = df_pca_kmeans['clusters_label'].values
cluster_means = df_profile.groupby('cluster')[
['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']
].mean().round(2)
display(cluster_means)
| fat_100g | sugars_100g | salt_100g | additives_n | energy_100g | |
|---|---|---|---|---|---|
| cluster | |||||
| 0 | 1.60 | 9.52 | 40.84 | 2.19 | 707.08 |
| 1 | 5.52 | 7.45 | 1.13 | 3.21 | 623.53 |
| 2 | 32.38 | 13.48 | 1.65 | 2.13 | 1867.73 |
| 3 | 12.93 | 33.26 | 1.09 | 3.37 | 1697.14 |
nutrient_features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']
energy_feature = ['energy_100g']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,6))
cluster_means.loc[[0,1,2,3]][nutrient_features].T.plot(
kind='bar', ax=ax1
)
ax1.set_title('Cluster Profiles â Macronutrients & Additives')
ax1.set_ylabel('Mean per 100g')
ax1.set_xlabel('Nutritional Feature')
ax1.set_xticklabels(nutrient_features, rotation=45)
ax1.legend(title='Cluster')
cluster_means.loc[[0,1,2,3]][energy_feature].T.plot(
kind='bar', ax=ax2, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#CC0000']
)
ax2.set_title('Cluster Profiles â Energy (kcal per 100g)')
ax2.set_ylabel('Mean Energy')
ax2.set_xlabel('Energy')
ax2.set_xticklabels(['energy_100g'], rotation=0)
ax2.legend(title='Cluster')
plt.tight_layout()
plt.show()
features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']
features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']
data = cluster_means.loc[[0,1,2,3]][features]
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=features)
data_scaled['cluster'] = ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3']
labels = features
num_vars = len(labels)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]
fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(polar=True))
for i, row in data_scaled.iterrows():
values = row[features].tolist()
values += values[:1]
ax.plot(angles, values, label=row['cluster'])
ax.fill(angles, values, alpha=0.15)
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
ax.set_thetagrids(np.degrees(angles[:-1]), labels)
ax.set_title("Cluster Nutritional Radar Chart", fontsize=14, pad=30)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(2, 2, subplot_kw=dict(polar=True), figsize=(12,10), constrained_layout=True)
axes = axes.flatten()
# Define colors for clusters
colors = ['#4682B4', '#FFA500', '#7ED957', '#B22222']
for i in range(4):
values = data_scaled.iloc[i][features].tolist()
values += values[:1]
ax = axes[i]
ax.plot(angles, values, color=colors[i], linewidth=2)
ax.fill(angles, values, color=colors[i], alpha=0.25)
ax.set_title(f"{data_scaled.iloc[i]['cluster']}", size=13, pad=35)
ax.set_thetagrids(np.degrees(angles[:-1]), features)
ax.set_ylim(0, 1)
fig.suptitle("Nutritional Radar Charts for Each Cluster", fontsize=16, y=1)
plt.tight_layout(rect=[1, 1, 1, 1])
fig.subplots_adjust(hspace=0.4)
plt.show()
Here are the results:
Cluster 0: High-Salt, Low-Fat Products
- đ§Ÿ Characteristics:
- Very low fat (1.6g)
- Extremely high salt (41g)
- Moderate sugar (9.5g), additives (2.2), energy (707 kcal)
- đ High-salt preserved foods or savory processed items.
Cluster 1: Low-Energy Foods
- đ§Ÿ Characteristics:
- Moderate fat (5.5g), sugar (7.5g), additives (3.2)
- Low salt (1.1g), lowest energy (624 kcal).
- đ Low-calorie snacks, possibly cereals or health-focused foods.
Cluster 2: High-Fat, High-Energy Foods
- đ§Ÿ Characteristics:
- Extremely high fat (32.4g)
- High sugar (13.5g)
- Low salt (1.6g)
- Lower additives (2.1), highest energy (1868 kcal).
- đ Fat-dense products like butters, nut spreads, cheeses, and creamy desserts.
Cluster 3: High-Sugar, Additive-Heavy Processed Foods
- đ§Ÿ Characteristics:
- Moderate fat (12.9g)
- Very high sugar (33.3g), highest additives (3.4), very high energy (1697 kcal).
- Low salt (1.1g)
- đ Highly processed sweets, sodas, candies, and energy bars.
4ïžâŁ Prediction Modeling: Predicting Product Healthiness¶
In this section, we develop machine learning models to evaluate how well a product's healthinessâmeasured by its Nutri-Score/ Nutri-grade âcan be predicted from its ingredient composition, additive content, and numerical nutritional values.
Our main objective is to determine whether Nutri-Score labels can be inferred directly from structured product data, and which features (e.g., sugar, salt, additives) are most predictive.
4.1 Ingredients + Additives Text Model¶
We explore whether free-text fields like ingredients and additives can effectively predict a productâs Nutri-grade, using text-based modeling techniques like TF-IDF and Logistic Regression.
Goal:
Predict Nutri-grade (AâE) and health category (healthy vs unhealthy) using only text-based ingredient and additive information.
Feature Preparation
Combined Tokens: We merge clean_ingredient_tokens with additives_list into a unified list per product.
TF-IDF Encoding: Convert token lists into TF-IDF vectors, using unigrams and bigrams (e.g., "palm").
Targets:
Multiclass: nutrition_grade_fr (AâE)
Binary: score_binary (healthy = A/B, unhealthy = C/D/E)
# Step 1: Combine ingredients and additives into a single token list
df_final['combined_tokens'] = df_final.apply(
lambda row: row['clean_ingredient_tokens'] + row['additives_list']
if isinstance(row['additives_list'], list) else row['clean_ingredient_tokens'],
axis=1
)
# Step 2: Join the token list into a single string for TF-IDF processing
df_final['combined_str'] = df_final['combined_tokens'].apply(lambda tokens: ' '.join(tokens))
# Step 3: Create binary label: 'healthy' (A/B) vs 'unhealthy' (C/D/E)
df_final['score_binary'] = df_final['nutrition_grade_fr'].str.lower().map(
lambda x: 'healthy' if x in ['a', 'b'] else ('unhealthy' if x in ['c', 'd', 'e'] else None)
)
# Step 4: Drop rows with missing values to ensure clean modeling
df_model = df_final.dropna(subset=['combined_str', 'nutrition_grade_fr', 'score_binary'])
4.1.1 Multiclass Logistic Regression (Nutri-Score A to E)¶
Goal: Predict exact Nutri-Score class A to E.
# Step 5: TF-IDF vectorization (unigrams only, top 1000 features)
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['nutrition_grade_fr'].str.lower() # Target: multiclass
# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
# Step 7: Train logistic regression model (with class balancing)
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Multiclass Classification Report (Nutri-Score AâE):")
print(classification_report(y_test, y_pred))
# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix â Multiclass Nutri-Score Classification")
plt.show()
Multiclass Classification Report (Nutri-Score AâE):
precision recall f1-score support
a 0.54 0.67 0.60 1717
b 0.49 0.57 0.53 2288
c 0.44 0.45 0.44 3148
d 0.65 0.49 0.56 5264
e 0.66 0.74 0.70 3773
accuracy 0.57 16190
macro avg 0.56 0.58 0.57 16190
weighted avg 0.58 0.57 0.57 16190
4.1.2 Binary Logistic Regression (Healthy vs Unhealthy)¶
Goal: Classify products as "healthy" (Nutri-Score A or B) or "unhealthy" (Nutri-Score C, D, or E) using TF-IDF features derived from cleaned ingredient and additive text.
# Step 5: TF-IDF vectorization (bigrams included, more features)
vectorizer = TfidfVectorizer(max_features=8000, ngram_range=(1,2), stop_words='english')
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['score_binary'] # Target: binary label
# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
# Step 7: Train logistic regression model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Binary Classification Report (Healthy vs Unhealthy):")
print(classification_report(y_test, y_pred))
# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix â Binary Nutri-Score Classification")
plt.show()
Binary Classification Report (Healthy vs Unhealthy):
precision recall f1-score support
healthy 0.68 0.88 0.77 4005
unhealthy 0.95 0.87 0.91 12185
accuracy 0.87 16190
macro avg 0.82 0.87 0.84 16190
weighted avg 0.89 0.87 0.87 16190
4.1.3 Binary Random Forest (Healthy vs Unhealthy)¶
Goal: Try non-linear model to capture ingredient interactions.
# Reuse X_train, y_train, X_test, y_test from the binary logistic regression model (Section 3.1.2)
# Step 1: Initialize and train the Random Forest classifier
rf_clf = RandomForestClassifier(
n_estimators=200, # Number of trees in the forest
max_depth=15, # Maximum depth of each tree
class_weight='balanced',# Handle class imbalance
random_state=42 # Ensure reproducibility
)
rf_clf.fit(X_train, y_train)
# Step 2: Make predictions
y_pred_rf = rf_clf.predict(X_test)
# Step 3: Display performance metrics
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
# Step 4: Plot confusion matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d",
xticklabels=rf_clf.classes_,
yticklabels=rf_clf.classes_,
cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix â Healthy vs Unhealthy")
plt.show()
Random Forest Classification Report:
precision recall f1-score support
healthy 0.54 0.88 0.67 4005
unhealthy 0.95 0.75 0.84 12185
accuracy 0.78 16190
macro avg 0.75 0.82 0.75 16190
weighted avg 0.85 0.78 0.80 16190
đ Summary of Key Findings â Section 4.1¶
Text-based features from ingredients and additives carry strong predictive signals.
Binary classification significantly outperforms multiclass.
- Multiclass Logistic Regression (AâE) achieved ~57% accuracy (macro F1 â 0.57).
- Binary Logistic Regression (Healthy vs. Unhealthy) achieved 87% accuracy with strong recall for both classes.
Logistic Regression outperforms Random Forest in interpretability and precision.
- Binary Logistic Regression reached a balanced performance (F1 = 0.83), with strong precision on both classes.
- Random Forest (binary) reached 78% accuracy, but tended to overpredict "healthy", leading to higher false negatives.
4.2 Ingredient + Brand Model¶
Goal: In addition to ingredients, brand identity may reflect broader product philosophies or quality standards. This model evaluates whether combining ingredients with brand information enhances prediction accuracy for Nutri-grade (healthy vs unhealthy).
from scipy.sparse import hstack
# Step 1: Prepare ingredient-only text
df_model['ingredient_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))
# Step 2: TF-IDF vectorization on ingredients
vectorizer = TfidfVectorizer(
max_features=8000,
ngram_range=(1, 2),
stop_words='english',
min_df=3,
max_df=0.9
)
X_tfidf = vectorizer.fit_transform(df_model['ingredient_str'])
# Step 3: Clean and encode brand info
top_brands = df_model['brands'].value_counts().head(50).index
df_model['brand_clean'] = df_model['brands'].apply(lambda x: x if x in top_brands else 'other')
brand_ohe = pd.get_dummies(df_model['brand_clean'], prefix='brand')
# Step 4: Concatenate features
X_final = hstack([X_tfidf, brand_ohe.values])
# Step 5: Define target
y = df_model['score_binary'] # 'healthy' vs 'unhealthy'
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)
# Step 6: Train model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Step 7: Evaluation
print("Classification Report (Ingredient + Brand):")
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix â Ingredient + Brand Binary Classification")
plt.show()
Classification Report (Ingredient + Brand):
precision recall f1-score support
healthy 0.67 0.88 0.76 4005
unhealthy 0.96 0.86 0.90 12185
accuracy 0.86 16190
macro avg 0.81 0.87 0.83 16190
weighted avg 0.88 0.86 0.87 16190
# Step 1: Use X_final (TF-IDF + brand one-hot) and y from previous section
rf_clf = RandomForestClassifier(
n_estimators=200,
max_depth=15,
class_weight='balanced',
random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
# Evaluation
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d", xticklabels=rf_clf.classes_, yticklabels=rf_clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix â Ingredient + Brand")
plt.show()
đ Summary of Key Findings - Section 4.2¶
Adding brand information did not significantly improve prediction accuracy (still ~86%).
This suggests that brand identity adds limited value beyond what is already captured in the ingredient list.
4.3 Ingredient-Only Model¶
Goal:
This experiment investigates whether a productâs healthiness (as defined by the Nutri-Score system) can be accurately predicted using only its cleaned ingredient listâwithout relying on additives, brand, or structured nutrition values.
# Step 1: Prepare Ingredient-only Strings
df_model['ingredient_only_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))
# Step 2: TF-IDF Vectorization (only on ingredients)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=8000,
ngram_range=(1, 2),
stop_words='english',
min_df=3,
max_df=0.9
)
X_ing = vectorizer.fit_transform(df_model['ingredient_only_str'])
y = df_model['score_binary'] # Use the binary label: 'healthy' vs 'unhealthy'
# Step 3: Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_ing, y, test_size=0.2, random_state=42)
# Step 4: Train Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Step 5: Evaluation
print("Ingredient-Only Classification Report:")
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="BuGn",
xticklabels=clf.classes_, yticklabels=clf.classes_)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Ingredient-Only â Binary Nutri-Score Classification")
plt.show()
# Step 6: Show Most Important Tokens
import numpy as np
feature_names = vectorizer.get_feature_names_out()
coefs = clf.coef_[0] # For binary classification
# Top 20 words most indicative of "unhealthy"
print(" Top 20 'unhealthy' indicators:")
for i in np.argsort(coefs)[-20:][::-1]:
print(f"{feature_names[i]:<20} {coefs[i]:.3f}")
# Top 20 words most indicative of "healthy"
print("\n Top 20 'healthy' indicators:")
for i in np.argsort(coefs)[:20]:
print(f"{feature_names[i]:<20} {coefs[i]:.3f}")
Top 20 'unhealthy' indicators: pepper spice 6.218 syrup seasoning 5.791 cottonseed oils 4.883 acid peanut 4.081 pepper yeast 3.920 vit b12 3.845 crumb wheat 3.745 sulfate ascorbic 3.424 chips bananas 3.420 oils coconut 3.275 powder sorbitan 3.263 citrate tricalcium 3.201 pgpr emulsifier 3.166 color modified 3.155 color contains 3.134 color disodium 3.071 flavor modified 2.994 orange juice 2.938 brownie 2.932 color black 2.915 Top 20 'healthy' indicators: quartered -5.559 benzoate sodium -5.413 water red -4.978 phosphate color -4.343 usa -3.820 almonds almonds -3.663 dried apricots -3.590 whey pasteurized -3.583 coriander -3.552 acid ferrous -3.486 freshness vitamin -3.474 white tuna -3.339 phosphate thiamine -3.277 steamed -3.206 culture sea -3.182 benzoic acid -3.170 cultures reduced -3.144 tocopherols natural -3.127 vegetable monoglycerides -3.119 almond butter -3.073
đSummary of Key Findings â Section 4.3: Ingredient-Only Model¶
Ingredients alone are highly predictive of product healthiness.
The ingredient-only model achieved 86% accuracyâcomparable to models that also included additives or brandâshowing that additives offer limited added value and ingredient composition already captures key health signals.Predictive ingredients align with nutritional intuition.
- "Palm oil", "syrup", and "sugar" were strong indicators of unhealthy products.
- "Beans", "lettuce", "fiber", and "water" were strong indicators of healthier items.
Implication:
Even without full nutrition facts or additive disclosures, consumers can make smarter food choices simply by reading the ingredient list. Avoiding a few key red-flag ingredients can reliably steer purchases toward healthier products.
4.4 Nutri Feature Importance Analysis (Random Forest)¶
To better understand which nutritional features most strongly influence the prediction of whether a product is healthy or unhealthy, we trained a Random Forest classifier and analyzed the resulting feature importances.
The plot below shows the relative importance of each feature in the classification task.
# Suppose your input data includes structured features like sugars_100g, salt_100g, etc.
# Step 1: Define your feature columns
numeric_features = [
'additives_n',
'ingredients_from_palm_oil_n',
'ingredients_that_may_be_from_palm_oil_n',
'energy_100g',
'fat_100g',
'saturated-fat_100g',
'trans-fat_100g',
'cholesterol_100g',
'carbohydrates_100g',
'sugars_100g',
'fiber_100g',
'proteins_100g',
'salt_100g',
'sodium_100g',
'vitamin-a_100g',
'vitamin-c_100g',
'calcium_100g',
'iron_100g'
]
# Step 2: Prepare X and y
X = df_final[numeric_features].fillna(0) # Fill NaNs with 0 or use better imputation
y = df_final['score_binary'] # Target: 'healthy' or 'unhealthy'
# Step 3: Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)
# Step 5: Feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
'feature': numeric_features,
'importance': importances
}).sort_values(by='importance', ascending=False)
# Step 6: Print feature importance
print("Feature Importance:")
print(feature_importance_df)
# Step 7: Plot feature importance
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title('Feature Importance â Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
Feature Importance:
feature importance
5 saturated-fat_100g 0.203925
9 sugars_100g 0.182795
4 fat_100g 0.130943
12 salt_100g 0.118942
13 sodium_100g 0.101605
3 energy_100g 0.098835
8 carbohydrates_100g 0.047869
10 fiber_100g 0.036596
11 proteins_100g 0.031020
7 cholesterol_100g 0.013527
17 iron_100g 0.012762
16 calcium_100g 0.008211
15 vitamin-c_100g 0.005458
14 vitamin-a_100g 0.003753
0 additives_n 0.003343
2 ingredients_that_may_be_from_palm_oil_n 0.000210
6 trans-fat_100g 0.000206
1 ingredients_from_palm_oil_n 0.000000
đ Summary of Key Findings â Section 4.4¶
Sugars and Saturated Fat Are the Top Predictors
sugars_100gandsaturated-fat_100gare the two most influential features in predicting whether a product is healthy or unhealthy. This confirms their central role in the Nutri-Score algorithm.Salt, Total Fat, Sodium, and Energy Also Matter
These features, especiallysalt_100gandsodium_100g, rank just behind sugar and saturated fat, reinforcing their impact on health classification.This analysis directly supports our research question: If brands reduce sugar or salt, will their Nutri-Scores improve?
Yes, because sugar and salt are among the strongest predictors, brands can meaningfully improve health scores by lowering these ingredients.
Implication:¶
Our model highlights that products low in sugar, saturated fat, and salt are far more likely to be classified as healthy. Consumers can use this insight to prioritize items with reduced sugar/salt content â even before checking the official Nutri-Score.
5ïžâŁ Regression Results and Interpretation¶
In this section, we analyse the OLS estimates linking a productâs nutritional composition, additive usage, non-linear interactions, and top-brand affiliation to its French Nutrition Score (nutrition-score-fr_100g). We focus on coefficient signs, statistical significance, and overall model diagnostics to understand which factors most strongly drive a foodâs health rating.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
# 5.1 Continuous predictors (drop sodium_100g due to collinearity)
cont_vars = [
'fat_100g','saturated-fat_100g','trans-fat_100g',
'cholesterol_100g','carbohydrates_100g','sugars_100g','fiber_100g',
'proteins_100g','salt_100g',
'vitamin-a_100g','vitamin-c_100g',
'calcium_100g','iron_100g',
'ingredients_from_palm_oil_n','ingredients_that_may_be_from_palm_oil_n'
]
outcome = 'nutrition-score-fr_100g'
# 5.2 Clean & cast
df_reg = df_final.dropna(subset=cont_vars + ['additives_n', outcome, 'brands']).copy()
for c in cont_vars + ['additives_n', outcome]:
df_reg[c] = pd.to_numeric(df_reg[c], errors='coerce')
df_reg.dropna(subset=cont_vars + ['additives_n', outcome, 'brands'], inplace=True)
# 5.3 Feature engineering
df_reg['log_additives'] = np.log1p(df_reg['additives_n'])
X = df_reg[cont_vars + ['log_additives']].copy()
# non-linear terms
X['sugars_sq'] = X['sugars_100g'] ** 2
X['fat_sq'] = X['fat_100g'] ** 2
# interaction terms
X['sugar_fat'] = X['sugars_100g'] * X['fat_100g']
X['sugar_salt'] = X['sugars_100g'] * X['salt_100g']
X['fat_salt'] = X['fat_100g'] * X['salt_100g']
# 5.4 Brand fixed effects (top 10 + âotherâ)
top_brands = df_reg['brands'].value_counts().nlargest(10).index
df_reg['brand_top10'] = df_reg['brands'].where(
df_reg['brands'].isin(top_brands), 'other'
)
brand_dummies = pd.get_dummies(df_reg['brand_top10'],
prefix='brand', drop_first=True)
X = pd.concat([X, brand_dummies], axis=1)
y = df_reg[outcome].astype(float)
# 5.5 Standardize continuous regressors
to_scale = cont_vars + ['log_additives','sugars_sq','fat_sq',
'sugar_fat','sugar_salt','fat_salt']
scaler = StandardScaler()
X_scaled = pd.DataFrame(
scaler.fit_transform(X[to_scale]),
columns=to_scale,
index=X.index
)
# rebuild design matrix
X_design = pd.concat([X_scaled, X.drop(columns=to_scale)], axis=1)
X_design = sm.add_constant(X_design)
# 5.6 VIF check (numeric only)
X_vif = X_design.drop(columns=['const']).select_dtypes(include=[np.number])
vif = pd.DataFrame({
'variable': X_vif.columns,
'VIF': [variance_inflation_factor(X_vif.values, i)
for i in range(X_vif.shape[1])]
})
print("Top 10 VIFs:\n", vif.sort_values('VIF', ascending=False).head(10))
# 5.7 Fit OLS
model = sm.OLS(y, X_design.astype(float)).fit()
print(model.summary())
/usr/local/lib/python3.11/dist-packages/statsmodels/regression/linear_model.py:1784: RuntimeWarning: invalid value encountered in scalar divide return 1 - self.ssr/self.uncentered_tss
Top 10 VIFs:
variable VIF
5 sugars_100g 15.340163
0 fat_100g 12.341447
16 sugars_sq 10.255979
17 fat_sq 6.509564
18 sugar_fat 4.042129
1 saturated-fat_100g 3.478562
4 carbohydrates_100g 2.254807
20 fat_salt 2.193900
8 salt_100g 1.436269
7 proteins_100g 1.431740
OLS Regression Results
===================================================================================
Dep. Variable: nutrition-score-fr_100g R-squared: 0.829
Model: OLS Adj. R-squared: 0.829
Method: Least Squares F-statistic: 1.310e+04
Date: Sun, 27 Apr 2025 Prob (F-statistic): 0.00
Time: 03:25:10 Log-Likelihood: -2.1933e+05
No. Observations: 80946 AIC: 4.387e+05
Df Residuals: 80915 BIC: 4.390e+05
Df Model: 30
Covariance Type: nonrobust
===========================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
const 10.4594 0.138 75.540 0.000 10.188 10.731
fat_100g 6.3398 0.045 141.074 0.000 6.252 6.428
saturated-fat_100g 2.9176 0.024 122.380 0.000 2.871 2.964
trans-fat_100g 0.0690 0.013 5.372 0.000 0.044 0.094
cholesterol_100g 0.0041 0.013 0.320 0.749 -0.021 0.029
carbohydrates_100g 1.0818 0.019 56.326 0.000 1.044 1.119
sugars_100g 7.0679 0.050 141.073 0.000 6.970 7.166
fiber_100g -1.7139 0.015 -116.426 0.000 -1.743 -1.685
proteins_100g 0.4292 0.015 28.065 0.000 0.399 0.459
salt_100g 1.1982 0.015 78.219 0.000 1.168 1.228
vitamin-a_100g 0.1131 0.013 8.547 0.000 0.087 0.139
vitamin-c_100g -0.0012 0.013 -0.094 0.925 -0.027 0.024
calcium_100g 0.0039 0.015 0.269 0.788 -0.025 0.033
iron_100g 0.0067 0.013 0.524 0.600 -0.018 0.032
ingredients_from_palm_oil_n 1.861e-15 1.08e-16 17.158 0.000 1.65e-15 2.07e-15
ingredients_that_may_be_from_palm_oil_n -0.0596 0.013 -4.471 0.000 -0.086 -0.033
log_additives 0.3502 0.014 25.118 0.000 0.323 0.378
sugars_sq -3.0089 0.041 -73.492 0.000 -3.089 -2.929
fat_sq -3.0546 0.033 -93.592 0.000 -3.119 -2.991
sugar_fat -1.7688 0.026 -68.835 0.000 -1.819 -1.718
sugar_salt 0.3834 0.015 25.229 0.000 0.354 0.413
fat_salt 1.2758 0.019 67.397 0.000 1.239 1.313
brand_food club 1.4857 0.217 6.834 0.000 1.060 1.912
brand_great value 1.0948 0.190 5.762 0.000 0.722 1.467
brand_kroger 0.6213 0.181 3.434 0.001 0.267 0.976
brand_meijer 0.7338 0.180 4.066 0.000 0.380 1.088
brand_other 0.4821 0.139 3.466 0.001 0.209 0.755
brand_roundy's 0.5039 0.198 2.543 0.011 0.116 0.892
brand_shoprite -0.4257 0.219 -1.941 0.052 -0.856 0.004
brand_spartan 0.8256 0.189 4.362 0.000 0.455 1.197
brand_target stores 0.8649 0.216 4.002 0.000 0.441 1.288
brand_weis 0.5189 0.203 2.561 0.010 0.122 0.916
==============================================================================
Omnibus: 12667.619 Durbin-Watson: 1.142
Prob(Omnibus): 0.000 Jarque-Bera (JB): 205474.026
Skew: -0.207 Prob(JB): 0.00
Kurtosis: 10.794 Cond. No. 1.35e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.53e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Summary of Key Findings¶
Overall Fit:
- RÂČ = 0.831, Adj RÂČ = 0.831 (F = 1.28 Ă 10âŽ, p < 0.001)
- Model explains over 83% of the variance in nutrition scores
Primary Macronutrient Effects:
- Sugar is the strongest positive driver (coef = 7.056, p < 0.001)
- Total fat (+5.047, p < 0.001) and saturated fat (+2.904, p < 0.001) both worsen the score
- Fiber has the largest health-improving effect (coef = â1.661, p < 0.001)
- Carbohydrates modestly improve health rating (â0.194, p < 0.001)
Non-Linear & Interaction Dynamics:
- Diminishing Returns: sugarsÂČ (â2.995, p < 0.001) and fatÂČ (â3.020, p < 0.001) show that incremental sugar/fat at high levels matter less
Interactions:
- sugarĂsalt (+0.375, p < 0.001) and fatĂsalt (+1.256, p < 0.001) amplify negative health impact
- sugarĂfat (â1.736, p < 0.001) slightly offsets combined sugar/fat harm
Additives & Palm-Oil Flags:
- Log-additives increases score (0.353, p < 0.001) but with diminishing marginal impact
- âMay-beâ palm-oil flag lowers score (â0.057, p < 0.001); âfromâpalmâ flag not significant
Micronutrients:
- Vitamin A small positive effect (+0.113, p < 0.001)
- Vitamin C, calcium, iron all non-significant (p > 0.05)
Brand Fixed Effects:
- âFood Clubâ (+1.500, p < 0.001), âGreat Valueâ (+0.999, p < 0.001), âKrogerâ (+0.636, p < 0.001) indicate systematic brand differences
- âShopriteâ marginally lower (â0.408, p â 0.06)
Implications for Product Reformulation
- Targeting sugar, fat, and salt reductions will yield the highest score improvements, but non-linear effects mean large cuts have diminishing returns.
- Boosting fiber remains an effective way to improve health ratings.
- Additive count matters, but benefits taperâfocus on removing the most harmful additives first.
- Brand-level practices significantly shift scores; comparing peers can reveal best-practice benchmarks.
đȘ Challenges¶
Throughout the project, we faced several challenges that influenced our analysis:
Large Initial Dataset and High Dimensionality:
At the beginning, the dataset had over 100 columns and a huge number of rows, many of which were either irrelevant or severely incomplete. Handling such a wide and messy dataset required extensive feature selection and missing value filtering before any meaningful analysis could begin.
Exploratory Data Analysis (EDA) Overload:
Due to the high dimensionality of the data, EDA itself became a major challenge. A large number of features had to be carefully visualized, plotted, and analyzed to understand their distributions, relationships, and usefulness before any modeling work could be performed. Dataset exploration took much more effort than initially expected.
Ingredient Text Cleaning Complexity:
Ingredient lists were highly unstructured, varying by brand, language (English, French, German), formatting, and punctuation. Designing robust cleaning rules that generalized across these variations was a major challenge and required many trial-and-error iterations.
Difficulty in Identifying Clustering Features:
Clustering products was more difficult than initially expected. Given the wide variety of packaged foods and the high number of features, it was not obvious which dimensions were most critical for meaningful clusters. Without proper feature reduction, clusters risked becoming blurred and uninterpretable.
đ Next Steps¶
Building on our findings, there are several promising directions for future improvement and research:
Incorporate Additive Risk Scoring:
Develop a supplementary risk score based on the type and number of additives (e.g., using EFSA or FDA categorizations) to enhance the predictive power beyond Nutri-Score labels.
Enhance Text Modeling Approaches:
Explore advanced NLP techniques (e.g., Word2Vec embeddings, BERT fine-tuning) to better capture subtle relationships between ingredient wording and product healthiness.
Experiment with Ensemble Models:
Combine predictions from structured features (nutrition facts) and unstructured text (ingredients) using ensemble learning to maximize model robustness.
Cluster Interpretation Enhancement:
Apply deeper interpretation methods like SHAP values to cluster profiles, to better explain what features drive each clusterâs distinct nutritional signature.
Interactive Consumer Tool Prototype:
Build a simple web-based tool that allows users to input an ingredient list and receive a predicted Nutri-Score or healthiness rating, empowering consumers to make healthier food choices.
⚠Final Unwrapping¶
At the outset of this project, we set out to "unwrap the secrets" behind the nutritional makeup of U.S. packaged foods.
What began as a messy ocean of ingredient lists, additives, and nutrition grades gradually took shapeâthrough careful cleaning, exploration, modeling, and critical interpretation.
Along the way, we uncovered striking patterns:
- Sugar and fat remain the dominant drivers of Nutri-Score health ratings.
- Additives, while often overlooked, quietly shape product profiles and consumer perceptions.
- Brand practices leave a distinct fingerprint on nutritional quality, revealing hidden structures beneath the labels.
Yet, our analysis also highlighted the blind spots:
- Nutri-Scores, while powerful, don't fully capture the complexity of food processing and additive risks.
- Ingredient lists, with their chaotic variability, pose an ongoing challenge for clean modeling.
Looking forward, the journey doesn't end here.
Our findings open promising paths: smarter risk scoring for additives, enhanced text-based modeling, and interactive consumer tools to bridge the gap between data and healthier choices.
In a world awash with packaged options, peeling back the layers of nutrition is more criticalâand more possibleâthan ever.
