!pip install umap-learn -q

!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/235.8 kB ? eta -:--:--
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 235.8/235.8 kB 9.8 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.4.0

# Basic utilities
import os
import re
import difflib
import unidecode
from collections import Counter
from itertools import chain

# Data processing
import pandas as pd
import numpy as np

# Text processing
import nltk
from nltk.corpus import stopwords

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Dimensionality reduction and clustering
import umap
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Preprocessing and model selection
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# Evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

# Dataset download
import kagglehub

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Download Open Food Facts dataset from Kaggle
path = kagglehub.dataset_download("openfoodfacts/world-food-facts")
print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/world-food-facts

# List available files
files = os.listdir(path)
print("Files in dataset:", files)

Files in dataset: ['en.openfoodfacts.org.products.tsv']

# Load the main data file
file_path = os.path.join(path, 'en.openfoodfacts.org.products.tsv')
df = pd.read_csv(file_path, sep='\t', low_memory=False)

# dataset review
df.head()

# data shape
print("Data rows and columns number:", df.shape)

Data rows and columns number: (356027, 163)

# check all columns' name and datatype:
print("All Columns and Their Data Types in the Dataset:\n")
print(f"{'Column Name':<50} {'Type'}")
print("-" * 60)
for col in df.columns:
    print(f"{col:<50} {df[col].dtype}")

All Columns and Their Data Types in the Dataset:

Column Name                                        Type
------------------------------------------------------------
code                                               object
url                                                object
creator                                            object
created_t                                          object
created_datetime                                   object
last_modified_t                                    object
last_modified_datetime                             object
product_name                                       object
generic_name                                       object
quantity                                           object
packaging                                          object
packaging_tags                                     object
brands                                             object
brands_tags                                        object
categories                                         object
categories_tags                                    object
categories_en                                      object
origins                                            object
origins_tags                                       object
manufacturing_places                               object
manufacturing_places_tags                          object
labels                                             object
labels_tags                                        object
labels_en                                          object
emb_codes                                          object
emb_codes_tags                                     object
first_packaging_code_geo                           object
cities                                             object
cities_tags                                        object
purchase_places                                    object
stores                                             object
countries                                          object
countries_tags                                     object
countries_en                                       object
ingredients_text                                   object
allergens                                          object
allergens_en                                       object
traces                                             object
traces_tags                                        object
traces_en                                          object
serving_size                                       object
no_nutriments                                      float64
additives_n                                        float64
additives                                          object
additives_tags                                     object
additives_en                                       object
ingredients_from_palm_oil_n                        float64
ingredients_from_palm_oil                          float64
ingredients_from_palm_oil_tags                     object
ingredients_that_may_be_from_palm_oil_n            float64
ingredients_that_may_be_from_palm_oil              float64
ingredients_that_may_be_from_palm_oil_tags         object
nutrition_grade_uk                                 float64
nutrition_grade_fr                                 object
pnns_groups_1                                      object
pnns_groups_2                                      object
states                                             object
states_tags                                        object
states_en                                          object
main_category                                      object
main_category_en                                   object
image_url                                          object
image_small_url                                    object
energy_100g                                        float64
energy-from-fat_100g                               float64
fat_100g                                           float64
saturated-fat_100g                                 float64
-butyric-acid_100g                                 float64
-caproic-acid_100g                                 float64
-caprylic-acid_100g                                float64
-capric-acid_100g                                  float64
-lauric-acid_100g                                  float64
-myristic-acid_100g                                float64
-palmitic-acid_100g                                float64
-stearic-acid_100g                                 float64
-arachidic-acid_100g                               float64
-behenic-acid_100g                                 float64
-lignoceric-acid_100g                              float64
-cerotic-acid_100g                                 float64
-montanic-acid_100g                                float64
-melissic-acid_100g                                float64
monounsaturated-fat_100g                           float64
polyunsaturated-fat_100g                           float64
omega-3-fat_100g                                   float64
-alpha-linolenic-acid_100g                         float64
-eicosapentaenoic-acid_100g                        float64
-docosahexaenoic-acid_100g                         float64
omega-6-fat_100g                                   float64
-linoleic-acid_100g                                float64
-arachidonic-acid_100g                             float64
-gamma-linolenic-acid_100g                         float64
-dihomo-gamma-linolenic-acid_100g                  float64
omega-9-fat_100g                                   float64
-oleic-acid_100g                                   float64
-elaidic-acid_100g                                 float64
-gondoic-acid_100g                                 float64
-mead-acid_100g                                    float64
-erucic-acid_100g                                  float64
-nervonic-acid_100g                                float64
trans-fat_100g                                     float64
cholesterol_100g                                   float64
carbohydrates_100g                                 float64
sugars_100g                                        float64
-sucrose_100g                                      float64
-glucose_100g                                      float64
-fructose_100g                                     float64
-lactose_100g                                      float64
-maltose_100g                                      float64
-maltodextrins_100g                                float64
starch_100g                                        float64
polyols_100g                                       float64
fiber_100g                                         float64
proteins_100g                                      float64
casein_100g                                        float64
serum-proteins_100g                                float64
nucleotides_100g                                   float64
salt_100g                                          float64
sodium_100g                                        float64
alcohol_100g                                       float64
vitamin-a_100g                                     float64
beta-carotene_100g                                 float64
vitamin-d_100g                                     float64
vitamin-e_100g                                     float64
vitamin-k_100g                                     float64
vitamin-c_100g                                     float64
vitamin-b1_100g                                    float64
vitamin-b2_100g                                    float64
vitamin-pp_100g                                    float64
vitamin-b6_100g                                    float64
vitamin-b9_100g                                    float64
folates_100g                                       float64
vitamin-b12_100g                                   float64
biotin_100g                                        float64
pantothenic-acid_100g                              float64
silica_100g                                        float64
bicarbonate_100g                                   float64
potassium_100g                                     float64
chloride_100g                                      float64
calcium_100g                                       float64
phosphorus_100g                                    float64
iron_100g                                          float64
magnesium_100g                                     float64
zinc_100g                                          float64
copper_100g                                        float64
manganese_100g                                     float64
fluoride_100g                                      float64
selenium_100g                                      float64
chromium_100g                                      float64
molybdenum_100g                                    float64
iodine_100g                                        float64
caffeine_100g                                      float64
taurine_100g                                       float64
ph_100g                                            float64
fruits-vegetables-nuts_100g                        float64
fruits-vegetables-nuts-estimate_100g               float64
collagen-meat-protein-ratio_100g                   float64
cocoa_100g                                         float64
chlorophyl_100g                                    float64
carbon-footprint_100g                              float64
nutrition-score-fr_100g                            float64
nutrition-score-uk_100g                            float64
glycemic-index_100g                                float64
water-hardness_100g                                float64

# Find number of duplicated rows in the dataset
num_duplicates = df.duplicated().sum()
print(f"The number of duplicated rows: {num_duplicates}")

The number of duplicated rows: 0

# Count missing values per column
missing_counts = df.isnull().sum()
missing_counts = missing_counts[missing_counts > 0].sort_values(ascending=False)

print("Missing Values Count per Column:")
display(missing_counts)

Missing Values Count per Column:

# Calculate missing value percentage
total_rows = df.shape[0]
missing_percent = (missing_counts / total_rows) * 100

print("The Missing Values (% of total rows) for Each Column:")
display(missing_percent.sort_values(ascending=False))

The Missing Values (% of total rows) for Each Column:

# Identify columns with 100% missing values
full_missing_cols = missing_percent[missing_percent == 100].index

print("Columns with 100% missing values:")
for col in full_missing_cols:
    print(col)

Columns with 100% missing values:
no_nutriments
chlorophyl_100g
water-hardness_100g
glycemic-index_100g
-butyric-acid_100g
-melissic-acid_100g
-nervonic-acid_100g
-erucic-acid_100g
-mead-acid_100g
-elaidic-acid_100g
-caproic-acid_100g
-lignoceric-acid_100g
-cerotic-acid_100g
nutrition_grade_uk
ingredients_from_palm_oil
ingredients_that_may_be_from_palm_oil

similar = difflib.get_close_matches("United States", unique_countries, n=10, cutoff=0.6)

# Define accepted U.S. labels
usa_labels = ['United States', 'USA', 'U.S.', 'US', 'United States of America']

# Filter rows where countries_en exactly matches any accepted label
df_usa = df[df['countries_en'].isin(usa_labels)].copy()

# Print result
print("Before filtering, rows and columns:", df.shape)
print(f"After filtering, retained {df_usa.shape[0]} rows out of {df.shape[0]} "
      f"({df_usa.shape[0]/df.shape[0]:.1%}) after filtering U.S. labels.")

Before filtering, rows and columns: (356027, 163)
After filtering, retained 173159 rows out of 356027 (48.6%) after filtering U.S. labels.

# Sample to verify country fields
df_usa[['countries', 'countries_tags', 'countries_en']].sample(10)

#Top 50 Columns by Missing Rate
missing_counts = df_usa.isnull().sum()
missing_percent = missing_counts / len(df_usa)
missing_percent_sorted = missing_percent.sort_values(ascending=True)
top_n = 50
missing_percent_top = missing_percent_sorted.head(top_n)

# drop bar chart
plt.figure(figsize=(20, 10))
missing_percent_top.plot(kind='bar')
plt.xlabel('Columns')
plt.ylabel('Missing Rate (%)')
plt.title(f'Top {top_n}% Columns by Missing Rate')
plt.xticks(rotation=90, ha='right')
plt.axhline(0.8, color='red', linestyle='--', linewidth=2)
plt.text(-0.5, 0.8, '80% Threshold', color='red', va='bottom', ha='left')

plt.tight_layout()
plt.show()

# Set missing value threshold
THRESHOLD = 0.8

# Specify columns to drop
cols_to_drop = missing_percent[missing_percent > THRESHOLD].index
df_usa_trimmed = df_usa.drop(cols_to_drop, axis=1)

print("Original dataset: rows and columns", df.shape)
print("After dropping columns: rows and columns", df_usa_trimmed.shape)

Original dataset: rows and columns (356027, 163)
After dropping columns: rows and columns (173159, 42)

# Drop all nulls
df_final = df_usa_trimmed.dropna().copy()
print("After dropping remaining nulls: rows and columns", df_final.shape)

# Drop all duplicates
df_final = df_final.drop_duplicates().copy()
print("After dropping remaining duplicates: rows and columns", df_final.shape)

After dropping remaining nulls: rows and columns (82380, 42)
After dropping remaining duplicates: rows and columns (82380, 42)

print("Final Check for null and duplicates:")

# Check for nulls
total_nulls = df_final.isnull().sum().sum()
print(f"Total missing values: {total_nulls}")

# Check for duplicates
duplicate_rows = df_final.duplicated().sum()
print(f"Total duplicated rows: {duplicate_rows}")

Final Check for null and duplicates:
Total missing values: 0
Total duplicated rows: 0

# diaplay dataset
df_final.sample(5)

# check all final retained columns' name:
print("Final retained columns:\n")
for col in df_final.columns:
    print(col)

Final retained columns:

code
url
creator
created_t
created_datetime
last_modified_t
last_modified_datetime
product_name
brands
brands_tags
countries
countries_tags
countries_en
ingredients_text
serving_size
additives_n
additives
additives_tags
additives_en
ingredients_from_palm_oil_n
ingredients_that_may_be_from_palm_oil_n
nutrition_grade_fr
states
states_tags
states_en
energy_100g
fat_100g
saturated-fat_100g
trans-fat_100g
cholesterol_100g
carbohydrates_100g
sugars_100g
fiber_100g
proteins_100g
salt_100g
sodium_100g
vitamin-a_100g
vitamin-c_100g
calcium_100g
iron_100g
nutrition-score-fr_100g
nutrition-score-uk_100g

# save df_final
df_final.to_csv('df_final.csv', index=False)

# Define allowed value range

valid_range_mask = df_final[[
    'saturated-fat_100g',
    'trans-fat_100g',
    'fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g',
    ]].apply(lambda x: x.between(0, 100)).all(axis=1)

# Filter the dataset
df_final_cleaned = df_final[valid_range_mask].copy()

# Compare shape before and after filtering
print(f"Before outlier removal: {df_final.shape}")
print(f"After outlier removal:  {df_final_cleaned.shape}")
print(f"Removed {df_final.shape[0] - df_final_cleaned.shape[0]} rows due to out-of-bound values.")

Before outlier removal: (82380, 42)
After outlier removal:  (82357, 42)
Removed 23 rows due to out-of-bound values.

df_final[~valid_range_mask][[
    'trans-fat_100g',
    'sugars_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g',
    ]]

df_final = df_final_cleaned.copy()

# Step 1: Check and Remove Duplicates

# Normalize text: lowercase and strip spaces
df_final['product_name'] = df_final['product_name'].str.lower().str.strip()
df_final['brands'] = df_final['brands'].str.lower().str.strip()
df_final['ingredients_text'] = df_final['ingredients_text'].str.lower().str.strip()

# Find duplicates
duplicates = df_final[df_final.duplicated(subset=[ 'product_name', 'brands', 'ingredients_text'], keep=False)]
print(f"Number of duplicate entries: {duplicates.shape[0]}")

# Sort duplicates for easier review
duplicates = duplicates.sort_values(by=['product_name', 'brands', 'ingredients_text'])

# Display the potential duplicates
duplicates[['code', 'product_name', 'brands', 'ingredients_text', 'nutrition_grade_fr', 'additives_en', 'nutrition-score-fr_100g']].head(10)

Number of duplicate entries: 2597

# Check original shape before removing duplicates
print(f"Before removing duplicate products: {df_final.shape}")

# Remove duplicates, keep the first occurrence
df_final = df_final.drop_duplicates(subset=['product_name', 'brands', 'ingredients_text'])
print(f"After removing duplicate products: {df_final.shape}")

Before removing duplicate products: (82357, 42)
After removing duplicate products: (80946, 42)

# Step 3: Standardize ingredients_list Additives List

# Split ingredients_text into a list by commas
df_final['ingredients_list'] = df_final['ingredients_text'].apply(lambda x: [i.strip().lower() for i in x.split(',')] if pd.notnull(x) else [])

# Split additives_en into a list
df_final['additives_list'] = df_final['additives_en'].apply(lambda x: x.split(',') if x != 'None' else [])

# Check an example
df_final[['product_name', 'ingredients_list','additives_en', 'additives_list']].head(10)

df_final['ingredients_text'].dropna().sample(5)

# Define a function to count "weird" characters in each text
def count_weird_chars(text):
    if pd.isna(text):
        return 0
    return len(re.findall(r'[^a-zA-Z0-9,\s\(\)\.\-]', str(text)))

# Apply to ingredients_text
df_final['weird_char_count'] = df_final['ingredients_text'].apply(count_weird_chars)

# Sort by weirdness descending
df_final_sorted_weird = df_final.sort_values(by='weird_char_count', ascending=False)

# Display the top 10 weirdest entries
with pd.option_context('display.max_colwidth', None):
  df_final_sorted_weird[['product_name', 'ingredients_text', 'weird_char_count']].head(10)

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

# Stopwords list: English + French + German + domain-specific
custom_stopwords = set(stopwords.words('english')) | set([
    # French
    'de', 'à', 'le', 'la', 'les', 'du', 'et', 'des', 'pour', 'avec', 'sur',
    'au', 'ou', 'par', 'en', 'lait', 'eau', 'ingrédients', 'produits',
    'contient', 'valeur', 'nutrition', 'base', 'moyenne',
    # German
    'mit', 'von', 'der', 'das', 'und', 'ein', 'eine', 'dem', 'den', 'für', 'ohne',
    'inhaltsstoffe', 'zutaten', 'lebensmittel',
    # Domain noise / marketing
    'organic', 'natural', 'product', 'ingredient', 'ingredients',
    'mg', 'g', 'ar', 'bl', 'fi', '–', '—', 'less', 'contains'
])

# Translation dictionary: French + German → English
translation_dict = {
    # French
    'sucre': 'sugar',
    'sel': 'salt',
    'huile': 'oil',
    'farine': 'flour',
    'poudre': 'powder',
    'lait': 'milk',
    'arome': 'flavor',
    'chocolat': 'chocolate',
    'acide': 'acid',
    'fromage': 'cheese',
    'cacao': 'cocoa',
    'beurre': 'butter',

    # German
    'zucker': 'sugar',
    'salz': 'salt',
    'mehl': 'flour',
    'milch': 'milk',
    'kakao': 'cocoa',
    'aroma': 'flavor',
    'butter': 'butter'
}

# define cleaning function
def clean_and_tokenize_ingredients(text):
    if pd.isna(text):
        return []

    # 1. Normalize to lowercase and remove accents
    text = unidecode.unidecode(text.lower())

    # 2. Remove numeric expressions (e.g., '2%', '100g', '25mg')
    text = re.sub(r'\b\d+%?\b', ' ', text)
    text = re.sub(r'\b\d+[a-z]+\b', ' ', text)

    # 3. Remove special characters (keep commas and minimal punctuation)
    text = re.sub(r'[^a-z0-9,\.\-\(\)\s]', ' ', text)

    # 4. Normalize space
    text = re.sub(r'\s+', ' ', text).strip()

    # 5. Split on commas — each comma-separated item is treated as a phrase
    raw_phrases = [p.strip() for p in text.split(',')]

    clean_phrases = []
    for phrase in raw_phrases:
        # Strip trailing punctuation
        phrase = phrase.strip('.,() ')

        # Translate known foreign terms (optional)
        phrase = ' '.join([translation_dict.get(w, w) for w in phrase.split()])

        # Skip short/meaningless phrases
        if len(phrase.split()) >= 2 and not any(w in custom_stopwords for w in phrase.split()):
            clean_phrases.append(phrase)

    return clean_phrases

df_final['clean_ingredient_tokens'] = df_final['ingredients_text'].apply(clean_and_tokenize_ingredients)

# Flatten the list of tokens
all_words = [word for tokens in df_final['clean_ingredient_tokens'] for word in tokens]
word_counts = Counter(all_words)

# Display top 50 words
pd.Series(word_counts).sort_values(ascending=False).head(10)

import matplotlib.pyplot as plt
import seaborn as sns

# Prepare DataFrame
top_ingredients = pd.Series(word_counts).sort_values(ascending=False).head(50)

# Plot
plt.figure(figsize=(12, 14))
ax = sns.barplot(x=top_ingredients.values, y=top_ingredients.index, palette='viridis')

plt.xlabel('Count')
plt.ylabel('Ingredient')
plt.title('Top 50 Most Common Ingredients in U.S. Packaged Foods')
plt.grid(axis='x', linestyle='--', alpha=0.7)

# Add count labels to the right of bars
for i, (value, name) in enumerate(zip(top_ingredients.values, top_ingredients.index)):
    ax.text(value + 100, i, f'{value:,}', va='center', ha='left', fontsize=9)

plt.tight_layout()
plt.show()

from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=600, background_color='white', colormap='tab20')
wordcloud.generate_from_frequencies(word_counts)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Ingredients Word Cloud')
plt.show()

import matplotlib.pyplot as plt
import seaborn as sns

# Total count for percentage
total = df_final['nutrition_grade_fr'].value_counts().sum()

# Plotting the distribution of Nutri-Scores
plt.figure(figsize=(6, 6))
sns.countplot(
    data=df_final,
    x='nutrition_grade_fr',
    order=sorted(df_final['nutrition_grade_fr'].dropna().unique()),
    palette='Set2',
    legend=False
)

plt.title('Distribution of Nutrition Grades (France Nutri-Score)', fontsize=16)
plt.xlabel('Nutrition Grade (A=Healthiest, E=Least Healthy)', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)

# Add count and percentage for each bar
for p in plt.gca().patches:
    count = p.get_height()
    percent = count / total * 100
    label = f'{count:.0f}\n({percent:.1f}%)'
    plt.gca().annotate(
        label,
        (p.get_x() + p.get_width() / 2., count),
        ha='center', va='center',
        fontsize=11, color='black',
    )

plt.tight_layout()
plt.show()

# Define groups for comparison
low_grades = df_final[df_final['nutrition_grade_fr'].isin(['d', 'e'])]
high_grades = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]

# Compare key nutritional metrics
metrics = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, metric in enumerate(metrics):
    sns.boxplot(
        data=df_final,
        x='nutrition_grade_fr',
        y=metric,
        order=['a', 'b', 'c', 'd', 'e'],
        palette='Set2',
        ax=axes[idx],
        legend=False,
        showfliers=False
    )

    axes[idx].set_title(f'Distribution of {metric} by Nutrition Grade')
    axes[idx].set_xlabel('Nutrition Grade')
    axes[idx].set_ylabel(metric)

plt.tight_layout()
plt.show()

# Plot distribution of additives number across all products
plt.figure(figsize=(8, 6))
sns.histplot(df_final['additives_n'], bins=30, kde=False)
plt.title('Distribution of Number of Additives in Products')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Scatter plot: Number of additives vs. Nutrition Score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=df_final, alpha=0.5)
plt.title('Number of Additives vs. Nutrition Score')
plt.xlabel('Number of Additives')
plt.ylabel('Nutrition Score (Lower is Better)')
plt.grid(True)
plt.show()

# Boxplot: Additives count by Nutri-Score Grade (A to E)
plt.figure(figsize=(8, 6))
sns.boxplot(x='nutrition_grade_fr', y='additives_n', data=df_final, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Distribution of Additives Count Across Nutri-Score Grades')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Additives')
plt.grid(True)
plt.show()

# Histogram: Additives count for products with Nutri-Score A or B
plt.figure(figsize=(8, 6))
df_good = df_final[df_final['nutrition_grade_fr'].isin(['a', 'b'])]

sns.histplot(df_good['additives_n'], bins=20, kde=False)
plt.title('Additives Count for Products with Nutri-Score A or B')
plt.xlabel('Number of Additives')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Pearson Correlation (linear)
pearson_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='pearson')

# Spearman Correlation (monotonic)
spearman_corr = df_final['additives_n'].corr(df_final['nutrition-score-fr_100g'], method='spearman')

print(f"Pearson correlation between additives_n and nutrition-score-fr_100g: {pearson_corr:.3f}")
print(f"Spearman correlation between additives_n and nutrition-score-fr_100g: {spearman_corr:.3f}")

Pearson correlation between additives_n and nutrition-score-fr_100g: 0.033
Spearman correlation between additives_n and nutrition-score-fr_100g: 0.034

# Keep only the first brand if multiple brands are listed
df_final['brand_main'] = df_final['brands'].apply(lambda x: x.split(',')[0].strip().lower() if pd.notnull(x) else x)

# Step 1: Group by Brand
# Group by main brand and calculate mean additives and mean nutrition score
brand_stats = df_final.groupby('brand_main').agg({
    'additives_n': 'mean',
    'nutrition-score-fr_100g': 'mean',
    'product_name': 'count'  # count number of products per brand
}).reset_index()

# Rename columns for clarity
brand_stats.rename(columns={'product_name': 'product_count'}, inplace=True)

# Check
brand_stats.sample(5)

# Step 2: Top Brands by Additives
# Only keep brands with enough products (e.g., more than 30 products) to avoid noisy small brands.
# Filter brands with at least 30 products
brand_stats_filtered = brand_stats[brand_stats['product_count'] >= 30]

# Top 10 brands by average additives
top_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='additives_n', y='brand_main', data=top_additive_brands)
plt.title('Top 10 Brands with Highest Average Number of Additives')
plt.xlabel('Average Number of Additives')
plt.ylabel('Brand')
plt.grid(True)
plt.show()

# Step 3: Top Brands by Worst Nutrition Score
# Top 10 brands by worst average nutrition score
top_unhealthy_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='nutrition-score-fr_100g', y='brand_main', data=top_unhealthy_brands)
plt.title('Top 10 Brands with Worst Average Nutrition Score')
plt.xlabel('Average Nutrition Score (Higher = Worse)')
plt.ylabel('Brand')
plt.grid(True)
plt.show()

# Step 4: Scatter Plot: Additives vs. Nutrition Score
# Scatter plot of average additives vs. average nutrition score
plt.figure(figsize=(8, 6))
sns.scatterplot(x='additives_n', y='nutrition-score-fr_100g', data=brand_stats_filtered)
plt.title('Average Additives vs. Average Nutrition Score per Brand')
plt.xlabel('Average Number of Additives')
plt.ylabel('Average Nutrition Score')
plt.grid(True)
plt.show()

# Select brands with high average additives
high_additive_brands = top_additive_brands['brand_main'].tolist()

# Filter original data
df_high_additives = df_final[df_final['brand_main'].isin(high_additive_brands)]

# Show Nutri-Grade distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='nutrition_grade_fr', data=df_high_additives, order=['a', 'b', 'c', 'd', 'e'])
plt.title('Nutri-Grade Distribution for Brands with High Additive Usage')
plt.xlabel('Nutri-Score Grade')
plt.ylabel('Number of Products')
plt.grid(True)
plt.show()

# Select low-additive brands
# Bottom 10 brands by average additives (filter brands with >= 30 products)
low_additive_brands = brand_stats_filtered.sort_values('additives_n', ascending=True).head(10)

# Filter the products for these brands
low_additives_df = df_final[df_final['brand_main'].isin(low_additive_brands['brand_main'])]

# Select brands with worst Nutri-Score
# Top 10 brands with worst average Nutri-Score
worst_nutriscore_brands = brand_stats_filtered.sort_values('nutrition-score-fr_100g', ascending=False).head(10)

# Filter the products for these brands
worst_nutriscore_df = df_final[df_final['brand_main'].isin(worst_nutriscore_brands['brand_main'])]

# Analyze Top Additives for Each Group
def plot_top_additives(dataframe, title):
    additives_used = list(chain.from_iterable(dataframe['additives_list'])) # Faster flattening
    additives_counter = Counter(additives_used)
    top_additives = additives_counter.most_common(20)
    top_additives_df = pd.DataFrame(top_additives, columns=['Additive', 'Count'])

    plt.figure(figsize=(10, 6))
    sns.barplot(x='Count', y='Additive', data=top_additives_df)
    plt.title(title)
    plt.xlabel('Count')
    plt.ylabel('Additive')
    plt.grid(True)
    plt.show()

# High-additive brands
plot_top_additives(df_high_additives, 'Top 20 Additives Used by High Additive Brands')

# Low-additive brands
plot_top_additives(low_additives_df, 'Top 20 Additives Used by Low Additive Brands')

# Worst Nutri-Score brands
plot_top_additives(worst_nutriscore_df, 'Top 20 Additives Used by Brands with Worst Nutri-Scores')

# General dataset
plot_top_additives(df_final, 'Top 20 Additives Used by All Brands')

# Rank the additives by their usage count
def rank_additives(dataframe):
    # Fast flatten the list of additives
    additives_used = list(chain.from_iterable(dataframe['additives_list']))

    # Count additives
    additives_counter = Counter(additives_used)

    # Create and sort DataFrame
    additives_rank_df = pd.DataFrame(additives_counter.items(), columns=['Additive', 'Count'])
    additives_rank_df = additives_rank_df.sort_values(by='Count', ascending=False).reset_index(drop=True)

    # Add Rank column
    additives_rank_df.index += 1
    additives_rank_df.index.name = 'Rank'

    return additives_rank_df

# Apply to all products
additives_rank_df = rank_additives(df_final)

# Display the result
display(additives_rank_df)

# include only the nutritional columns for PCA
nutrient_cols = df_final[[
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n',
    'energy_100g',
    'fat_100g',
    'saturated-fat_100g',
    'trans-fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g'
]]

# drop rows with missing values in those columns
df_nutrient = nutrient_cols.dropna()
df_nutrient.head(5)

# Standardizing the features
X_scaled = StandardScaler().fit_transform(df_nutrient)

# Apply PCA with 18 components
pca = PCA(n_components=18)
X_pca = pca.fit_transform(X_scaled)

# The variance explained by each component
np.set_printoptions(suppress=True)
pca.explained_variance_ratio_

array([0.16566702, 0.13153279, 0.10773104, 0.07771031, 0.07305444,
       0.06295702, 0.05915664, 0.05824494, 0.05748542, 0.05220961,
       0.04915088, 0.03967248, 0.03290017, 0.0197183 , 0.01169987,
       0.00110908, 0.        , 0.        ])

# Get Explained Variance Ratio and create plot
explained_var = pca.explained_variance_ratio_
cum_var = np.insert(np.cumsum(explained_var), 0, 0.0)
x_full = np.arange(0, len(explained_var) + 1)

# plotting
plt.figure(figsize=(8, 6))
plt.plot(x_full, cum_var, marker="o", label="Cumulative")
plt.axhline(0.95, color="red", linestyle="--", label="95% Threshold")

# graph format
plt.xlim(-0.5, len(explained_var)+0.5)
plt.xticks(x_full)
plt.xlabel("Number of Principal Components")
plt.ylabel("Explained Variance Ratio")
plt.title("Explained Variance per Principal Component")
plt.legend()
plt.tight_layout()
plt.show()

from sklearn.decomposition import PCA

def run_pca(X_scaled, n_components):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)
    df_pca = pd.DataFrame(X_pca, columns=[f"PC{i+1}" for i in range(n_components)])
    return pca, df_pca

# Standardizing
X_scaled = StandardScaler().fit_transform(df_nutrient)

# List of PCA dimensions you want to test
pca_dims = [4, 5, 6, 8, 10, 13]

# Dictionary to store silhouette scores
silhouette_scores = {}

for dim in pca_dims:
    # PCA Dimension Reduction
    pca = PCA(n_components=dim, random_state=42)
    X_pca = pca.fit_transform(X_scaled)

    # Apply KMeans, each optimal n_clusters is decided from Elbow Method
    kmeans = KMeans(n_clusters=dim - 1, random_state=42)
    clusters = kmeans.fit_predict(X_pca)

    # check silhouette score for the performance of KMeans clustering:
    score = silhouette_score(X_pca, clusters)
    silhouette_scores[dim] = score

# Output
for dim, score in silhouette_scores.items():
    print(f"Silhouette score for {dim}D PCA + KMeans: {score:.4f}")

Silhouette score for 4D PCA + KMeans: 0.3159
Silhouette score for 5D PCA + KMeans: 0.3064
Silhouette score for 6D PCA + KMeans: 0.2864
Silhouette score for 8D PCA + KMeans: 0.2920
Silhouette score for 10D PCA + KMeans: 0.2992
Silhouette score for 13D PCA + KMeans: 0.2357

# dataset for KMeans Clustering
pca_kmeans, df_pca_kmeans = run_pca(X_scaled, 4)
df_pca_kmeans.head(5)

distortions = []
K_range = range(1, 30)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_pca_kmeans)
    distortions.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, distortions, marker='o')
plt.title("Elbow Method for Optimal k")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Within-cluster sum of squares (distortion)")
plt.xticks(K_range)
plt.grid(True)
plt.tight_layout()
plt.show()

# k=4 KMeans Clustering
kmeans = KMeans(n_clusters=4, random_state=42)
df_pca_kmeans['clusters_label'] = kmeans.fit_predict(df_pca_kmeans)
df_pca_kmeans.head(10)

print(np.unique(df_pca_kmeans['clusters_label'], return_counts=True))

(array([0, 1, 2, 3], dtype=int32), array([  349, 36919, 11865, 31813]))

# convert PCA data with 15 components to 2D using UMAP
umap_2d = umap.UMAP(
    n_components=2,
    n_neighbors=10,
    min_dist=0.1,
    random_state=42
).fit_transform(df_pca_kmeans.values)

plt.figure(figsize=(6,5))
sns.scatterplot(
    x=umap_2d[:, 0],
    y=umap_2d[:, 1],
    # hue=df_final['nutrition_grade_fr'],
    hue=df_pca_kmeans['clusters_label'],
    palette='Set2',
    s=12,
    alpha=0.8
)
plt.title("UMAP 2-D embedding of packaged foods")
plt.xlabel("UMAP-1"); plt.ylabel("UMAP-2")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

df_final

df_profile = df_nutrient.copy()
df_profile['cluster'] = df_pca_kmeans['clusters_label'].values

cluster_means = df_profile.groupby('cluster')[
    ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']
].mean().round(2)

display(cluster_means)

nutrient_features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n']
energy_feature = ['energy_100g']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,6))

cluster_means.loc[[0,1,2,3]][nutrient_features].T.plot(
    kind='bar', ax=ax1
)
ax1.set_title('Cluster Profiles – Macronutrients & Additives')
ax1.set_ylabel('Mean per 100g')
ax1.set_xlabel('Nutritional Feature')
ax1.set_xticklabels(nutrient_features, rotation=45)
ax1.legend(title='Cluster')


cluster_means.loc[[0,1,2,3]][energy_feature].T.plot(
    kind='bar', ax=ax2, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#CC0000']
)
ax2.set_title('Cluster Profiles – Energy (kcal per 100g)')
ax2.set_ylabel('Mean Energy')
ax2.set_xlabel('Energy')
ax2.set_xticklabels(['energy_100g'], rotation=0)
ax2.legend(title='Cluster')

plt.tight_layout()
plt.show()

features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']


features = ['fat_100g', 'sugars_100g', 'salt_100g', 'additives_n', 'energy_100g']

data = cluster_means.loc[[0,1,2,3]][features]


scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=features)
data_scaled['cluster'] = ['Cluster 0', 'Cluster 1', 'Cluster 2', 'Cluster 3']

labels = features
num_vars = len(labels)

angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]


fig, ax = plt.subplots(figsize=(8, 6), subplot_kw=dict(polar=True))


for i, row in data_scaled.iterrows():
    values = row[features].tolist()
    values += values[:1]
    ax.plot(angles, values, label=row['cluster'])
    ax.fill(angles, values, alpha=0.15)


ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
ax.set_thetagrids(np.degrees(angles[:-1]), labels)
ax.set_title("Cluster Nutritional Radar Chart", fontsize=14, pad=30)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.show()

fig, axes = plt.subplots(2, 2, subplot_kw=dict(polar=True), figsize=(12,10), constrained_layout=True)
axes = axes.flatten()

# Define colors for clusters
colors = ['#4682B4', '#FFA500', '#7ED957', '#B22222']

for i in range(4):
    values = data_scaled.iloc[i][features].tolist()
    values += values[:1]
    ax = axes[i]
    ax.plot(angles, values, color=colors[i], linewidth=2)
    ax.fill(angles, values, color=colors[i], alpha=0.25)
    ax.set_title(f"{data_scaled.iloc[i]['cluster']}", size=13, pad=35)
    ax.set_thetagrids(np.degrees(angles[:-1]), features)
    ax.set_ylim(0, 1)

fig.suptitle("Nutritional Radar Charts for Each Cluster", fontsize=16, y=1)
plt.tight_layout(rect=[1, 1, 1, 1])
fig.subplots_adjust(hspace=0.4)
plt.show()

# Step 1: Combine ingredients and additives into a single token list
df_final['combined_tokens'] = df_final.apply(
    lambda row: row['clean_ingredient_tokens'] + row['additives_list']
    if isinstance(row['additives_list'], list) else row['clean_ingredient_tokens'],
    axis=1
)

# Step 2: Join the token list into a single string for TF-IDF processing
df_final['combined_str'] = df_final['combined_tokens'].apply(lambda tokens: ' '.join(tokens))

# Step 3: Create binary label: 'healthy' (A/B) vs 'unhealthy' (C/D/E)
df_final['score_binary'] = df_final['nutrition_grade_fr'].str.lower().map(
    lambda x: 'healthy' if x in ['a', 'b'] else ('unhealthy' if x in ['c', 'd', 'e'] else None)
)

# Step 4: Drop rows with missing values to ensure clean modeling
df_model = df_final.dropna(subset=['combined_str', 'nutrition_grade_fr', 'score_binary'])

# Step 5: TF-IDF vectorization (unigrams only, top 1000 features)
vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['nutrition_grade_fr'].str.lower()  # Target: multiclass

# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 7: Train logistic regression model (with class balancing)
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Multiclass Classification Report (Nutri-Score A–E):")
print(classification_report(y_test, y_pred))

# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Multiclass Nutri-Score Classification")
plt.show()

Multiclass Classification Report (Nutri-Score A–E):
              precision    recall  f1-score   support

           a       0.54      0.67      0.60      1717
           b       0.49      0.57      0.53      2288
           c       0.44      0.45      0.44      3148
           d       0.65      0.49      0.56      5264
           e       0.66      0.74      0.70      3773

    accuracy                           0.57     16190
   macro avg       0.56      0.58      0.57     16190
weighted avg       0.58      0.57      0.57     16190

# Step 5: TF-IDF vectorization (bigrams included, more features)
vectorizer = TfidfVectorizer(max_features=8000, ngram_range=(1,2), stop_words='english')
X_tfidf = vectorizer.fit_transform(df_model['combined_str'])
y = df_model['score_binary']  # Target: binary label


# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 7: Train logistic regression model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

# Step 8: Evaluate classification performance
y_pred = clf.predict(X_test)
print("Binary Classification Report (Healthy vs Unhealthy):")
print(classification_report(y_test, y_pred))

# Step 9: Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Binary Nutri-Score Classification")
plt.show()

Binary Classification Report (Healthy vs Unhealthy):
              precision    recall  f1-score   support

     healthy       0.68      0.88      0.77      4005
   unhealthy       0.95      0.87      0.91     12185

    accuracy                           0.87     16190
   macro avg       0.82      0.87      0.84     16190
weighted avg       0.89      0.87      0.87     16190

# Reuse X_train, y_train, X_test, y_test from the binary logistic regression model (Section 3.1.2)

# Step 1: Initialize and train the Random Forest classifier
rf_clf = RandomForestClassifier(
    n_estimators=200,       # Number of trees in the forest
    max_depth=15,           # Maximum depth of each tree
    class_weight='balanced',# Handle class imbalance
    random_state=42         # Ensure reproducibility
)
rf_clf.fit(X_train, y_train)

# Step 2: Make predictions
y_pred_rf = rf_clf.predict(X_test)

# Step 3: Display performance metrics
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Step 4: Plot confusion matrix
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d",
            xticklabels=rf_clf.classes_,
            yticklabels=rf_clf.classes_,
            cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix – Healthy vs Unhealthy")
plt.show()

Random Forest Classification Report:
              precision    recall  f1-score   support

     healthy       0.54      0.88      0.67      4005
   unhealthy       0.95      0.75      0.84     12185

    accuracy                           0.78     16190
   macro avg       0.75      0.82      0.75     16190
weighted avg       0.85      0.78      0.80     16190

from scipy.sparse import hstack
# Step 1: Prepare ingredient-only text
df_model['ingredient_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))

# Step 2: TF-IDF vectorization on ingredients
vectorizer = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words='english',
    min_df=3,
    max_df=0.9
)
X_tfidf = vectorizer.fit_transform(df_model['ingredient_str'])

# Step 3: Clean and encode brand info
top_brands = df_model['brands'].value_counts().head(50).index
df_model['brand_clean'] = df_model['brands'].apply(lambda x: x if x in top_brands else 'other')
brand_ohe = pd.get_dummies(df_model['brand_clean'], prefix='brand')

# Step 4: Concatenate features
X_final = hstack([X_tfidf, brand_ohe.values])

# Step 5: Define target
y = df_model['score_binary']  # 'healthy' vs 'unhealthy'
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# Step 6: Train model
clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Step 7: Evaluation
print("Classification Report (Ingredient + Brand):")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", xticklabels=clf.classes_, yticklabels=clf.classes_, cmap="Greens")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix – Ingredient + Brand Binary Classification")
plt.show()

Classification Report (Ingredient + Brand):
              precision    recall  f1-score   support

     healthy       0.67      0.88      0.76      4005
   unhealthy       0.96      0.86      0.90     12185

    accuracy                           0.86     16190
   macro avg       0.81      0.87      0.83     16190
weighted avg       0.88      0.86      0.87     16190

# Step 1: Use X_final (TF-IDF + brand one-hot) and y from previous section

rf_clf = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    class_weight='balanced',
    random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

# Evaluation
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

conf_matrix_rf = confusion_matrix(y_test, y_pred_rf, labels=rf_clf.classes_)
sns.heatmap(conf_matrix_rf, annot=True, fmt="d", xticklabels=rf_clf.classes_, yticklabels=rf_clf.classes_, cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Random Forest Confusion Matrix – Ingredient + Brand")
plt.show()

# Step 1: Prepare Ingredient-only Strings
df_model['ingredient_only_str'] = df_model['clean_ingredient_tokens'].apply(lambda x: ' '.join(x))

# Step 2: TF-IDF Vectorization (only on ingredients)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=8000,
    ngram_range=(1, 2),
    stop_words='english',
    min_df=3,
    max_df=0.9
)

X_ing = vectorizer.fit_transform(df_model['ingredient_only_str'])
y = df_model['score_binary']  # Use the binary label: 'healthy' vs 'unhealthy'

# Step 3: Train/Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ing, y, test_size=0.2, random_state=42)

# Step 4: Train Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

clf = LogisticRegression(max_iter=1000, class_weight='balanced')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

# Step 5: Evaluation
print("Ingredient-Only Classification Report:")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred, labels=clf.classes_)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="BuGn",
            xticklabels=clf.classes_, yticklabels=clf.classes_)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Ingredient-Only – Binary Nutri-Score Classification")
plt.show()

# Step 6: Show Most Important Tokens
import numpy as np

feature_names = vectorizer.get_feature_names_out()
coefs = clf.coef_[0]  # For binary classification

# Top 20 words most indicative of "unhealthy"
print(" Top 20 'unhealthy' indicators:")
for i in np.argsort(coefs)[-20:][::-1]:
    print(f"{feature_names[i]:<20} {coefs[i]:.3f}")

# Top 20 words most indicative of "healthy"
print("\n Top 20 'healthy' indicators:")
for i in np.argsort(coefs)[:20]:
    print(f"{feature_names[i]:<20} {coefs[i]:.3f}")

 Top 20 'unhealthy' indicators:
pepper spice         6.218
syrup seasoning      5.791
cottonseed oils      4.883
acid peanut          4.081
pepper yeast         3.920
vit b12              3.845
crumb wheat          3.745
sulfate ascorbic     3.424
chips bananas        3.420
oils coconut         3.275
powder sorbitan      3.263
citrate tricalcium   3.201
pgpr emulsifier      3.166
color modified       3.155
color contains       3.134
color disodium       3.071
flavor modified      2.994
orange juice         2.938
brownie              2.932
color black          2.915

 Top 20 'healthy' indicators:
quartered            -5.559
benzoate sodium      -5.413
water red            -4.978
phosphate color      -4.343
usa                  -3.820
almonds almonds      -3.663
dried apricots       -3.590
whey pasteurized     -3.583
coriander            -3.552
acid ferrous         -3.486
freshness vitamin    -3.474
white tuna           -3.339
phosphate thiamine   -3.277
steamed              -3.206
culture sea          -3.182
benzoic acid         -3.170
cultures reduced     -3.144
tocopherols natural  -3.127
vegetable monoglycerides -3.119
almond butter        -3.073

# Suppose your input data includes structured features like sugars_100g, salt_100g, etc.
# Step 1: Define your feature columns
numeric_features = [
    'additives_n',
    'ingredients_from_palm_oil_n',
    'ingredients_that_may_be_from_palm_oil_n',
    'energy_100g',
    'fat_100g',
    'saturated-fat_100g',
    'trans-fat_100g',
    'cholesterol_100g',
    'carbohydrates_100g',
    'sugars_100g',
    'fiber_100g',
    'proteins_100g',
    'salt_100g',
    'sodium_100g',
    'vitamin-a_100g',
    'vitamin-c_100g',
    'calcium_100g',
    'iron_100g'
]

# Step 2: Prepare X and y
X = df_final[numeric_features].fillna(0)  # Fill NaNs with 0 or use better imputation
y = df_final['score_binary']  # Target: 'healthy' or 'unhealthy'

# Step 3: Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Random Forest
rf = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)

# Step 5: Feature importance
importances = rf.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': numeric_features,
    'importance': importances
}).sort_values(by='importance', ascending=False)

# Step 6: Print feature importance
print("Feature Importance:")
print(feature_importance_df)

# Step 7: Plot feature importance
plt.figure(figsize=(10,6))
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title('Feature Importance – Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

Feature Importance:
                                    feature  importance
5                        saturated-fat_100g    0.203925
9                               sugars_100g    0.182795
4                                  fat_100g    0.130943
12                                salt_100g    0.118942
13                              sodium_100g    0.101605
3                               energy_100g    0.098835
8                        carbohydrates_100g    0.047869
10                               fiber_100g    0.036596
11                            proteins_100g    0.031020
7                          cholesterol_100g    0.013527
17                                iron_100g    0.012762
16                             calcium_100g    0.008211
15                           vitamin-c_100g    0.005458
14                           vitamin-a_100g    0.003753
0                               additives_n    0.003343
2   ingredients_that_may_be_from_palm_oil_n    0.000210
6                            trans-fat_100g    0.000206
1               ingredients_from_palm_oil_n    0.000000

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

# 5.1 Continuous predictors (drop sodium_100g due to collinearity)
cont_vars = [
    'fat_100g','saturated-fat_100g','trans-fat_100g',
    'cholesterol_100g','carbohydrates_100g','sugars_100g','fiber_100g',
    'proteins_100g','salt_100g',
    'vitamin-a_100g','vitamin-c_100g',
    'calcium_100g','iron_100g',
    'ingredients_from_palm_oil_n','ingredients_that_may_be_from_palm_oil_n'
]
outcome = 'nutrition-score-fr_100g'

# 5.2 Clean & cast
df_reg = df_final.dropna(subset=cont_vars + ['additives_n', outcome, 'brands']).copy()
for c in cont_vars + ['additives_n', outcome]:
    df_reg[c] = pd.to_numeric(df_reg[c], errors='coerce')
df_reg.dropna(subset=cont_vars + ['additives_n', outcome, 'brands'], inplace=True)

# 5.3 Feature engineering
df_reg['log_additives'] = np.log1p(df_reg['additives_n'])
X = df_reg[cont_vars + ['log_additives']].copy()

# non-linear terms
X['sugars_sq']  = X['sugars_100g'] ** 2
X['fat_sq']     = X['fat_100g'] ** 2

# interaction terms
X['sugar_fat']  = X['sugars_100g'] * X['fat_100g']
X['sugar_salt'] = X['sugars_100g'] * X['salt_100g']
X['fat_salt']   = X['fat_100g']    * X['salt_100g']

# 5.4 Brand fixed effects (top 10 + “other”)
top_brands = df_reg['brands'].value_counts().nlargest(10).index
df_reg['brand_top10'] = df_reg['brands'].where(
    df_reg['brands'].isin(top_brands), 'other'
)
brand_dummies = pd.get_dummies(df_reg['brand_top10'],
                               prefix='brand', drop_first=True)
X = pd.concat([X, brand_dummies], axis=1)

y = df_reg[outcome].astype(float)

# 5.5 Standardize continuous regressors
to_scale = cont_vars + ['log_additives','sugars_sq','fat_sq',
                        'sugar_fat','sugar_salt','fat_salt']
scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X[to_scale]),
    columns=to_scale,
    index=X.index
)

# rebuild design matrix
X_design = pd.concat([X_scaled, X.drop(columns=to_scale)], axis=1)
X_design = sm.add_constant(X_design)

# 5.6 VIF check (numeric only)
X_vif = X_design.drop(columns=['const']).select_dtypes(include=[np.number])
vif = pd.DataFrame({
    'variable': X_vif.columns,
    'VIF': [variance_inflation_factor(X_vif.values, i)
            for i in range(X_vif.shape[1])]
})
print("Top 10 VIFs:\n", vif.sort_values('VIF', ascending=False).head(10))

# 5.7 Fit OLS
model = sm.OLS(y, X_design.astype(float)).fit()
print(model.summary())

/usr/local/lib/python3.11/dist-packages/statsmodels/regression/linear_model.py:1784: RuntimeWarning: invalid value encountered in scalar divide
  return 1 - self.ssr/self.uncentered_tss

Top 10 VIFs:
               variable        VIF
5          sugars_100g  15.340163
0             fat_100g  12.341447
16           sugars_sq  10.255979
17              fat_sq   6.509564
18           sugar_fat   4.042129
1   saturated-fat_100g   3.478562
4   carbohydrates_100g   2.254807
20            fat_salt   2.193900
8            salt_100g   1.436269
7        proteins_100g   1.431740
                               OLS Regression Results                              
===================================================================================
Dep. Variable:     nutrition-score-fr_100g   R-squared:                       0.829
Model:                                 OLS   Adj. R-squared:                  0.829
Method:                      Least Squares   F-statistic:                 1.310e+04
Date:                     Sun, 27 Apr 2025   Prob (F-statistic):               0.00
Time:                             03:25:10   Log-Likelihood:            -2.1933e+05
No. Observations:                    80946   AIC:                         4.387e+05
Df Residuals:                        80915   BIC:                         4.390e+05
Df Model:                               30                                         
Covariance Type:                 nonrobust                                         
===========================================================================================================
                                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------
const                                      10.4594      0.138     75.540      0.000      10.188      10.731
fat_100g                                    6.3398      0.045    141.074      0.000       6.252       6.428
saturated-fat_100g                          2.9176      0.024    122.380      0.000       2.871       2.964
trans-fat_100g                              0.0690      0.013      5.372      0.000       0.044       0.094
cholesterol_100g                            0.0041      0.013      0.320      0.749      -0.021       0.029
carbohydrates_100g                          1.0818      0.019     56.326      0.000       1.044       1.119
sugars_100g                                 7.0679      0.050    141.073      0.000       6.970       7.166
fiber_100g                                 -1.7139      0.015   -116.426      0.000      -1.743      -1.685
proteins_100g                               0.4292      0.015     28.065      0.000       0.399       0.459
salt_100g                                   1.1982      0.015     78.219      0.000       1.168       1.228
vitamin-a_100g                              0.1131      0.013      8.547      0.000       0.087       0.139
vitamin-c_100g                             -0.0012      0.013     -0.094      0.925      -0.027       0.024
calcium_100g                                0.0039      0.015      0.269      0.788      -0.025       0.033
iron_100g                                   0.0067      0.013      0.524      0.600      -0.018       0.032
ingredients_from_palm_oil_n              1.861e-15   1.08e-16     17.158      0.000    1.65e-15    2.07e-15
ingredients_that_may_be_from_palm_oil_n    -0.0596      0.013     -4.471      0.000      -0.086      -0.033
log_additives                               0.3502      0.014     25.118      0.000       0.323       0.378
sugars_sq                                  -3.0089      0.041    -73.492      0.000      -3.089      -2.929
fat_sq                                     -3.0546      0.033    -93.592      0.000      -3.119      -2.991
sugar_fat                                  -1.7688      0.026    -68.835      0.000      -1.819      -1.718
sugar_salt                                  0.3834      0.015     25.229      0.000       0.354       0.413
fat_salt                                    1.2758      0.019     67.397      0.000       1.239       1.313
brand_food club                             1.4857      0.217      6.834      0.000       1.060       1.912
brand_great value                           1.0948      0.190      5.762      0.000       0.722       1.467
brand_kroger                                0.6213      0.181      3.434      0.001       0.267       0.976
brand_meijer                                0.7338      0.180      4.066      0.000       0.380       1.088
brand_other                                 0.4821      0.139      3.466      0.001       0.209       0.755
brand_roundy's                              0.5039      0.198      2.543      0.011       0.116       0.892
brand_shoprite                             -0.4257      0.219     -1.941      0.052      -0.856       0.004
brand_spartan                               0.8256      0.189      4.362      0.000       0.455       1.197
brand_target stores                         0.8649      0.216      4.002      0.000       0.441       1.288
brand_weis                                  0.5189      0.203      2.561      0.010       0.122       0.916
==============================================================================
Omnibus:                    12667.619   Durbin-Watson:                   1.142
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           205474.026
Skew:                          -0.207   Prob(JB):                         0.00
Kurtosis:                      10.794   Cond. No.                     1.35e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.53e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

	code	url	creator	created_t	created_datetime	last_modified_t	last_modified_datetime	product_name	brands	brands_tags	...	fiber_100g	proteins_100g	salt_100g	sodium_100g	vitamin-a_100g	vitamin-c_100g	calcium_100g	iron_100g	nutrition-score-fr_100g	nutrition-score-uk_100g
118826	0607880033342	http://world-en.openfoodfacts.org/product/0607...	usda-ndb-import	1489137479	2017-03-10T09:17:59Z	1489137480	2017-03-10T09:18:00Z	Sweet & Salty Caramel Trail Mix	Southern Home	southern-home	...	6.7	13.33	1.01600	0.400	0.000000	0.0000	0.133	0.00240	20.0	20.0
142621	0749826575520	http://world-en.openfoodfacts.org/product/0749...	usda-ndb-import	1489096243	2017-03-09T21:50:43Z	1489096243	2017-03-09T21:50:43Z	High Protein Fruit & Nut Bar	Pure Protein	pure-protein	...	9.4	33.96	0.76708	0.302	0.000057	0.0045	0.151	0.00136	9.0	9.0
159299	0850335006013	http://world-en.openfoodfacts.org/product/0850...	usda-ndb-import	1489093649	2017-03-09T21:07:29Z	1489093649	2017-03-09T21:07:29Z	Verry Berry Fruit Pop	Squeaky Pops	squeaky-pops	...	0.0	0.00	0.01524	0.006	0.000000	0.0029	0.000	0.00000	5.0	5.0
94808	0077661147306	http://world-en.openfoodfacts.org/product/0077...	usda-ndb-import	1489138130	2017-03-10T09:28:50Z	1489138130	2017-03-10T09:28:50Z	Opa, Greek Yogurt Roasted Garlic Dressing	Litehouse, Litehouse Inc.	litehouse,litehouse-inc	...	0.0	3.33	1.69418	0.667	0.000000	0.0000	0.133	0.00000	7.0	7.0
141907	0747599322013	http://world-en.openfoodfacts.org/product/0747...	usda-ndb-import	1489075490	2017-03-09T16:04:50Z	1489075490	2017-03-09T16:04:50Z	Squares, Chocolate Assortment	Ghirardelli Chocolate, Ghirardelli Chocolate ...	ghirardelli-chocolate,ghirardelli-chocolate-co...	...	5.0	5.00	0.15748	0.062	0.000000	0.0030	0.100	0.00360	20.0	20.0

	trans-fat_100g	sugars_100g	salt_100g	sodium_100g	vitamin-c_100g	calcium_100g	iron_100g
1483	0.00	5.71	870.85678	342.857	0.0000	0.000	0.00617
8043	0.00	2.31	781.53768	307.692	0.0000	0.015	0.00138
11206	0.00	22.58	327.74128	129.032	0.0000	0.032	0.00232
12036	369.00	0.81	0.24892	0.098	0.0000	0.054	0.00073
41869	0.00	17.86	2.72034	1.071	-0.0021	0.143	0.00129
50827	0.00	20.59	130.73634	51.471	0.0000	0.118	0.00106
69041	0.00	9.46	1098.37728	432.432	0.0000	0.054	0.00243
69050	0.00	11.27	858.59112	338.028	0.0000	0.028	0.00254
95375	0.00	33.33	1318.38192	519.048	0.0000	0.262	0.01414
107178	0.00	8.48	1139.15190	448.485	0.0000	0.012	0.00087
108870	-0.70	20.42	0.22352	0.088	0.0000	0.070	0.00190
110111	0.00	50.00	101.60000	40.000	0.0000	0.000	0.00000
113274	0.00	50.00	0.30734	0.121	0.0043	285.714	0.00000
119086	0.00	-3.57	0.95250	0.375	0.0086	0.071	0.00129
122368	0.00	65.85	123.90120	48.780	0.0000	0.000	0.00000
133607	0.00	0.00	101.23678	39.857	0.0000	0.000	0.00000
133611	0.00	0.00	104.86644	41.286	0.0000	0.000	0.00000
139181	0.00	0.00	187.96000	74.000	2.1000	0.000	0.00000
140017	173.26	1.60	0.80772	0.318	0.0024	0.027	0.00072
148432	0.00	7.65	100.85324	39.706	0.0000	0.000	0.00000
155175	-3.57	10.71	2.44856	0.964	0.0000	0.071	0.00386
162397	0.00	0.00	1669.14322	657.143	0.0000	0.057	0.00309
351458	0.00	14.29	0.00000	0.000	0.0107	0.014	-0.00026

	code	product_name	brands	ingredients_text	nutrition_grade_fr	additives_en	nutrition-score-fr_100g
49387	0041497097548	1% low fat chocolate milk, chocolate	weis quality, weis markets inc.	low fat milk, high fructose corn syrup, sugar,...	b	E407 - Carrageenan	0.0
49388	0041497097555	1% low fat chocolate milk, chocolate	weis quality, weis markets inc.	low fat milk, high fructose corn syrup, sugar,...	b	E407 - Carrageenan	0.0
57213	0044100106804	1% lowfat milk	hood, hp hood llc	lowfat milk, ascorbic acid (vitamin c), vitami...	a	E300 - Ascorbic acid	-1.0
57264	0044100169267	1% lowfat milk	hood, hp hood llc	lowfat milk, ascorbic acid (vitamin c), vitami...	a	E300 - Ascorbic acid	-1.0
54127	0041900074302	1% lowfat milk, chocolate	trumoo, dean foods company	lowfat milk, sugar, contains less than 1% of: ...	a	E407 - Carrageenan	-1.0
54182	0041900075712	1% lowfat milk, chocolate	trumoo, dean foods company	lowfat milk, sugar, contains less than 1% of: ...	a	E407 - Carrageenan	-1.0
44558	0041318020540	100% juice	schnuck markets inc.	tomato concentrate (water, tomato paste), salt...	b	E300 - Ascorbic acid	2.0
44759	0041318131444	100% juice	schnuck markets inc.	tomato concentrate (water, tomato paste), salt...	b	E300 - Ascorbic acid	2.0
7575	0011213015347	100% juice, tomato	spartan	tomato concentrate (water, tomato paste), salt...	b	E300 - Ascorbic acid,E330 - Citric acid	1.0
8025	0011213049427	100% juice, tomato	spartan	tomato concentrate (water, tomato paste), salt...	b	E300 - Ascorbic acid,E330 - Citric acid	1.0

	product_name	ingredients_list	additives_en	additives_list
82	peanuts, mixed nuts	[peanuts, honey, coating (sucrose, wheat starc...	E415 - Xanthan gum	[E415 - Xanthan gum]
149	turkish apricots	[apricots, sulfur dioxide.]	E220 - Sulphur dioxide	[E220 - Sulphur dioxide]
152	chili mango	[dried mango, paprika, sugar, salt, citric aci...	E330 - Citric acid	[E330 - Citric acid]
153	milk chocolate pretzels	[milk chocolate (sugar, cocoa butter, chocolat...	E101 - Riboflavin,E101i - Riboflavin,E322 - Le...	[E101 - Riboflavin, E101i - Riboflavin, E322 -...
200	butter croissants	[wheat flour, butter (cream), water, yeast, su...	E300 - Ascorbic acid	[E300 - Ascorbic acid]
201	wild blueberry muffins	[enriched wheat flour (wheat flour, malted bar...	E101 - Riboflavin,E101i - Riboflavin,E375 - Ni...	[E101 - Riboflavin, E101i - Riboflavin, E375 -...
202	bolillos	[enriched wheat flour (wheat flour niacin, red...	E101 - Riboflavin,E101i - Riboflavin,E200 - So...	[E101 - Riboflavin, E101i - Riboflavin, E200 -...
203	biscuit	[enriched wheat flour (niacin, reduced iron, t...	E101 - Riboflavin,E101i - Riboflavin,E375 - Ni...	[E101 - Riboflavin, E101i - Riboflavin, E375 -...
204	biscuit	[enriched wheat flour (niacin, reduced iron, t...	E101 - Riboflavin,E101i - Riboflavin,E375 - Ni...	[E101 - Riboflavin, E101i - Riboflavin, E375 -...
205	oatmeal raisin cookie	[enriched flour (bleached wheat flour, niacin,...	E101 - Riboflavin,E101i - Riboflavin,E160a - A...	[E101 - Riboflavin, E101i - Riboflavin, E160a ...

	ingredients_text
22667	premium fresh pork, water, premium fresh beef,...
183109	*solution: water, potassium lactate, sodium ph...
315392	water, tomatillo, jalapeno peppers, habanero p...
136002	soy protein isolate, organic cane syrup, organ...
108289	enriched bleached flour (wheat flour, niacin, ...

	code	url	creator	created_t	created_datetime	last_modified_t	last_modified_datetime	product_name	generic_name	quantity	...	fruits-vegetables-nuts_100g	fruits-vegetables-nuts-estimate_100g	collagen-meat-protein-ratio_100g	cocoa_100g	chlorophyl_100g	carbon-footprint_100g	nutrition-score-fr_100g	nutrition-score-uk_100g	glycemic-index_100g	water-hardness_100g
0	0000000003087	http://world-en.openfoodfacts.org/product/0000...	openfoodfacts-contributors	1474103866	2016-09-17T09:17:46Z	1474103893	2016-09-17T09:18:13Z	Farine de blé noir	NaN	1kg	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	0000000004530	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489069957	2017-03-09T14:32:37Z	1489069957	2017-03-09T14:32:37Z	Banana Chips Sweetened (Whole)	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	14.0	14.0	NaN	NaN
2	0000000004559	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489069957	2017-03-09T14:32:37Z	1489069957	2017-03-09T14:32:37Z	Peanuts	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN
3	0000000016087	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489055731	2017-03-09T10:35:31Z	1489055731	2017-03-09T10:35:31Z	Organic Salted Nut Mix	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	12.0	12.0	NaN	NaN
4	0000000016094	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489055653	2017-03-09T10:34:13Z	1489055653	2017-03-09T10:34:13Z	Organic Polenta	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	0
no_nutriments	356027
chlorophyl_100g	356027
water-hardness_100g	356027
glycemic-index_100g	356027
-butyric-acid_100g	356027
...	...
code	26
url	26
created_datetime	10
creator	3
created_t	3

	0
no_nutriments	100.000000
chlorophyl_100g	100.000000
water-hardness_100g	100.000000
glycemic-index_100g	100.000000
-butyric-acid_100g	100.000000
...	...
code	0.007303
url	0.007303
created_datetime	0.002809
creator	0.000843
created_t	0.000843

	countries	countries_tags	countries_en
69486	US	en:united-states	United States
146344	US	en:united-states	United States
107607	US	en:united-states	United States
131265	US	en:united-states	United States
36611	US	en:united-states	United States
125188	US	en:united-states	United States
298030	United States	en:united-states	United States
59010	US	en:united-states	United States
59328	US	en:united-states	United States
17235	US	en:united-states	United States

	0
folic acid	19036
citric acid	18796
corn syrup	13024
reduced iron	12425
thiamine mononitrate	10395
soy lecithin	9883
soybean oil	8885
xanthan gum	7948
cocoa butter	7064
sea salt	6314

	brand_main	additives_n	nutrition-score-fr_100g	product_count
1537	bucky badger	2.5	17.500000	2
4627	grant park custom meats	3.0	3.000000	1
3349	echo lake foods	5.0	2.000000	1
2754	daily bread	2.0	20.000000	2
10153	sea port products corp	1.0	-1.333333	3

	Additive	Count
Rank
1	E330 - Citric acid	20026
2	E101 - Riboflavin	19564
3	E101i - Riboflavin	19559
4	E375 - Nicotinic acid	19535
5	E322 - Lecithins	16498
...	...	...
321	E555 - Potassium aluminium silicate	1
322	E343i - Monomagnesium phosphate	1
323	E365 - Sodium fumarate	1
324	E266 - Sodium dehydroacetate	1
325	E470 - Sodium/potassium/calcium and magnesium ...	1

	additives_n	energy_100g	fat_100g	saturated-fat_100g	cholesterol_100g	carbohydrates_100g	sugars_100g	fiber_100g	proteins_100g	salt_100g	sodium_100g	vitamin-a_100g	vitamin-c_100g	calcium_100g	iron_100g
82	1.0	2389.0	42.86	7.14	0.000	25.00	14.29	7.1	25.00	0.54356	0.214	0.000000	0.0000	0.071	0.00514
149	1.0	1046.0	0.00	0.00	0.000	62.50	52.50	7.5	2.50	0.00000	0.000	0.001125	0.0000	0.050	0.00360
152	1.0	1569.0	2.50	0.00	0.000	87.50	65.00	2.5	2.50	1.96850	0.775	0.000750	0.0000	0.100	0.00090
153	5.0	1883.0	22.50	12.50	0.012	70.00	42.50	2.5	5.00	1.01600	0.400	0.000075	0.0000	0.050	0.00180
200	1.0	1523.0	16.88	10.39	0.052	44.16	5.19	1.3	7.79	1.08966	0.429	0.000195	0.1013	0.026	0.00094

	PC1	PC2	PC3	PC4
0	2.684036	1.629462	-1.264479	-1.027111
1	0.049841	-1.180093	1.096871	-2.193361
2	0.695542	-1.189935	2.092358	-1.137370
3	2.072854	-0.649734	0.724378	0.590306
4	0.486327	0.340976	-0.266710	-0.917434

	code	url	creator	created_t	created_datetime	last_modified_t	last_modified_datetime	product_name	brands	brands_tags	...	vitamin-c_100g	calcium_100g	iron_100g	nutrition-score-fr_100g	nutrition-score-uk_100g	ingredients_list	additives_list	weird_char_count	clean_ingredient_tokens	brand_main
82	0000000033688	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489050424	2017-03-09T09:07:04Z	1489050424	2017-03-09T09:07:04Z	peanuts, mixed nuts	northgate market	northgate-market	...	0.0000	0.071	0.00514	14.0	14.0	[peanuts, honey, coating (sucrose, wheat starc...	[E415 - Xanthan gum]	0	[coating (sucrose, wheat starch, xanthan gum, ...	northgate market
149	0000000045292	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489069958	2017-03-09T14:32:38Z	1489069958	2017-03-09T14:32:38Z	turkish apricots	northgate	northgate	...	0.0000	0.050	0.00360	8.0	8.0	[apricots, sulfur dioxide.]	[E220 - Sulphur dioxide]	0	[sulfur dioxide]	northgate
152	0000000045421	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489069957	2017-03-09T14:32:37Z	1489069957	2017-03-09T14:32:37Z	chili mango	torn & glasses	torn-glasses	...	0.0000	0.100	0.00090	19.0	19.0	[dried mango, paprika, sugar, salt, citric aci...	[E330 - Citric acid]	0	[dried mango, citric acid]	torn & glasses
153	0000000045483	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489050424	2017-03-09T09:07:04Z	1489050424	2017-03-09T09:07:04Z	milk chocolate pretzels	torn & glasser	torn-glasser	...	0.0000	0.050	0.00180	25.0	25.0	[milk chocolate (sugar, cocoa butter, chocolat...	[E101 - Riboflavin, E101i - Riboflavin, E322 -...	7	[milk chocolate (sugar, cocoa butter, chocolat...	torn & glasser
200	0000020039127	http://world-en.openfoodfacts.org/product/0000...	usda-ndb-import	1489138568	2017-03-10T09:36:08Z	1489138568	2017-03-10T09:36:08Z	butter croissants	fresh & easy	fresh-easy	...	0.1013	0.026	0.00094	18.0	18.0	[wheat flour, butter (cream), water, yeast, su...	[E300 - Ascorbic acid]	0	[wheat flour, butter (cream, wheat gluten, asc...	fresh & easy
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
355821	9556041620369	http://world-en.openfoodfacts.org/product/9556...	usda-ndb-import	1489066070	2017-03-09T13:27:50Z	1489066070	2017-03-09T13:27:50Z	sardines in spicy tomato sauce, chili and lime	ayam brand	ayam-brand	...	0.0000	0.357	0.00257	3.0	3.0	[sardines, water, tomato paste, sugar, dried c...	[E322 - Lecithins, E322i - Lecithin, E415 - Xa...	0	[tomato paste, dried chili, thickener (xanthan...	ayam brand
355844	9556173386461	http://world-en.openfoodfacts.org/product/9556...	usda-ndb-import	1489066836	2017-03-09T13:40:36Z	1489066836	2017-03-09T13:40:36Z	chewy candy	fruit plus	fruit-plus	...	0.0000	0.000	0.00000	21.0	21.0	[sugar, glucose syrup, vegetable fat (hydrogen...	[E102 - Tartrazine, E330 - Citric acid, E414 -...	3	[glucose syrup, vegetable fat (hydrogenated pa...	fruit plus
355859	9556390158162	http://world-en.openfoodfacts.org/product/9556...	usda-ndb-import	1489069476	2017-03-09T14:24:36Z	1489069476	2017-03-09T14:24:36Z	lee, special crackers	lee biscuits (pte.) ltd.	lee-biscuits-pte-ltd	...	0.0000	0.045	0.00082	16.0	16.0	[wheat flour, vegetable oil (palm olein), suga...	[E1101 - Protease, E450 - Diphosphates, E471 -...	0	[wheat flour, vegetable oil (palm olein, corn ...	lee biscuits (pte.) ltd.
355860	9556390178160	http://world-en.openfoodfacts.org/product/9556...	usda-ndb-import	1489070026	2017-03-09T14:33:46Z	1489070026	2017-03-09T14:33:46Z	sugar crackers	lee biscuits (pte.) ltd.	lee-biscuits-pte-ltd	...	0.0000	0.000	0.00082	13.0	13.0	[wheat flour, sugar, vegetable fat (palm base)...	[E450 - Diphosphates, E500 - Sodium carbonates...	0	[wheat flour, corn starch, vegetable oil (palm...	lee biscuits (pte.) ltd.
355968	9780803738782	http://world-en.openfoodfacts.org/product/9780...	usda-ndb-import	1489069944	2017-03-09T14:32:24Z	1489069945	2017-03-09T14:32:25Z	organic z bar	clif kid	clif-kid	...	0.0583	0.556	0.00500	11.0	11.0	[organic oat blend (organic rolled oats, organ...	[E322 - Lecithins, E322i - Lecithin]	0	[]	clif kid

	fat_100g	sugars_100g	salt_100g	additives_n	energy_100g
cluster
0	1.60	9.52	40.84	2.19	707.08
1	5.52	7.45	1.13	3.21	623.53
2	32.38	13.48	1.65	2.13	1867.73
3	12.93	33.26	1.09	3.37	1697.14

Unwrapping the Secrets: Nutritional Analytics of U.S. Packaged Foods¶

🕵🏻‍♀️ What is it about?¶

Why It’s Interesting?¶

Key Research Question:¶

Where Our Data Comes From?¶

1️⃣ Data Loading and Preprocessing¶

1.1 Setting Up Data¶

1.2 Exploring the Raw Dataset¶

1.3 Attribute Key Definitions¶

1.4 Checking Nulls and Duplicates¶

1.5 Define the Scope of Analysis: Focus on U.S. Data¶

Attempted Approach: Fuzzy Matching (Not Used)¶

Second Approach: Exact Match with U.S. Labels¶

1.6 Cleaning Data: handle the missing and duplicate value¶

Step 1: Visualize Missing Rates Across Columns¶

Step 2: Drop Columns with Excessive Missing Values¶

Step 3: Drop Remaining Nulls and Duplicate¶

2️⃣ Exploratory Data Analysis (EDA)¶

2.1 Cleaning Numerical Features¶

2.2 Text Preprocessing for Ingredient and Additive Fields¶

Step 1: Remove Duplicate Products¶

Step 2: Standardize ingredients_list and additives_list¶

Step 3: Further Clean ingredients_text¶

(1) Check for Non-English or Noisy Characters¶

(2) Expand Stopwords List¶

(3) Build Translation Dictionary¶

(4) Define Cleaning Function¶

(5) Apply Cleaning and Explore Top Ingredients¶

💬 Observations from 50 Most Common Ingredients in the U.S. packaged foods¶

2.3 Initial Data Exploration¶

Nutrition Grade Distribution¶

Exploring Key Nutritional Metrics Across Nutri-Score¶

2.4 Deeper Analysis: Additives, Brands, and Nutrition Grades¶

🎯 Research Question 1: Does Nutri-Score reflect additives presence?¶

🛎️ Research Question 1: Conclusion¶

🎯 Research Question 2: Which brands use more additives and have lower Nutri-Scores?¶

🛎️ Research Question 2: Conclusion¶

1. Brands with the Highest Average Number of Additives:¶

2. Brands with the Worst Average Nutrition Scores:¶

3. Relationship Between Additives and Nutrition Scores Across Brands:¶

🎯 Research Question 3: How do food additives and healthiness vary across different brands?¶

🛎️ Research Question 3: Conclusion¶

3️⃣ Classification Modeling¶

3.1 Principal Component Analysis (PCA)¶

Deciding How Many Features to Keep for KMeans Clustering¶

Clustering Evaluation: How Good Are Our Groups?¶

KMeans Clustering and Visualization with UMAP¶

Cluster Profiling¶

4️⃣ Prediction Modeling: Predicting Product Healthiness¶

4.1 Ingredients + Additives Text Model¶

4.1.1 Multiclass Logistic Regression (Nutri-Score A to E)¶

4.1.2 Binary Logistic Regression (Healthy vs Unhealthy)¶

4.1.3 Binary Random Forest (Healthy vs Unhealthy)¶

🚀 Summary of Key Findings – Section 4.1¶

4.2 Ingredient + Brand Model¶

🚀 Summary of Key Findings - Section 4.2¶

4.3 Ingredient-Only Model¶

🚀Summary of Key Findings – Section 4.3: Ingredient-Only Model¶

4.4 Nutri Feature Importance Analysis (Random Forest)¶

🚀 Summary of Key Findings – Section 4.4¶

Implication:¶

5️⃣ Regression Results and Interpretation¶

Summary of Key Findings¶

💪 Challenges¶

📊 Next Steps¶

✨ Final Unwrapping¶

Step 2: Standardize `ingredients_list` and `additives_list`¶

Step 3: Further Clean `ingredients_text`¶