L'Oréal
BotiqueAI
🎨 L'Oréal☁️ GCPNLP · Embeddings · Clustering
Case Study

Consumer Loop

Global review intelligence — unified sentiment across every site, language and product variant.

A multi-pipeline ML system that scrapes product reviews from e-commerce sites in multiple countries, harmonises product names using embeddings and clustering, and delivers structured sentiment signals at scale on Google Cloud.

Multi
country coverage
FR · UK · US · DE · IT …
3
Vertex AI pipelines
Data · Training · Inference
XLM-R
sentiment model
Fine-tuned multilingual BERT
BigTable
reviews store
Low-latency at scale
Client
L'Oréal
Global beauty leader
Project
Consumer Loop
Review intelligence
Cloud
GCP
Google Cloud Platform
Storage
BigTable
Low-latency reviews store
Pipelines
3 × Vertex AI
Data · Training · Inference
Scope
Multi-country
Multiple languages & sites
The Challenge

Same product.
Dozens of names, sites and languages.

L'Oréal sells the same product under slightly different names across Sephora, Amazon, Douglas, Boots and many other retailers — in different countries and volumes. No two sites describe a product identically.

Without a unified product identity, aggregating consumer sentiment across sites was impossible. Reviews were fragmented, untapped and not actionable at scale.

Fragmented product identities
"Elvive Total Repair 5" on Sephora.fr becomes "Total Repair Shampoo" on Amazon.co.uk and "Elvive Reparatur Shampoo" on Douglas.de.
Multi-language review noise
Reviews arrive in French, English, German, Italian, Spanish and more — each with different slang, idioms and beauty vocabulary.
No unified data infrastructure
Review data was siloed per site, per country — no single pipeline to collect, normalise and analyse at scale.
Volume and velocity
Thousands of new reviews per day across hundreds of product variants — manual analysis was not feasible.
The Solution

Three coordinated
Vertex AI pipelines.

Consumer Loop is built as three independent but interconnected ML pipelines on Vertex AI — each handling a distinct stage: data collection, model training and live inference. All share a central BigTable store.

The key innovation is the product harmonisation layer: multilingual embeddings + clustering that links product variants across sites before any sentiment analysis is run.

Data Pipeline
Scrapes reviews from all target sites and countries. Normalises raw text, detects language and stores in BigTable.
Training Pipeline
Reads from BigTable, generates embeddings, clusters products by name similarity, fine-tunes XLM-RoBERTa for beauty sentiment, registers the model.
Inference Pipeline
Scheduled batch job that matches new reviews to product clusters, runs sentiment inference and writes results to BigQuery for reporting.
3
Vertex pipelines
Data · Training · Inference
XLM-R
sentiment model
100+ languages natively
cosine
product similarity
Embedding-based matching
live
BigQuery output
Query-ready sentiment data
Architecture

3-pipeline
diagram

Dashed edges are cross-pipeline data flows. BigTable is the shared store — it feeds both the training and inference pipelines. The trained model artifact flows from the registry into batch inference.

① Data Pipeline
② Training Pipeline
③ Inference Pipeline
Cross-pipeline flow

Solid edges = pipeline flow · Dashed edges = cross-pipeline data sharing

The Algorithm

From fragmented reviews
to unified sentiment signals

The core challenge is not sentiment analysis — it's knowing that two reviews from different sites are talking about the same product. Consumer Loop solves this with a two-stage approach: product harmonisation via embeddings, then multilingual sentiment classification.

1
Product harmonisation — embeddings + clustering

Product names from different sites are encoded into a shared vector space using a multilingual embedding model. Cosine similarity + clustering groups product variants that represent the same item, regardless of language or naming convention.

Sephora.fr · FR
"Elvive Total Repair 5 Shampooing"
Amazon.co.uk · EN
"Total Repair 5 Restoring Shampoo"
Douglas.de · DE
"Elvive Reparatur Shampoo 5"
All 3 resolve to the same product cluster · cluster_id: ELVIVE_REPAIR_5_SHAMPOO
2
Multilingual sentiment — fine-tuned XLM-RoBERTa

A base XLM-RoBERTa model is fine-tuned on annotated beauty product reviews. The model classifies sentiment at the review level (positive / negative / neutral) and extracts aspect-level signals — texture, scent, effectiveness, packaging.

Positive
"Leaves hair silky smooth, love the scent"
Negative
"Dried out my hair after 2 uses, disappointed"
Neutral
"Good product, same as the previous version"
🏷️
Aspect tags
texture · scent · effectiveness · packaging
3
Clustering strategies

Two levels of clustering are used in the pipeline — one for product identity resolution, one for review theme discovery.

🧩
K-Means — product name clustering

Product title embeddings are clustered with K-Means to group variants of the same product. The number of clusters K is estimated per product line. Silhouette score is used to validate the grouping quality.

multilingual-e5-large embeddings
cosine distance metric
silhouette coefficient tuning
💬
DBSCAN — review theme discovery

Review text embeddings are clustered with DBSCAN to surface recurring themes within a product cluster — without needing to pre-define the number of topics. Noise points (outlier reviews) are filtered out automatically.

Sentence-BERT review embeddings
DBSCAN epsilon auto-tuning
BERTopic for topic labelling
Tech Stack

GCP-native
ML stack

Everything runs on Google Cloud. Vertex AI handles pipeline orchestration, model training and serving. BigTable provides the low-latency backbone. BigQuery exposes outputs to downstream analytics.

Cloud & Compute
⚙️
Vertex AI Pipelines
KFP-based orchestration for all 3 pipelines on GCP
📦
Vertex AI Model Registry
Versioned model storage and deployment management
Storage
🗄️
Google BigTable
Low-latency key-value store for raw and processed reviews
📊
BigQuery
Aggregated sentiment outputs — query-ready analytics tables
ML & NLP
🧬
multilingual-e5-large
State-of-the-art multilingual sentence embeddings (HuggingFace)
🤖
XLM-RoBERTa
Fine-tuned for beauty-domain multilingual sentiment analysis
Clustering
🧩
Scikit-learn K-Means
Product name harmonisation across sites and languages
💬
DBSCAN + BERTopic
Review theme discovery without predefined topic count
References

Further reading

Vertex AI Pipelines
Kubeflow-based ML pipeline orchestration on GCP
Google BigTable
Fully managed NoSQL for large-scale low-latency workloads
multilingual-e5-large
State-of-the-art multilingual sentence embeddings — HuggingFace
XLM-RoBERTa
Unsupervised cross-lingual representation learning at scale
BERTopic
Leveraging BERT for topic modelling in NLP — GitHub
Scikit-learn — DBSCAN
Density-based spatial clustering for noise-robust grouping
Sentence Transformers
Producing semantically meaningful sentence embeddings (SBERT)
BigQuery ML
Running ML models directly in BigQuery — Google Cloud
BotiqueAI
Custom AI and MLOps for enterprise clients
← Back to portfolio