Global review intelligence — unified sentiment across every site, language and product variant.
A multi-pipeline ML system that scrapes product reviews from e-commerce sites in multiple countries, harmonises product names using embeddings and clustering, and delivers structured sentiment signals at scale on Google Cloud.
L'Oréal sells the same product under slightly different names across Sephora, Amazon, Douglas, Boots and many other retailers — in different countries and volumes. No two sites describe a product identically.
Without a unified product identity, aggregating consumer sentiment across sites was impossible. Reviews were fragmented, untapped and not actionable at scale.
Consumer Loop is built as three independent but interconnected ML pipelines on Vertex AI — each handling a distinct stage: data collection, model training and live inference. All share a central BigTable store.
The key innovation is the product harmonisation layer: multilingual embeddings + clustering that links product variants across sites before any sentiment analysis is run.
Dashed edges are cross-pipeline data flows. BigTable is the shared store — it feeds both the training and inference pipelines. The trained model artifact flows from the registry into batch inference.
Solid edges = pipeline flow · Dashed edges = cross-pipeline data sharing
The core challenge is not sentiment analysis — it's knowing that two reviews from different sites are talking about the same product. Consumer Loop solves this with a two-stage approach: product harmonisation via embeddings, then multilingual sentiment classification.
Product names from different sites are encoded into a shared vector space using a multilingual embedding model. Cosine similarity + clustering groups product variants that represent the same item, regardless of language or naming convention.
A base XLM-RoBERTa model is fine-tuned on annotated beauty product reviews. The model classifies sentiment at the review level (positive / negative / neutral) and extracts aspect-level signals — texture, scent, effectiveness, packaging.
Two levels of clustering are used in the pipeline — one for product identity resolution, one for review theme discovery.
Product title embeddings are clustered with K-Means to group variants of the same product. The number of clusters K is estimated per product line. Silhouette score is used to validate the grouping quality.
Review text embeddings are clustered with DBSCAN to surface recurring themes within a product cluster — without needing to pre-define the number of topics. Noise points (outlier reviews) are filtered out automatically.
Everything runs on Google Cloud. Vertex AI handles pipeline orchestration, model training and serving. BigTable provides the low-latency backbone. BigQuery exposes outputs to downstream analytics.