Back to Blog
The LLM Tagging Paradox: Why Multi-Label Classification is Harder Than It Looks

The LLM Tagging Paradox: Why Multi-Label Classification is Harder Than It Looks

LLMPrompt EngineeringMulti-Label ClassificationMachine LearningAI

The LLM Tagging Paradox: Why Multi-Label Classification is Harder Than It Looks

Bad tags are a silent killer. In a recommendation engine, a search index, or a content pipeline, a wrong tag sends users to the wrong place, surfaces irrelevant results, and erodes trust over time. The business cost is real, and it compounds. So when LLMs promised to automate tagging at scale, it felt like a solved problem. It isn't.

This challenge goes well beyond restaurants. Automated categorization of product catalogs, support ticket routing, contract document classification, and industrial data annotation all run into the exact same obstacles. Zero-shot multi-label classification with an LLM is one of the most common NLP use cases in enterprise, and one of the least understood.

If you've ever tried using an LLM to categorize data, you probably thought it would be simple. Give the AI a piece of text, provide a list of tags, ask it to pick the relevant ones. Done.

Recently, I had to build a system to classify restaurants using multiple cuisine tags simultaneously: French, Italian, Greek, and so on. What seemed like a straightforward prompt quickly turned into a fascinating lesson in AI psychology.

Here's what happens when you ask an LLM to multi-task, and why it swings from forcing wrong answers to refusing to answer at all.

Phase 1: The AI Forces a Fit (The "Least Bad" Bias)

In my first iteration, I gave the AI a restaurant description and a list of 7 cuisine tags, asking it to apply all that fit.

The immediate issue? The AI suffered from what I call the "Least Bad" Bias. If a restaurant is a modern American burger joint with absolutely no relation to the tags provided, the AI would still try to force a fit. It might tag it as "British" because they serve chips, or "German" because there's a sausage on the menu.

Why does this happen? LLMs are inherently completion engines designed to please.[¹,²] When you hand an AI a multiple-choice list, its default assumption is that the answer must be in the list.[²] Unless you explicitly build an "escape hatch," the AI will look at your tags and think:

"None of these are great, but 'Italian' is the least wrong. I'll pick that."

Phase 2: The AI Freezes Up (The Over-Strictness Problem)

To fix the hallucinated tags, I updated the prompt. I made it strict. I added instructions like: "ONLY apply a tag if you are 100% certain. If none apply, return NOTHING."

It worked... a little too well. Suddenly, the AI became incredibly stingy with its tags.[³] This failure mode is not unique to food data: it is documented in industrial classification settings — categorizing spare parts, quality defects, or sensor signals — where label precision is critical and zero-shot LLMs fall into the same over-strictness trap. In fact, it became much stricter than when I asked it to evaluate tags one by one in isolated prompts (e.g., "Is this French? Yes/No").

A restaurant clearly serving Mediterranean food was left completely untagged.

The Root Cause: Overthinking Shared Traits

When you ask an AI to evaluate one tag at a time, it works in a vacuum. It looks at a menu with feta cheese and olives and confidently says, "Yes, this is Greek."

But when you show the AI all the tags at once (Greek, Turkish, Lebanese, Mediterranean), it starts overthinking the boundaries between them.[⁴]

It sees the ingredients and hesitates:

"Wait, tzatziki is Greek, but a similar recipe is also used in Lebanese salads. I also see olive oil, which can be found in Italian dishes. Plus, there is no feta, which is common in Greek cuisine."

By seeing all the options side-by-side, the AI realizes that features are shared across categories. Since it can't definitively isolate the cuisine to just one without overlapping the others, and the prompt told it to be strict, it chooses paralysis: it selects nothing.

How to Fix It: 4 Rules for Multi-Label Prompting

If you're building an automated tagging system, here's how you balance the scale so the AI is accurate but not paralyzed.

1. Provide an Explicit "None" Category

Don't just tell the AI to return nothing. Actually include tags like No_Match or Other. This gives the LLM a definitive "bucket" to put the data in when it doesn't fit, satisfying its urge to answer without forcing a bad match.

2. Define the Threshold for Inclusion

Instead of just saying "be strict," provide clear definitions for each tag and explain what constitutes a match.[⁵]

"Apply the 'Italian' tag if the restaurant's primary identity, name, or main dishes are Italian. Do not apply it just because they serve a single pasta side dish."

3. Use Chain-of-Thought (CoT)

Force the AI to explain its reasoning before outputting the final tags.[⁶] Ask it to list evidence for and against a tag before labeling it Yes/No.[⁸] When the AI has to write:

"This menu features tacos and margaritas. Therefore, Mexican."

...it anchors its final decision to logic rather than guesswork.

4. Group and Conquer

If you have 50 tags, don't throw them all at the AI at once across flat hierarchies. Group them.[⁹,¹⁰] First ask:

"Is this Asian, European, or Latin American?"

Then, in a separate sequential prompt once the first is done:

"You classified this as European. Is it French, Italian, or Spanish?"

This dramatically reduces the surface area of ambiguity at each step.

The Takeaway

Multi-label classification with LLMs is a balancing act. You have to save the AI from its own desire to be helpful, while preventing it from overthinking the overlaps. Whether you're working on text classification, content tagging, product categorization, or automated annotation for enterprise data pipelines, the same four rules apply.

Give it clear boundaries, an explicit way to opt-out, and room to "think" out loud. The result is a tagging system that's both accurate and robust: one that knows when to commit and when to say "none of the above."

At BotiqueAI, we build AI systems that go beyond simple prompting: robust classification pipelines, multi-agent workflows, and production-ready LLM integrations.

If you're wrestling with a classification or tagging problem, let's talk.

Book a free slot →

Sources

Phase 1 — The "Least Bad" Bias & LLM Sycophancy

¹ Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., ... & Perez, E. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548. arxiv.org/abs/2310.13548 — Foundational study characterising sycophantic behaviour across LLMs, showing models systematically favour responses that align with perceived user preferences over truthful ones.

² Zheng, C. et al. (2023). Large Language Models Are Not Robust Multiple Choice Selectors. ICLR 2024. openreview.net — Empirically demonstrates that LLMs exhibit an inherent "selection bias" in multiple-choice tasks, preferring specific option positions or IDs regardless of content.

Phase 2 — Overthinking & Paralysis with Overlapping Categories

³ Senger, M. et al. (2025). Language Models to Support Multi-Label Classification of Industrial Data. arxiv.org/abs/2504.15922 — Documents the same over-strictness failure mode in a real industrial zero-shot multi-label classification setting.

⁴ Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. arxiv.org/abs/2307.03172 — Shows that LLMs struggle to use information presented in the middle of long contexts, relevant to why presenting many tags at once degrades classification quality.

⁵ Heseltine, M. & Luzardo, A. (2026). Improving LLM Classification of Social Science Texts Through Prompt Engineering. arxiv.org/abs/2603.25422 — Systematically tests label descriptions and instructional nudges on classification accuracy, directly relevant to the "define the threshold" fix.

The Chain-of-Thought Fix

⁶ Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022, 35, 24824–24837. arxiv.org/abs/2201.11903 — The foundational CoT paper showing that prompting models to reason step-by-step significantly improves performance on complex tasks.

⁷ Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., ... & Chi, E. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625. arxiv.org/abs/2205.10625 — Introduces decomposing a hard problem into simpler subproblems solved sequentially, directly underpinning the "Group and Conquer" strategy.

⁸ Sprague, Z. et al. (2024). To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. arxiv.org/abs/2409.12183 — Nuanced follow-up showing CoT benefits are strongest on tasks requiring logic and disambiguation: exactly the multi-label overlap problem.

The Group and Conquer Fix

⁹ Lim, J. et al. (2025). Hierarchical Text Classification Using Black Box Large Language Models. arxiv.org/abs/2508.04219 — Empirically shows that a hierarchical "divide and conquer" strategy outperforms flat classification with LLMs, especially on deeper taxonomies.

¹⁰ Schindler, A. et al. (2026). Automated coding of communication data using large language models: a comparison of hierarchical and direct prompting strategies. Frontiers in Education. frontiersin.org — Compares hierarchical vs. flat prompting on a real classification task; confirms hierarchical is better overall but more sensitive to top-level errors.