So I wrote a clear prompt. Gave Claude a detailed set of instructions. Fed it the dataset. It gave me an output. I delivered it.
But when the stakeholder and I reviewed the deliverable in depth, we noticed some increasingly unsettling things.
Claude was confidently wrong.
Not wrong wrong, like hallucinating facts from nowhere. More like… overconfident wrong. It would generate a quarterly insight report and say something like:
“Negative sentiment in the Dresses department increased 23% this quarter, indicating a significant shift in customer satisfaction that warrants immediate attention from the product team.”
Sounds great. Except that spike was driven almost entirely by a single popular item that launched mid-quarter with a known sizing defect. One product. Not the whole department.
Claude had no idea. And my prompt didn’t tell it to care.

A Quarterly Customer Review Report Skill
I’m going to walk you through a Claude skill I built that generates a quarterly customer sentiment report from unstructured product review text, delivered as a PDF to stakeholders.
Obviously, I won’t be sharing the actual dataset I analyzed at work. The dataset I’m using is the Women’s E-Commerce Clothing Reviews dataset from Kaggle (CC0 license). It contains 23,000 real, anonymized customer reviews across clothing departments (Tops, Dresses, Bottoms, Jackets, and more) with text, star ratings, and product metadata. References to the company in the reviews have been replaced with “retailer.”
The skill should:
- Read a filtered slice of reviews for the current quarter
- Group them by department
- Identify trends & concerns
- Write a professional summary PDF for the product leadership team
Here’s the original prompt:
You are a data analyst generating a quarterly customer sentiment report for a women’s clothing e-commerce retailer. Given this quarter’s customer reviews (including review text, star ratings, and department), write a professional stakeholder report that includes:
– An overall sentiment summary for the quarter
– Key themes by department (Tops, Dresses, Bottoms, Jackets)
– 2-3 standout insights from the review text
– A brief recommendation for the product team
Be professional and clear.
When you’re done with this task, please create a skill titled reviews-analysis and save your instructions in there.
What “Confidently Wrong” Actually Looks Like
Here’s an example of what Claude produced with the naive skill above, on a quarter where the Dresses department had an influx of negative reviews:
“Negative sentiment in the Dresses department increased significantly this quarter, with customers frequently citing fit and sizing issues. This suggests the retailer’s sizing standards may be drifting from customer expectations — a trend that, if unaddressed, could erode brand loyalty in this key category.”
The real explanation? One dress (a single SKU) launched in Week 7 with a batch quality issue. The reviews were almost entirely about that one item. The rest of the Dresses department was performing fine.
Claude didn’t necessarily invent anything. It just had no context for why the pattern existed. And without that context, it did what LLMs do: it filled the gap with the most plausible-sounding narrative.

The Fix: 4 Lines You MUST Include
Line 1: Tell Claude What Context It’s Missing
You do NOT have access to product launch calendars, inventory records, promotional campaigns, or individual SKU-level history. Do NOT attribute department-level trends to brand-wide causes. Report patterns you observe in the text; do not explain why they exist unless the reviews themselves make it unambiguous.
This single instruction eliminates a huge category of confident wrongness. Without it, Claude will always reach for a strategic narrative because that’s what a good analyst does, and Claude is trying to be a good analyst.
The problem is that a good analyst also knows what they don’t know. They say “We’re seeing elevated sizing complaints in Dresses this quarter. This may be isolated to a recent launch but we’d need SKU-level data to confirm.” Claude won’t say that unless you tell it to.
Line 2: Define What “Significant” Actually Means
Claude loves the word significant. It uses it all the time. And it almost never defines it.
Only flag a sentiment shift as “significant” if it represents a change of more than 15 percentage points in positive/negative ratio compared to the prior quarter, OR if a theme appears in more than 20% of reviews in a given department. For smaller signals, use language like “slight uptick” or “minor increase.” Do not use the word “notable” or “significant” for anything below these thresholds. Always report the actual number value for the shift along with your claim.
You can adjust the 15% and 20% thresholds to whatever makes sense for your data. The point is to anchor Claude’s language to something real.
Without this, Claude will call both a 3-review spike in complaints and a genuine 30-point sentiment drop “significant”. Your stakeholders will start to tune out. And when something actually significant happens, they won’t know it.
Line 3: Force a Confidence Qualifier on Every Insight
Before each insight, include a confidence label in brackets: [Data-Supported], [Possible], or [Speculative].
Use [Data-Supported] only when the insight follows directly from the review text provided. Use [Possible] when the insight is a reasonable inference from the text. Use [Speculative] when you are making assumptions about causes or context that are not present in the reviews themselves.
When I first added this line, I was expecting mostly [Data-Supported] tags. What I actually got was a mix of all three, which told me exactly how much Claude had been filling in gaps in my previous reports without me realizing it.
An example of what the output looks like after adding this line:

Now your stakeholders can see exactly what’s solid and what’s a guess. That’s a much more honest report.
Line 4: Require Claude to State the Limits of the Analysis
At the end of the report, include a section called “What This Report Cannot Tell You.” List 2-3 things that would be needed to draw stronger conclusions, for example, SKU-level review breakdowns, return rates, or repeat purchase data.
This line forces Claude to acknowledge the edges of its own analysis. And it gives your stakeholders a clear roadmap for what questions to investigate further, which is actually the most valuable thing an analyst can do.
Here’s the output:

How to Use Claude to Refine the Skill
Writing a skill once isn’t enough. You need to test it and improve it the same way you’d iterate on a model.
Step 1: Run the skill on known examples.
Filter the dataset to a time window where you already know what happened. (A quarter with a product recall, a seasonal promotion, a period with unusually high return rates, etc.) See what Claude says. Does it use the word “significant” correctly? Does it state facts/statistics where it should?
Step 2: Feed Claude its own output and ask it to audit.
Claude is good at catching its own overconfidence when you explicitly ask it to look for it.
Here is a quarterly customer sentiment report generated by an AI analyst. Review every insight in this report and flag any that:
– Make causal claims without direct evidence in the review text
– Use words like “significant” or “notable” without justification
– Attribute individual product issues to brand-wide trends
– Assume context not present in the dataset (launch calendars,
inventory, purchase history)
For each flagged item, suggest a revised version that is more appropriately hedged.
Step 3: Add a clause for each failure you find.
Every time Claude produces a report with a clearly wrong or overconfident insight, ask it to add a new constraint to your skill. Over time, your skill pretty much becomes a record of everything Claude gets wrong.
A Word of Caution
Adding constraints to your skill can sometimes make Claude produce an output where every single sentence ends with “…though additional data would be needed to confirm this.”
That’s not useful either.
The goal is calibrated confidence where the strength of Claude’s language matches the strength of the evidence. If you find Claude becoming overly wishy-washy, you can add a counterbalancing constraint:
Do not over-qualify every statement. If a pattern appears clearly and consistently across many reviews, state it plainly and include references to the data behind the pattern. Reserve qualifiers for genuinely uncertain or speculative claims.
Conclusion
Claude is impressive at generating professional-looking reports, which can sometimes be the problem.
The polish hides the overconfidence. Your stakeholders see clean formatting and authoritative language, and they assume the insights are solid even when they’re not.
The four lines I’ve walked through here don’t make Claude less capable. They make it more honest. And in a reporting context, honest is more valuable than impressive.
Read more about what other use cases Claude is good for here, including building dashboards, debugging, and writing documentation:
→ 3 Claude Skills Every Data Scientist Needs in 2026
Thanks for Reading
Connect with me on LinkedIn
Buy me a coffee to support my work!







