The Hanifa AI Service Team | Agentic Customer Service | An Independent Build by Maurisa Westbury

The Project

Hanifa is a luxury fashion brand serving women sizes 0–24 for over 14 years. After a customer service crisis shook the brand’s trust with its community, I saw a product problem worth solving: how does a small business maintain quality customer communication when the team is small, the stakes are high, and one careless response can undo years of trust?

I built a team of six AI agents to handle that workflow. Each one has a defined role, scoped knowledge, and clean handoffs. Not a chatbot. Not a template engine. A system that thinks before it writes.

Business Context

The Problem

A yearly pre-order event that has always gone smoothly. In 2025, production delays pushed orders back two months. No updates. No timelines. No acknowledgment from a brand customers trusted for over a decade. The community took to social media. If the brand had an agentic system like this in place, timely, on-brand communication could have gone out before the frustration became a crisis.

Post-Crisis Communications

The crisis broke trust. Now every single interaction is a chance to rebuild it. These agents don’t just handle customer service; they protect the brand’s reputation while restoring the relationship. Body positivity, empathy, and community respect aren’t just guidelines, they’re built into every agent’s instructions.

Production Pause

No new collections. No restock timelines, and customers are asking about items that aren’t coming back. Every agent knows this and communicates with the sensitivity that a brand in recovery demands. No false promises. No deflection. Honest, warm responses that acknowledge the disappointment and keep the door open.

Independent build. I am not affiliated with Hanifa, I identified a public CX challenge and built a solution.

Agent Design

Each agent has one job, and explicit rules about what it doesn’t do.

Six specialized agents organized by function: sense, orchestrate, research internally, research externally, write, and evaluate. No agent crosses its lane. I can test, iterate, and improve each one independently without breaking the pipeline.

Why 6 agents instead of one

One agent doing everything is fast to build and impossible to debug. Six agents means when something breaks, I know exactly where.

The right model for the right job

RECEPTION runs Haiku, fast and consistent for classification. ATELIER runs Sonnet, needs voice and personality. MAÎTRE runs Sonnet Extended Thinking at temperature 0.0 because the final quality gate gets zero randomness.

What each agent doesn’t do

RECEPTION senses but never drafts. CONCIERGE orchestrates but never writes. MAÎTRE evaluates but never rewrites. Every boundary is explicit. Same discipline you’d enforce across a cross-functional team where expertise is critical.

Why knowledge boundaries matter

SOMMELIER searches the internal knowledge base. CURATOR searches the web. They never cross. Mixing sources is how AI gives confident wrong answers. In any domain where accuracy matters, you need to know exactly where your agent’s knowledge comes from.

See the PRD ↓

RECEPTION

Sentiment Analysis & Guest Understanding

Reads the emotional temperature of every incoming email. Classifies tone and urgency so every downstream agent knows exactly what they’re walking into before they respond.

Claude Haiku 4.5Temp 0.2

CONCIERGE

Orchestration & Intelligent Routing

The boss. Classifies, coaches downstream agents, manages routing and revision loops.

Claude Sonnet 4.5Temp 0.1–0.3

SOMMELIER

Internal Knowledge Expert

Searches the KB to ground every response in verified, accurate information. No guessing, no hallucinating. If it’s not in the KB, it says so.

Claude Sonnet 4.5Temp 0.2

CURATOR

External Intelligence Specialist

Conditionally deployed. Fills external gaps: shipping status, industry context, real-time information the knowledge base doesn’t carry.

Claude Haiku 4.5Temp 0.2Conditional

ATELIER

Communication Craftsmanship

The MVP of this team. Shoulders the responsibility of rebuilding trust through every email it drafts. Writes in the brand’s luxury voice using industry communication best practices.

Claude Sonnet 4.5Temp 0.6

MAÎTRE

Evaluation & Quality Standards

Final quality gate. Every draft scored against a 7-dimension rubric before a human ever sees it. Nothing ships without passing.

Sonnet 4.5 Extended ThinkingTemp 0.0

Orchestration

Every email follows a pipeline with built-in self-correction and human checkpoints.

CONCIERGE reads the incoming email plus RECEPTION’s sentiment analysis, then coaches every downstream agent with situation-specific direction. Not generic instructions. Tailored guidance based on what this particular customer needs.

Not every interaction needs every agent

CURATOR only activates when CONCIERGE identifies a gap the internal KB can’t fill. Routine emails skip it. Cost and latency stay low for the simple ones; capability is there for the complex ones.

When a draft fails, the system knows why

When a draft fails eval, CONCIERGE doesn’t start over. It routes to the specific fix: tone adjustment, deeper KB search, external research, or full rework. The system diagnoses and prescribes.

Three tries, then human hands

Maximum 3 revision loops, then it stops and escalates. The system never loops indefinitely. If AI can’t get it right in three passes, the right answer is a person.

Why the system drafts but never sends

Every email goes through quality evaluation, a routing decision, and human approval before it reaches the customer. Whether it’s a body image misstep in fashion or a tax calculation error in payroll, the only pattern that works is AI proposes, human decides.

Voice isn’t a nice-to-have when you’re rebuilding trust

Every word either rebuilds trust or erodes it. The writing agent was trained on a detailed voice analysis so the output sounds like the brand, every time.

Without the Agentic Service Team

Single shared inbox, no triage
Manual responses, hours to draft
No consistent brand voice
No quality check before sending
One bad response away from a crisis

With the Agentic Service Team

Every email triaged by sentiment and urgency
Drafted in minutes, in the brand’s voice
7-dimension quality evaluation on every draft
Intelligent routing when something needs a fix
Built-in escalation for press, legal, and crisis scenarios
Nothing reaches the customer without human approval

Excerpt from the Hanifa AI Service Team PRD. Every build starts with one.

Product Requirements Document

Hanifa AI Service Team

Agentic Customer Service System

Version	Author	Date
v0.1	Maurisa Westbury	March 2026

Executive Card

Problem	Hanifa receives customer emails about orders, returns, sizing, press, events, and wholesale. All handled manually, inconsistently, and often delayed. A 2025 pre-order crisis exposed how critical fast, empathetic, on-brand communication is to brand survival.
Users	Hanifa customers (loyal community, sizes 0–24) + human reviewer and approver
MVP Scope	Automated email triage → sentiment analysis → research → draft in brand voice → quality evaluation → human review → send. Every email is human-approved.
Workflow	Gmail → Filter → RECEPTION → CONCIERGE → SOMMELIER → CURATOR → ATELIER → MAÎTRE → CONCIERGE → Slack → Human → Send
Top 3 Risks	1. Pre-order/delay info may not be in the KB, CURATOR fills real-time gaps. 2. Brand voice is specific, ATELIER needs detailed voice training. 3. Press and wholesale emails need hard escalation rules, not AI drafts.
Knowledge Guardrails	Agents only reference verified KB content or cited external sources. No fabricated policies, no invented timelines, no promises the brand can’t keep. If the answer isn’t in the KB, the agent says so.

Success Metrics

KPI	Target	How Measured
Email-to-draft	< 1 hour	Workflow timestamp vs Gmail receipt
Approval rate	> 70%	Approve/edit/reject tracking in Slack
Voice match	> 85%	Observational eval using rubric
AI cost/email	< $0.50	Token usage logs
Review time	< 3 min	Timed during testing
Hallucination	0%	Zero fabricated policies or prices

The framework behind every agent on this team. Five elements, applied consistently.

Agent Design Framework

RCICE

A repeatable structure for designing reliable AI agents

Role

What the agent is and who it serves. One sentence that defines identity before anything else.

The agent reads this first and uses it as the lens for every instruction that follows.

Context

What the agent needs to know about its environment. Curated background knowledge placed on the workbench.

Context is what separates a generic AI response from a grounded one.

Instructions

How the agent does its work, step by step. The process, the decision logic, the guardrails.

A well-instructed agent doesn’t guess; it follows a designed workflow.

Criteria

What “good” looks like. The quality bar the agent evaluates its own output against before delivering.

Without criteria, agents optimize for producing an answer, not producing a good one.

Examples

What ideal output looks like in practice. Few-shot demonstrations that shift the model toward the right patterns.

Examples do more to shape agent behavior than any amount of abstract instruction.

Eval Framework

MAÎTRE doesn’t get creative. It just scores the work.

Every draft gets scored against 7 dimensions before a human sees it. A critical failure doesn’t get flagged. It gets blocked.

Five questions on every draft

Did we address the person, not the ticket? Did we solve AND elevate? Did we anticipate the follow-up? Did we leave the door open? Would we be comfortable if this were posted publicly? 4 out of 5 to pass.

The edge cases that break AI

Five test scenarios: crisis-adjacent complaint, body positivity sizing question, new customer celebration, press inquiry where the correct output is NO draft at all, and a standard return. If it handles these five, it handles the middle.

Knowing when to step aside

Not every email should get an AI response. Press inquiries, legal issues, active crises: the system recognizes these and stops. No draft generated. Human only. Knowing when NOT to respond is just as important as knowing how.

MAÎTRE’s 7-Dimension Rubric

Guardrails & Escalation

Does this need a human instead of AI? Press inquiries, legal questions, crisis scenarios get flagged and routed. No draft generated.

Grounded in Knowledge, Not Guessing

Every claim in the email traces back to the KB or verified research. Nothing made up. Nothing hallucinated.

Answer the Question and the Next One

Did we address what they asked? Did we anticipate what they’ll ask next? Both matter.

Sounds Like the Brand

Luxury customers deserve communication that meets them where they are. Elevated, respectful, and human. Not corporate, not casual, not generic.

Empathy Match

Is the response calibrated to how this customer is feeling right now? A frustrated customer and a first-time buyer need completely different energy.

Clean and Professional

Grammar, formatting, structure. The basics that signal a brand takes its communication seriously.

Respect for the Community

Body-positive language. Cultural sensitivity. No tone-deaf responses to a community that’s watching closely and has every right to.

Scoring: EXCELLENT / GOOD / ACCEPTABLE / NEEDS WORK / CRITICAL FAILURE. Any critical failure blocks entirely.

My Process

How I Ship Product

Discovery to delivery. Engineering-ready specs. The right conversations with the right teams at the right time.

Discovery

What’s the problem, who owns it, and what does solved look like? Get stakeholders aligned on what we’re building and why.

Research & Knowledge Design

The knowledge base is part of the product. I go deep on domain research by leaning into existing documentation and human conversations so I’m walking into engineering with context, not assumptions.

Requirements

Engineering-ready specs. Agent roles, boundaries, guardrails, edge cases, acceptance criteria. Clear enough that engineering can execute.

Build

I work cross-functionally, speaking the language well enough to have trade-off conversations and drive informed decisions about architecture, model selection, and system behavior.

Test & Iterate

Define and run the scenarios. Observe the patterns. Adjust. Test again. Eval criteria documented upfront so QA isn’t a vibe check, it’s a measured loop.

Ship & Monitor

Human checkpoints active. Feedback collection in place. Shipping is the beginning, not the end. Iterate based on what the data says.

See the framework ↓