CLAUDE CERTIFIED ARCHITECT – FOUNDATIONS EXAM
================================================================================
CLAUDE CERTIFIED ARCHITECT – FOUNDATIONS EXAM
QUESTION DUMP - FYI these questions are from actual exam
================================================================================
TITLE: Returning customers and stale tool results (session design)
A customer returns 4 hours after their initial session about the same billing dispute. The previous 32-turn session contains lookup_order results showing "Status: Pending. Expected resolution: 24-48 hours." In testing, when resuming sessions with stale tool results, the agent often references outdated data in responses (for example, saying a refund is still being processed) even after subsequent fresh tool calls return different information.
QUESTION: What approach most reliably handles returning customers?
OPTIONS:
A) Start a new session, inject a structured summary of the previous interaction (issue type, actions taken, resolution status), then make fresh tool calls before engaging.
B) Resume with full history and add a system prompt instruction telling the agent to always prefer the most recent tool results when multiple calls to the same tool exist in context.
C) Resume with full history and configure the agent to automatically re-call all previously-used tools at session start to ensure data freshness.
D) Resume with full history but filter out previous tool_result messages before resuming, keeping only the human/assistant turns so the agent must re-fetch needed data.
TITLE: MCP tool errors — communicating failures to the agent (lookup_order)
When implementing your lookup_order MCP tool, the backend sometimes returns errors (for example, "Order not found" or temporary database failures).
QUESTION: What is the correct pattern for communicating these errors back to the agent?
OPTIONS:
A) Throw an exception from the tool handler so the agent framework can catch and log it.
B) Return a success response with a "status" field indicating the error type.
C) Return the error message in the tool result content with the isError flag set to true.
D) Log the error server-side and return an empty result to avoid confusing the model.
TITLE: Uniform MCP errors and inconsistent agent behavior
Production logs reveal inconsistent error handling: when lookup_order fails, the agent sometimes retries 5+ times (wasteful when the order ID doesn't exist), escalates immediately (premature for temporary network issues), and sometimes asks users for clarification (inappropriate when the issue is a backend permission error). Investigation shows your MCP tool returns uniform error responses:
{"isError": true, "content": [{"type": "text", "text": "Operation failed"}]}
The agent cannot distinguish between error types.
QUESTION: What's the most effective improvement?
OPTIONS:
A) Implement retry logic with exponential backoff in your MCP server for all errors, returning to the agent only after retries are exhausted.
B) Create an analyze_error MCP tool the agent calls after any failure to determine the error category and recommended action.
C) Add few-shot examples to the system prompt demonstrating how to interpret error message patterns and select appropriate responses for each.
D) Enhance error responses with structured metadata: include errorCategory (transient/validation/permission), isRetryable boolean, and a description of what caused the failure.
TITLE: Agentic loops — choosing the next tool after lookup_order
SCENARIO: Customer Support Agent
When the agent calls lookup_order and receives order details showing the item was purchased 45 days ago, how does the agentic loop determine whether to call process_refund or escalate_to_human next?
OPTIONS:
A) The agent follows a pre-configured decision tree mapping order attributes to specific tool calls.
B) The orchestration layer automatically routes to the next tool based on the order's status field.
C) The agent executes the remaining steps in a tool sequence planned at the start of the request.
D) The order details are added to the conversation and the model reasons about which action to take.
TITLE: Multi-issue chats approaching context limits
A customer raises three separate issues during one session: a refund inquiry (turns 1-15), a subscription question (turns 16-30), and a payment method update (turns 31-45). At turn 48, the customer asks "What happened with my refund?" The conversation is approaching context limits.
QUESTION: What strategy best maintains the agent's ability to address all issues throughout the session?
OPTIONS:
A) Implement sliding window context that retains the most recent 30 turns.
B) Extract and persist structured issue data (order IDs, amounts, statuses) into a separate context layer.
C) Summarize earlier turns into a narrative description, preserving full message history only for the active issue.
D) Rely on MCP tools to re-fetch relevant information on demand when the customer references earlier issues.
TITLE: Customer wants a human — refund is straightforward
A customer writes: "I've been going back and forth on this return for days. I just want to speak to someone who can actually help me." The agent has confirmed via lookup_order that the return is straightforward—within policy and eligible for immediate processing.
QUESTION: What should the agent do?
OPTIONS:
A) Call escalate_to_human immediately to honor the customer's request.
B) Process the refund via process_refund to resolve the underlying issue, then inform them it's complete.
C) Acknowledge frustration, inform them this is resolvable now, and offer to complete it or escalate.
D) Ask what specifically hasn't worked in previous attempts before deciding whether to escalate or resolve automatically.
TITLE: Escalation handoff when the human cannot see the transcript
After investigating a billing dispute over 25+ turns, you've identified that duplicate charges occurred due to a payment gateway timeout triggering retry logic. The required refund ($847) exceeds your $500 authorization limit. You need to call escalate_to_human, and the human agent won't have access to your conversation transcript.
QUESTION: What context should you pass to enable effective resolution?
OPTIONS:
A) The complete conversation transcript with all tool results.
B) The customer's original complaint verbatim plus the tool result excerpts showing duplicate transactions.
C) A structured summary: customer ID, root cause, refund amount, and recommended action.
D) Your diagnosis and the refund amount only.
TITLE: lookup_order outputs crowding context
Your agent has called lookup_order multiple times while investigating a customer's return requests. Each response includes 40+ fields (items, shipping details, payment info, status history). Tool outputs now represent the majority of the conversation's context. The customer mentions two more orders they want to discuss.
QUESTION: What's the most effective approach before making additional lookups?
OPTIONS:
A) Move all tool responses to a vector database with semantic indexing, retrieving relevant portions as the conversation continues.
B) Extract only return-relevant fields (items, purchase date, return window, status) from each existing order response, removing verbose details.
C) Have the model generate a natural language summary of each order's key details, replacing structured responses with prose descriptions.
D) Proceed with additional lookups without modifying the existing tool output context.
TITLE: process_refund timeouts — balancing resolution and honesty
During a billing dispute resolution, your agent successfully retrieves customer info via get_customer and order details via lookup_order, but when attempting to call process_refund, the tool returns a timeout error. The agent has enough information to explain the charges and verify refund eligibility, but cannot actually process the refund due to the backend failure.
QUESTION: What approach best balances first-contact resolution with appropriate error handling?
OPTIONS:
A) Confirm the refund will be processed and close the conversation, since the system has all necessary information to complete it automatically.
B) Implement automatic retries with exponential backoff for process_refund, keeping the conversation open until the refund is successfully processed.
C) Escalate immediately to a human agent since the refund action cannot be completed.
D) Explain the billing, confirm refund eligibility, acknowledge the system issue preventing immediate processing, and offer escalation or retry later.
TITLE: Identity verification — the agent forgets earlier answers
The agent verifies customer identity through a multi-step process before resetting passwords. During testing, you notice that after the customer answers the third verification question, the agent asks them to provide their name again, as if the earlier exchange never happened.
QUESTION: What's the most likely cause of this behavior?
OPTIONS:
A) Claude's memory retention is limited to two conversational turns by default, requiring explicit configuration to extend it.
B) The verification tool is clearing the agent's internal state after each successful validation step.
C) The conversation history isn't being passed in subsequent API requests.
D) The prompt lacks instructions telling Claude to remember information across multiple exchanges.
TITLE: "I want a human NOW" — before any tools run
A customer sends: "This is frustrating. I've explained my issue twice and nothing is being resolved. I want to talk to a real person NOW." The agent has not yet called any tools to investigate their account.
QUESTION: What should the agent do?
OPTIONS:
A) First call get_customer and lookup_order to gather account context, then escalate to a human agent.
B) Immediately call escalate_to_human with the conversation history.
C) Briefly explain what the agent can help with and offer to resolve the issue quickly, escalating only if the customer repeats their request.
D) Acknowledge the frustration and ask one targeted question to understand the specific issue before escalating.
TITLE: Context management for a long dinner-party planning session
After a 40-minute session helping plan a dinner party, the conversation has grown to 78,000 tokens. The history includes: (1) a guest has a severe shellfish allergy, (2) specific measurements for scaling recipes to 8 servings, (3) clarification that "room temperature butter" means 68°F in their kitchen, and (4) general back-and-forth about meal timing and presentation. You need to implement context management before the window limit is reached.
QUESTION: What approach best balances information preservation with token reduction?
OPTIONS:
A) Store the full conversation externally and use semantic search to retrieve relevant portions for each turn, loading only matching segments into context.
B) Summarize the entire conversation history into a concise summary capturing main topics discussed, then append new messages going forward.
C) Implement a sliding window retaining only the most recent 20,000 tokens, relying on users to re-state important information when relevant.
D) Extract critical structured data (allergies, serving counts, user-defined terms) into a compact reference section, summarize general discussion, and retain recent exchanges verbatim.
TITLE: Home renovation assistant — system prompt seems to stop working mid-chat
Your home renovation planning assistant uses a system prompt defining an expert contractor persona with specific guidelines: always ask about budget, suggest alternatives at multiple price points, and confirm timeline requirements. During testing, responses follow these guidelines for turns 1-4, but by turn 7, the assistant gives generic advice without asking about budget or timeline. The conversation totals only 2,500 tokens.
QUESTION: What is the most likely cause?
OPTIONS:
A) The model's attention on system prompt instructions naturally weakens as turns accumulate.
B) System prompts only establish initial behavior and don't persist across all turns.
C) The assistant's accumulated responses are diluting the system prompt's influence.
D) The system prompt is only sent with the first API request.
TITLE: Playlist preferences — Claude doesn't remember earlier messages
You're implementing a feature where users refine playlist preferences through multiple conversation turns. After deploying, you notice Claude's responses don't reflect what users said earlier in the same conversation—for example, a user says they love jazz, but two messages later Claude asks what genres they enjoy.
QUESTION: What is the most likely cause?
OPTIONS:
A) Your application isn't including prior messages in the messages array.
B) The model's context window has been exceeded by the conversation length.
C) Claude requires a vector database connection to maintain conversation memory.
D) The Claude API requires a session_id parameter that you haven't configured.
TITLE: Long chats — behavior drifts even within a huge context window
During QA testing, you notice that Claude follows your system prompt guidelines consistently in the first 10-15 turns, but by turn 25-30, responses begin deviating—using informal tone when formality was specified, occasionally skipping required formatting, or providing information types the guidelines restrict. Conversation length is well within context limits (typically 30,000 tokens out of 200,000 available).
QUESTION: What's the most effective approach to maintain consistent behavior throughout extended conversations?
OPTIONS:
A) Move behavioral guidelines from the system prompt into the first user message.
B) Automatically start a new conversation after 20 turns, passing a summary of the prior context to maintain continuity.
C) Implement post-response validation that regenerates each response until it conforms to the specified guidelines.
D) Insert user-role messages that reinforce critical guidelines at natural conversation breakpoints, especially before complex requests.
TITLE: Sliding window memory — users lose track of earlier topics
Users report that during extended conversations, the AI loses track of specific topics, examples, and preferences they mentioned earlier in the session. Your current implementation uses a sliding window that keeps only the most recent 25 message pairs to stay within context limits.
QUESTION: What's the most effective approach to maintain awareness of earlier conversation content while managing context size?
OPTIONS:
A) Implement vector similarity search over the full conversation history, retrieving relevant past messages for each user query.
B) Replace the sliding window with a hybrid approach: summarize older messages while keeping recent messages verbatim.
C) Add a separate API call each turn to summarize messages being dropped, prepending this running summary to the conversation.
D) Increase the window size to 50 message pairs to retain more conversation history before truncation.
TITLE: Fitness coaching — implicit expertise signals
Your fitness coaching assistant uses a system prompt with detailed conditional logic: "If the user mentions being a beginner, provide step-by-step form instructions. If they use terms like 'progressive overload' or 'superset', respond concisely. If they ask about injury history, always recommend consulting a physician." During evaluation, you find the assistant correctly adapts to explicit expertise declarations but struggles when users don't clearly state their level—often defaulting to overly detailed responses regardless of contextual cues like technical terminology.
QUESTION: Which change to the system prompt would most directly address this failure to pick up on implicit expertise signals?
OPTIONS:
A) Replace most conditionals with a general principle: "Adapt explanation depth to match user expertise, mirroring their terminology." Keep only the safety-critical conditional about injury consultations.
B) Implement a pre-conversation intake that asks users to rate their experience level, then inject that rating into the system prompt as context for all subsequent responses.
C) Add more conditional branches to cover additional expertise signals, such as "If user mentions specific rep ranges or asks about periodization, treat as advanced."
D) Add an explicit instruction for the model to ask a clarifying question about experience level whenever the user's expertise isn't immediately clear from their first message.
TITLE: Repetitive assistant openers ("Certainly!", "I'd be happy to help!")
Users report that responses feel repetitive across turns—each message begins with phrases like "Certainly!" or "I'd be happy to help!" even deep into conversations. You want responses to flow naturally without these repetitive openers.
QUESTION: What's the most effective approach?
OPTIONS:
A) Append a partial assistant message with a direct response opening that the model will continue from.
B) Implement post-processing to detect and strip common greeting phrases from response beginnings.
C) Lower the temperature parameter to make response openings more deterministic and less variable.
D) Add system prompt instructions specifying phrases to avoid, such as "Never begin responses with 'Certainly' or similar affirmations."
TITLE: API latency and cost vs conversation length
Users report that API latency increases noticeably and costs rise as practice conversations extend beyond 50+ turns.
QUESTION: What is the PRIMARY cause of this behavior?
OPTIONS:
A) The model builds an internal profile of the user's conversation patterns, requiring more processing as the profile grows.
B) The entire conversation history is included with each API request, so input tokens grow with every turn.
C) The model generates progressively longer responses as it accumulates more context to reference.
D) Database operations for retrieving and storing conversation history slow down as the table grows larger.
TITLE: Investment advice — conflicting risk preferences
Over several turns discussing investment strategy, a user has stated: "I have a very low risk tolerance—I can't handle seeing my portfolio drop" and later, "I want to maximize my returns—I've seen friends do well in crypto and tech stocks." They now ask "What should I invest in?"
QUESTION: Which approach best ensures the resulting recommendation aligns with the user's actual priority?
OPTIONS:
A) Provide separate recommendations for both scenarios (conservative and growth-oriented) without clarifying which the user prefers, and let them choose.
B) Surface the contradiction and ask the user to clarify which matters more—capital preservation or growth potential—before recommending.
C) Recommend a balanced portfolio that moderates both preferences without explicitly addressing the conflict.
D) Proceed with the most recently stated preference.
TITLE: RAG results crowding conversation history
Performance analysis reveals your context is composed of accumulated RAG results from all previous queries, which is crowding out conversation history and causing coherence degradation after 15+ turns.
QUESTION: Which approach best addresses this issue?
OPTIONS:
A) Shift context budget to favor RAG results while reducing conversation history allocation.
B) Compress all RAG results into a consolidated summary document that updates incrementally after each retrieval.
C) Implement a sliding window for RAG results from the last 2-3 queries while preserving conversation history.
D) Implement semantic deduplication to identify and remove redundant information across the accumulated RAG results and conversation turns.
TITLE: Long conversations — summarization vs sliding window only
After 30+ turns, your conversational assistant shows noticeably slower responses and occasionally produces less coherent outputs. Investigation reveals: (1) average conversations reach 50,000 tokens by turn 35, (2) production logs show 94% of user messages only reference the previous 3-5 exchanges, (3) the 6% of queries referencing earlier context typically ask about information the user could easily re-state. Your goal is to improve response speed and quality while maintaining good user experience.
QUESTION: What's the most effective approach?
OPTIONS:
A) Enable prompt caching and continue sending the complete conversation history; using cached prefixes to reduce per-request costs while preserving all context.
B) Implement a sliding window keeping only the system prompt and last 8-10 turns. When users reference earlier context, acknowledge the limitation and ask them to re-state the relevant information.
C) Implement a summarization layer that progressively compresses older conversation turns into a running summary while keeping the most recent 5-6 turns verbatim, maintaining full historical context in condensed form.
D) Build a retrieval system that stores all conversation turns and uses semantic search to pull in relevant historical context only when the current query appears to reference it.
TITLE: Vocabulary tutoring — Claude "forgets" words from earlier in the chat
During initial testing, you notice that Claude doesn't seem to remember vocabulary words from earlier in the conversation. When a student asks "Can you quiz me on those words?", Claude responds as if no words have been discussed.
QUESTION: What is the most likely explanation?
OPTIONS:
A) The model's context window has filled up, causing earlier conversation content to be dropped.
B) You need to enable conversation persistence by passing a session ID parameter with each API call.
C) Your system prompt needs explicit instructions telling Claude to remember information from earlier turns.
D) You're not including prior messages in each API request—the stateless API doesn't retain conversation history.
TITLE: Updated system prompt breaks long-running threads
After deploying an updated system prompt that improves response quality, users with multi-session conversations spanning several weeks report that the assistant now contradicts its earlier statements and has a noticeably different communication style. New users don't experience these issues.
QUESTION: What's the best approach to resolve this?
OPTIONS:
A) Add instructions to the new system prompt directing the assistant to maintain consistency with any prior statements in the conversation history.
B) Add a transition message when sessions resume explaining that the assistant has been updated and behavior may differ.
C) Regenerate summaries of existing conversations using the new prompt and replace the stored histories to align past context with current behavior.
D) Version system prompts and associate each conversation with the prompt version under which it started, applying updates only to new conversations.
TITLE: Where to define persistent behavioral guidelines (music assistant)
Your music discovery assistant should consistently maintain an enthusiastic tone, explain its reasoning for each recommendation, and ask clarifying questions to better understand user preferences. You want this behavior to persist reliably across all user interactions.
QUESTION: Where should you define these behavioral guidelines?
OPTIONS:
A) In the system prompt.
B) Prepended to each user message before sending to the API.
C) In the first assistant message, instructing Claude to follow these guidelines going forward.
D) In environmental variables that your application passes to the API client.
TITLE: Webhook mid-conversation — incorporate shipping status naturally
During a conversation about order tracking, your external system receives a webhook indicating the user's package has shipped. The user is actively chatting and will likely send a follow-up message soon. You want the assistant to naturally incorporate this status change in its next response.
QUESTION: What's the most effective approach?
OPTIONS:
A) Immediately send an API request with the update as a synthetic user message, generating an unsolicited assistant response.
B) Append the status update as a prefix to the next user message before calling the API.
C) Configure the assistant to call a get_order_status tool at the start of every response.
D) Add the current shipping status to the system prompt before the next API call.
TITLE: MCP tool errors — malformed params vs downstream API failures
Your MCP server implements a check_availability tool that queries an external calendar API. During testing, you encounter three error conditions: (1) the tool is called with a malformed request missing the required user_email parameter, (2) the calendar API returns a 404 because the specified user doesn't exist in the calendar system, and (3) the calendar API returns a 503 because the service is temporarily unavailable.
QUESTION: How should each error be reported according to MCP's error handling design?
OPTIONS:
A) Report all three as tool results with isError: true.
B) Report all three as JSON-RPC protocol errors.
C) Report errors 1 and 2 as JSON-RPC protocol errors; report error 3 as a tool result with isError: true.
D) Report error 1 as a JSON-RPC protocol error; report errors 2 and 3 as tool results with isError: true.
TITLE: Multi-carrier shipment tracking — normalize tool outputs
Your shipment tracking tool queries multiple carriers (FedEx, UPS, DHL) that each return status information in different formats—FedEx uses numeric codes, UPS uses descriptive phrases, and DHL returns timestamped event arrays. The agent uses these results to determine delivery status and escalate delayed shipments.
QUESTION: How should you structure the tool's return value to best enable the agent's reasoning?
OPTIONS:
A) Return a normalized schema with status, estimated_delivery, delay_reason, and requires_action fields, converting carrier formats internally.
B) Return both a normalized summary and the complete raw carrier response in each tool call.
C) Design separate tool endpoints for each carrier (track_fedex, track_ups, track_dhl) with carrier-specific response schemas.
D) Return raw carrier responses with source metadata, encoding carrier-specific interpretation logic in the system prompt.
TITLE: Structured JSON vs formatted text (get_portfolio_value)
Your get_portfolio_value tool returns the total value of a user's investment portfolio. You're deciding between returning a structured JSON object with explicit fields versus returning the information as a formatted text string.
QUESTION: What is the primary advantage of using structured output with defined fields?
OPTIONS:
A) Structured JSON is processed deterministically by the model, significantly improving accuracy when extracting values.
B) Structured JSON consumes significantly fewer tokens than natural language, substantially reducing API costs.
C) JSON schemas automatically validate that the underlying API returned correct data before the agent processes it.
D) The agent can reliably extract specific values without parsing free-form text, reducing errors in subsequent operations.
TITLE: track_shipment errors — exceptions vs helpful agent behavior
Your track_shipment tool queries an external logistics API that sometimes fails—the API may be temporarily unavailable, the tracking ID may be malformed, or the shipment may not exist. Currently, your tool raises a Python exception when errors occur. Users report the agent gives unhelpful responses like "I'm having trouble with that request" instead of suggesting alternatives such as verifying the tracking number format or checking by order number.
QUESTION: How should you handle errors in tool results?
OPTIONS:
A) Return a generic error response (for example, {"success": false, "error": "lookup_failed"}) for all failure cases to maintain a consistent schema and avoid exposing internal error details.
B) Implement retry logic with exponential backoff inside the tool implementation so transient errors are automatically handled and only return a result after all retry attempts are exhausted.
C) Create dedicated error-recovery tools (retry_tracking_lookup, search_by_order_number) that the model can invoke after the primary tracking tool returns a failure indicator.
D) Return structured error information as normal tool output including error type, recoverability status, and actionable context for the user.
TITLE: Choosing a document database parameter from natural language
Your search_documents tool needs a parameter to specify which database to search. Your organization has three document databases: research_papers, internal_reports, and technical_specs. Users express this naturally in conversation ("search the research database", "check technical documents").
QUESTION: How should you design the database selection parameter?
OPTIONS:
A) No explicit parameter—search all three databases by default, then have the model filter results by source.
B) A freeform string parameter where the backend uses semantic matching to determine which database(s) to search.
C) A freeform string parameter with runtime validation that returns an error if the value doesn't match a known database.
D) An enum parameter with values ["research_papers", "internal_reports", "technical_specs"], requiring the model to map natural language to the appropriate value.
TITLE: MCP weather integration vs custom in-app tools
You're deciding whether to expose your weather API integration as an MCP server or implement it as a custom tool directly in your agent application. Your team plans to build several different AI applications that will all need weather data.
QUESTION: What is the primary advantage of the MCP approach?
OPTIONS:
A) MCP servers process tool calls faster than custom implementations because they use an optimized binary protocol.
B) MCP provides built-in retry logic that automatically handles failed API calls with exponential backoff.
C) MCP automatically handles authentication and rate limiting with the weather API, reducing backend implementation work.
D) Any MCP-compatible client can connect to your weather server without writing custom integration code.
TITLE: send_notification timeouts — ambiguous delivery and duplicate sends
Your send_notification tool calls third-party messaging APIs. When these services time out during delivery, you cannot determine whether the message was actually sent. Currently, the tool returns is_error: true with a generic "Notification failed" message for all timeouts. Production monitoring reveals agents automatically retry these failures, frequently causing users to receive duplicate notifications.
QUESTION: How should you modify the error response?
OPTIONS:
A) Return is_error: true with a message communicating uncertainty: "Timeout—status unknown. Message may have been sent. Avoid retry."
B) Return is_error: false with the original message content echoed back.
C) Return is_error: true with a structured field retry_safe: true for timeouts, distinguishing them from permanent failures that should not be retried.
D) Return is_error: true with a message encouraging retry: "Delivery service temporarily unavailable. Please retry the notification."
TITLE: Scheduling race — slot disappears between availability check and booking
Your scheduling agent uses get_available_slots(date, provider_id) to retrieve open appointment times, then book_appointment(provider_id, slot_time, patient_id) to reserve a slot. Support tickets show that 15% of booking attempts fail with "slot no longer available" because another user booked the slot between the availability check and the booking call.
QUESTION: How should you redesign these tools?
OPTIONS:
A) Add a hold_slot(provider_id, slot_time) tool that creates a 60-second temporary reservation, requiring the agent to call it between checking availability and booking.
B) Modify book_appointment to return detailed failure information including currently available alternative slots when the requested slot is unavailable, enabling the agent to retry with a different time.
C) Combine both tools into a single find_and_book_appointment that atomically checks availability and books, returning either the confirmed booking or available alternatives.
D) Keep both tools but add retry logic to the agent's system prompt, instructing it to call get_available_slots again and select a different time if booking fails.
TITLE: search_products pagination — slow auto-fetch of all pages
Your search_products tool queries an external catalog API that returns paginated results (50 items per request). Production logs show queries frequently match 200+ products, and the current design that auto-fetches all pages causes 15-20 second delays.
QUESTION: How should you redesign the pagination handling?
OPTIONS:
A) Create separate search_products and fetch_more_results tools for pagination.
B) Implement server-side relevance ranking and return only the top 50 most relevant items.
C) Return the first page with total match count and cursor for additional pages.
D) Add a max_pages parameter (default: 2) that controls how many pages are fetched internally.
TITLE: Invoice extraction — confidence scores and human review routing
Your document extraction tool uses ML models to extract invoice fields (vendor, amount, date). The models return confidence scores (0.0-1.0) for each extracted field. In production, you observe: (1) the agent proceeds with low-confidence extractions that are incorrect 23% of the time, and (2) the agent requests unnecessary human review for 31% of extractions that were actually correct.
QUESTION: How should you restructure the tool's output?
OPTIONS:
A) Compute an aggregate extraction_quality score across all fields and return it alongside the extracted values. Include a text summary describing the overall extraction reliability.
B) Return fields organized into verified and needs_verification objects based on confidence thresholds.
C) Return fields with their raw confidence scores and add detailed few-shot examples to your system prompt demonstrating how to interpret different confidence ranges and when to request human review.
D) Return fields with confidence scores, plus a requires_review boolean computed using your tested confidence thresholds, along with a review_reasons array explaining which fields triggered review.
TITLE: update_user_profile — Claude omits user_id / mis-shapes fields
Your update_user_profile tool accepts a user_id (required) and an optional fields_to_update object. In testing, Claude frequently omits user_id or passes incorrectly structured data.
QUESTION: What is most critical for helping Claude understand what parameter values to provide?
OPTIONS:
A) Strict JSON Schema type constraints marking user_id as required and defining fields_to_update as an object type.
B) Verbose parameter names encoding format hints, such as user_id_string_uuid_format.
C) Detailed error responses explaining why invalid parameter values were rejected.
D) Clear parameter descriptions explaining expected format, such as "user_id: UUID of the user to update (required)."
TITLE: control_device timeouts — user-facing next steps
Your control_device tool manages smart home devices through external APIs. When a device doesn't respond within the timeout period, the tool returns an error. Production logs show that the agent simply tells users "the device is not responding" without offering helpful next steps.
QUESTION: Which error response structure would best enable the agent to provide useful follow-up?
OPTIONS:
A) Set is_error: true with a brief "Device offline" message and provide a separate tool the agent can call to retrieve context-specific troubleshooting suggestions.
B) Set is_error: true with a structured technical error containing the device ID, timeout duration, and raw API response code for debugging purposes.
C) Set is_error: false with an optimistic message indicating the command was dispatched successfully but device acknowledgment is still pending.
D) Set is_error: true with a message explaining the likely cause and suggesting troubleshooting steps the agent can offer the user.
TITLE: post_content — dangerous "quick confirm" approvals
Your post_content tool requires user confirmation before publishing. The current workflow displays "Ready to post to social media. Confirm?" and analytics show users approve 98% of requests within 2 seconds. Post-mortems reveal incidents where posts went to wrong accounts, were scheduled for wrong times, or contained errors—all confirmed by users without catching the mistakes.
QUESTION: How should you redesign the confirmation workflow?
OPTIONS:
A) Auto-approve routine posts and only require explicit confirmation for unusual patterns like posting to new accounts or large audiences.
B) Include the complete post text, target account, scheduled time, and platform in the confirmation request.
C) Require users to type a confirmation phrase instead of clicking a button.
D) Add a mandatory waiting period before the confirm option becomes available.
TITLE: search_products — empty results mistaken as failures
Your product search tool queries an external catalog API and returns matching items. In production, you observe the agent frequently retries searches immediately after receiving zero results, treating "no matches found" as a failure requiring retry. The external API returns HTTP 200 with an empty results array—a valid response.
QUESTION: How should you restructure the tool's result to help the agent correctly interpret empty result sets?
OPTIONS:
A) Return a natural language string describing the outcome, allowing the agent to interpret the result contextually based on the message content.
B) Add a suggestions field containing alternative search strategies when results are empty, helping guide the agent toward more productive follow-up queries.
C) Return a result object with isError: true and a message explaining no products matched.
D) Return a structured result with a success boolean and results array, reserving isError: true for actual execution failures only.
TITLE: update_game_score — nicknames, date formats, rematches
Your agent includes an update_game_score tool that accepts game_date (string), home_team (string), and away_team (string) parameters. Production logs reveal recurring issues: the agent uses team nicknames instead of official names, applies inconsistent date formats, and selects the wrong game when teams have rematches in the same season.
QUESTION: What tool interface change would most effectively prevent these errors?
OPTIONS:
A) Add enum constraints listing valid team names for both team parameters, and add a regex pattern enforcing ISO 8601 format for the date parameter.
B) Replace the three parameters with a single game_id parameter and a separate search_games lookup tool that returns matching game IDs.
C) Add a season parameter to disambiguate rematches, and add a confirm_before_update flag that returns the resolved game details for the agent to verify before the score is committed.
D) Add detailed examples to the tool description showing the required date format and complete list of official team names.
TITLE: E-commerce extraction — inconsistent "materials" field
Your extraction system parses e-commerce product descriptions to extract specifications like dimensions, weight, and materials into JSON. Despite having a well-defined schema, the model inconsistently extracts the materials field—sometimes returning "cotton blend", other times "Cotton/Polyester mix", and occasionally omitting the field when material information is clearly present in the source.
QUESTION: What's the most effective way to improve extraction consistency?
OPTIONS:
A) Set temperature to 0 to eliminate randomness and ensure deterministic outputs.
B) Make the "materials" field required instead of optional in the schema to force the model to always extract a value.
C) Add few-shot examples showing 2-3 complete input-output pairs with standardized material description formats.
D) Switch to a more capable model tier since inconsistent extraction indicates insufficient model capability.
TITLE: Citations / methodology missing despite strict JSON schema
After implementing tool use with strict schema definitions, JSON syntax errors are eliminated, but 5% of extractions still have valid JSON with empty arrays or null values for required fields like citations and methodology. Spot-checking reveals that source documents contain this information, but in varied formats—inline citations vs bibliographies, methodology sections vs details embedded in introductions.
QUESTION: What's the most effective way to address these failures?
OPTIONS:
A) Add few-shot examples demonstrating extractions from documents with varied structures—showing how to identify citations in different formats and locate methodology details across section types.
B) Modify your schema to make citations and methodology optional, and flag incomplete records for manual review rather than failing validation.
C) Build a regex-based post-processing layer that scans source documents for citation patterns and methodology keywords, populating empty fields when the model fails to extract.
D) Implement retry logic that re-sends requests when validation detects empty required fields.
TITLE: Calendar invites — strict schema compliance
Your system must extract event details from calendar invitations and output JSON that strictly conforms to a schema with fields for title, date, time, location, and attendees. Downstream systems reject any malformed or non-conformant JSON.
QUESTION: What approach provides the most reliable schema compliance?
OPTIONS:
A) Append instructions like "Output only valid JSON matching the schema exactly" and implement retry logic to re-prompt when JSON parsing fails.
B) Define a tool with your target schema as input parameters and have Claude call it with the extracted data.
C) Pre-fill Claude's response with an opening brace to force JSON output, then complete and parse the response.
D) Include detailed JSON formatting instructions and the target schema in your prompt, then parse Claude's text response as JSON.
TITLE: Large tool definitions + long documents + missed end-of-doc facts
Your extraction system uses tool_use with a JSON schema containing 12 fields and detailed descriptions, totaling approximately 2,500 tokens for the complete tool definition. Processing documents under 150K tokens yields 98% accuracy. For documents between 175-190K tokens, accuracy drops to 71%, with information from the final third consistently missed. The model's context window is 200K tokens.
QUESTION: What is the most likely cause?
OPTIONS:
A) The model distributes attention proportionally across input length, causing fields mentioned only once near the document's end to receive insufficient processing focus.
B) Schemas exceeding 8-10 fields increase decision complexity during parameter generation, reducing extraction accuracy independent of document length.
C) Very long documents exceed the model's effective attention span regardless of context limits, causing accuracy degradation for content farther from the prompt instructions.
D) Tool definitions consume input context tokens. Combined with system prompts and document content, the total approaches the context limit, degrading end-of-document processing.
TITLE: Message Batches API — SLA vs 24-hour batch processing window
Documents arrive continuously throughout business hours and need structured data extracted. To reduce costs, you want to use the Message Batches API (50% discount, up-to-24-hour processing window). Your SLA specifies that extraction results must be available within 30 hours of document arrival with 99.9% reliability.
QUESTION: Which batching strategy is most appropriate?
OPTIONS:
A) Submit batches every 6 hours containing documents from that window.
B) Submit a single batch at end of day containing all documents from that day.
C) Use the real-time API for all documents instead of batch processing.
D) Submit batches every 4 hours containing documents from that window.
TITLE: Automating high-confidence extractions — validation before reducing reviewers
Your system has been operating with 100% human review for 3 months. Analysis shows that extractions with model confidence >=90% have 97% accuracy overall. To reduce reviewer workload, you plan to automate high-confidence extractions. Before deploying, what validation step is most critical?
OPTIONS:
A) Run a two-week pilot routing 25% of high-confidence extractions directly to downstream systems and monitor error reports.
B) Verify that 97% accuracy meets requirements for all downstream systems that consume the extracted data.
C) Analyze accuracy by document type and field to verify high-confidence extractions perform consistently across all segments, not just in aggregate.
D) Compare accuracy at different confidence thresholds (85%, 90%, 95%) to find the optimal cutoff that maximizes automation while minimizing errors.
TITLE: Semantic errors that pass schema validation — allocating 20% review capacity
After deployment, you find that 12% of extractions contain semantic errors that pass JSON schema validation (for example, a duration like "30 minutes" incorrectly placed in an ingredient quantity field). Human reviewers have capacity to check only 20% of extractions.
QUESTION: Which approach most effectively allocates reviewer attention?
OPTIONS:
A) Have the model output field-level confidence scores, then calibrate review thresholds using a labeled validation set.
B) Review all extractions from documents with formatting anomalies such as unusual layouts or mixed content types.
C) Prioritize review of all extractions where required fields are empty or explicitly marked as not found.
D) Randomly sample 20% of extractions for review, using corrections to track accuracy and identify error patterns.
TITLE: Invoice line items vs totals mismatch
Your extraction pipeline processes invoices and extracts line items, subtotals, tax amounts, and grand totals. During evaluation, you discover that in 18% of extractions, the sum of extracted line item amounts doesn't match the extracted grand total—sometimes due to OCR errors in the source document, sometimes due to extraction mistakes by the model. Downstream accounting systems reject records with mismatched totals.
QUESTION: What's the most effective approach to improve extraction reliability?
OPTIONS:
A) Extract line items and totals independently, then use a separate validation model to reconcile discrepancies by determining which extracted values are most likely correct.
B) Implement post-processing that automatically adjusts line item amounts proportionally when their sum doesn't match the stated total.
C) Add few-shot examples demonstrating invoices where extracted line items sum correctly to the stated total, encouraging the model to produce mathematically consistent extractions.
D) Add a calculated_total field where the model sums extracted line items alongside a stated_total field. Flag records for human review when values differ.
TITLE: Pydantic validation failures like "expected float, got '2 to 3'"
Monitoring shows 12% of extractions fail Pydantic validation with specific errors like "expected float for quantity, got '2 to 3'". Retrying these requests without modification produces identical failures.
QUESTION: What's the most effective approach to recover from these validation failures?
OPTIONS:
A) Pre-process source documents to standardize problematic formats before sending them for extraction.
B) Implement a secondary pipeline using a larger model tier to reprocess documents that fail validation.
C) Set temperature to 0 to eliminate output variability and ensure consistent formatting.
D) Send a follow-up request including the validation error, asking the model to correct its output.
TITLE: Product reviews extraction — fabrication + sarcasm (schema design)
The system processes product reviews using tool use with a defined schema: rating (integer 1-5), pros (string array), cons (string array), and overall_sentiment (enum: positive, negative, mixed). Testing reveals two issues with brief or ambiguous reviews (~20% of the dataset): (1) for reviews like "Great product", Claude fabricates specific pros and cons rather than indicating this information isn't explicitly stated, and (2) for sarcastic reviews like "Well that was... interesting", Claude picks sentiment arbitrarily since there's no option for ambiguous cases.
QUESTION: What schema modification best addresses both issues?
OPTIONS:
A) Allow null values for pros/cons, and add "unclear" to the sentiment enum.
B) Make pros and cons optional fields, and add "neutral" and "unclear" to the sentiment enum.
C) Allow empty arrays for pros/cons as valid output, and add "unclear" to the sentiment enum.
D) Add an extraction_confidence field (0.0-1.0) for each value, and filter outputs where any confidence falls below a threshold.
TITLE: Long meeting transcripts — scattered information extraction
Evaluation shows 94% extraction accuracy on short meeting transcripts (<30 minutes) but only 68% on longer transcripts (>60 minutes) where discussions meander and key information is scattered throughout. Transcripts of both lengths fit within the model's context window.
QUESTION: What pattern most effectively improves accuracy on complex, lengthy documents?
OPTIONS:
A) Split lengthy transcripts into chunks, extract from each chunk separately, then merge and deduplicate the results.
B) Add few-shot examples demonstrating correct extraction from lengthy meetings with scattered information.
C) Upgrade to a more capable model tier for the extraction task.
D) Add a pre-extraction step where the model summarizes key discussions and conclusions before performing structured extraction.
TITLE: Multiple document types + tool_choice:auto sometimes returns plain text
The extraction pipeline receives documents of varying types—some are invoices, others are contracts, and some are receipts. You've defined separate extraction tools, each with its own JSON schema tailored to the document type. During testing, you observe that with tool_choice: "auto", Claude sometimes returns conversational text instead of calling an extraction tool, causing downstream parsing failures. You need guaranteed structured output without knowing the document type in advance.
QUESTION: What's the most effective approach?
OPTIONS:
A) Consolidate all document types into a single unified-schema extraction tool and force that tool.
B) Add a preliminary classification call, then make a second call with tool_choice forced to the identified extraction tool.
C) Set tool_choice: "any" with all extraction tools defined.
D) Keep tool_choice: "auto" with system prompt instructions requiring tool use.
TITLE: extract_metadata must run before citation enrichment tools
Your pipeline uses a tool called extract_metadata with a JSON schema for paper details. You've also defined lookup_citations and verify_doi tools for enrichment. During testing, you notice that when users include requests like "extract the metadata and tell me how cited it is," Claude sometimes calls lookup_citations first, which fails because it needs the DOI that extract_metadata would provide.
QUESTION: What's the most effective way to ensure structured metadata extraction happens first?
OPTIONS:
A) Set tool_choice to "auto" and reorder the tool definitions so extract_metadata appears first in the tools array, since Claude prioritizes earlier-listed tools.
B) Set tool_choice to {"type": "tool", "name": "extract_metadata"} and process the enrichment requests in subsequent turns after receiving the extracted metadata.
C) Set tool_choice to {"type": "tool", "name": "extract_metadata"} for every API call in the pipeline, ensuring Claude always extracts metadata before any enrichment can occur.
D) Set tool_choice to "any" so Claude must use a tool, combined with system prompt instructions prioritizing extract_metadata.
TITLE: Downstream rejects malformed JSON — move beyond "JSON in text"
The extraction pipeline occasionally receives responses that don't parse as valid JSON, causing downstream processing failures. The current implementation prompts Claude to return JSON in the response text and then parses it.
QUESTION: What is the most reliable approach to ensure Claude returns valid, schema-compliant structured data?
OPTIONS:
A) Define a tool with a JSON schema specifying the expected structure, using tool_use to constrain Claude's output to schema-compliant JSON.
B) Implement a retry loop that catches JSON parse errors and re-prompts Claude with the error details, asking it to correct the malformed output.
C) Add explicit formatting instructions to the prompt with JSON examples, emphasizing that Claude must return only valid JSON with no surrounding text.
D) Use regular expressions to locate and extract JSON from the response text, handling cases where Claude includes explanatory text around the JSON block.
TITLE: Contracts with amendments — conflicting clauses over time
The extraction pipeline processes contracts that frequently include amendments. When a contract contains both original terms and later amendments (for example, original clause specifies "30-day payment terms" while Amendment 1 changes this to "45 days"), the model inconsistently extracts one value or the other with no indication of which applies.
QUESTION: What's the most effective approach to improve extraction accuracy for documents with amendments?
OPTIONS:
A) Redesign the schema so amended fields capture multiple values, each with source location and effective date.
B) Add prompt instructions to always extract the most recent amendment value and ignore superseded original terms.
C) Implement post-extraction validation using pattern matching to detect amendments and flag those extractions for manual review.
D) Preprocess documents with a classifier that identifies and removes superseded sections before the main extraction step.
Comments
Post a Comment