Research Paper · May 2026 · Case Study 5.9.26

Voice Search: How Spoken Shopper Queries Become Accurate Results

An empirical study of the Voice Search pipeline behind InHouse America — how raw audio is captured, transcribed, repaired, disambiguated, and handed to the AI Brain so a shopper can simply speak what they need and land on the right product.

Authors: InHouse America Research Published: May 9, 2026 Version: v1.0 6 min Read

Abstract

Voice Search is the spoken-input layer of InHouse America's search experience. It captures a shopper's voice, produces a faithful transcript, repairs the small errors that always come with speech (homophones, dropped words, spoken numerals, brand mispronunciations), and hands a clean query to the AI Brain for intent resolution. Across 12,400 synthetic voice sessions generated by our internal test harness, the pipeline delivered a 95.6% word-level transcription accuracy, a 97.1% intent-equivalence rate against the same query typed, and reduced voice-driven zero-result sessions by 72.5% versus a raw transcript baseline. Median end-to-end latency from end-of-utterance to first result was 610 ms.

95.6%

Word-level transcription accuracy

97.1%

Intent-equivalence vs. typed

−72.5%

Zero-result voice sessions

1. Why Voice Search exists

Typing on a phone is the worst part of mobile shopping. Thumbs are slow, autocorrect is hostile to brand names, and people on the move don't want to look at a screen at all. Voice removes that friction — but only if the system can survive the messiness of how people actually speak.

Three observations from our internal test scenarios (Nov 2025 – Apr 2026) drove the design:

34% of simulated mobile sessions activated the microphone at least once; 61% of those did so before any typing.
49% of generated voice transcripts contained at least one repair-worthy artifact: a spoken numeral ("twenty bucks"), a dropped article, or a homophone ("for" / "four").
When a voice query resolved on the first attempt, scenarios projected a 2.3× conversion lift over voice queries that required a re-utterance.

Voice Search exists so that the fastest input on the phone is also the most accurate.

2. Architecture

The Voice Search system runs as a five-stage pipeline. Each stage is independently observable and replaceable, which lets the system improve without regressing prior behavior.

2.1 Pipeline

microphone ─► [1] Capture ─► [2] Transcribe ─► [3] Repair ─► [4] Validate ─► [5] Handoff ─► AI Brain
                  │              │                │              │                │
                  │              │                │              │                └─ clean query string + confidence
                  │              │                │              └─ confidence gate, clarification trigger
                  │              │                └─ homophones, numerals, brand fixups, profanity scrub
                  │              └─ streaming ASR with partial hypotheses
                  └─ VAD, noise gate, end-of-utterance detection

2.2 Stage detail

Capture — Voice activity detection (VAD) opens the stream on first phoneme and closes it on a 320 ms trailing silence. A noise gate suppresses HVAC, traffic, and TV background; a clipping limiter protects the transcriber from saturation.
Transcribe — A streaming ASR model emits partial hypotheses every 80 ms so the UI can show live text. The final hypothesis is locked when end-of-utterance fires.
Repair — A deterministic post-processor rewrites spoken numerals to digits ("twenty bucks" → "$20"), corrects high-frequency homophones in shopping context ("for" vs "four," "to" vs "two"), normalizes brand spellings, and scrubs profanity and PII.
Validate — A confidence gate. Below 0.55 the system asks "did you mean…?" with the top-2 hypotheses instead of guessing.
Handoff — The cleaned string is delivered to the AI Brain's Normalize stage with a source: "voice" flag so downstream resolvers can weight ambiguous tokens accordingly.

3. Transcript repair

Repair is where Voice Search earns its keep. The same utterance, raw versus repaired, often produces an entirely different result set.

Raw transcript	Repaired query	Repair type
"under twenty bucks"	"under $20"	Numeral + currency
"for men socks"	"four men socks" → "men's socks"	Homophone + possessive
"loreal mascara"	"L'Oréal mascara"	Brand normalization
"uh show me cheap ones"	"show me cheap ones"	Disfluency removal
"size eight running shoe"	"size 8 running shoe"	Numeral conversion
"navy blue crew neck"	"navy blue crewneck"	Compound merge

4. Handoff to the AI Brain

Voice Search does not interpret intent. It produces a clean string and a confidence score, then hands both to the AI Brain. This separation is deliberate: the Brain is already responsible for intent resolution, conversational memory, and routing — duplicating that logic inside the voice pipeline would be expensive and risky.

handoff = {
  query: "under $20 men's socks",
  source: "voice",
  confidence: 0.91,
  alternates: [
    "under $20 men's stocks",
    "under $24 men's socks"
  ],
  utterance_ms: 1480
}

The Brain uses source: "voice" to slightly down-weight rare brand tokens (which are a common ASR failure mode) and to widen the comparator window for price phrases — both behaviors that would over-correct typed input.

5. Methodology

We evaluated the pipeline on 12,400 synthetic voice sessions generated by our internal test harness, covering scenarios constructed between January and April 2026. Sessions were stratified across quiet (52%), moderate-noise (33%), and high-noise (15%) acoustic environments. Four metrics were measured:

Word Error Rate (WER) — token-level transcription error against the test-harness ground-truth transcript.
Intent equivalence — fraction of voice queries whose AI-Brain-resolved intent matched the same query injected as typed text.
Zero-result rate — share of voice queries that produced an empty result set, before and after repair.
End-to-end latency — milliseconds from end-of-utterance to first rendered result.

An automated regression suite re-validated 1,200 randomly drawn voice scenarios end-to-end against the harness's reference outputs to confirm machine scores.

6. Results & accuracy

Figure 1. Word Error Rate by acoustic environment. Quiet and moderate environments sit comfortably under 5%; high-noise environments remain the open frontier and account for most clarification escalations.

Figure 2. Repair effectiveness by repair type. Numeral conversion and homophone correction produce the largest absolute gain in downstream intent equivalence, because they change the semantic meaning of the query rather than its surface form.

Figure 3. Zero-result rate before and after the Repair stage, by query category. Price-bearing voice queries see the largest drop because raw ASR rarely emits a clean currency token without normalization.

Figure 4. End-to-end latency distribution from end-of-utterance to first result. The median is 610 ms; the 95th percentile is 1,040 ms — within the 1.2 s threshold above which shoppers begin to repeat themselves.

Figure 5. Test-harness answer quality versus the pipeline's confidence score. Quality tracks confidence almost linearly, validating that the confidence signal is calibrated and safe to act on for the clarification escalation in §2.2.

6.1 Summary table

Environment	Sessions	WER	Intent equiv.	Zero-result	Quality
Quiet	6,450	2.8% strong	98.4%	2.0%	4.7 / 5
Moderate	4,090	4.6% strong	97.0%	3.1%	4.5 / 5
High-noise	1,860	9.1% watch	93.2%	5.8%	4.1 / 5
Overall	12,400	4.4%	97.1%	2.9%	4.5 / 5

"The hardest problem in voice search isn't hearing the words. It's hearing what the shopper meant — and the only way to do that is to clean the transcript before anyone tries to act on it."

7. Limitations

WER in high-noise environments (cars, transit, retail floors) remains roughly 2× the quiet baseline. A noise-adaptive acoustic model is in evaluation for v1.2.
The Repair stage is English-only. Spanish and Mandarin repair rules are scheduled alongside the AI Brain's multilingual rollout.
Rare or invented brand names (under 200 monthly mentions) still depend on the Brain's downstream fuzzy match; voice cannot disambiguate them on its own.

8. Conclusion

Voice Search reframes the microphone as a serious shopping input rather than a novelty. By separating capture from transcription from repair from understanding, the pipeline can deliver a clean, confidence-scored query to the AI Brain in roughly 600 milliseconds — fast enough to feel instant, accurate enough to trust on the first try. The result is a search experience where the shopper can simply speak what they need, in the words they would use with a salesperson, and land on the right product. Future work extends the pipeline to noise-adaptive acoustic modeling, multilingual repair, and longer multi-turn voice dialogues that share memory with the AI Brain.