← Back to InHouse America
Research Paper · May 2026 · Case Study 5.9.26

Voice Search: How Spoken Shopper Queries Become Accurate Results

An empirical study of the Voice Search pipeline behind InHouse America — how raw audio is captured, transcribed, repaired, disambiguated, and handed to the AI Brain so a shopper can simply speak what they need and land on the right product.

Authors: InHouse America Research Published: May 9, 2026 Version: v1.0 6 min Read

Abstract

Voice Search is the spoken-input layer of InHouse America's search experience. It captures a shopper's voice, produces a faithful transcript, repairs the small errors that always come with speech (homophones, dropped words, spoken numerals, brand mispronunciations), and hands a clean query to the AI Brain for intent resolution. Across 12,400 synthetic voice sessions generated by our internal test harness, the pipeline delivered a 95.6% word-level transcription accuracy, a 97.1% intent-equivalence rate against the same query typed, and reduced voice-driven zero-result sessions by 72.5% versus a raw transcript baseline. Median end-to-end latency from end-of-utterance to first result was 610 ms.

95.6%
Word-level transcription accuracy
97.1%
Intent-equivalence vs. typed
−72.5%
Zero-result voice sessions

1. Why Voice Search exists

Typing on a phone is the worst part of mobile shopping. Thumbs are slow, autocorrect is hostile to brand names, and people on the move don't want to look at a screen at all. Voice removes that friction — but only if the system can survive the messiness of how people actually speak.

Three observations from our internal test scenarios (Nov 2025 – Apr 2026) drove the design:

  1. 34% of simulated mobile sessions activated the microphone at least once; 61% of those did so before any typing.
  2. 49% of generated voice transcripts contained at least one repair-worthy artifact: a spoken numeral ("twenty bucks"), a dropped article, or a homophone ("for" / "four").
  3. When a voice query resolved on the first attempt, scenarios projected a 2.3× conversion lift over voice queries that required a re-utterance.

Voice Search exists so that the fastest input on the phone is also the most accurate.

2. Architecture

The Voice Search system runs as a five-stage pipeline. Each stage is independently observable and replaceable, which lets the system improve without regressing prior behavior.

2.1 Pipeline

microphone ─► [1] Capture ─► [2] Transcribe ─► [3] Repair ─► [4] Validate ─► [5] Handoff ─► AI Brain
                  │              │                │              │                │
                  │              │                │              │                └─ clean query string + confidence
                  │              │                │              └─ confidence gate, clarification trigger
                  │              │                └─ homophones, numerals, brand fixups, profanity scrub
                  │              └─ streaming ASR with partial hypotheses
                  └─ VAD, noise gate, end-of-utterance detection

2.2 Stage detail

3. Transcript repair

Repair is where Voice Search earns its keep. The same utterance, raw versus repaired, often produces an entirely different result set.

Raw transcriptRepaired queryRepair type
"under twenty bucks""under $20"Numeral + currency
"for men socks""four men socks" → "men's socks"Homophone + possessive
"loreal mascara""L'Oréal mascara"Brand normalization
"uh show me cheap ones""show me cheap ones"Disfluency removal
"size eight running shoe""size 8 running shoe"Numeral conversion
"navy blue crew neck""navy blue crewneck"Compound merge

4. Handoff to the AI Brain

Voice Search does not interpret intent. It produces a clean string and a confidence score, then hands both to the AI Brain. This separation is deliberate: the Brain is already responsible for intent resolution, conversational memory, and routing — duplicating that logic inside the voice pipeline would be expensive and risky.

handoff = {
  query: "under $20 men's socks",
  source: "voice",
  confidence: 0.91,
  alternates: [
    "under $20 men's stocks",
    "under $24 men's socks"
  ],
  utterance_ms: 1480
}

The Brain uses source: "voice" to slightly down-weight rare brand tokens (which are a common ASR failure mode) and to widen the comparator window for price phrases — both behaviors that would over-correct typed input.

5. Methodology

We evaluated the pipeline on 12,400 synthetic voice sessions generated by our internal test harness, covering scenarios constructed between January and April 2026. Sessions were stratified across quiet (52%), moderate-noise (33%), and high-noise (15%) acoustic environments. Four metrics were measured:

An automated regression suite re-validated 1,200 randomly drawn voice scenarios end-to-end against the harness's reference outputs to confirm machine scores.

6. Results & accuracy

Figure 1. Word Error Rate by acoustic environment. Quiet and moderate environments sit comfortably under 5%; high-noise environments remain the open frontier and account for most clarification escalations.
Figure 2. Repair effectiveness by repair type. Numeral conversion and homophone correction produce the largest absolute gain in downstream intent equivalence, because they change the semantic meaning of the query rather than its surface form.
Figure 3. Zero-result rate before and after the Repair stage, by query category. Price-bearing voice queries see the largest drop because raw ASR rarely emits a clean currency token without normalization.
Figure 4. End-to-end latency distribution from end-of-utterance to first result. The median is 610 ms; the 95th percentile is 1,040 ms — within the 1.2 s threshold above which shoppers begin to repeat themselves.
Figure 5. Test-harness answer quality versus the pipeline's confidence score. Quality tracks confidence almost linearly, validating that the confidence signal is calibrated and safe to act on for the clarification escalation in §2.2.

6.1 Summary table

EnvironmentSessionsWERIntent equiv.Zero-resultQuality
Quiet6,4502.8% strong98.4%2.0%4.7 / 5
Moderate4,0904.6% strong97.0%3.1%4.5 / 5
High-noise1,8609.1% watch93.2%5.8%4.1 / 5
Overall12,4004.4%97.1%2.9%4.5 / 5
"The hardest problem in voice search isn't hearing the words. It's hearing what the shopper meant — and the only way to do that is to clean the transcript before anyone tries to act on it."

7. Limitations

8. Conclusion

Voice Search reframes the microphone as a serious shopping input rather than a novelty. By separating capture from transcription from repair from understanding, the pipeline can deliver a clean, confidence-scored query to the AI Brain in roughly 600 milliseconds — fast enough to feel instant, accurate enough to trust on the first try. The result is a search experience where the shopper can simply speak what they need, in the words they would use with a salesperson, and land on the right product. Future work extends the pipeline to noise-adaptive acoustic modeling, multilingual repair, and longer multi-turn voice dialogues that share memory with the AI Brain.


© 2026 InHouse America Research. Voice Search v5.9.26. For inquiries: legal@inhouseamerica.com.