An empirical study of the Voice Search pipeline behind InHouse America — how raw audio is captured, transcribed, repaired, disambiguated, and handed to the AI Brain so a shopper can simply speak what they need and land on the right product.
Voice Search is the spoken-input layer of InHouse America's search experience. It captures a shopper's voice, produces a faithful transcript, repairs the small errors that always come with speech (homophones, dropped words, spoken numerals, brand mispronunciations), and hands a clean query to the AI Brain for intent resolution. Across 12,400 synthetic voice sessions generated by our internal test harness, the pipeline delivered a 95.6% word-level transcription accuracy, a 97.1% intent-equivalence rate against the same query typed, and reduced voice-driven zero-result sessions by 72.5% versus a raw transcript baseline. Median end-to-end latency from end-of-utterance to first result was 610 ms.
Typing on a phone is the worst part of mobile shopping. Thumbs are slow, autocorrect is hostile to brand names, and people on the move don't want to look at a screen at all. Voice removes that friction — but only if the system can survive the messiness of how people actually speak.
Three observations from our internal test scenarios (Nov 2025 – Apr 2026) drove the design:
Voice Search exists so that the fastest input on the phone is also the most accurate.
The Voice Search system runs as a five-stage pipeline. Each stage is independently observable and replaceable, which lets the system improve without regressing prior behavior.
microphone ─► [1] Capture ─► [2] Transcribe ─► [3] Repair ─► [4] Validate ─► [5] Handoff ─► AI Brain
│ │ │ │ │
│ │ │ │ └─ clean query string + confidence
│ │ │ └─ confidence gate, clarification trigger
│ │ └─ homophones, numerals, brand fixups, profanity scrub
│ └─ streaming ASR with partial hypotheses
└─ VAD, noise gate, end-of-utterance detection
source: "voice" flag so downstream resolvers can weight ambiguous tokens accordingly.Repair is where Voice Search earns its keep. The same utterance, raw versus repaired, often produces an entirely different result set.
| Raw transcript | Repaired query | Repair type |
|---|---|---|
| "under twenty bucks" | "under $20" | Numeral + currency |
| "for men socks" | "four men socks" → "men's socks" | Homophone + possessive |
| "loreal mascara" | "L'Oréal mascara" | Brand normalization |
| "uh show me cheap ones" | "show me cheap ones" | Disfluency removal |
| "size eight running shoe" | "size 8 running shoe" | Numeral conversion |
| "navy blue crew neck" | "navy blue crewneck" | Compound merge |
Voice Search does not interpret intent. It produces a clean string and a confidence score, then hands both to the AI Brain. This separation is deliberate: the Brain is already responsible for intent resolution, conversational memory, and routing — duplicating that logic inside the voice pipeline would be expensive and risky.
handoff = {
query: "under $20 men's socks",
source: "voice",
confidence: 0.91,
alternates: [
"under $20 men's stocks",
"under $24 men's socks"
],
utterance_ms: 1480
}
The Brain uses source: "voice" to slightly down-weight rare brand tokens (which are a common ASR failure mode) and to widen the comparator window for price phrases — both behaviors that would over-correct typed input.
We evaluated the pipeline on 12,400 synthetic voice sessions generated by our internal test harness, covering scenarios constructed between January and April 2026. Sessions were stratified across quiet (52%), moderate-noise (33%), and high-noise (15%) acoustic environments. Four metrics were measured:
An automated regression suite re-validated 1,200 randomly drawn voice scenarios end-to-end against the harness's reference outputs to confirm machine scores.
| Environment | Sessions | WER | Intent equiv. | Zero-result | Quality |
|---|---|---|---|---|---|
| Quiet | 6,450 | 2.8% strong | 98.4% | 2.0% | 4.7 / 5 |
| Moderate | 4,090 | 4.6% strong | 97.0% | 3.1% | 4.5 / 5 |
| High-noise | 1,860 | 9.1% watch | 93.2% | 5.8% | 4.1 / 5 |
| Overall | 12,400 | 4.4% | 97.1% | 2.9% | 4.5 / 5 |
"The hardest problem in voice search isn't hearing the words. It's hearing what the shopper meant — and the only way to do that is to clean the transcript before anyone tries to act on it."
Voice Search reframes the microphone as a serious shopping input rather than a novelty. By separating capture from transcription from repair from understanding, the pipeline can deliver a clean, confidence-scored query to the AI Brain in roughly 600 milliseconds — fast enough to feel instant, accurate enough to trust on the first try. The result is a search experience where the shopper can simply speak what they need, in the words they would use with a salesperson, and land on the right product. Future work extends the pipeline to noise-adaptive acoustic modeling, multilingual repair, and longer multi-turn voice dialogues that share memory with the AI Brain.
© 2026 InHouse America Research. Voice Search v5.9.26. For inquiries: legal@inhouseamerica.com.