Evaluations

A good evaluation answers one question: does this SDK improve the metric that matters for our product? That metric is different for Hermes and Orpheus, and getting the distinction right is the single most common source of misleading evaluations.

Choose the right metric

Hermes is evaluated by humans. It's tuned to make speech sound clearer, more natural, and more intelligible to a listener. The right evaluations are listener-based: MOS-style A/B comparisons, intelligibility tests, conversational quality scoring, or your own internal listening protocol.
Orpheus is evaluated by machines. It's tuned to reduce Word Error Rate (WER) or Character Error Rate (CER) on a downstream ASR. Subjective listening is not a valid evaluation method for Orpheus; enhanced audio may sound unchanged or even slightly different while still significantly lowering WER.

Evaluating Orpheus by listening to it is the most common evaluation mistake we see. Always measure against the ASR pipeline you intend to ship.

Hermes evaluation protocol

Pick a representative dataset. Real audio from your product: noisy calls, fast speakers, accented speech.
Generate paired clips. For each source clip, produce one or more enhanced variants (different config combinations: noise cancellation only, noise cancellation + voice boost, full stack with speed control, etc.).
Score. Use listener A/B tests, MOS scoring, or your internal listening rubric to rate clarity and intelligibility relative to the original.
Iterate on config. Hermes has three independent functionalities; the best configuration depends on your audio. Try noise cancellation alone, with voice boost, and with speed control to find the right combination.

Orpheus evaluation protocol

Pick a representative dataset. Real recordings from your product, not clean studio speech. Cover the noise profiles, languages, and speaker types you actually see.
Define ground truth. Hand-transcribed reference text per clip.
Run the baseline. Send each clip through your ASR pipeline without Orpheus. Compute WER (or CER) against the ground truth.
Run with Orpheus. Send the same clips through Orpheus, then through the same ASR pipeline. Compute WER again on the enhanced output.
Compare. Compare WER before and after the change. Include both the absolute change in WER and the relative percentage change. Slice the result by noise condition and speaker type to find where Orpheus helps most.

What we provide during evaluation

The SDK build and key for your platform.
Documentation for the models available under your license.
Direct contact with our team for integration questions, config tuning, and evaluation methodology.

If you'd like help designing an evaluation against your specific dataset, ask during onboarding.

Evaluations

Choose the right metric

Hermes evaluation protocol

Orpheus evaluation protocol

What we provide during evaluation

On this page