How we score confidence on a cross-reference

Every cross-reference we return carries a confidence score between 0 and 1. The score is not a vanity metric — it is the input a trade buyer uses to decide whether the platform can auto-route the order or whether a human should eyeball the match. This post walks through what goes into that number, what doesn’t, and where we draw the human-review line.

The four input families

The confidence model takes inputs from four families. Each one votes; the votes are weighted; the result is mapped through a calibration curve to keep the published score interpretable.

Source authority: Where the cross-reference came from. OEM technical bulletin > supplier-confirmed > scraped catalogue > inferred from co-occurrence.
Specification match: How completely the candidate part matches the queried part on physical and electrical specification. A fuel injector that matches on flow rate, body, connector and nozzle is high. One that matches only on the application string is low.
Fitment overlap: How much of the fitment fan is shared between the two numbers. Two parts that fit the same set of variants are more likely to be the same part than two parts whose fan only partially overlaps.
Independent corroboration: Whether more than one independent source agrees. A link that appears in a Bosch bulletin and in a Pierburg catalogue is more trustworthy than one that appears in either alone.

Weights, and why they aren’t fixed

The weight given to each family varies by part category. Source authority dominates for safety-critical parts like brake components — we will not publish a high-confidence match on a brake disc without an OEM-grade source. Specification match dominates for electronic parts where physical interchange is exact or it isn’t. Fitment overlap carries more weight for body and trim parts, where the fan is the main signal.

The category-specific weights live in a config that the engineering and parts teams own jointly. Every change ships with the back-test against a held-out validation set of known-good and known-bad matches.

Calibration: making 0.9 mean 0.9

A raw weighted score is not, by itself, a probability. If you sample every match in production where the model output 0.9, you do not necessarily find that 90% of them are correct. The model needs to be calibrated against ground truth.

We maintain a labelled set of ~14,000 cross-reference decisions, each one a "definitely correct" or "definitely incorrect" call from a senior parts specialist. The raw model output is mapped through an isotonic regression fitted on that set. After calibration, a published confidence of 0.9 means that, empirically, ~90% of matches at that score are correct. We re-fit the calibration monthly.

The human-review band

Three bands, three behaviours:

Confidence ≥ 0.92: Auto-publish. Returned by the API as a normal match. Suitable for auto-routing.
0.70 ≤ Confidence < 0.92: Auto-publish, flagged as "review recommended" in the API response. Buyers who care can hold for human approval; buyers who don’t can proceed at their own risk.
Confidence < 0.70: Held. Does not surface in API responses until a human reviewer has looked at it. Goes into the internal review queue with the evidence attached.

The thresholds are not arbitrary. 0.92 is the point at which our error rate on the held-out set drops below 1%, the level our heaviest API consumer told us they could absorb without a review step. 0.70 is the point below which manual review actually adds signal — above it, reviewers agreed with the model more than 95% of the time, and the review cost was no longer worth it.

Edge cases we treat specially

Supersessions across major revisions

When a manufacturer issues a supersession that changes the physical specification (a redesigned bracket, a different connector), the new part replaces the old in production but is not always backwards-compatible. We never auto-publish a supersession edge that crosses a known compatibility break, regardless of source authority. Those always go to review.

OE versus OES versus aftermarket

A part made by the original supplier to OE specification, sold in the supplier’s box, is not always identical to the OE-branded part. Sometimes the OE pack includes mounting hardware or a software-coded module that the OES box does not. We flag these as "equivalent with caveats" and surface the difference in the evidence inline.

Region-locked parts

A part listed in a US catalogue as equivalent to a European number can be the same shape, with a different software map. The cross-reference is technically correct and operationally wrong. We strip US-only sources from the European confidence calculation by default.

What we don’t use

A few signals look tempting and we deliberately don’t weight. Price similarity — two parts at similar price points are not more likely to be the same part. Image similarity — two photos of similar-looking castings can be different parts at the connector. Description-string similarity — manufacturers describe the same part in five different ways across catalogues. All three of these mislead the model more often than they help.

How the score appears in the API

{
  "part_number": "0281002996",
  "brand": "Bosch",
  "confidence": 0.97,
  "review_status": "auto_published",
  "evidence": [ ... ]
}

Integrators decide their own auto-route threshold. A national distributor running overnight reconciliation may auto-accept everything ≥ 0.85; a trade counter doing instant fulfilment may set the bar at 0.95 and route everything below to a human. The API is opinionated about the score and unopinionated about the policy.