Meet PSM Pal, our pre-trained survival model
applied to the automotive industry
Here is our full methodology in the order it is actually applied: how data is collected and standardized, how raw probabilities are estimated, how known reporting biases are corrected against independent references, how probabilities are converted into cost distributions, and what our model deliberately does not claim to do.
Data Collection and Standardization
Our base layer is built from owner-reported failures drawn from open-data and permissively licensed sources — including public-sector open data and content released under licenses such as Creative Commons and the Open Database License — that explicitly authorize reuse, analysis, and the training of statistical models — learn more about our training data. This base layer is then calibrated and corrected (see Section 4) against independent reliability references that are representative of the broader vehicle population rather than the self-selected population of people who write reviews. Depending on the country, these references include:
- United States — NHTSA complaint and recall data, J.D. Power vehicle dependability studies, and aggregated consumer-reliability surveys.
- Germany — the ADAC breakdown statistics and the TÜV defect reports.
- United Kingdom — DVSA MOT test result data, an open dataset of real-world defect rates by model and vehicle age.
- France — aggregated consumer satisfaction and reliability indices.
Every reported failure is normalized into a common reference framework. Trim identity follows industry-standard automotive cataloguing conventions, and each failure is mapped to a single normalized component within one of the eight functional systems. This standardization is what makes statistics comparable across sources, languages, and markets. Without it, the same failure described three different ways would be counted as three different problems.
Estimating Failure Probabilities
For each trim × mileage, we produce a probability estimate by combining two independent methods that compensate for each other's weaknesses.
Descriptive Statistics
The observed frequency of every failure type is computed within each 50,000 km interval, across the entire dataset. This frequency is our first estimate of the probability that the failure occurs within that mileage range. Descriptive statistics have one decisive advantage: they are grounded directly in observed data. Their weakness is variance — when only a handful of observations exist for a given combination, the observed frequency is unstable and can swing wildly. For this reason, descriptive statistics make up only one half of our estimate.
Random Forests
A Random Forests model is trained on the full standardized dataset. It is well suited to this problem for two reasons: first, it captures complex, non-linear interactions between trim characteristics — for example, the way a specific engine and transmission pairing behaves differently from either component in isolation; second, its ensemble structure naturally produces probability distributions rather than single-point predictions. Crucially, the model learns from all trims simultaneously. This lets it generalize to combinations where direct observations are too sparse for descriptive statistics to be reliable, by borrowing strength from mechanically similar vehicles.
Combining the Two Estimates by Dynamic Credibility Weighting
We weight the two estimates dynamically, according to how much real-world evidence supports the descriptive estimate. The more observations we have, the more we trust direct observation; the fewer we have, the more we lean on the model's generalization. We use the Bühlmann credibility factor — a standard method from actuarial theory:
w = n / (n + λ)
P_final = w · P_descriptive + (1 − w) · P_RandomForest
where: n is the number of observations available for that trim × mileage; and λ is a confidence constant, calibrated empirically, that sets how much data is required before observation dominates. This is the same principle insurers use to blend an individual's limited claims history with the behavior of a broader population, and it is well documented as a method for reducing overall estimation variance. Where we have data, we trust it; where we don't, we fall back on a model trained on everything we do know.
Correcting for Reporting Biases
Owner-reported data is powerful but systematically skewed. We identify four distinct biases and apply a specific correction to each. Because biased data cannot reliably correct itself, every correction is anchored to an independent external reference rather than to our own dataset.
Inter-Model Reporting Bias
Models with large, active owner communities or strong reputations attract disproportionately more reports than less-discussed models, even at equal true failure rates. Observed differences between models therefore partly reflect community activity and sentiment rather than reliability. For each model, we derive an overall reliability and satisfaction index from sources independent of our own data (see Section 1). These cover far broader populations and are not driven by the impulse to report a problem. We use this index as a weighting factor to recalibrate relative failure rates between models: a model over-represented negatively relative to its external reputation is adjusted downward, and vice versa.
Global Pessimism Bias
Even after inter-model correction, the dataset as a whole is skewed toward failure: owners who experience no problems rarely write about it. As a result, absolute failure rates are systematically inflated across every model. We compare our estimates against independent datasets built from representative populations not affected by reporting behavior. For each covered model, we compute a global calibration ratio and apply it as a corrective factor. This adjusts the overall level of estimated probabilities without disturbing their relative structure, which has already been handled by the previous correction.
Community Amplification
A single high-profile defect can dominate a forum or social platform, with one underlying issue generating many near-duplicate reports. Left untreated, this inflates that defect's apparent probability. We apply clustering techniques to detect and merge highly similar reports before computing any statistics, so that one widely discussed issue counts as one issue.
Failure-Severity Skew
Major failures generate far more reports than minor ones, over-representing dramatic, highly visible defects relative to small recurring annoyances. We apply severity-specific correction factors to rebalance the contribution of failures across severity classes.
From Probabilities to Cost — Monte Carlo Simulation
Failure probabilities answer "how likely?". Owners and dealers also need "how much?". To convert individual probabilities into a distribution of total expected repair cost, we run a Monte Carlo simulation of 10,000 iterations for each trim × mileage. In each iteration, and for every potential failure, we draw two random values: first, does the failure occur? — a draw against that component's calibrated probability; second, if it occurs, what does it cost? — a draw from that component's cost distribution.
The total cost for the iteration is the sum of all failures that occurred. After 10,000 iterations, we obtain a complete probability distribution of total repair cost, from which we report the expected value together with 75% and 95% confidence ranges. Repair costs follow a log-normal distribution, fitted to pricing data collected per country from both independent repair shops and authorized dealer networks. The log-normal shape reflects observed reality: most repairs sit in a low-to-moderate range, while a smaller number are substantially more expensive. The same repair might cost US$300 at an independent garage and US$900 at a manufacturer's network — which is exactly why we report probabilistic ranges rather than a single quote.
Limitations and What the Model Does Not Do
We are deliberately explicit about the boundaries of these estimates.
First, we do not predict VIN-specific breakdown. Our objective is to quantify risk exposure, not to forecast a particular failure.
Second, individual maintenance history is not modeled. A meticulously maintained vehicle will generally outperform our estimate; a neglected one will underperform it. The report reflects the expected behavior of an average vehicle in its category.
Third, minor and silent failures are underrepresented. Small or non-disruptive issues are less likely to be reported and may be undercounted. We continuously expand our failure reference database to reduce this gap.
Fourth, residual bias may remain. Our corrections target the largest and best-understood biases in owner-reported data, calibrated against independent references. They reduce, but cannot fully eliminate, every source of distortion. Where independent reference data is thin for a given market or model, residual bias is correspondingly harder to remove.
PSM Pal is a living model. Its estimates improve as new data is collected, new markets are integrated, and calibration references are updated. Every report you explore is built on the same pipeline described here — transparent, reproducible, and continuously refined. We invite you to explore the data, compare trims, and form your own informed judgment.