As large
amounts of patient data become increasingly available, apps for predicting clinical
outcomes will proliferate. Imagine a scenario where you have an app on your
device that seems to answer the clinical question you have, perhaps whether you
should place the patient before you on statins or if their disease is more
advanced, offer PCI or CABG? You plug in the key variables for your patient and
come up with a predicted result. You have heard, correctly, that new AI-based
outcome modelling is usually better than that with traditional statistical modelling1. Should you trust the results?
What should you ask?
1. Are a lot of patients like mine in the
dataset from which the model was derived?
It is well known that patients with substantial comorbidities,
those not at equipoise, as well as women and minorities are under-represented
in U.S.-based randomized trials (RCT) that often drive clinical guidelines.
This is not the case in broad-based registries such as the EPIC Cosmos, the US
National Inpatient Registry or ACC/NCDR PCI Registry, but these registries have
their own deficiencies (see below). To get data and models on Caucasians who
were excluded from RCT, subset analysis from comprehensive datasets such as
Swedeheart can be valuable. However, if your patient is African-American,
Hispanic, Asian or from low socio-economic status, you’ll need to look
elsewhere, as all of these factors influence (particularly long-term) outcomes.
Even models from large datasets such as the Pooled Cohort Equation (PCE) models
perform less well in minority groups2.
Finally, for AI-based modelling, such as that by XG Boost or Random Survival
Forests, which iteratively model overlapping subsets of patients until their
loss function (the difference between their prediction and reality) is stably
minimized and outperform traditional model with complex datasets, truly large
numbers of patients like yours are needed.
2. Are the data unbiased?
One would certainly like to think large broad-based databases
should be unbiased, but what about predictive models that come from industry or
use industry-funded data? We all know that data can be cherry-picked and
conclusions “spun.” Risk of bias is common even in RCT3. It is known that industry-funded studies that
are “positive” are more likely to be published than those that are neutral or
negative4. This can skew even well
intentioned “neutral databases”. As an example, and not to be judgmental, but
industry-based models such as the QRISK-3 model have been criticized for
overestimating patient risk of MI and CVA.
3. Is the nature of the dataset appropriate to
the question you seek to answer?
There are many issues here. Many of the really big U.S. datasets are
based only on ICD-10 codes, medications prescribed and lab values. Administrative
datasets often lack important details,
often have incomplete data, are subject to miscoding/misclassification and have
incomplete patient follow up. They also lack information on patient quality of
life and their desired outcomes. They may be reasonable, in concept, to access
the relationships between baseline characteristics and later MACE such as used
in the PCE, but they lack the nuance to accurately predict procedural outcomes
that may be found in the NCDR PCI, STS and TVT Registries. On the other hand, the NCDR PCI
dataset, even if supplemented by survival data, may also be poorly suited to
predict long term general outcomes such as mortality. For instance, we have
developed models to predict long-term mortality after PCI using data from
ACC/NCDR supplemented by natural language processing (NLP)- extracted data from
our patients’ EPIC EMR. Five of the top 10 predictive correlates were not in
the ACC/NCDR dataset (e.g. serum albumin, diuretic/dose and depression [all
more important than LVEF]). Additionally, coding cause of death is notoriously
unreliable, so focusing on cardiac death is highly problematic. Digging deeper,
even in 2024, NLP-based data extraction from most databases is typically
accurate only 75-90% of the time (free text is especially challenging)5. That said, large datasets often have results
far more generalizable than those from single center or more limited datasets.
4. Has field evolved since the dataset was
constructed?
The PCE and MESA cohorts are good examples of models developed from
large datasets to inform decisions about preventive treatments such as aspirin
and statins. As calcium scoring became available, it became clear that inclusion
of these data improved the model’s predictive capabilities6. Since
non-calcified plaque is less stable that calcified plaque, it stands to reason
that quantization of non-calcified coronary plaque, soon to be readily
available, will provide even more information. Genetic data is also becoming
increasingly available. Beyond this, the widespread use of GLP-1 inhibitors will
likely reduce risk. There is no good solution to the problem of predicting
long-term outcomes when background treatment and available tests are evolving
quickly, but physicians need to be aware of the latest data.
5. Is the model good?
Currently, we use a number of models that really aren’t that good. For
example, the commonly used CHADSVASC2 score has validation c-statistics ranging
from 0.59-0.67 and the DAPT score from 0.49-0.71. One might wonder why thought
leaders and our societies haven’t stressed these limitations. Perhaps it’s
because these models are better than a “gut choice.” The thoughtful
cardiologist should at least know the basics of how to critique a model. Models
(AI generated or not) should be evaluated on how well does the tool discriminate risk [low, medium and
high for example; typically measured by the c-statistic [0.50 no discrimination
to 1.00 complete separation on the basis of risk; good: 0.7-0.8, very good
0.8-0.9]) or the statistically better F1 score [good 0.7-0.8, very good >0.8]),
calibration (does the predicted
risk match the actual risk? (The tool isn’t helpful if it discriminates amongst
patient’s risks but under or overestimates it) assessed by the calibration
plot, Brier score or Hosmer-Lemeshow test
and generalizability (results of the
model in datasets other than the one it which it was developed.) Beyond this and at a more nuanced level, they
need to minimize confounder effect and avoid overfitting (AI based models have,
on average, three times more predictors than non-AI models, so they are
particular risk of this problem)7 Guidelines
for quality modelling with AI have been published recently8.
6. Is there a good reason to think that the
results at my practice/hospital should be different?
Predicting outcomes of treatments that involve physician skill
(devices) are fundamentally more challenging than those that don’t (drugs). Although
many of our current procedures are largely standardized (PCI, TAVR), data exist
that should make us question how well global results apply to your patient. For
instance, CABG-related mortality in the SYNTAX trial varied widely by hospital.
As another example, if we know that intra-coronary imaging (ICI) improves PCI
outcomes and yet it is only used in a small minority of cases nationwide,
should you trust national outcome data if your center uses ICI in 90% of its
cases?
There are
always trade-offs in medicine. RCTs eliminate the treatment biases that
contaminate observational trials, but their results apply only to the typical
patient in the study. Patients want and will increasingly expect treatment
recommendations tailored specifically to their situation (think genetic risk
factor-based cancer therapy). Large dataset, AI-based predictive apps have the
potential to meet this need, but they should not be followed without question.
Should you
have to know the basics of what we just reviewed? Perhaps not. It would be
better if our cardiac societies would do the model evaluation for us, but from the
review above it seems that they can’t be fully relieved upon. In the end, your
patients depend on you to know what’s best.
7. References