Abstract
Viral
transmission among vaccinated people is a major problem in the ongoing fight to
contain the seasonal flu. Ordinary flu shots give decent coverage, but immune
decline supported by the constant mutation of the virus leads to cross-infection.
This paper covers creating and implementing machine learning algorithms that
use real large-scale testing data from a pharmacy to predict influenza
breakthrough infections. Taking advantage of deidentified information from a
large number of testing records done in major retail pharmacies in the United
States between the years 2022 and 2023, we have used and compared several
supervised learning algorithms, such as logistic regression, random forest and
XGBoost, to determine the probability of infection with influenza among
vaccinated population given the demographic, clinical and temporal covariates.
The
model, XGBoost, was the most accurate, achieving an overall accuracy of 90%
ROC-AUC. Real-time risk scoring was deployed in two hundred and fifteen retail
pharmacy locations to keep up with this success. Therefore, the study outcomes
for the feature importance analysis showed that time since vaccination, age,
comorbidities and other factors were significant indicators of breakthrough
infection. The model accurately followed the flu positivity trend indicated by
the CDC in the regions (Pear son correlation coefficient= 0.81), thus showing
that the model can be used for near real-time monitoring. This work shows how
pharmacy data could complement other methods of public health surveillance, as
well as the data protection measures and decision-making process within a large
pharmacy shop for identifying at-risk patients before and during flu seasons.
Keywords: Influenza, Breakthrough
infections, Machine learning, Pharmacy data, Predictive modeling
1.
Introduction
1.1.
Influenza and the role of vaccines
The
common devastating flu is a highly contagious disease caused by the influenza
viruses that impact people’s health internationally. Influenza is responsible
for causing a considerable amount of morbidity and mortality annually,
particularly during the flu season, to the elderly, children or those with
existing chronic health complications. Annual flu vaccination is the only
preventive measure recommended by the authorities for flu prevention, such as
CDCP and WHO1-3. These vaccines
are modified annually according to information on the circulating strains from
worldwide. Influenza vaccines are used widely and are highly valuable; however,
their efficacy is comparatively moderate, from 40% to 60% of the target
population, to prevent symptomatic illness depending on the level of
cross-reacting, similar antigens or immunity reported in the population.
Vaccination
effectively minimizes detrimental health outcomes of influenza; nonetheless,
breakthrough infections refer to laboratory-confirmed influenza infections
among persons who have received adequate immunization. These may be due to a decline
in vaccine-induced immunity, a change of circulating strains or host factors
such as the age of patients and co-morbid health conditions. Screening for
persons at a higher risk of developing a breakthrough infection is important
not just in assessing vaccine efficacy but also in the clinical management of
infected persons and decision-makers on vaccines. Still, the contemporary means
of surveillance do not possess the necessary detail to disclose a real-time
breakthrough, either from an individual or community perspective.
1.2.
The emergence of pharmacy testing data
Over
the past few years, HDT has emerged as a key mode of service delivery for
infection testing in disease prevention and public health. Some large pharmacy
service providers, like CVS, Walgreens and Walmart, provide versions of
influenza testing services, more so during the flu season. These services
create a massive volume of anonymous diagnostic information, which is still
becoming instrumental to the work of public health. Compared to conventional
surveillance systems, data originating from pharmacy testing also encompass a
more significant portion of asymptomatic and mildly symptomatic persons who may
not necessarily approach a hospital or clinic.
The
increase in diagnostic diversity means there is the chance to monitor
ever-varying influenza occurrences at the population level and the dynamics in
real-time, which are not reflected in doctor’s offices or hospitals. However,
when this artifact is combined with demographic, vaccination and comorbidity
information, pharmacy test data can provide good material for predictive
analytics. As such, these models can predict likely exposure to the virus and
promote vaccination and safe practices for those still vulnerable.
1.3.
Objectives of the study
The
purpose of this paper is to advance and implement models for the diagnosis of influenza
breakthrough disease using real-world pharmacy testing data. We are interested
in the following aims: (i) to develop and train predictive models for the
probability of emergence of superspreading using own and external factors; (ii)
to compare and estimate the accuracy of the models using the test databases
from the pharmacies; (iii) to apply the developed models in real-life settings
of pharmacies for risk assessment.
Using
expandable data transfer and cloud-based setup, Choate fills the gap between
applying data analysis in public health and clinics. All in all, the findings
of this research provide a foundation for enhancing influenza monitoring and
various individual approaches to prevention and creating effective vaccine
distribution and marketing strategies in the retail healthcare network of the
United States.
2. Related
Work
2.1.
Overview of existing predictive models for infectious diseases
In
recent years, it has proved itself an important tool in tracking and monitoring
infected diseases through predictive models, especially with the current
technological advances in machine learning and big data analysis. For decades,
classical models like compartmental models like SIR, SEIR and others have been
used to estimate disease transmission at the population level4-6. In more recent years, there has been
increased consideration of data-driven approaches to help collect more
comprehensive and realistic data, such as EHRs, social media and mobility data,
to anticipate outbreaks and recognizeg those at risk.
Influenza,
in particular, has a number of machine learning-based models to predict whether
one is likely to get infected, the chance of hospitalization and even the trend
of influenza. Methods like logistic regression, decision trees, support vector
machines and deep learning networks have been practiced in several studies. For
instance, the models developed based on CDC influenza data and weather
information have given fairly good predictions for weekly flu incidence. Other
works have used information about the various symptoms acquired from mobile
applications, wearing devices and web search trends to predict flu activity on
regional and national levels. However, these strategies provide more general
information about the levels of risk instead of identifying the risk for
specific patients, especially if they get vaccinated.
2.2.
Limitations in prior studies
Although
substantial achievements have been made in the modeling of psychiatric
disorders, most of the existing works are associated with limitations that
reduce their usefulness for deployment in outpatient clinics. First, most
predictive models depend on the data collected from centralized databases like
hospital admission or state-reported cases, which are inaccurate and always
delayed. This delay leads to slow detection and subsequent steering of its
actions in line with emerging trends and patterns. Second, few models have been
tested with more detailed retail or outpatient testing data, even though the
latter is becoming a point of entry for patients with flu symptoms.
More
importantly, limited research has been carried out on the concept of breakdown
infections, which could be an intellectual deficit. Most are based on the
overall flu or flu transmission levels without making distinctions for
vaccinated or non-vaccinated persons. Hence, they can seldom be used to
evaluate remedy efficacy or improve the risk messaging of vaccinated persons.
One means by which MRLs could be strengthened is through the enhancement of
model deployment studies, which is missing in this study. While models were
shown to succeed in offline assessments, they are translated into clinical
practices and especially into off-site ones such as pharmacies inadequately.
2.3.
Contribution of this paper
In
this regard, the present work contributes to the development of the
identification of infectious disease modeling in several respects. First, it
targets predicting the influenza breakthrough infection, an untapped area of
research, even though the number of such infections is increasing yearly. By
training models on real-world pharmacy testing data, we present a new and
important data stream that includes diagnostics, vaccination history and risk
factors on an individual level at the population scale. Not only does this data
source offer timeliness, which is a virtue normally associated with centralized
surveillance systems, but also granularity, which is usually lacking in
centralized surveillance systems.
Second,
multiple models are evaluated, including logistic regression, random forest,
XGBoost, etc. After that, an absolute model assessment of multiple folds
cross-validation and validation in real-life parameters is performed. The first
model yielded the best result of 0.90 ROC-AUC and was implemented into a live
system installed at various retail pharmacies. Such an approach is more
practical than the majority of the prior works, which either are theoretical or
emphasize retrospective data analysis.
Lastly,
the mentioned approach illustrates that the community data and the computerized
AI model can help improve the decision-making process in pharmacy. Our system
updates the risk scores of breakthrough infections and assists pharmacists and
clinicians in filtering out and making recommendations based on high-risk
patients. The contribution of this paper comes not only in the form of model
development but, more importantly, in the reproduction of a model for health
policy implications in the case of vaccine-preventable diseases.
2.4.
Immunological mechanisms influencing breakthrough infections in the context of
bacterial, protozoan and viral vaccines
Figure
1:
Immune Pathways Affecting Vaccine Breakthrough Infections.
2.4.1.
Immune response and vaccine efficacy: The diagram depicts the
interaction between components of the immune system-most notably CD8+ T cells,
CD4+ T cells and B cells and how they collectively work to produce specific
antibodies against vaccines or pathogens. These antibodies are essential in the
reduction of bacterial load and disease severity7.
Nonetheless, the diagram highlights that drug resistance and vaccine
breakthrough infections are still possible even with these immune defenses.
This
is most striking in those with compromised immune responses or where the
pathogen has developed strategies to evade immune detection. Loss of vaccine
effectiveness can be due to lowered antibody levels, failure to activate T
cells or pathogen immune evasion. When designing and testing any predictive
system to identify at-risk individuals, these should be considered.
2.4.2.
Role of coinfections and protozoa: The bottom half of the
figure highlights how coinfections with other pathogens and protozoa can
undermine the immune response to vaccines. Such concurrent infections can
prevent the activation of immune cells, thus lowering antibody levels and
T-cell-mediated immunity. This results in reduced vaccine efficacy and
increased susceptibility to breakthrough infections.
This
is especially applicable in actual pharmacy data, where comorbidities and
concurrent infections are usually present in-patient histories. These factors
are essential to integrate into feature engineering in predictive modeling,
such as demonstrated in your system architecture. Knowledge of these underlying
biological interactions lends richness to the interpretability of the model and
ensures that predictions fit clinical and immunological realities.
2.4.3.
Implications for predictive modeling: From a systems design
point of view, the biological processes demonstrated in this figure reinforce
the necessity of including variables like time since the last vaccine dose,
comorbidities and demographic variables in machine learning models. These
immunological findings justify why some features predict breakthrough
infections and justify the inclusion of such attributes in feature engineering.
By
capturing how immunologic complexity is translated into heterogeneous responses
to vaccines, your predictive system is strengthened and better linked to
biomedical data. This strengthens risk stratification accuracy and the
potential public health benefit of the deployed system.
3. Data and
Methods
3.1.
Data Sources
For
this study, data were compiled from multiple sources obtained commercially and
from public health platforms to present a broad view of influenza testing,
vaccination and risk factors by individuals.
Pharmacy
Testing Data were determined using self-collected, anonymous records from three
large US retail pharmacy companies, CVS Health8-11.
Walgreens and Walmart, from October 2022 to March 2023. These datasets contain
over 1,200,000 records from the rapid antigen and RT-PCR influenza tests and
consist of testing location, testing result, testing date and time and the
basic identifiable data such as age, gender, zip code the patient belongs to,
etc. All the patients’ information was stripped of any identifiers that could
be matched to them based on the HIPAA guidelines and thus anonymized before any
modeling activities were conducted.
This
information from the CDC gave spatial context about the population density, age
distribution and the historical burden of the flu at a county level. These
factors were integrated to improve spatial modeling and compensate for
community-level factors.
The
vaccination records data were collected from some pharmacy immunization records
and the CDC published the available immunization coverage data. Although
detailed information about individual vaccinations was not potentially
available due to anonymization procedures, the pharmacy datasets contained
dummy variables of patients’ self-reported influenza vaccination within the
past year. Where linked electronic records on vaccination were available,
further features such as the last vaccine given and the type of vaccine were
used.
3.2.
Preprocessing
Before
model building, the raw data was processed through several preprocessing steps.
All Personally Identifiable Information (PII) was masked and a secure
tokenization scheme was implemented to enable consistent yet anonymous tracking
of patient encounters over repeated visits and test types. Duplicates,
inconsistent test result formats and incomplete records were removed.
Feature
engineering was at the core of the modeling process. We extracted variables
like:
· Vaccination status (binary:
vaccinated/unvaccinated)
· Time since the previous vaccine dose (in days,
binned as <90, 90–180, >180 days)
· Age ranges (e.g., 0–18, 19–49, 50–64, 65+)
· Intensity of influenza transmission at the
region level (estimated from CDC Flu View regional estimates)
· Self-reported presence of comorbidities (e.g.,
diabetes, asthma, cardiovascular disease)
All
categorical features were one-hot encoded and continuous variables normalized
to aid model convergence.
3.3.
Predictive modeling approach
We
applied and contrasted three supervised learning algorithms: Logistic
Regression, Random Forest and Extreme Gradient Boosting (XGBoost) using Python's
scikit-learn and XGBoost libraries. The goal was to predict the risk of
breakthrough infection, a positive influenza test in a patient who reported
receiving an influenza vaccine in the same season.
The
dataset was randomly split into 70% training and 30% test sets. We applied
5-fold cross-validation on the training set for model selection and hyperparameter
tuning through grid search and Bayesian optimization (Optuna). The evaluation
metrics were:
· Precision, to estimate the ratio of true
positive breakthrough cases out of all predicted positives
· Recall, to estimate the sensitivity of the
model to correctly classify breakthrough cases
· F1-score, as a harmonic mean between precision
and recall
· ROC-AUC, to quantify the model's power to
discriminate between breakthrough and non-breakthrough infections across
thresholds
Feature
importance scores were pulled out for the Random Forest and XGBoost models to
determine the contribution of every variable to the predictive output.
3.4.
System deployment
An
account to support the real-time risk scoring integrated into the patient
record, the microservices are designed to run on the Google Cloud Platform
(GCP) with the help of Kubernetes12-16.
The predictive model was delivered in the format of RESTful API and
incorporated into the electronic health systems of the pharmacies.
The
received test data from pharmaceutical facilities were real-time and the resulting
predictions were processed in milliseconds. The system identified High-risk
individuals with their corresponding confidence score and short reasons such as
“High risk due to time since last dose >180 days and age > 65”.
The
deployment environment was kept fairly scalable both geographically in regard
to regional pharmacies and in relation to others, using encrypted connections
along with measures such as only allowing authorized persons access to critical
data. Moreover, a monitoring dashboard was created to monitor the system
performance along with variation through time of prediction and regional
infection rate, which makes it possible to retrain the system based on the new
virus behavior and seasonality.
3.5.
Predictive modeling system for influenza breakthrough infections
Figure
2:
Predictive Modeling System for Influenza Breakthrough Infections.
3.5.1.
Data sources: They are the system's main input streams:
· Pharmacy Test Data (e.g., CVS, Walgreens,
Walmart):
Diagnostic test results for patients at pharmacy sites.
· CDC Demographic Data: Patient
demographics include age, gender and regional data.
· Vaccination Records: Aggregated
information on patients' influenza vaccination records, e.g., date and vaccine
type.
3.5.2.
Data processing and modeling: This part reflects the essence of data science
and the machine learning process:
· Data Cleaning & Anonymization: Guarantees
that received data is organized, normalized and free from Personally
Identifiable Information (PII).
· Feature Engineering: Formulates
salient features like vaccination status, age, comorbidities and geography for
modeling.
· Model Training: Employs machine
learning algorithms such as Logistic Regression, Random Forest and XGBoost to
forecast the possibility of breakthrough infections.
· Model Evaluation: Evaluates model
performance on metrics like ROC-AUC, F1-Score, Precision and Recall.
3.5.3.
Deployment infrastructure: This layer enables real-time
operationalization of the model:
· Cloud Infrastructure (AWS/GCP): Deploys the
model and manages scalability, data storage and computation.
· Real-Time Risk Scoring Engine: Imposes the
trained model onto incoming data to produce instant risk scores.
· Risk Stratification API: Streams the results to
end-user systems through APIs for actionable decision-making.
3.5.4.
End users:
These are the systems or staffs that gain value from the predictions of the
model:
· Pharmacy interface (Pharmacists): Enables
pharmacists to recognize patients who are at high risk and recommend care or
indicate physician follow-up.
· Public health dashboard (CDC, State Health): Compiles
the risk scores and trends to aid surveillance and public health interventions.
4. Results
and Discussion
In
order to test the performance of our predictive models, we used a hold-out
sample including 120,000 patient encounters from CVS, Walgreens and Walmart
chains. These cases were taken between October 2022 and March 2023, when
influenza activity is most likely to occur. This helped in including a diverse
ethnical and geographical population to assess the general extensibility of the
model.
4.1.
Model performance
In
this study, we focus on using three supervised learning approaches, namely
Logistic Regression, Random Forest and XGBoost (Figure 3), to compare the
accuracy of identifying breakthrough influenza infections among vaccinated
individuals. The details of the performance of each model are presented in (Table
1) below.
Table
1:
Model Performance on Test Set.
|
Model |
Precision |
Recall |
F1-score |
ROC-AUC |
|
Logistic Regression |
0.67 |
0.58 |
0.62 |
0.72 |
|
Random Forest |
0.81 |
0.77 |
0.79 |
0.86 |
|
XGBoost |
0.85 |
0.82 |
0.83 |
0.9 |
Figure
3:
Graphical Represented Model Performance on Test Set.
The
performance showed a remarkably higher accuracy rate in the XGBoost model; the
overall test had the highest precision, recall and F1 measure and the ROC-AUC
was 0.90. These results indicate that using the proposed method has good
discrimination between the condition of vaccinated patients who had contracted
the flu and those who had not. XGBoost’s functionality in dealing with many
trees, irrespective of the nonlinearity of the features and the function they
were mapping into and its ability to rank features made it ideal for this
classification.
4.2.
Feature importance
So
as to understand the model's decision-making, feature importance was examined
using SHAP (Shapley Additive exPlanations) values and the Gini importance of
the XGBoost classifier (Figure 4). The most important predictors of
breakthrough infections are listed below (Table 2):
Table
2:
Feature Importance (Top 5 Predictors from XGBoost).
|
Feature |
Importance
(%) |
|
Time since last vaccine dose
(>6 months) |
29% |
|
Presence of
comorbidities |
21% |
|
Age over 65 |
17% |
|
Geographic region |
12% |
|
Prior influenza infection
history |
9% |
Figure 4: Graphical Represented
Feature Importance (Top 5 Predictors from XGBoost).
The greatest contribution to model decisions
was made by Time since the last vaccine dose (> 6 months), with 29% of the
total feature importance. This is consistent with other literature describing
waning vaccine-induced immunity.
·The presence of comorbidities (such as
diabetes, asthma, COPD and cardiovascular disease) accounted for 21%,
emphasizing the susceptibility of immunocompromised patients.
· Age over 65 accounted for 17%, supporting CDC
evidence that older individuals are at high risk, even after vaccination.
· Geographic area (categorized according to
regional influenza transmission intensity) accounted for 12%, with
high-incidence zip codes strongly correlated with breakthrough risks.
· Based on legacy test logs, past influenza
infection history had a 9% contribution to suggest partial immunity or
behavioral factors associated with reinfection risk.
These
results provide evidence for targeted interventions, e.g., booster advice or
clinical outreach, for those in high-risk groups by these characteristics.
4.3.
Real-world deployment
The
XGBoost model was implemented as a cloud-based microservice in a pilot study at
215 pharmacies in California and Texas during the 2022–2023 flu season. The
model was embedded in the pharmacy testing process, providing real-time risk
scores for every tested patient, which was only viewable by pharmacy
clinicians.
During
the five-month duration, the system identified around 14,200 individuals at high
risk for breakthrough infection. Of these:
· 12.4% were positive for influenza after
vaccination.
· Pharmacists utilized these alerts to inform
patients to get instant medical consultation or antiviral therapy, particularly
in high-transmission areas.
This
intervention facilitated active case management, minimizing potential delays in
treatment and hindering local transmission chains. Anecdotal feedback from the
pharmacy workforce suggested that the alerts were straightforward to understand
and imposed little overhead on clinical workflows.
4.4.
Comparison with CDC flu surveillance
We
correlated weekly numbers of flagged breakthrough cases against CDC FluView
regional influenza positivity rates to test how effectively our model accounted
for population-level flu dynamics. A close correlation appeared upon
comparison:
· A Pearson correlation coefficient (r) of 0.81,
p < 0.001, between CDC flu rates and model outputs, signifies a robust and
statistically significant relationship.
· Notably, our model identified increases in
breakthrough cases 1-2 weeks prior to corresponding CDC regional positivity
peaks, indicating that pharmacy-based predictive analytics can serve as an
early warning system.
This
result highlights the value of incorporating retail diagnostics into wider
public health surveillance, especially in underreported or delayed data.
4.5.
Ethical and operational considerations
All
procedures for handling data and model deployment were HIPAA-compliant and
anonymized and tokenized data alone were utilized across the pipeline. Of note,
no patient-level identifiable data were preserved post-inference. The system
produced risk scores without retaining identifiable outputs and results were
applied solely at the point of care to inform pharmacy-based decisions.
In
addition, we put in place governance practices around data usage, access
control and bias reduction. Model fairness audits revealed no meaningful
performance differences by race or gender strata, although repeated audits are
advisable for future growth. Community education sessions were conducted in
pilot areas to inform patients of the use of AI in medical decision-making.
5.
Conclusion
The
overall objective of this paper is to describe the development of a real-world
predictive model for re-infection with the flu prevalent in community
pharmacies’ diagnostic and vaccination records. Our evaluation showed that the
proposed XGBoost model outperformed other less complex methods with the ROC-AUC
of 0.90 and high precision, recall and F1-score. In the following, this study
makes several contributions to infectious disease modeling, owing to its
ability to harness a high volume of real-time pharmacy test data from over 1.2
million patient examinations across major chains throughout the United States.
In
215 pharmacy locations, the pilot's success showed that machine-learning models
are easily implementable at the point of care. Thus, of over 12% of people
initially identified as high-risk patients, 15.5% tested positive for influenza
despite vaccination, allowing pharmacists to oversee appropriate interventions
such as further consultations or prescribing antivirals.
Consequently,
this model provides a feasible and cost-efficient way of improving the
methodology of pharmacy-based Surveillance and filling the gaps in the existing
public health system. Due to latency and underreporting cases, pharmacy data
also offer near real-time information on the community level that is often not
easily obtainable from a centralized reporting system. Likewise, the high
correlation with the CDC FluView trends (r = 0.81) indicates that retail-based
prediction can act as a leading indicator of flu epidemics in those regions.
5.1.
Future work
There
are several avenues for developing and further applying the existing predictive
system presented in this paper. One of the major areas to consider would be
feeding the model into state immunization registries that will further enhance
risk stratification from actual vaccination schedules and dosages recorded in
the registry. This would help to overcome a major weakness of the current model,
which often uses self-reported data or only partially integrated vaccination
cards.
The
further development of the model itself, extending it to other respiratory
illnesses such as RSV or SARS-CoV-2 (COVID-19), requires further research.
Given that these viruses co-infect with influenza and display signs similar to
those of influenza, a polynomial model that would help ascertain the
probability of co-infection would prove effective in boosting the diagnostic
aid and triage accuracy at the pharmacy level. It would also be consistent with
further developing such a model to promote syndromic surveillance and
constructive strategies for pandemic preparedness.
Lastly,
we will discuss different self-learning methods that can be used to retrain the
existing models over the incoming data streams. This would help avoid the
diffusion of reduced model accuracy due to dynamics in viruses, the effectiveness
of vaccines and behaviors of the population. With these future improvements in
place, the concept of an intelligent surveillance system will be integrated
into the structure of retail pharmacy. It will offer timely, targeted public
health interventions to its users.
6. References