Research Article

Design and Deployment of Predictive Models for Influenza Breakthrough Infections Using Pharmacy Test Data

Authors: Vijitha Uppuluri*

Publication Date: June 20, 2023

DOI: https://doi.org/10.51219/JAIMLD/vijitha-uppuluri/626

Citation: Citation: Vijitha Uppuluri. Design and Deployment of Predictive Models for Influenza Breakthrough Infections Using Pharmacy Test Data. J Artif Intell Mach Learn & Data Sci, 2023, 1(2): 1-8.

Copyright:Copyright: ©2023 Vijitha Uppuluri. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

View : PDF

Abstract

Viral transmission among vaccinated people is a major problem in the ongoing fight to contain the seasonal flu. Ordinary flu shots give decent coverage, but immune decline supported by the constant mutation of the virus leads to cross-infection. This paper covers creating and implementing machine learning algorithms that use real large-scale testing data from a pharmacy to predict influenza breakthrough infections. Taking advantage of deidentified information from a large number of testing records done in major retail pharmacies in the United States between the years 2022 and 2023, we have used and compared several supervised learning algorithms, such as logistic regression, random forest and XGBoost, to determine the probability of infection with influenza among vaccinated population given the demographic, clinical and temporal covariates.

The model, XGBoost, was the most accurate, achieving an overall accuracy of 90% ROC-AUC. Real-time risk scoring was deployed in two hundred and fifteen retail pharmacy locations to keep up with this success. Therefore, the study outcomes for the feature importance analysis showed that time since vaccination, age, comorbidities and other factors were significant indicators of breakthrough infection. The model accurately followed the flu positivity trend indicated by the CDC in the regions (Pear son correlation coefficient= 0.81), thus showing that the model can be used for near real-time monitoring. This work shows how pharmacy data could complement other methods of public health surveillance, as well as the data protection measures and decision-making process within a large pharmacy shop for identifying at-risk patients before and during flu seasons.

Keywords: Influenza, Breakthrough infections, Machine learning, Pharmacy data, Predictive modeling

1. Introduction

1.1. Influenza and the role of vaccines

The common devastating flu is a highly contagious disease caused by the influenza viruses that impact people’s health internationally. Influenza is responsible for causing a considerable amount of morbidity and mortality annually, particularly during the flu season, to the elderly, children or those with existing chronic health complications. Annual flu vaccination is the only preventive measure recommended by the authorities for flu prevention, such as CDCP and WHO^1-3. These vaccines are modified annually according to information on the circulating strains from worldwide. Influenza vaccines are used widely and are highly valuable; however, their efficacy is comparatively moderate, from 40% to 60% of the target population, to prevent symptomatic illness depending on the level of cross-reacting, similar antigens or immunity reported in the population.

Vaccination effectively minimizes detrimental health outcomes of influenza; nonetheless, breakthrough infections refer to laboratory-confirmed influenza infections among persons who have received adequate immunization. These may be due to a decline in vaccine-induced immunity, a change of circulating strains or host factors such as the age of patients and co-morbid health conditions. Screening for persons at a higher risk of developing a breakthrough infection is important not just in assessing vaccine efficacy but also in the clinical management of infected persons and decision-makers on vaccines. Still, the contemporary means of surveillance do not possess the necessary detail to disclose a real-time breakthrough, either from an individual or community perspective.

1.2. The emergence of pharmacy testing data

Over the past few years, HDT has emerged as a key mode of service delivery for infection testing in disease prevention and public health. Some large pharmacy service providers, like CVS, Walgreens and Walmart, provide versions of influenza testing services, more so during the flu season. These services create a massive volume of anonymous diagnostic information, which is still becoming instrumental to the work of public health. Compared to conventional surveillance systems, data originating from pharmacy testing also encompass a more significant portion of asymptomatic and mildly symptomatic persons who may not necessarily approach a hospital or clinic.

The increase in diagnostic diversity means there is the chance to monitor ever-varying influenza occurrences at the population level and the dynamics in real-time, which are not reflected in doctor’s offices or hospitals. However, when this artifact is combined with demographic, vaccination and comorbidity information, pharmacy test data can provide good material for predictive analytics. As such, these models can predict likely exposure to the virus and promote vaccination and safe practices for those still vulnerable.

1.3. Objectives of the study

The purpose of this paper is to advance and implement models for the diagnosis of influenza breakthrough disease using real-world pharmacy testing data. We are interested in the following aims: (i) to develop and train predictive models for the probability of emergence of superspreading using own and external factors; (ii) to compare and estimate the accuracy of the models using the test databases from the pharmacies; (iii) to apply the developed models in real-life settings of pharmacies for risk assessment.

Using expandable data transfer and cloud-based setup, Choate fills the gap between applying data analysis in public health and clinics. All in all, the findings of this research provide a foundation for enhancing influenza monitoring and various individual approaches to prevention and creating effective vaccine distribution and marketing strategies in the retail healthcare network of the United States.

2. Related Work

2.1. Overview of existing predictive models for infectious diseases

In recent years, it has proved itself an important tool in tracking and monitoring infected diseases through predictive models, especially with the current technological advances in machine learning and big data analysis. For decades, classical models like compartmental models like SIR, SEIR and others have been used to estimate disease transmission at the population level^4-6. In more recent years, there has been increased consideration of data-driven approaches to help collect more comprehensive and realistic data, such as EHRs, social media and mobility data, to anticipate outbreaks and recognizeg those at risk.

Influenza, in particular, has a number of machine learning-based models to predict whether one is likely to get infected, the chance of hospitalization and even the trend of influenza. Methods like logistic regression, decision trees, support vector machines and deep learning networks have been practiced in several studies. For instance, the models developed based on CDC influenza data and weather information have given fairly good predictions for weekly flu incidence. Other works have used information about the various symptoms acquired from mobile applications, wearing devices and web search trends to predict flu activity on regional and national levels. However, these strategies provide more general information about the levels of risk instead of identifying the risk for specific patients, especially if they get vaccinated.

2.2. Limitations in prior studies

Although substantial achievements have been made in the modeling of psychiatric disorders, most of the existing works are associated with limitations that reduce their usefulness for deployment in outpatient clinics. First, most predictive models depend on the data collected from centralized databases like hospital admission or state-reported cases, which are inaccurate and always delayed. This delay leads to slow detection and subsequent steering of its actions in line with emerging trends and patterns. Second, few models have been tested with more detailed retail or outpatient testing data, even though the latter is becoming a point of entry for patients with flu symptoms.

More importantly, limited research has been carried out on the concept of breakdown infections, which could be an intellectual deficit. Most are based on the overall flu or flu transmission levels without making distinctions for vaccinated or non-vaccinated persons. Hence, they can seldom be used to evaluate remedy efficacy or improve the risk messaging of vaccinated persons. One means by which MRLs could be strengthened is through the enhancement of model deployment studies, which is missing in this study. While models were shown to succeed in offline assessments, they are translated into clinical practices and especially into off-site ones such as pharmacies inadequately.

2.3. Contribution of this paper

In this regard, the present work contributes to the development of the identification of infectious disease modeling in several respects. First, it targets predicting the influenza breakthrough infection, an untapped area of research, even though the number of such infections is increasing yearly. By training models on real-world pharmacy testing data, we present a new and important data stream that includes diagnostics, vaccination history and risk factors on an individual level at the population scale. Not only does this data source offer timeliness, which is a virtue normally associated with centralized surveillance systems, but also granularity, which is usually lacking in centralized surveillance systems.

Second, multiple models are evaluated, including logistic regression, random forest, XGBoost, etc. After that, an absolute model assessment of multiple folds cross-validation and validation in real-life parameters is performed. The first model yielded the best result of 0.90 ROC-AUC and was implemented into a live system installed at various retail pharmacies. Such an approach is more practical than the majority of the prior works, which either are theoretical or emphasize retrospective data analysis.

Lastly, the mentioned approach illustrates that the community data and the computerized AI model can help improve the decision-making process in pharmacy. Our system updates the risk scores of breakthrough infections and assists pharmacists and clinicians in filtering out and making recommendations based on high-risk patients. The contribution of this paper comes not only in the form of model development but, more importantly, in the reproduction of a model for health policy implications in the case of vaccine-preventable diseases.

2.4. Immunological mechanisms influencing breakthrough infections in the context of bacterial, protozoan and viral vaccines

Figure 1: Immune Pathways Affecting Vaccine Breakthrough Infections.

2.4.1. Immune response and vaccine efficacy: The diagram depicts the interaction between components of the immune system-most notably CD8+ T cells, CD4+ T cells and B cells and how they collectively work to produce specific antibodies against vaccines or pathogens. These antibodies are essential in the reduction of bacterial load and disease severity⁷. Nonetheless, the diagram highlights that drug resistance and vaccine breakthrough infections are still possible even with these immune defenses.

This is most striking in those with compromised immune responses or where the pathogen has developed strategies to evade immune detection. Loss of vaccine effectiveness can be due to lowered antibody levels, failure to activate T cells or pathogen immune evasion. When designing and testing any predictive system to identify at-risk individuals, these should be considered.

2.4.2. Role of coinfections and protozoa: The bottom half of the figure highlights how coinfections with other pathogens and protozoa can undermine the immune response to vaccines. Such concurrent infections can prevent the activation of immune cells, thus lowering antibody levels and T-cell-mediated immunity. This results in reduced vaccine efficacy and increased susceptibility to breakthrough infections.

This is especially applicable in actual pharmacy data, where comorbidities and concurrent infections are usually present in-patient histories. These factors are essential to integrate into feature engineering in predictive modeling, such as demonstrated in your system architecture. Knowledge of these underlying biological interactions lends richness to the interpretability of the model and ensures that predictions fit clinical and immunological realities.

2.4.3. Implications for predictive modeling: From a systems design point of view, the biological processes demonstrated in this figure reinforce the necessity of including variables like time since the last vaccine dose, comorbidities and demographic variables in machine learning models. These immunological findings justify why some features predict breakthrough infections and justify the inclusion of such attributes in feature engineering.

By capturing how immunologic complexity is translated into heterogeneous responses to vaccines, your predictive system is strengthened and better linked to biomedical data. This strengthens risk stratification accuracy and the potential public health benefit of the deployed system.

3. Data and Methods

3.1. Data Sources

For this study, data were compiled from multiple sources obtained commercially and from public health platforms to present a broad view of influenza testing, vaccination and risk factors by individuals.

Pharmacy Testing Data were determined using self-collected, anonymous records from three large US retail pharmacy companies, CVS Health^8-11. Walgreens and Walmart, from October 2022 to March 2023. These datasets contain over 1,200,000 records from the rapid antigen and RT-PCR influenza tests and consist of testing location, testing result, testing date and time and the basic identifiable data such as age, gender, zip code the patient belongs to, etc. All the patients’ information was stripped of any identifiers that could be matched to them based on the HIPAA guidelines and thus anonymized before any modeling activities were conducted.

This information from the CDC gave spatial context about the population density, age distribution and the historical burden of the flu at a county level. These factors were integrated to improve spatial modeling and compensate for community-level factors.

The vaccination records data were collected from some pharmacy immunization records and the CDC published the available immunization coverage data. Although detailed information about individual vaccinations was not potentially available due to anonymization procedures, the pharmacy datasets contained dummy variables of patients’ self-reported influenza vaccination within the past year. Where linked electronic records on vaccination were available, further features such as the last vaccine given and the type of vaccine were used.

3.2. Preprocessing

Before model building, the raw data was processed through several preprocessing steps. All Personally Identifiable Information (PII) was masked and a secure tokenization scheme was implemented to enable consistent yet anonymous tracking of patient encounters over repeated visits and test types. Duplicates, inconsistent test result formats and incomplete records were removed.

Feature engineering was at the core of the modeling process. We extracted variables like:

· Vaccination status (binary: vaccinated/unvaccinated)

· Time since the previous vaccine dose (in days, binned as <90, 90–180, >180 days)

· Age ranges (e.g., 0–18, 19–49, 50–64, 65+)

· Intensity of influenza transmission at the region level (estimated from CDC Flu View regional estimates)

· Self-reported presence of comorbidities (e.g., diabetes, asthma, cardiovascular disease)

All categorical features were one-hot encoded and continuous variables normalized to aid model convergence.

3.3. Predictive modeling approach

We applied and contrasted three supervised learning algorithms: Logistic Regression, Random Forest and Extreme Gradient Boosting (XGBoost) using Python's scikit-learn and XGBoost libraries. The goal was to predict the risk of breakthrough infection, a positive influenza test in a patient who reported receiving an influenza vaccine in the same season.

The dataset was randomly split into 70% training and 30% test sets. We applied 5-fold cross-validation on the training set for model selection and hyperparameter tuning through grid search and Bayesian optimization (Optuna). The evaluation metrics were:

· Precision, to estimate the ratio of true positive breakthrough cases out of all predicted positives

· Recall, to estimate the sensitivity of the model to correctly classify breakthrough cases

· F1-score, as a harmonic mean between precision and recall

· ROC-AUC, to quantify the model's power to discriminate between breakthrough and non-breakthrough infections across thresholds

Feature importance scores were pulled out for the Random Forest and XGBoost models to determine the contribution of every variable to the predictive output.

3.4. System deployment

An account to support the real-time risk scoring integrated into the patient record, the microservices are designed to run on the Google Cloud Platform (GCP) with the help of Kubernetes^12-16. The predictive model was delivered in the format of RESTful API and incorporated into the electronic health systems of the pharmacies.

The received test data from pharmaceutical facilities were real-time and the resulting predictions were processed in milliseconds. The system identified High-risk individuals with their corresponding confidence score and short reasons such as “High risk due to time since last dose >180 days and age > 65”.

The deployment environment was kept fairly scalable both geographically in regard to regional pharmacies and in relation to others, using encrypted connections along with measures such as only allowing authorized persons access to critical data. Moreover, a monitoring dashboard was created to monitor the system performance along with variation through time of prediction and regional infection rate, which makes it possible to retrain the system based on the new virus behavior and seasonality.

3.5. Predictive modeling system for influenza breakthrough infections

Figure 2: Predictive Modeling System for Influenza Breakthrough Infections.

3.5.1. Data sources: They are the system's main input streams:

· Pharmacy Test Data (e.g., CVS, Walgreens, Walmart): Diagnostic test results for patients at pharmacy sites.

· CDC Demographic Data: Patient demographics include age, gender and regional data.

· Vaccination Records: Aggregated information on patients' influenza vaccination records, e.g., date and vaccine type.

3.5.2. Data processing and modeling: This part reflects the essence of data science and the machine learning process:

· Data Cleaning & Anonymization: Guarantees that received data is organized, normalized and free from Personally Identifiable Information (PII).

· Feature Engineering: Formulates salient features like vaccination status, age, comorbidities and geography for modeling.

· Model Training: Employs machine learning algorithms such as Logistic Regression, Random Forest and XGBoost to forecast the possibility of breakthrough infections.

· Model Evaluation: Evaluates model performance on metrics like ROC-AUC, F1-Score, Precision and Recall.

3.5.3. Deployment infrastructure: This layer enables real-time operationalization of the model:

· Cloud Infrastructure (AWS/GCP): Deploys the model and manages scalability, data storage and computation.

· Real-Time Risk Scoring Engine: Imposes the trained model onto incoming data to produce instant risk scores.

· Risk Stratification API: Streams the results to end-user systems through APIs for actionable decision-making.

3.5.4. End users: These are the systems or staffs that gain value from the predictions of the model:

· Pharmacy interface (Pharmacists): Enables pharmacists to recognize patients who are at high risk and recommend care or indicate physician follow-up.

· Public health dashboard (CDC, State Health): Compiles the risk scores and trends to aid surveillance and public health interventions.

4. Results and Discussion

In order to test the performance of our predictive models, we used a hold-out sample including 120,000 patient encounters from CVS, Walgreens and Walmart chains. These cases were taken between October 2022 and March 2023, when influenza activity is most likely to occur. This helped in including a diverse ethnical and geographical population to assess the general extensibility of the model.

4.1. Model performance

In this study, we focus on using three supervised learning approaches, namely Logistic Regression, Random Forest and XGBoost (Figure 3), to compare the accuracy of identifying breakthrough influenza infections among vaccinated individuals. The details of the performance of each model are presented in (Table 1) below.

Table 1: Model Performance on Test Set.

Model	Precision	Recall	F1-score	ROC-AUC
Logistic Regression	0.67	0.58	0.62	0.72
Random Forest	0.81	0.77	0.79	0.86
XGBoost	0.85	0.82	0.83	0.9

Figure 3: Graphical Represented Model Performance on Test Set.

The performance showed a remarkably higher accuracy rate in the XGBoost model; the overall test had the highest precision, recall and F1 measure and the ROC-AUC was 0.90. These results indicate that using the proposed method has good discrimination between the condition of vaccinated patients who had contracted the flu and those who had not. XGBoost’s functionality in dealing with many trees, irrespective of the nonlinearity of the features and the function they were mapping into and its ability to rank features made it ideal for this classification.

4.2. Feature importance

So as to understand the model's decision-making, feature importance was examined using SHAP (Shapley Additive exPlanations) values and the Gini importance of the XGBoost classifier (Figure 4). The most important predictors of breakthrough infections are listed below (Table 2):

Table 2: Feature Importance (Top 5 Predictors from XGBoost).

Feature	Importance (%)
Time since last vaccine dose (>6 months)	29%
Presence of comorbidities	21%
Age over 65	17%
Geographic region	12%
Prior influenza infection history	9%

Figure 4: Graphical Represented Feature Importance (Top 5 Predictors from XGBoost).

The greatest contribution to model decisions was made by Time since the last vaccine dose (> 6 months), with 29% of the total feature importance. This is consistent with other literature describing waning vaccine-induced immunity.

·The presence of comorbidities (such as diabetes, asthma, COPD and cardiovascular disease) accounted for 21%, emphasizing the susceptibility of immunocompromised patients.

· Age over 65 accounted for 17%, supporting CDC evidence that older individuals are at high risk, even after vaccination.

· Geographic area (categorized according to regional influenza transmission intensity) accounted for 12%, with high-incidence zip codes strongly correlated with breakthrough risks.

· Based on legacy test logs, past influenza infection history had a 9% contribution to suggest partial immunity or behavioral factors associated with reinfection risk.

These results provide evidence for targeted interventions, e.g., booster advice or clinical outreach, for those in high-risk groups by these characteristics.

4.3. Real-world deployment

The XGBoost model was implemented as a cloud-based microservice in a pilot study at 215 pharmacies in California and Texas during the 2022–2023 flu season. The model was embedded in the pharmacy testing process, providing real-time risk scores for every tested patient, which was only viewable by pharmacy clinicians.

During the five-month duration, the system identified around 14,200 individuals at high risk for breakthrough infection. Of these:

· 12.4% were positive for influenza after vaccination.

· Pharmacists utilized these alerts to inform patients to get instant medical consultation or antiviral therapy, particularly in high-transmission areas.

This intervention facilitated active case management, minimizing potential delays in treatment and hindering local transmission chains. Anecdotal feedback from the pharmacy workforce suggested that the alerts were straightforward to understand and imposed little overhead on clinical workflows.

4.4. Comparison with CDC flu surveillance

We correlated weekly numbers of flagged breakthrough cases against CDC FluView regional influenza positivity rates to test how effectively our model accounted for population-level flu dynamics. A close correlation appeared upon comparison:

· A Pearson correlation coefficient (r) of 0.81, p < 0.001, between CDC flu rates and model outputs, signifies a robust and statistically significant relationship.

· Notably, our model identified increases in breakthrough cases 1-2 weeks prior to corresponding CDC regional positivity peaks, indicating that pharmacy-based predictive analytics can serve as an early warning system.

This result highlights the value of incorporating retail diagnostics into wider public health surveillance, especially in underreported or delayed data.

4.5. Ethical and operational considerations

All procedures for handling data and model deployment were HIPAA-compliant and anonymized and tokenized data alone were utilized across the pipeline. Of note, no patient-level identifiable data were preserved post-inference. The system produced risk scores without retaining identifiable outputs and results were applied solely at the point of care to inform pharmacy-based decisions.

In addition, we put in place governance practices around data usage, access control and bias reduction. Model fairness audits revealed no meaningful performance differences by race or gender strata, although repeated audits are advisable for future growth. Community education sessions were conducted in pilot areas to inform patients of the use of AI in medical decision-making.

5. Conclusion

The overall objective of this paper is to describe the development of a real-world predictive model for re-infection with the flu prevalent in community pharmacies’ diagnostic and vaccination records. Our evaluation showed that the proposed XGBoost model outperformed other less complex methods with the ROC-AUC of 0.90 and high precision, recall and F1-score. In the following, this study makes several contributions to infectious disease modeling, owing to its ability to harness a high volume of real-time pharmacy test data from over 1.2 million patient examinations across major chains throughout the United States.

In 215 pharmacy locations, the pilot's success showed that machine-learning models are easily implementable at the point of care. Thus, of over 12% of people initially identified as high-risk patients, 15.5% tested positive for influenza despite vaccination, allowing pharmacists to oversee appropriate interventions such as further consultations or prescribing antivirals.

Consequently, this model provides a feasible and cost-efficient way of improving the methodology of pharmacy-based Surveillance and filling the gaps in the existing public health system. Due to latency and underreporting cases, pharmacy data also offer near real-time information on the community level that is often not easily obtainable from a centralized reporting system. Likewise, the high correlation with the CDC FluView trends (r = 0.81) indicates that retail-based prediction can act as a leading indicator of flu epidemics in those regions.

5.1. Future work

There are several avenues for developing and further applying the existing predictive system presented in this paper. One of the major areas to consider would be feeding the model into state immunization registries that will further enhance risk stratification from actual vaccination schedules and dosages recorded in the registry. This would help to overcome a major weakness of the current model, which often uses self-reported data or only partially integrated vaccination cards.

The further development of the model itself, extending it to other respiratory illnesses such as RSV or SARS-CoV-2 (COVID-19), requires further research. Given that these viruses co-infect with influenza and display signs similar to those of influenza, a polynomial model that would help ascertain the probability of co-infection would prove effective in boosting the diagnostic aid and triage accuracy at the pharmacy level. It would also be consistent with further developing such a model to promote syndromic surveillance and constructive strategies for pandemic preparedness.

Lastly, we will discuss different self-learning methods that can be used to retrain the existing models over the incoming data streams. This would help avoid the diffusion of reduced model accuracy due to dynamics in viruses, the effectiveness of vaccines and behaviors of the population. With these future improvements in place, the concept of an intelligent surveillance system will be integrated into the structure of retail pharmacy. It will offer timely, targeted public health interventions to its users.

6. References

Full Text

Design and Deployment of Predictive Models for Influenza Breakthrough Infections Using Pharmacy Test Data

Other Journals

Useful Links