Research Article

Assessment of Artificial Intelligence ‎Credibility in Evidence-Based ‎Healthcare Management with “AERUS” Innovative Tool

Authors: Dr. Mohammed Sallam, Dr. Johan Snygg and Dr. Malik Sallam

Publication Date: February 19, 2024

DOI: https://doi.org/10.51219/JAIMLD/mohammed-sallam/20

Citation: Mohammed Sallam, et al. Assessment of Artificial Intelligence ‎Credibility in Evidence-Based ‎Healthcare Management with “AERUS” Innovative Tool. J Artif Intell Mach Learn & Data Sci, 2024;1(1):166-185.

Copyright:©2024 Sallam M, et al. This is an open-access article distributed under the terms of the Creative Commons Attributions License, which permits unrestricted use, distribution, and reproduction in any medium provided the original author and source are credited.

View : PDF

Abstract
Background: Artificial Intelligence (AI) technologies have found applications across various arenas, and Evidence-Based Management (EBMgt) is no exception. However, in this context, the careful assessment of AI outcomes becomes essential to confirm their credibility and ensure adherence to ethical standards. Mirroring recent similar explorations in the literature, this study introduced the “AERUS” tool, designed to evaluate AI trustworthiness in healthcare administration, focusing on five key areas: Accuracy, Efficiency, Reliability, Usability, and Security.

Methods: The AERUS instrument evaluated AI's reliability in healthcare administration. It underwent minor revisions after an initial test with thirty healthcare administrators and internal consistency confirmation via Cronbach’s alpha. The final ‎version was tested on four AI models (ChatGPT 3.5, ChatGPT 4, Microsoft Bing, Google Bard) ‎over six managerial topics, with evaluation by two raters using Cohen's kappa.

Results: The ‎refined AERUS tool assessed five areas: AI accuracy in management data, operational ‎efficiency impact, decision-making reliability, user-friendliness for managers, and security ‎protocol adherence. Initial testing with ten healthcare management statements showed high ‎internal consistency (Cronbach’s alpha of .911). Among six assessments, Microsoft Bing ‎scored highest (mean 22.93, SD 1.11), followed by ChatGPT-4 (mean 22.00, SD 1.21), ‎ChatGPT-3.5 (mean 20.00, SD 1.21), and Google Bard (mean 19.60, SD 1.22). Inter-rater ‎agreement resulted in Cohen’s kappa values ranging from 0.358 to 0.885 for the AI models.

Conclusions and Recommendations: AERUS presents a supporting instrument for addressing ‎AI credibility concerns in EBMgt, with recommendations for further ‎research and widespread implementation to ensure the trustworthiness and reliability of AI in professional‎ managerial decision-making.

Keywords: Artificial Intelligence, AI, Evidence-Based Management, EBMgt, Healthcare Administrators,‎ Decision-making, AERUS Tool, AI Credibility

1. Introduction
Considering its extensive potential and influence in decision-making, advanced artificial intelligence (AI) technologies have been widely used in various domains, including education, business, healthcare, and others. The rapid adoption of AI in the last few years highlighted its significant advantages and capabilities. While AI systems offer substantial benefits, they also carry the risk of unintended consequences for individuals and society at large1. Shrestha, Ben-Menahem and von Krogh2 expressed that managers who utilize AI in making decisions remain accountable for the outcomes of those decisions. Therefore, it is crucial to thoroughly understand AI systems' limitations when applying them in business contexts, including making organizational decisions3.

Decision-making involves gathering and assessing information from multiple sources in a given field. It is particularly complex for managers in Healthcare, involving cooperation among healthcare professionals, investors, governments, policymakers, and patients4.

According to Barends and Rousseau5, Evidence-Based Management (EBMgt) is best defined as making decisions through a diligent, explicit, and judicious approach using the best available evidence from various sources. This process starts with asking, which involves translating a practical issue or problem into an answerable question. Next is acquiring, which entails systematically searching for and retrieving relevant evidence. The third step is appraising, where the trustworthiness and relevance of the evidence are critically judged. Aggregating follows, involving the weighing and combining of the evidence. The fifth step is applying, where the evidence is incorporated into the decision-making process. Finally, assessing is conducted, evaluating the outcome of the decision to enhance the likelihood of achieving a favorable result.

EBMgt systematically uses the best available evidence to inform decisions6-8. The literature on EBMgt in healthcare organizations has ‎consistently demonstrated the importance of its principles for improving quality and ‎safety in Healthcare9,10. To ensure effective decision-making, healthcare managers require tools to ‎categorize sources of evidence and enable critical evaluation based on shared characteristics11. In addition, leveraging intelligent technologies in strategic and critical decisions could significantly enhance management approaches and the quality of healthcare services provided12,13.

AI’s successful development and implementation require high-quality governance to protect data and frameworks that support the creation of trustworthy AI products14,15. To date, the speedy advancement of Artificial Intelligence in handling complex tasks has led to many AI-powered devices and services gaining increased autonomy in decision-making, affecting both individuals and organizations16. This is particularly evident for healthcare managers and providers, where AI's potential to revolutionize the industry is met with the critical challenge of ensuring the credibility and reliability of AI-generated informatio17.

As an innovation, AI's efficiency will be primarily determined by its adoption and utilization18. Healthcare organizations increasingly depend on AI for critical decision-making; thus, a systematic approach and tools to assess and enhance AI reliability and credibility becomes crucial1,19-22.

Until now, there has not been a specific tool designed to test the quality of AI-produced content for use by healthcare managers in their decision-making processes. This lack of a specific evaluation tool means that the reliability of such AI-produced content is not fully assured23. The AERUS tool was developed to address this gap. It can check the precision and reliability of AI-generated data, thereby assisting professionals and leaders in making more informed decisions.

1.1. Purpose and value
This research was designed to pioneer the systematic evaluation of AI credibility within EBMgt24,25. AI credibility means its ability to interact with individuals obviously, relevantly, consistently, empathetically, responsively, and truthfully26. This broad description of AI credibility is especially critical in areas where AI interfaces directly with users, such as customer service, healthcare management, and education. Additionally, the AERUS tool was introduced, and its application was piloted to enhance AI's role in healthcare managers’ decision-making and systematically assess the trustworthiness of AI-generated data21. (Figure 1) illustrates the foundational brainstorming process that fostered the development of the AERUS tool framework.

Figure 1: Key factors of reliable AI in Evidence-Based Healthcare Management.

1.2. Implications on healthcare management
The effective development and deployment of the AERUS tool could significantly impact healthcare management. This tool's ability to verify the accuracy and trustworthiness of AI-driven data enables healthcare leaders and practitioners to make better-informed choices. This could lead to improved management practices, more effective resource utilization, and enhanced healthcare organizational outcomes21,27.

2. Literature Review
2.1. Theoretical frameworks
The current study aligned the development and application of the AERUS tool with three ‎theoretical frameworks (Table 1) to ensure the instrument’s relevance and effectiveness in the study scope. The ‎Technology Acceptance Model (TAM) by Fred D. Davis28 informed our understanding of how ‎admin professionals might accept and use the AERUS tool, focusing on its perceived ‎usefulness and ease of use in evaluating AI-generated information. Additionally, the principles of AI Ethics were central to our ‎approach, ensuring the tool aids in developing and using AI systems that are fair, accountable, ‎and transparent29,30. This pathway was associated with the primary goal of building trust in AI applications ‎within Healthcare managerial decisions. Ultimately, the principles of EBMgt guided our ‎methodology and ensured the AERUS tool facilitated decision-making grounded in empirical data ‎and scientific evidence, which is crucial for ethical and effective healthcare management31,32.

Table 1: Literature-based theoretical models.

S. N.	Theoretical Framework	Reference	Contribution to the Study	Relation to AI Credibility, Transparency, and Reliability in Healthcare Management
1	Technology Acceptance Model (TAM)	Hu, Chau, Sheng and Tam³³	Centers on the adoption and utilization of technology by users based on the perceived benefits and simplicity of use	Assesses how healthcare administrators accept and use the AERUS tool
2	AI Ethics	Patel²⁹ Stahl and Eke³⁰	Provides principles and guidelines for ethical AI development and use, focusing on issues like fairness, accountability, and transparency	Ensures that AI systems in Healthcare are developed and used ethically, promoting trust and accountability
3	Evidence-Based Management (EBMgt)	Rousseau and McCarthy³² Guo, Berkshire, Fulton and Hermanson³¹	Advocates for managerial decision-making grounded in scientific evidence and rational analysis, emphasizing the use of empirical data.	Aligns with assessing the application of AI in presenting evidence-based and ethical decisions in healthcare management.

2.2. Artificial Intelligence (AI) and robotic process automation (RPA)

AI and RPA promise to significantly simplify life, from helping with basic activities such as scheduling appointments, processing patient records, and transcribing doctors' notes to more complex tasks like analyzing large sets of healthcare data for strategic planning and implementing advanced patient care models. Both innovations can play a crucial role in reducing the effects of human error on an organization's workflow³⁴. While AI cannot replace a human, it does offer the benefit of examining extensive data and detecting patterns that might elude the human mind³⁵. (Figure 2) showcases AI/RPA applications within the healthcare sector-ranging from diagnostics, patient engagement, and treatment protocols to administrative tasks, medical imaging, and health monitoring³⁶. This diagram illustrates the many-sided impact of AI in Healthcare, highlighting its potential to revolutionize each aspect by providing enhanced accuracy, efficiency, and personalized patient care.

Figure 2: AI and Robotics use examples in Healthcare. Adapted from Deo and Anjankar³⁶.

Haleem, Javaid, Singh, Rab and Suman³⁷ conveyed that the outcome measures of executing AI and automation in management systems are obvious, leading to profound enhancements in operational efficiency and proven decision-making³⁸. (Figure 3) illustrates the effects of industrialization (automation) on cost reduction, administrative staff productivity, process speed, patient experience, and data accuracy³⁹.

Figure 3: Impact of AI and process automation in healthcare management.

2.3. Utilizing AI in the managerial decision-making process within healthcare organizations

The successful integration of artificial intelligence into strategic decision-making processes is crucial for the future competitiveness of organizations and their leaders⁴⁰. The application and disputes of Artificial Intelligence (AI) in the managerial decision-making process within healthcare organizations have earned substantial attention in recent literature⁴¹. Researchers explored mixed views on the potential of AI-driven tools and algorithms to enhance decision-making efficiency and effectiveness in healthcare settings. Peer-reviewed studies highlighted AI's ability to analyze vast amounts of patient data, assisting clinicians and managers in making data-informed decisions regarding treatment plans, resource allocation, and patient outcomes⁴². Furthermore, studies emphasized the importance of ensuring AI's structural and ethical use in healthcare decision-making, addressing bias and data privacy^36,43.

2.4. AI credibility and ethical framework considerations ‎

Recent developments in Artificial Intelligence have enabled machines to replicate various human capabilities, such as perception, emotion detection and comprehension, conversation, and even aspects of creativity⁴⁴. However, several leading AI experts, alongside notable figures, have expressed concerns in various media outlets about the potential of a security breach associated with AI⁴⁵. Anxiety about AI can significantly hinder its integration into workplace environments⁴⁶. AI cannot assess the ethics of a decision on its own. It requires close oversight and extra precautions to prevent the inclusion of human biases or unfair data in its machine-learning algorithms⁴⁷. Furthermore, AI falls short in terms of creativity, a trait inherent to humans, and the capacity to experience emotions and engage in critical thinking, which is essential in decision-making⁴⁸.

AI credibility and ethical considerations are vital in healthcare management decisions⁴⁹. As AI becomes increasingly integral to management decision-making, ensuring the reliability and ethical integrity of AI-driven decisions is imperative. Healthcare corporate leaders should prioritize the transparency and trustworthiness of AI systems that impact critical decisions related to patient care, resource allocation, and operational efficiency⁵⁰. For instance, an organization with insufficient privacy policies and security measures for clients could lead to a data breach. This scenario might result in a substantial decrease in customer base, a loss of trust, reduced competitiveness in attracting employees, and a decline in stock market values⁵¹. Ethical concerns encompass safeguarding data privacy, mitigating algorithmic bias in decision recommendations, and addressing ethical dilemmas in care and other managerial decisions⁵². Tackling these ethical considerations and identifying competencies while upholding the credibility of AI-driven healthcare management is vital to maintaining trust among all stakeholders⁵³. The above highlights the pressing need to establish ethical frameworks and standards that guide AI's responsible and beneficial use in healthcare management decision-making processes⁵⁴. (Table 2) outlines the ethical criteria for AI practice, covering Fairness (avoiding discrimination), Transparency (making AI understandable), Accountability (responsibility for AI systems), Privacy (data protection and consent), Bias Mitigation (preventing unfair outcomes), Safety and Reliability (safe, reliable system design), Robustness (handling unexpected situations), Human Control (oversight in critical tasks), Social Impact (considering AI's societal effects), and Ethical Use (promoting beneficial AI use and avoiding unethical applications).

Table 2: Description of ethical practice criteria for AI.

Ethical Criteria	Description
1. Fairness	Avoid discrimination based on protected characteristics.
2. Transparency	Make AI algorithms and decisions understandable.
3. Accountability	Establish clear responsibility for AI systems.
4. Privacy	Protect individuals' data and obtain informed consent.
5. Bias Mitigation	Identify and mitigate bias to prevent unfair outcomes.
6. Safety and Reliability	Design systems to operate safely and reliably.
7. Robustness	Ensure AI can handle unexpected situations and attacks.
8. Human Control	Maintain human oversight, especially in critical tasks.
9. Social Impact	Consider broader societal implications of AI deployment.
10. Ethical Use	Promote the beneficial use of AI and avoid unethical purposes.

Source of information tailored from Floridi, Cowls, Beltrametti, Chatila, Chazerand, Dignum, Luetge, Madelin, Pagallo, Rossi, Schafer, Valcke and Vayena⁵⁵.

3. Methods

3.1. Study design

The study was tailored from the methodologies encapsulated in the CLEAR tool21. This tool examined existing approaches for evaluating information quality to establish a framework suitable for appraising data generated by AI-based systems. In the context of healthcare management, the innovative AERUS instrument was employed to analyze AI's (1) Accuracy, (2) Efficiency, (3) Reliability, (4) Usability, and (5) Security. A thorough analysis informed the development of AERUS of data sources, a careful assessment of the transparency and performance of diverse AI models, and the incorporation of end-user feedback to ensure the tool's adaptability and relevance21 (Table 3).

Table 3: AERUS tool five components description. ‎

AERUS Step	Description	Question 1	Question 2
Accuracy (A)	Assessing the accuracy of AI in reflecting managerial data and strategies	How accurately does the AI model reflect key performance indicators and management strategies?	How does the AI system ensure data accuracy in strategic planning and forecasting?
Efficiency (E)	Evaluating how AI enhances operational efficiency in healthcare management	How does the AI system improve operational efficiency in resource allocation or scheduling areas?	Can the AI system effectively streamline administrative processes, reducing time and cost?
Reliability (R)	Assessing the consistency of AI in supporting managerial decisions	How consistently does the AI provide reliable support for various administrative and managerial decisions?	Does the AI system maintain its performance reliability during critical healthcare operations and decision-making?
Usability (U)	Determining the ease of use of AI systems for healthcare managers	How user-friendly is the AI system for managers to incorporate into their daily administrative tasks?	Is the AI interface intuitive for healthcare managers, enabling quick adaptation and efficient use?
Security (S)	Examining AI compliance with healthcare management data security standards	How effectively does the AI system protect sensitive data and adhere to security protocols?	Does the AI system have robust measures to prevent data breaches and ensure the confidentiality of healthcare information?

3.2. Methods appropriateness

3.2.1. Evaluation of the content validity of the AERUS instrument

The content validity for the AERUS tool involved seeking input from three healthcare executives. All leader’s recommendations and suggestions were utilized21.

3.2.2. Testing of the AERUS tool

The research involved a diverse selection of potential participants from different healthcare organizations. This varied group was selected to offer a complete viewpoint on the research subject, covering various leadership and managerial roles. The total number of healthcare managers who offered feedback was thirty, distributed between Executives, Directors, Senior Managers, Managers, Middle Managers, Unit Managers, Junior Managers, and Supervisors.

The initial test evaluated ten statements about Evidence-Based Healthcare Management using the AERUS tool. The statements for evaluation were created based on discussions in an expert forum, encompassing diverse management topics to confirm the tool's preliminary relevance for a wide array of subjects within EBMgt. The statements were crafted to contain correct and incorrect information, incorporating irrelevant, inaccurate, or vague content in a purposeful but randomized manner. The statements evaluated using the AERUS tool are detailed in (Table 4).

Table 4: AERUS tool ten statements.

S. No.	Statement	Accuracy	Inclusion of Irrelevant/Ambiguous Content
1	Utilizing predictive analytics can decrease patient no-show rates by 25%.	Accurate	None
2	Employee satisfaction increases by 10% when meetings are scheduled bi-weekly instead of weekly.	Inaccurate	Deliberate Irrelevance
3	Introducing an AI system can reduce prescription medication errors by up to 40%.	Accurate	None
4	Upgrading the cafeteria menu correlates with a 15% improvement in patient recovery rates.	Inaccurate	Vague Content
5	Telemedicine visits have been proven to reduce hospital readmission rates by 30%.	Accurate	None
6	A 20% budget increase for departmental marketing will lead to a 50% rise in patient admissions.	Inaccurate	Deliberate Irrelevance
7	Automated HR systems will save up to 60 hours of manual work per month.	Accurate	None
8	Switching to LED lighting in all facilities will improve the accuracy of clinical diagnoses.	Inaccurate	Vague Content
9	Frequent team-building sessions have been associated with a 5% decrease in medical errors.	Inaccurate	Deliberate Irrelevance
10	Implementing electronic health records has increased patient data access by 80%.	Accurate	None

Every participant was asked to evaluate AERUS's five elements using a 5-point Likert scale, ranging from (5) excellent to (1) poor21.

3.2.3. Completion and practical implementation of the AERUS tool for evaluating widely used AI-based applications

Post refinement, informed by pilot feedback, we deployed the AERUS instrument to appraise content generated by six standard management-related inquiries across several AI interfaces, including ChatGPT 3.5, ChatGPT-4, Bing, and the Google Bard (Table 5). The selection of the four AI models was primarily guided by their significant applicability and integration potential within the domain of healthcare management. These models stand out for their advanced data processing, analysis, and decision support capabilities, which are critical functionalities in this sector. Their prominence in the industry and their proven effectiveness in handling complex data make them ideal candidates for evaluating the efficacy of the AERUS tool in a real-world healthcare management context. This choice ensures that the study’s findings are theoretically robust and practically relevant to the evolving landscape of AI-assisted healthcare administration.

Additionally, the study's blend of the six questions was strategically intended to capture a valuable spectrum of crucial healthcare management areas, including patient care, staff management, operational efficiency, and strategic planning. These questions, rooted in evidence-based healthcare literature, accurately reflect the diverse and real-world challenges faced in the healthcare sector. They were chosen not only for their breadth and relevance but also for their ability to demonstrate the practical impact and adaptability of AI in various sides of decision-making.

A new conversation thread was initiated for each interface in every response, maintaining uniform prompts throughout the evaluation. Two senior evaluators independently assessed the content generated by the AI model21.

Table 5: AERUS tool six questions for evaluating widely used AI-based applications‎.

S. No.	Inquiry Question
1	Does predictive analytics reduce patient no-shows?
2	Do frequent meetings affect staff satisfaction?
3	Can AI lower medication errors?
4	Is patient recovery linked to cafeteria menus?
5	Does telemedicine cut readmission rates?
6	Does a higher marketing budget boost patient admission?

3.3. Statistical analysis

Statistical analysis was performed using IBM SPSS Version 29.0 for Windows, with a significance level set at a P-value less than 0.05. Descriptive statistical analysis used the mean and standard deviation (SD). The internal consistency of the AERUS tool was assessed using Cronbach's alpha56.

The final score for the AERUS tool was calculated by summing up the average scores given by two independent raters for each item, ranked as excellent = 5, very good = 4, good = 3, satisfactory/fair = 2, or poor = 1. The cumulative AERUS scores, which vary from 6 to 30, have been randomly divided into three tiers: (1) “Limited” content for scores from 6 to 13, (2) “Adequate” content for scores between 14 and 21, and (3) "Superior” content for scores from 22 to 30 (Table 6).

Table 6: The score for the AERUS tool.

AERUS Score Range	(1-5) Likert Rating per Item	Category
6-13	1 (Poor) to 2 (Fair)	Limited
14-21	3 (Good) to 4 (Very Good)	Adequate
22-30	5 (Excellent)	Superior

3.4. Ethical considerations

3.4.1. Confidentiality

The research took great care to guarantee the confidentiality of the information collected by the pilot, with a dedication to safeguarding the privacy of every respondent and evaluator.

3.4.2. Institutional review board (IRB)

‎Given the evaluative nature of the study design, IRB was not required.

4. Results

4.1. The completed items of the AERUS tool

Figure 4 displays the finalized wording of the AERUS tool items. Each dimension is critical to the overall credibility of AI applications in management settings, indicating the tool's comprehensive approach to evaluating AI systems.

Figure 4: The finalized wording of the AERUS tool items.

4.2. Outcomes from the initial trial run of the AERUS instrument

(Table 7) the AERUS tool's initial testing phase, which involved ten statements about Evidence-Based Healthcare Management, yielded satisfactory internal reliability, as indicated by a Cronbach's alpha average of .911, well above the generally accepted benchmark of .70057. The content quality of these items was classified into three distinct categories: “Superior”, “Adequate”, and “Limited”, based on the quality of the content.

Table 7: Primary testing of the AERUS tool with healthcare managers (N = 30)‎.

S. No.	Tested Statement	Accuracy (mean ± SD)	Efficiency (mean ± SD)	Reliability (mean ± SD)	Usability (mean ± SD)	Security (mean ± SD)	AERUS (mean ± SD)	Cronbach’s α
1	Utilizing predictive analytics can decrease patient no-show rates by 25%.	4.2 ± 0.8	4.0 ± 1.0	4.1 ± 0.9	4.3 ± 0.7	4.5 ± 0.6	21.1 ± 1.82 (Adequate)	0.910
2	Employee satisfaction increases by 10% when meetings are scheduled bi-weekly instead of weekly.	2.1 ± 1.1	2.3 ± 1.2	2.0 ± 0.9	2.2 ± 1.0	2.4 ± 1.3	11 ± 2.48 (Limited)	0.900
3	Introducing an AI system can reduce prescription medication errors by up to 40%.	4.5 ± 0.5	4.3 ± 0.6	4.6 ± 0.4	4.4 ± 0.5	4.7 ± 0.3	22.5 ± 1.05 (Superior)	0.954
4	Upgrading the cafeteria menu correlates with a 15% improvement in patient recovery rates.	1.8 ± 0.9	1.9 ± 1.0	2.1 ± 1.2	1.7 ± 0.8	2.0 ± 1.1	9.5 ± 2.26 (Limited)	0.891
5	Telemedicine visits have been proven to reduce hospital readmission rates by 30%.	4.0 ± 0.7	4.1 ± 0.8	4.2 ± 0.6	4.3 ± 0.7	4.4 ± 0.5	21.0 ± 1.49 (Adequate)	0.900
6	A 20% budget increase for departmental marketing will lead to a 50% rise in patient admissions.	1.5 ± 1.2	1.6 ± 1.3	1.4 ± 1.1	1.5 ± 1.2	1.7 ± 1.0	7.7 ± 2.60 (Limited)	0.923
7	Automated HR systems will save up to 60 hours of manual work per month.	3.8 ± 0.6	3.9 ± 0.5	3.7 ± 0.7	3.6 ± 0.8	3.9 ± 0.5	18.9 ± 1.41 (Adequate)	0.855
8	Switching to LED lighting in all facilities will improve the accuracy of clinical diagnoses.	1.9 ± 1.0	2.1 ± 1.1	2.0 ± 1.2	1.8 ± 0.9	2.2 ± 1.3	10.0 ± 2.48 (Limited)	0.920
9	Regular team-building retreats are linked to a 5% reduction in medical errors.	2.4 ± 1.0	2.5 ± 1.1	2.3 ± 0.9	2.6 ± 1.2	2.5 ± 1.0	12.3 ± 2.34 (Limited)	0.980
10	Implementing electronic health records has increased patient data access by 80%.	4.7 ± 0.3	4.6 ± 0.4	4.8 ± 0.2	4.5 ± 0.5	4.9 ± 0.2	23.5 ± 0.76 (Superior)	0.874

Thirty diverse administrators across various age groups, professional experiences, and educational backgrounds contributed to the primary testing of the AERUS tool. The gender distribution of female participants was equal to males, indicating diversity and inclusion58. The primary areas of specialization are management and medical/clinical/nursing, business and administration, finance and human resources. Familiarity with AI is generally high (Table 8), with 46.7% being moderately familiar and 33.3% very familiar. AI usage is also significant, with 43.3% frequently using it, showing its relevance in their professional sphere.

Table 8: Familiarity and usage levels of AI (N = 30)‎.

AI familiarity/usage levels	Frequency (N)	Percent (%)
Familiarity with the Use of AI
Slightly familiar	5	16.7%
Moderately familiar	14	46.7%
Very familiar	10	33.3%
Extremely familiar	1	3.3%
Usage Level of AI
Not used at all	1	3.3%
Rarely used	4	13.3%
Sometimes used	12	40.0%
Frequently used	13	43.3%

4.3. Outcomes from the assessment of the refined AERUS instrument across four AI-enabled models

During the evaluation, six standard management inquiries were randomly selected for testing on four different AI models. Two independent evaluators used the AERUS tool to rate the responses generated by each AI (Table 9). Among the six inquiries (Table 10), Microsoft Bing achieved the highest mean AERUS score (22.93 ± 1.11), closely followed by ChatGPT-4 (22.00 ± 1.21). ChatGPT-3.5 had a slightly lower mean score (20.00 ± 1.21), while Google Bard had the lowest mean score (19.60 ± 1.22)‎.

Table 9: AI models independent raters.

AI Model	Rater 1	Rater 2
ChatGPT-3.5	Superior	Adequate
ChatGPT-4	Superior	Superior
Microsoft Bing	Adequate	Superior
Google Bard	Superior	Superior

Table 10: Mean AERUS scores evaluated on ChatGPT-3.5, ChatGPT-4, Microsoft Bing, and Google Bard models‎

No.	Inquiry Question	ChatGPT-3.5 Mean	ChatGPT-4 Mean	Microsoft Bing Mean	Google Bard Mean
1	Does predictive analytics reduce patient no-shows?
	Accuracy	4.2	4.6	4.8	4
	Efficiency	4.4	4.8	5	4.2
	Reliability	4.3	4.7	4.9	4.1
	Usability	4.2	4.6	4.8	4
	Security	4.1	4.5	4.7	3.9
	AERUS Score	21.2	23.2	24.2	20.2
2	Do frequent meetings affect staff satisfaction?
	Accuracy	3.9	4.3	4.5	3.8
	Efficiency	3.8	4.2	4.4	3.7
	Reliability	3.7	4.1	4.3	3.6
	Usability	3.6	4	4.2	3.5
	Security	3.5	3.9	4.1	3.4
	AERUS Score	18.5	20.5	21.5	17.5
3	Can AI lower medication errors?
	Accuracy	4	4.4	4.6	4.1
	Efficiency	4.1	4.5	4.7	4.3
	Reliability	4	4.4	4.6	4.2
	Usability	3.9	4.3	4.5	4
	Security	3.8	4.2	4.4	3.8
	AERUS Score	19.8	21.8	22.8	20.3
4	Is patient recovery linked to cafeteria menus?
	Accuracy	3.6	4	4.2	3.7
	Efficiency	3.7	4.1	4.3	3.8
	Reliability	3.8	4.2	4.4	3.9
	Usability	3.9	4.3	4.5	4
	Security	3.5	3.9	4.1	3.6
	AERUS Score	19.5	21.5	22.5	18.9
5	Does telemedicine cut readmission rates?
	Accuracy	4.3	4.7	4.9	4.2
	Efficiency	4.5	4.9	5	4.3
	Reliability	4.4	4.8	4.9	4.3
	Usability	4.3	4.7	4.8	4.2
	Security	4.2	4.6	4.7	4.1
	AERUS Score	21.7	23.7	24.3	20.9
6	Does a higher marketing budget boost patient admission?
	Accuracy	3.8	4.2	4.4	3.9
	Efficiency	3.9	4.3	4.5	4
	Reliability	4	4.4	4.6	4.1
	Usability	3.9	4.3	4.5	4
	Security	3.7	4.1	4.3	3.8
	AERUS Score	19.3	21.3	22.3	19.8
	Total AERUS Score Mean ± SD	20.0 ± 1.21	22.0 ± 1.21	22.93 ± 1.11	19.6 ± 1.22
	Cohen's kappa (κ)	0.885	0.79	0.358	0.758
	P-value	<.001	<.001	0.037	<.001

The T-test and P-value analysis of AI models reveal significant score differences. Notably, ChatGPT-3.5 vs. Microsoft Bing and ChatGPT-4 vs. Google Bard show significant disparities, while ChatGPT-3.5 vs. Google Bard has the least. The T-test values indicate the extent of these differences, with negative values suggesting lower scores for the first model in each pair (Table 11).

Table 11: Comparative analysis of AI models: Mean scores, T-Test.

Metric	ChatGPT-3.5 vs ChatGPT-4	ChatGPT-3.5 vs. Microsoft Bing	ChatGPT-3.5 vs. Google Bard	ChatGPT-4 vs. Microsoft Bing	ChatGPT-4 vs. Google Bard	Microsoft Bing vs. Google Bard
Mean ± SD	20.0 ± 1.21 / 22.0 ± 1.21	20.0 ± 1.21 / 22.93 ± 1.11	20.0 ± 1.21 / 19.6 ± 1.22	22.0 ± 1.21 / 22.93 ± 1.11	22.0 ± 1.21 / 19.6 ± 1.22	22.93 ± 1.11 / 19.6 ± 1.22
T-test	-2.855	-4.374	0.569	-1.392	3.412	4.948
P-value	0.0171	0.0014	0.5821	0.1942	0.0066	0.0006

5. Discussion

The level of trust will be a determining factor in the extent and pace of AI integration in future decision-making processes22,59 Therefore, this research introduced the AERUS tool, designed to evaluate the credibility of Artificial Intelligence (AI) in Evidence-Based Management (EBMgt), targeting AI platforms such as ChatGPT, Microsoft Bing, and Google Bard. Microsoft Bing emerged as the top performer with a mean AERUS score of 22.93 ± 1.11, indicating its high reliability and suitability for healthcare managerial tasks. This is particularly significant in Healthcare management, where precision and reliability are vital. ChatGPT-4 followed closely, scoring a mean of 22.00 ± 1.21. ChatGPT-4 ‎performance suggests its effectiveness in tasks requiring user-friendly interfaces and efficient data analysis, which is crucial in healthcare settings. The slightly lower score compared to Microsoft Bing may indicate areas for improvement or differences in specific capabilities that healthcare managers should consider. ChatGPT-3.5 and Google Bard scored lower, with means of 20.00 ± 1.21 and 19.60 ± 1.22, respectively. These scores suggest that while these models are adequate, they may require further development for specific healthcare management applications, especially in scenarios demanding high accuracy and reliability. The study also revealed significant score differences between the AI models in pairwise comparisons. For instance, the T-test values between ChatGPT-3.5 and Microsoft Bing and ChatGPT-4 vs. Google Bard showed notable disparities, indicating the extent of performance variation between these models. The P-values, especially the significant ones (P < 0.05) in the comparisons of ChatGPT-3.5 vs. Microsoft Bing, ChatGPT-4 vs. Google Bard, and Microsoft Bing vs. Google Bard, reinforce the statistical significance of these findings. The comparison between ChatGPT-3.5 and Google Bard shows a P-value (0.5821) indicative of a non-significant difference, aligning with their close mean scores.

The AERUS tool development is particularly relevant in healthcare management, where it evaluates critical dimensions such as the accuracy of AI in reflecting factual data, the efficiency of AI systems in streamlining managerial processes, the reliability of AI outputs in healthcare decision-making, the ease of integrating AI tools for healthcare managers, and the AI systems' compliance with data security standards. These aspects are essential in ensuring that AI-generated content is accurate but also reliable, secure, and practical for use in healthcare settings60. This development is crucial and timely, responding to the vital necessity of examining AI-generated content for potential inaccuracies61,62. AI outputs, prone to deviations from evidence-based standards, necessitate a structured mechanism for their assessment. The AERUS tool served this purpose by standardizing the evaluation of managerial information produced by AI-based models, which is increasingly sought for decision-making at various managerial levels.

Given the rapid and extensive advancements in AI technologies, this study, though insightful, has limitations due to the focus on a selected number of AI models and managerial topics. These choices may not fully capture the extensive range of AI applications in healthcare management. Furthermore, the dynamic nature of AI technology suggests that our findings could vary over time, highlighting the need for cautious interpretation and application of our results. Another critical limitation is the study’s sample size, which might be too limited to encompass a broad spectrum of perspectives or accurately represent the diversity inherent in healthcare management. Additionally, potential biases in participant selection might have influenced the outcomes, thus affecting the validity of the conclusions drawn. These limitations, taken together, should be carefully considered to appreciate the study's contributions and implications fully.

6. Conclusion and future research

This research highlighted the need to evaluate AI's potential in decision-making processes in the digital era. The AERUS tool marked an expressive development in evaluating AI models such as ChatGPT, Microsoft Bing, and Google Bard. It laid the groundwork for a more systematic and reliable assessment of AI-generated information in healthcare administration. Future research should expand the tool's application across a broader and more diverse sample and explore a more comprehensive range of management topics. This expansion is critical for validating the tool's effectiveness in varied healthcare management scenarios in line with EBMgt theories and enhancing AI's credibility in critical decision-making processes. Moreover, integrating the AERUS tool into actual healthcare management systems is a promising direction for future research. It aims to solidify its practical utility and assess its real-world impact on decision-making for business analysis, human resource management, patient experience, and quality improvement initiatives.

7. Declarations

7.1. Funding

Not applicable.

7.2. Availability of data and materials

The data supporting this study's findings are available from the corresponding author upon reasonable request.

7.3. Conflict of interest

The authors declare that they have no competing interests.

7.4. Author contributions

MoS performed data collection, entry, analysis, and interpretation.

MoS and MaS contributed to the data analysis and manuscript development.

MoS, MaS, and JS contributed to the manuscript review.

MoS, MaS, and JS read and approved the final manuscript.

7.5. Acknowledgments

The authors sincerely thank the School of Business at the International American University (IAU), USA, for their fundamental guidance in the research study initiative.

Special thanks to Mr. Hein VanEck, CEO of Mediclinic Middle East, Mr. Barry Bedford, COO of Mediclinic Middle East, Dr. Pietie Loubser, CCO of Mediclinic Middle East, and Mr. David Jelley, Dr. Albert Oliver, and the Senior Leadership Team at Mediclinic Parkview Hospital for their exceptional guidance and commitment to fostering a culture of research and continuous development within the Mediclinic network.

8. References

Full Text

Assessment of Artificial Intelligence ‎Credibility in Evidence-Based ‎Healthcare Management with “AERUS” Innovative Tool

Other Journals

Useful Links