1 Introduction

The digitalization of learning processes with the provision of online learning objects, such as interactive quizzes, videos, and texts creates massive amounts of data on learners, learning processes, and learning results. AI technologies, especially machine learning, are increasingly used to provide data-based, individualized, meaningful insights, and adaptive learning paths or recommendations. However, the ethical implications of AI, including issues of fairness, non-discrimination, transparency, and robustness are increasingly being discussed as part of the broader concept of trustworthy AI (see Sect. 3).

The quality of an AI system will need to be assessed in terms of its intended task, functionality and performance, robustness, and transparency [26, 70, 82]. All those aspects are linked to fairness. If an AI is not accomplishing its intended task (functionality and performance) or if it is designed to discriminate, it cannot be considered fair. If the AI system cannot cope with incomplete, noisy input data (robustness) or delivers completely different outcomes with small input changes, it will fail to perform accurately in uncommon or unexpected scenarios, or scenarios with extreme values which are often referred to as edge cases. If the system output is mysterious to the user, learners and instructors might draw the wrong conclusions.

This paper contributes to the research discussion in several ways:

  1. 1.

    Providing a comprehensive framework for auditing AI applications in LA systems by proposing a four-phase audit process, which includes delimitation, the risk-based definition of audit criteria, auditing and assessment, and monitoring and re-assurance. This framework can be useful for practitioners and researchers who are interested in evaluating the ethical implications of AI systems in education.

  2. 2.

    Focusing on the learner's perspective by emphasizing the importance of putting the learner at the center of the risk analysis, ensuring that any LA system benefits learners and does not put them at risk. This perspective is crucial, as ethical AI systems in education should prioritize the well-being and autonomy of the learners.

  3. 3.

    Discussing auditing methodologies by clustering different methodologies for auditing AI systems into four categories, including reviewing system objectives, interventions, and consequences, reviewing datasets, analyzing source code and model quality, and conducting technical black-box testing. This discussion can be useful for researchers and practitioners who are interested in selecting appropriate auditing methods for their AI systems.

  4. 4.

    Emphasizing the need for reflection and engaging in joint discussions among providers, buyers, and users of these systems. This emphasis on reflection and dialogue can contribute to the development of more ethical AI systems in education.

Overall, the paper provides a comprehensive framework for auditing AI applications in LA systems from the perspective of the learner's well-being and autonomy. It also offers practical insights into different auditing methodologies and emphasizes the importance of reflection and dialogue. These contributions can improve the research discussion on ethical AI systems in education and guide practitioners and researchers in designing and evaluating these systems.

The ethical implications of AI in education are increasingly being discussed in the context of trustworthy AI. In heavily regulated domains such as finance and healthcare, AI systems are subject to rigorous auditing processes. While in the European Union a draft regulation for AI is being discussed [31], no federal regulation is currently being prepared in the USA. Nonetheless, recent initiatives document the importance of the ethical use of AI on the other side of the Atlantic: in October 2022, the White House has issued a nonbinding Blueprint for an AI Bill of Rights [102]. The National Institute of Standards and Technology (NIST) issued an AI Risk Management Framework in January 2023 [75]. In China, regulation on recommender systems was enacted already in 2022 [23]. Hence, the comprehensive framework for auditing AI applications in LA systems provided in this article will be useful for practitioners and researchers interested in evaluating the ethical implications of AI systems in education in all regions, while the most urgent need is in the European Union. This includes educational institutions, educational technology providers, and policymakers involved in regulating AI in education. By following the proposed audit process and considering the domain-specific audit criteria and methodologies presented in this article, stakeholders can ensure that AI systems in education prioritize the autonomy of learners and comply with regulations.

In this article, the discussion around the fairness, transparency, and robustness of AI systems will be briefly summarized (Sect. 3), and AI auditing requirements, processes, and methodologies will be introduced (Sect. 4). Domain-specific audit criteria for ethical AI applications in LA will be presented in Sect. 5. A process for auditing AI applications in LA will be introduced in Sect. 6 and numerous, complementary auditing methodologies will be discussed.

2 Learning analytics

Over the past decade, LA research has steadily increased [89]. LA uses data from learners and learning platforms to analyze learning processes and contributes to enhancing learning [59], for example by implementing learning dashboards, predicting performance or drop-out and potentially triggering personalized interventions, or by providing personalized recommendations, feedback, or hints [4, 28, 60, 66, 74]. AI technologies such as machine learning or natural language processing are frequently used in LA [2, 3, 10, 17, 88, 91]. Fairness, transparency, and explainability are inevitable dimensions in LA, especially as learners' professional futures are at stake [35, 77, 86, 97].

3 Fairness, transparency, and robustness of AI systems

The use of AI technologies is rapidly expanding in all areas of society. However, AI technologies are associated with a number of risks, especially related to privacy, fairness, robustness, and transparency [39]. The privacy-related risks of using data-driven technologies have been widely discussed, also with regard to LA [27, 77]. Trustworthiness, including fairness, robustness, and transparency, are additional challenges when AI technologies are used for learning analytics. In the following, we will shortly summarize the discussion around fairness, transparency, and robustness of AI systems.

3.1 Fairness of AI systems

The most widespread AI systems used today are based on machine learning technologies, where relationships are abstracted into models from data (“learned”). However, if the data used to train those models contains correlations that are considered unfair or biased, the AI system may produce unfair outcomes when used in sensitive environments such as the criminal justice system or for credit decisions. AI systems replicate biases from the training data or can even amplify them, especially for underrepresented groups [108], as has been shown also for LA applications [85, 90]. Further sources of bias in AI systems result from the data selection and measurement process, data preparation, or the interaction of the user with the system [67].

Unfair AI systems can lead to direct discrimination [34], indirect discrimination [11, 34, 108], or underestimation [52]. Biased educational systems can result in allocative harm, when they prevent groups from access to education or in representational harm, when they underrecognize or stereotype certain groups [9]. Fairness in LA applications of AI is increasingly discussed [25, 55, 86]. This highlights the need for auditing methodologies of AI integrating LA systems.

3.2 Transparency of AI systems

Transparency with regard to the handling of student data has been demanded of LA systems for a long time [77]. Again, the use of massive amounts of data including log data and machine learning technology poses new challenges to the transparency of LA systems, as AI systems are not easily comprehendible. The transparency of AI systems can be assessed in three categories: traceability, explainability, and communication [30].

Traceability refers to the documentation of the AI system, including the methods used for gathering, labeling, and pre-processing data, splitting into training, test, and validation set, choice of ML model, model parameters and evaluation criteria, and hardware and software setup [30, 72].

Explainability of AI (XAI) is a field of research. Explainability in the sense of an understanding of the system is required by AI developers to improve the system, by managers to ensure proper use of the system in organizations, by AI users to validate match between input and output, by individuals affected by AI results, and by auditors to assess compliance with requirements [68]. The explanations provided need to be adjusted to the needs of those groups [68]. Broadly used approaches to improve explainability are the use of more transparent models (decision trees vs. neural networks), the assessment of feature importance [6, 63, 87], or the inspection through counterfactual examples [24, 62, 65]. However, technical model explanations were found to be of limited use to system users as explanations need to consider the process within which the system is used, the recipient’s knowledge, and domain-specific approaches to explanations [62].

Communication refers to the need to inform recipients of the system’s output adequately about the functioning of the system, as well as its limitations [30].

3.3 Robustness of AI systems

If decisions are taken based on the results or recommendations provided by a system, users expect those results not to change massively with small fluctuations in the input data. Robustness [110] refers to an AI system's ability to handle incomplete or noisy input data and still provide reliable results. In educational applications, where human assessments are often used to measure competencies, progress, effort, or output quality, robustness is particularly important since the accuracy of the input data cannot always be guaranteed [29, 48].

Some machine learning systems, such as neural networks, have been found to lack robustness [100]. Lack of robustness is often demonstrated using designed edge cases as so-called adversarial examples [38]. A trade-off has been found to exist between accuracy and robustness [103].

4 Auditing AI systems

Auditing AI systems is complex, because AI systems are part of socio-technical systems that are under continuous development [56]. Fairness, transparency, and robustness of AI systems cannot be assured using a single practice or technology, but they can be created through conscious human-centered design choices about datasets, models, and validation processes [93]. Standardized certifications by independent auditors are considered to be important factors to achieve human-centered AI systems [95].

This section is divided into three parts: first, the need for AI audits will be explained, secondly, processes for auditing AI systems will be presented, and lastly AI auditing approaches and methodologies will be discussed.

4.1 Auditing requirements

There is an ongoing global discussion about the need to regulate and audit AI systems. The European Commission has proposed the Artificial Intelligence Act [31], which, once enacted, will regulate all AI systems used in the European Union and require conformity assessments for all high-risk AI systems, including those used in education. In 2022, the Chinese Administration has issued an administrative rule on algorithmic recommender systems which aim to protect national security and “social public interests” for which providers have to document adequate audits to ensure compliance [18]. The German Federal State of Schleswig–Holstein has recently adopted a law on the requirements for AI systems used in the public sector, which includes education [49].

Finance and healthcare are heavily regulated and technology-driven sectors, as they are linked to important risks. As a result, regulation in these sectors is more advanced compared to other industries. In the Finance sector, conformity assessments have been required for some time for algorithmic trading systems in Europe according to Art. 6 when the system is used for the first time, upon material changes or prior to material updates [EU 31/589]. The regulation also requires financial institutions to separate production and testing environments for conformity assessment. For risk models in financial institutions explainability is a challenge as the relationship between input and output data cannot be described verbally or mathematically anymore, as historically required by surveillance authorities [7]. Explainability is, therefore, considered a major criterion for model selection [8]. As changes to important risk models have to be reported to and sometimes be approved by the regulating authority there is a need to define what a model change is, for example re-training based on updated data or adjustment of hyper-parameters, such as the number of layers in a neural network [7].

Auditing requirements for medical devices that use AI differ depending on the type of AI used: Static medical AI systems, that use a trained model are easier to certify than dynamic medical AI systems that regularly update machine learning models based on new data [46, 106]. The inherently higher incertitude in continuously learning medical devices needs to be justified by better functionality [106].

For other domains, such as educational technology, this means, the level of risk associated with the AI systems in use should dictate the level of explainability required. Additionally, it is important to differentiate between static models and more dynamic models that are continuously updated. Significant model updates will require re-auditing.

4.2 Auditing process

The process for auditing AI systems depends on the chosen audit methodology. In the following, we will describe processes for fairness audits [1, 83], external code audits [101], and life-cycle-based audits [36, 79]. However, as the harmonized standards referred to in the AI Act do not exist yet, none of the presented approaches can be compliant with future European regulation, yet. The final harmonized standards for AI audits are not expected to be released until 2025. The European Commission's final standardization request was approved in February 2023, and the development of standards for “Safe and trustworthy artificial intelligence systems” is now included in the 2023 annual Union work program for European standardization [32].

Agarwal et al. [1] propose a standard operating procedure for fairness certification which relies on a training dataset to calculate a bias index for the protected attributes and a fairness score for the total system. The bias index takes into account several fairness metrics and the fairness score is an average bias index [1]. Raji et al. [83] created a process framework for internal audits of AI systems consisting of the stages (1) scoping, (2) mapping, (3) artifact collection, (4) testing, (5) reflection, and (6) post-audit. During the scoping stage, the audit scope is defined and system documentation as well as underlying principles are acquired. Auditors assess the system’s social impact and ethics within use cases [83]. In the mapping stage, interviews are conducted with stakeholders to perform a failure mode and effects analysis (FMEA) that is used throughout the subsequent stages. In artifact collection information on systems and models is obtained. The testing stage includes the main auditing procedures and the review of the obtained material with regard to the identified risks using appropriate methodologies. In the last two stages, the audit results are documented and reported, and possible mitigating measures are assessed [83]. An internal AI audit is also described in [71]. The authors describe contradictory requirements within the organization and issues to operationalize scope and claims of the audit [71].

Tagharobi and Simbeck [101] introduced a process for code-based fairness audits of LA systems that consists of the steps (a) definition of the scope of the audit, (b) artifact collection and confinement, (c) mapping, description, and prioritization of relevant functions, (d) fairness assessment, and (d) interpretation of results. They find that the assessment of a system based on code alone is only possible to a limited extent if neither training data nor test data are available [101].

Other auditing approaches acknowledge the difficulties in assessing production systems and propose following the AI development lifecycle to ensure compliance [36, 79]. CapAI [36] introduces a procedure for conducting internal conformity assessments that is supposed to meet EU AIA requirements and that follows the AI lifecycle stages (design, development, evaluation, operation, retirement). For every stage, review items with corresponding documentation are defined. The procedure describes data pre-processing, splitting of data into training, validation, and test data, and proposes to use standard measures for model quality (mean squared error, mean absolute error, accuracy, precision, recall, F1-score), or fairness (such as equal opportunity, disparate impact, equalized odds, demographic parity) [36]. Given the procedure’s broad target audience, it serves more as a best practice manual for AI development. It stands out from those best practices, however, by calling for the definition, documentation, and communication of the norms and values on which the design of the AI system shall be based. It does not explain though, how the required ethical assessment against these values shall be conducted. The Fraunhofer AI assessment catalog [79] follows the AI development lifecycle as well but defines the stages of data, development, and production. Each of the dimensions of fairness, autonomy/control, transparency, reliability, security, and data protection shall be assessed based on risk analysis, adequate metrics, and measures [79]. Concerning fairness, the procedure calls for defining possibly disadvantaged groups and adequate fairness objectives in the framework of the risk analysis. Metrics are to be defined with regard to system output and data input. Measures include data pre-processing (such as preferential sampling or data reweighing [51]), post-processing of results (calibration, thresholding, transformation) results or choice of adequate models (such as re-weighing or adversarial learning).

In summary, there is agreement that the elements of scoping (delimiting the system to be audited) and the risk-based approach are required elements of an AI auditing process [1, 36, 79, 83, 101]. Many audit approaches recommend documenting the system development process [36, 71, 79, 83] which is in line with the AIA conformity assessment based on internal controls.

4.3 Auditing methodologies

In this section, several auditing methodologies will be discussed. This includes methodologies described in academic audits [21, 92], commissioned or voluntary audits [76, 109], user audits [53, 94], and non-binding proposals for standards from institutions [16, 26, 44, 46].

Ideally, complete system documentation, source code, and data are available to conduct an AI audit. According to DIN Spec 92,001 the quality of AI systems needs to be assessed on the levels of their model, data, platform, and environment [26]. Here, the “setting in which the AI module is situated and with which it can interact” is defined as its environment. The platform refers to the hardware and operating systems on which the AI is run, including its limitations induced by interfaces, processing power, or technical dependencies [26]. However, such a comprehensive approach is not always possible. Even if the source code is disclosed, the sheer size and complexity of the system might make it difficult to assess its fairness, transparency, and robustness [92]. If source code and machine learning model are not available, a black box audit can be conducted. In black-box audits, system input data is systematically varied to identify the impact of input features on system output [73]. Sandvig [92] differentiated between four black-box AI auditing approaches: noninvasive user audits (questioning users about interactions with a system), scraping audits (systematically querying a system), sock puppet audits (using test user profiles), and crowdsourced audits (collaborating with system users to systematically collect real-world system input and output).

Black-box audits have been used to assess the fairness of online delivery of job ads [47] and housing ads [5]. In both cases, discrimination was detected by running experiments based on assumed optimization algorithms used by platforms. In other cases, system output data are verified against publicly available third-party data to assess representativity and identify sampling bias [21]. Black box audits are sometimes initiated by accidental findings of system users and later crowdsourced to obtain larger datasets [94]. Examples include the investigation of the Twitter image cropping algorithm [94], or the TikTok recommendation algorithm [53].

Datasets that are not representative or biased can lead to incorrect or biased machine-learning models. Biased datasets will result in biased machine-learning models. [86] provides an example of how bias from the data is transferred into the model. The audit approach of an “extended conversation” was used in the 2020 audit of the hiring assessment provider HireVue [76]. The audit comprised a review of documentation, interviews, and discussions with stakeholders, the establishment of an ethical matrix, and the planning of remediation steps [76]. The potential fairness issues identified through the stakeholder interviews (i.e. balanced training data, accents in voice data, dealing with short answers) were explained and categorized based on information provided by HireVue [76]. The audit serves thus more as an example of the first step of an audit process – the identification of domain-specific risk areas. It is to be noted, that stakeholders included an association representing minority candidates or neuro-atypical candidates as well as a client.

A similar approach is proposed by the German Institute of Public Auditors which sets standards for public accountants in Germany in its draft auditing standard for AI systems [44]. The audit will either assess appropriateness (documented measures meet minimum criteria, are appropriate, and are implemented) or moreover effectiveness (documented measures meet minimum criteria, are appropriate, are implemented, and are effective) [44]. According to [44], auditors rely mainly on the description and documentation provided by the auditee. Important auditing procedures include the identification and interrogation of responsible persons, and the assessment of the system documentation with regard to completeness, correctness, and currentness [44]. The draft standard describes four areas of requirements for the AI system: compliance with ethical and legal requirements, comprehensibility (transparency and explainability), IT security (confidentiality, integrity, availability, authorization, authentication, binding force), and performance [44]. In order to meet the requirements, the draft standard calls for the implementation of AI governance measures, AI compliance, AI monitoring, data management, model training, AI application, and AI infrastructure [44].

In their cooperative audit of the PlyMetrics hiring tool [109]) assess for risks identified in starting step: correctness, direct discrimination, de-biasing circumvention, sociotechnical safeguards, and sound assumptions (imputation of missing values). Their audit uses mixed methods: it is primarily a code audit (manual examination of source code) that was complemented by systematic experiments to explore the possibility of de-biasing circumvention, review of documentation, and data review [109]. They reported, that the source code (Jupyter notebooks provided by PlyMetrics) correctly implements the adverse impact metric and that it did not use demographic features that could lead to direct discrimination [109]. They assessed the sociotechnical safeguards (human oversight) by understanding the data science process in the audited company [109]. To test the impact of the imputation of missing values, they analyzed the distribution of missing values among groups and tested the model for adverse impact [109].

For some application domains of AI systems, questionnaires have been proposed for auditing the systems. IG-NB [46] created a questionnaire for the certification of AI in medical products that covers the areas of supplier’s AI competency, documentation, medical purpose and context of use, risk management process, functionality and performance, user interface, security risks, data acquisition/labeling/pre-processing, model creation and assessment, and post-market surveillance. While some requirements in the questionnaire refer specifically to medical applications (i.e. product requirement definition should include indications, contraindications, comorbidities), most of the items are generic and could be applied to assessing any AI system (i.e. requirement to document feature selection and split into training/validation/test data).

The German Federal Office for Information Security published an AI Cloud Service Compliance Criteria Catalogue (AIC4) that also relies on the AI service provider’s system description [16]. The provided compliance criteria are divided into eight areas: general cloud computing compliance, security and robustness, performance and functionality, reliability, data quality, data management, explainability, and bias. In order to assess security and robustness, the catalog calls for the documentation of continuous risk management procedures against malicious attacks including scenario building, risk exposure assessment, robustness testing using for example white or black box attacks or physical attacks, and implementation of countermeasures [16]. With regards to performance assessment, specific criteria are recommended (i.e. ROC curve, AUC curve, and Gini coefficient for scoring), and AI system providers are also required to descript the capabilities and boundaries of the applied machine learning model [16]. The data quality criteria include among many others the “selection and aggregation of [data that] is statistically representative and free of unwanted bias” [16]. Specifically, equalized odds, equalized opportunity, demographic parity, and fairness through (un)awareness are listed as fairness metrics, and several methods for bias mitigation are also mentioned [16]. The level of explainability on the other hand shall depend on the purpose, potential damages, and decision-making implications [16].

In summary, the different auditing methodologies are partly complementary, as they provide different insights into the system. On the other hand, not every methodology can be applied in every audit, depending on the availability of system access, extensive documentation, or source code (see Table 1).

Table 1 Overview of AI auditing methodologies

5 Audit criteria for ethical LA

In their seminal work, Slade and Prinsloo [97] describe six principles for ethical LA which will serve as a base in the following section to deduce domain-specific audit criteria for AI applications in Learning Analytics. Figure 1 gives an overview of the six principles and the derived audit criteria. The principles proposed by Slade and Prinsloo [97] are widely recognized in the field of LA and are often used as a framework for the ethical use of data in education as they prioritize the purpose of education [43].

Fig. 1
figure 1

Audit Criteria for Ethical AI Applications in LA Systems, derived from [97]

The first principle demands LA to be a “moral practice”, which means that it should not follow a technocratic efficiency focussed approach [97]. Education requires value-based, normative judgments that cannot be replaced by data-driven decision-making [13]. Any decision-making in education needs to inquire first about what is educationally desirable [13]. The first principle thus leads to audit criteria such as:

  • What educational theories/values/didactic assumptions is the system/intervention based on?

  • Are the educational objectives of the system/intervention defined?

  • Does the system help to understand learning (process/effort/success/success factors) rather than just measuring it?

The second principle calls for seeing “students as agents” of their learning process, not only as recipients of learning interventions that generate data [97]. This implies the necessity of informed consent for the use of LA systems but also an inquiry into students’ educational priorities and challenges [57]. The need for explainability which can be deducted from the second principle is closely related to the fifth (transparency) principle. The second principle leads to audit criteria such as:

  • Did learners consent to the specific use of data?

  • Are drivers of system results understandable to learners/instructors/institution?

  • Are implications/actions from system results understandable to learners?

  • Do learners get the chance to explain themselves/adapt/correct system output and/or implications?

The third principle underlines that “student identity and performance as temporal dynamic constructs” [97]. Even though some competencies can be frustratingly stable and slow to acquire, education mostly takes place in phases of life that are associated with rapid personal development and frequent changes of context and setting. This also implies that learners are in a critical phase of their life and must therefore be protected from discrimination and unfair treatment. For the use of data in LA, the following questions must be answered:

  • Is the timespan of data used for analysis or prediction adequate?

  • Is newer data weighted stronger?

  • Are system results fair and unbiased with regard to minority groups?

  • Is the data used balanced with regard to data from minority groups?

  • Will data/system outputs be deleted or anonymized after an appropriate timespan?

  • Can learners request the deletion of data?

  • Will past data on learning performance or process permanently limit learning opportunities?

The fourth principle states that “student success is a complex and multidimensional phenomenon” [97]. Digital LA systems are biased towards using digitally available data and ignoring important success factors of learning, because they are either not easily available (i.e. socio-demographic background) or not digitally captured (offline learning activities such as reading books, discussing with friends). An ethical review of an LA system, therefore, needs to address the following questions:

  • How have the data points been chosen?

  • Which data is potentially missing because it is not (digitally) available?

  • Is it made transparent which dimensions of student effort/progress/success cannot be measured?

The fifth ethical principle in LA is the principle of transparency [97], also required by Pardo and Siemens [77]. It includes transparency about the purpose and conditions of data use and access to data and results [97]. The notion of transparency is closely related to explainability, which is covered by the second principle (learners as agents of their learning). Therefore, the following questions, have to be answered to assess an LA system:

  • Are learners informed about which data is processed by the system?

  • Are learners informed about system objectives?

  • Is system output shared with learners, or only with teachers or institution?

In contrast to the previous five principles which set limits for LA, the sixth principle “higher education cannot afford to not use data” encourages the adoption of LA practices [97]. Data can and shall be used for the benefit primarily of learners, but also of instructors, and educational institutions to provide meaningful learning experiences and individualized learning support with limited resources. Data-based analytical approaches can also help to understand and mitigate biases that exist in society, i.e. by identifying underserved groups of students or discrimination in opportunities or results. Some authors argue, that educational institutions should therefore strive to systematically collect data on minorities and analyze it to identify and prevent discrimination [9]. However, this contradicts the established principle of data minimization, which is one of the basic principles of data protection and is required by the European General Data Protection Regulation (GDPR).

This leads to the following audit question of an LA system:

Based on the educational objectives and approaches of the institution, is LA used where needed?

6 Auditing AI in LA systems

The creation and continuous development of educational AI systems is a complex multi-step process, where every step can potentially introduce or amplify bias [9]. The creation of a robust and transparent system requires dedicated effort at every step of this process. In the following, a process for auditing the fairness, transparency, and robustness of AI applications in the LA domain will be presented (Fig. 2). While the terms auditing, assurance, and certification are all used in the literature, in the following only the term audit will be used. The proposed process integrates the ISO standard for auditing management systems [48] with prior works by [83, 101].

Fig. 2
figure 2

Process for auditing the fairness, transparency, and robustness of AI applications in the LA domain

6.1 Delimitation and definition of scope

The first step of the audit process consists of the delimitation and definition of the scope of the system to be audited [71]. In many cases, the AI technology may be part of a broader system that is not completely subject to the audit. This includes delimiting the scope of the relevant system components, the relevant input data, the output data of the system, and, if possible, the identification of the processing methodology of the data within the system. In ISO 19011 [48] , this step is part of establishing and implementing an audit program.

Identification and documentation of system scope

Before beginning the audit, it is essential to define the scope of the system to be audited, especially if the system is embedded in a more comprehensive system or scope. This includes defining the start and the end of the relevant processes. For this purpose, it is necessary to obtain access to documentation, process descriptions, and possibly source code. Traceability in the broadest sense is not a pre-requisite for all types of audits, audits with lower levels of assurance can be performed without full insight into data, code, and models. In many cases, it will be impossible to analyze such extensive information.

In the automotive industry, the scope of a system risk assessment is described as the Operational Design Domain (ODD) and describes the conditions under which a system is expected to be reliable, for example in terms of geography or weather [61]. In their audit of the LA functionality of the Moodle learning management system, Tagharobi and Simbeck [101] delimited the scope to only 35 thousand of a total of 2.6 million lines of code of the complete application.

Identification and documentation of input and output data

Software applications use databases to store and retrieve data. In complex applications, many variables of different types are stored in columns of interrelated tables in databases. Machine learning applications access those databases to feed data into the trained models and store the output back into the database. Not all data columns may be relevant to the scope of the audit. To assess the AI-based LA application, it is necessary to understand and document which variables are used as input to the system, which output is created by the system, and where it is stored. Usually, data need to be pre-processed before it can be used for machine learning. The applied data pre-processing shall be documented, for example, data cleaning or creation of additional calculated or aggregated attributes. It is also important to understand and document when, where, and how the system output is displayed and to whom. A dropout prediction in a learning management system could be initiated by an admin, the aggregated result be visible to a teacher, and the individual result could be visible to learners. Apart from the aggregated and individual results, learners and teachers might receive further information about the quality of the prediction or be able to compare themselves to other groups.

Identification and documentation of processing methodology

A machine-learning based AI system applies a trained model to new data. In some cases, the trained model may be built into the system, in other cases, users may be able to train their own models. For training the model, numerous approaches can be used, from linear regression to deep neural networks. In some cases, system users may be able to choose a model-building approach. Auditors should identify and document how the data is processed in the system and how the model is trained. If a pre-trained model is used the available quality measures of the model and the used training data need to be documented.

6.2 Risk-based definition of audit criteria

Once the system to be audited has been delimited and access to relevant stakeholders, documentation, data, system, and possibly source code has been acquired, the audit objectives, risks, and audit criteria need to be identified and documented. This step is also part of establishing and implementing an audit program in [48].

Identification of risks

To systematically assess the risks of the system, auditors have to consider the perspective of all stakeholders and identify what could go wrong, starting with the consequences of the system output. Interviews with diverse stakeholders can serve to identify domain-specific risk areas and describe cases of potential negative consequences from system failure [76]. For medical software, risk analysis focuses on reliability (of intended functionality) and safety (avoidance of unintended harm) [20]. Classical other risk categories are damage to property or financial assets, injury, revealing of personal data, and negative consequences from non-availability of the system or failure of the system.

To identify and describe domain-specific risks, the six principles of ethical LA as described by [97] can be used as a starting point (Fig. 1).

Risk-based definition of testing scenarios/edge cases/corner cases

In the next step, the identified risks can be translated into testable scenarios. Scenarios are combinations of input factors that may occur and influence the system output [107]. Testing scenarios should include both usual and frequent scenarios and unusual or unexpected scenarios. The technique of creating Personas [80] helps to identify and cover typical system usage. Often, edge cases or corner cases are used that describe scenarios with low probabilities but possibly severe consequences [14, 40, 83]. Kitto and Knight [54] promote the use of edge cases for assessing ethical implications of AI use in LA and discuss three example cases (prior consent, conflicting claims in collaborative learning, and profiling. Another approach to identifying relevant scenarios is the use of acceptance tests (compliance with requirements) and assurance cases (structured claims that can be proven based on evidence) [41, 84, 98].

Mapping of risks to audit criteria

The risks identified in the prior steps need to be mapped to testable criteria. ISO 19011 [48] explains: “The audit criteria are used as a reference against which conformity is determined. These may include (…) applicable policies, processes, procedures, performance criteria including objectives, statutory and regulatory requirements, management system requirements, information regarding the context and the risks and opportunities as determined by the auditee (…), sector codes of conduct (…).”

6.3 Methodologies for auditing AI-based LA systems

In this section, an overview of auditing approaches for AI systems will be presented and discussed with regard to their applicability in an LA context. Not all audit approaches can and should be applied in all audits. Auditors need to select appropriate methodologies based on the identified risks, system complexity, access to system, code, and documentation, and their own competencies (Table 1).

Some of the auditing methodologies to be discussed require extensive access to confidential data, source code, and machine learning models ("review of datasets" and "code analysis and review of model quality"). These approaches allow for more detailed audits. Without such access, only methodologies "review of system objectives, interventions, consequences" and "black box testing" are available. However, even with full access to system and data, the complexity of many AI systems often still renders them incomprehensive [81, 96]. Therefore, the risk analysis step is crucial to identify relevant risks and determine the appropriate audit methodology. Ultimately, it is up to the stakeholders to weigh the potential benefits of a system against the risks and decide whether or not to use it.

The auditing methodologies can be combined. In any case, auditors need to focus on the risks defined in the prior step during the audit and document their work. The documentation of the audit needs to include information on the auditing methodologies applied, why they were chosen, and which risks and benefits are associated with the choice of methodology. The audit documentation should also state under which conditions the audit would be reproducible.

Review of system objectives, interventions, and consequences

The first auditing methodology is to systematically review the system objectives, interventions, and consequences. This can be done using system/process documentation, websites, trying out the system, and or interviews with system provider and users. If process documentation does not exist yet, it should be created during the audit in order to document every step that is taken using the system. A similar approach is described in [76] and has yielded interesting results.

The objectives, interventions, and consequences of the system have to be identified and documented. Here, objectives refer to the purpose of the system: why is the system used and what is supposed to be achieved using the system? The system objectives should be evaluated with regard to the identified risk areas. Unless aspects of fairness, transparency, and robustness are formulated as objectives, they are probably not achieved. Interviews with developers can help to understand whether different groups of learners (gender diversity, migration/education background, language, age, disabilities) were included in the development process.

Interventions are the activities performed (i.e. assigning learning resources, access to data about the learning process) or recommended (i.e. consultation sessions) by the system. Interventions of the system with regard to all stakeholders should be reviewed (learners, teachers, and organization). Screenshots of the system output can help to assess transparency. Possible audit criteria are:

  • Are system results provided to learners or only to teachers?

  • Are system results appropriately explained?

  • Do learners/teachers receive the opportunity to overwrite results or opt out of the system?

  • Are autonomy and control of learners respected?

Those interventions then result in consequences outside of the AI application. Interventions and consequences for all stakeholders should be considered. Consequences for learners could be the motivational effect of feedback or lack thereof, increased/decreased time spent learning, more efficient learning process. As a consequence of system recommendations about learners, teachers could be influenced in their grading decisions. Consequences should be discussed both with regard to the objectives of the system and with regard to unintended side effects. A system can only be fair if the intended objectives of the system are achieved through the interventions and result in intended consequences.

The advantage of this auditing approach is that the holistic effect of the system is considered. However, it is not considered how this effect is created technically. Using documentation and interviews to audit objectives, interventions, and consequences the system can hardly be assessed with regard to robustness. Rare cases of dysfunctionality can probably not be identified.

Review of datasets

One important aspect of auditing these systems is to review the datasets used for training and applying machine learning models. This is because biased datasets can lead to biased machine-learning models, which can have negative consequences for the learners and the educational institutions that use these systems. Therefore, auditors need to understand the importance of reviewing datasets and the criteria that they should consider when conducting such reviews. In this section, we will delve into this critical auditing methodology and provide insights into why it is necessary for ensuring the fairness and accuracy of AI components in LA systems.

This approach requires data science capabilities and access to system developers and/or datasets.

The first area to be reviewed is data selection. There is a bias toward using data that is easily available, but datasets that are easily available are often not representative of the target population [93]. To assess how representative a dataset is, its properties need to be compared to the target group’s properties, for example in terms of demographics. 22 have used administrative data (voter roll data) to assess the coverage of mobility data used for pandemic decision-making and found that vulnerable groups (elderly people, minorities) are underrepresented. Biased datasets will result in biased machine-learning models. Bias in the data used for training educational AI systems on one hand reflects historical biases and norms in society (“girls underperform in science”) as well as population distributions [9]. On the other hand, systems themselves influence behaviors and the data created by systems is biased towards digitally observable behaviors, thus missing offline learning or offline social networks. In the pre-processing step, data need to be cleaned, features selected, added, and calculated. This step can also be used to improve the fairness of the dataset. If subgroups are underrepresented in the training data, subgroup accuracy can be systematically higher or lower [86].

This results in the audit criteria with regard to representativeness:

  • Is the dataset used for training representative?

  • Is the dataset used for training suitable for the desired application?

  • Has the data been adjusted during pre-processing to reduce bias?

  • Are the properties of the training/validation/test data similar?

  • Are protected groups well represented in the data?

  • Are rare cases well represented in the data?

  • The used datasets should also be reviewed with regard to privacy and transparency:

  • Are learners/teachers/users informed that their data is used and for which purpose?

  • Has informed consent been given to the use of the data for this specific purpose?

  • Has data been anonymized?

Especially in a learning context, measures of interest are not directly observable (i.e. competency levels, learning progress, quality of teaching) and must be operationalized through constructs of other variables. The ability to objectively measure learning behaviors, outcomes, and competencies is limited, leading to measurement, annotation, and documentation biases [9]. The fairness of LA systems’ outcome will be negatively affected if the constructs used do not meet the requirements of validity and reliability [50]. According to quantitative social science theory, validity means that a construct measures what it is intended to measure whereas reliability refers to the notion of reproducibility and consistency.

  • Which constructs are used as input and output variables?

  • How valid and reliable are those constructs?

The review of datasets can give helpful insights into the potential fairness of the system. However, the review of datasets alone does not allow for assessing the real-world consequences of unfair data. A more holistic picture will thus require this methodology to be combined with a review of system objectives, interventions, and consequences. The review of datasets will also not provide information about the transparency or robustness of the system.

Code audit and review of model quality

In this section, two interrelated technical approaches to system audits are discussed: code audits and reviews of model quality. In the context of auditing AI components in LA systems, understanding and reviewing the technical approaches such as code audits and reviews of model quality are critical. These approaches provide auditors with essential tools and methods to assess the implementation, suitability, and quality of the software and machine learning models used in these systems. A source code audit (also referred to as static code analysis) aims to verify the implementation and suitability of the software for intended use, review conformance, and/or identify defects or vulnerabilities by systematically reading and understanding the source code [37, 45, 73]. Software products consist of hundreds of thousands of lines of code that are interrelated, created by teams of developers, and difficult to comprehend. According to [45], an auditor can go through up to 200 lines per hour. Often, a code audit can be supported by specialized tools [64]. Code analysis is not only very time-consuming, but it is also difficult to assess complex system results that emerge at the interplay of system and data [96].

The static code analysis is complemented by dynamic testing, where the software is systematically executed to identify issues, usually based on designed test cases [73]. Other forms of software reviews are management reviews (reviews of the software development process), technical reviews, inspections (visual examination), and walk-throughs [45, 73].

Code analysis is used for example in the HR tool audit by [109] where they manually review Jupyter notebooks provided by the company. A code audit for a LA application is performed by [101]. Their proposed framework consists of the steps definition of the scope of the audit, artifact collection and confinement of source code, mapping/description/prioritization of classes and functions, fairness assessment, and results interpretation [101].

The review of model quality is closely related to the code audit as it may include reviewing the source code for generating the model. The review can include the data pre-processing, the split of data into training, validation, and test data, the selection of the machine learning approach and model (i.e. linear regression or neural networks), optimization criteria, or indicators selected for model performance evaluation.

Most fairness measures are based on the confusion matrix (true positives/false positives/true negatives/false negatives) that is used to assess model quality [105]. Commonly used measures are equality of opportunity, predictive equality, and predictive parity [105]. Those fairness measures are also relevant in a LA context [86].

The review of the machine learning model helps to judge the potential transparency of the system, as some models (decision trees, regression) are easier to interpret than others as they yield interpretable weights or rules [62, 65]. For other models, feature importance measures can be calculated or they can be approximated to decision trees [62, 65]. Several toolkits are available to provide explainability, such as LIME [87], SHAP [63], or SAGE [22]. The application of those tools can, however, yield inconsistent results, therefore the appropriate approach needs to be selected carefully [99]. Further, explainability can only be judged from the perspective of the user [19]. A per se explainable model selection only increases explainability if the decision parameters are shown on the user interface level.

Black box testing

Several black box audit techniques can be used to systematically test the system using test input data. Those include equivalence partitioning, boundary value analysis, cause-effect graphing, error-guessing [73], assurance cases, acceptance tests [41], or the generation of adversarial examples [100].

Synthetic data can be created to test model properties [42], especially fairness and robustness. 108 (2021) in their audit of the PlyMetrics hiring tool systematically generated training data to explore scenarios to circumvent the tool’s de-biasing mechanism. To test the impact of the imputation of missing values, they analyzed the distribution of missing values among groups and tested the model for adverse impact [109]. Synthetic data can be especially useful in data protection sensitive fields such as LA [12].

Independently of the data used and the technical implementation of the model and system, acceptance tests and assurance cases can be used to systematically compare system functionality to expectations [41]. In acceptance test-driven development (ATDD), concrete, unambiguous, observable acceptance criteria are specified to describe the wanted behavior of the system [41]. While ATDD is a software engineering paradigm, the concept of assurance cases is used in safety and security norms such as ISO 26262. An assurance case can be defined as “a structured argument, supported by evidence, intended to justify that a system is acceptably assured relative to a concern (such as safety or security) in the intended operating environment” [78]. To apply acceptance tests and assurance cases to audit fairness, the audit criteria identified in 0 need to be substantiated in acceptance tests and assurance cases [41].

Adversarial examples are data points (such as images or voice inputs) that are not correctly classified by different machine learning models [100]. Those data points differ only slightly, often imperceptibly to humans, from correctly classified examples [38, 100]. Adversarial examples can be used to attack machine learning models but also to assess the robustness of models [58]. Adversarial testing has also been used to generate test cases for autonomous driving [104] or for security tests [15]

6.4 Monitoring and re-assurance after system changes

After deployment, machine learning models need to be monitored as models may degrade over time and may need retraining based on new/current data. After those system changes, re-assurance may be required [20, 33]. The proposed European AI Act [31, 69] requires providers of AI systems to conduct postmarket monitoring to ensure ongoing compliance.

7 Conclusion

With the growing use of AI in education and the public discussion around ethical AI systems, educational institutions adopting AI technologies will feel the need to assure their fairness, transparency, and robustness. Legislators in several countries start to regulate AI [18, 31, 49]. The possible AI regulation in Europe with its conformity assessment requirement might set standards well beyond Europe. Behind this background, this article provides an overview of how AI can be audited, specifically in the context of LA.

It is proposed to derive domain-specific audit criteria for AI Applications in LA systems from the six principles of ethical LA systems [97]. Using this approach, the learner is put at the center of the risk analysis, as any LA system shall benefit learners and not put them at risk. It is further proposed, that the audit process of AI applications in LA systems shall comprise the four discussed phases of Delimitation, Risk-based definition of audit criteria, Auditing and assessment, and Monitoring and re-assurance.

Several methodologies can be applied for conducting the audit, depending on the risks identified, the access to stakeholders, source code, and documentation as well as data, and the capabilities of the reviewers. The auditing methodologies discussed are clustered into

  • Review of system objectives, interventions, consequences,

  • Review of datasets,

  • Code analysis and review of model quality, and

  • Technical black box testing.

The auditing of AI systems is an emerging topic. There will be a need to qualify not only auditors but also instructors and educational administration/management to assess educational AI systems. However, apart from assessing systems, educational institutions will also choose to apply other risk mitigation measures, such as contractual guarantees with data/software suppliers, service level agreements, or the use of more transparent infrastructure (i.e. open source systems).

Providers, buyers, and users of educational AI systems will have to start reflecting on the ethical dimensions and implications of their systems, especially concerning fairness, transparency, and robustness and jointly discuss how assurance of those is possible in the future.