Grand Challenge

Overview

AVI Challenge 2026 consists of two tracks: the true personality assessment track and the cognitive ability assessment track. Track 1 focuses on evaluating subjects' self-reported personality traits based on their responses to corresponding personality questions. Track 2 concentrates on using multimodal information from subjects' responses to all questions (i.e., including both generic and personality questions) to assess their cognitive ability.

Dataset

Subjects

The dataset consisted of video interview data from 644 subjects (307 men, 309 women, 28 non-binary). Subjects (n = 793) were recruited through the online platform Prolific. After excluding subjects (a) with incomplete responses (n = 40), (b) who did not consent for their data to be shared (n = 10), (c) who did not pass the attention checks (n = 7), (d) whose variation in personality (HEXACO) items was either too large or too small (n = 6), (e) who self-reported that they did not take the study seriously (n = 12), (f) whose videos contained corrupted audio (n = 51), (g) and who were flagged by personality raters as non-compliant (n = 23), the final sample size consisted of 644 subjects.

Procedure

Subjects applied to a fictitious management traineeship position. Part of the application procedure was to complete an AVI using a platform we developed for the purpose of the study. During the AVI, subjects responded to six interview questions. Two of them were generic questions, frequently asked in selection interviews. The other four questions were related to the personality traits of Honesty-Humility, Extraversion, Agreeableness, and Conscientiousness, as described by the HEXACO model of personality (i.e, the personality questions). Subjects were instructed to reply within 1-2 minutes to the interview questions.

Interview question development

The interview followed a structured format, since previous literature suggests that structured (vs. non- or semi-structured) interviews have stronger reliability and validity. Subjects always started with the generic questions and proceeded to the personality questions. Table 1 shows the content, order and type of the six questions and their corresponding personality traits.
- Generic questions. For the development of generic interview questions, we created an initial pool of 86 job interview questions taken from previous literature and a list of frequently asked interview questions provided by a Dutch consultancy company. As a first step, we screened those questions for eligibility. Questions were included if they were (a) open-ended, (b) conveyed personality information to some extent, and (c) could apply to multiple jobs. Questions were excluded if they described specific behaviors, specific jobs, or knowledge, values, and motives. This procedure ended up in retaining 61 questions. To assess those 61 questions, we asked 17 professional recruiters from a Dutch consultancy company to assess how frequently they use each of those questions in practice. Responses were given on a 3-point scale. The inter-rater agreement between the recruiters was ICC(2,17) = 0.88. We then asked four personality experts to assess each interview question on a 7-point scale using three criteria. Namely, whether the questions applied to limited or multiple jobs, whether they activated one or more personality traits, as well as provide a general assessment (ICC(2,4) = 0.66). Then, we calculated the average score per criterion (professional recruiters, personality experts) and excluded all questions that scored below the average (per criterion). This process returned 16 questions which (a) were frequently used by practitioners, (b) were not specific to a particular job, and (c) activated more than one personality traits. Of those 16 questions, we slightly edited and selected two questions that received the highest ratings from recruiters and personality experts.
- Personality questions. For the development of personality interview questions, we created an initial pool of 25 past behavior interview questions for the personality traits of Honesty-Humility (n = 6), Extraversion (n = 8), Agreeableness (n = 5), and Conscientiousness (n = 6). The questions were developed to target the core facets of each personality trait. Questions were developed in a past-behavior formal (e.g., “Think of situations when…”) since this type of format are more suited to elicit personality-relevant information according to previous research. Four personality experts independently selected one question per personality trait and later discussed any disagreements between them until a consensus was reached, retaining one question per personality trait. After some further editing, we ended up with four personality-related questions (Table 1).

**Table 1.** The content, order and type of the interview questions and their corresponding personality traits.
Order	Interview question	Type	Personality Trait
1	What would you consider among your greatest strengths and weaknesses as an employee?	Generic	\
2	How would your best friend describe you?	Generic	\
3	Think of situations when you made professional decisions that could affect your status or how much money you make. How do you usually behave in such situations? Why do you think that is?	Personality	Honesty-Humility
4	Think of situations when you joined a new team of people. How do you usually behave when you enter a new team? Why do you think that is?	Personality	Extraversion
5	Think of situations when someone annoyed you. How do you usually react in such situations? Why do you think that is?	Personality	Agreeableness
6	Think of situations when your work or workspace were not very organized. How typical is that of you? Why do you think that is?	Personality	Conscientiousness

Annotations

True personality traits. Self-reported personality traits were collected directly from the participants using validated inventories. Prior to the interview, each subject completed a comprehensive self-assessment based on the HEXACO model. Responses were provided using a Behavioral Anchored Response Scale (BARS), which was adapted for self-perception. The BARS contained four items per personality domain (one item per facet), and participants recorded their responses on a 5-point scale (1 = Very low; 5 = Very high), with the precision of one decimal point. Personality domain scores were calculated by averaging the four constituent facet scores.
Cognitive ability. Cognitive ability was assessed through a multi-dimensional psychometric test administered to all candidates. The assessment comprised 28 items categorized into four distinct tasks. Following the assessment, professional psychologists conducted a comprehensive evaluation of the candidates' responses. To ensure practical utility for recruitment modeling, scores were synthesized and categorized into a three-tier ordinal scale: Low, Medium, and High level. This expert-led classification ensures that the ground truth reflects not just raw accuracy, but a holistic psychometric interpretation of the candidate's intellectual potential.

Verbal reasoning (16 items): These items evaluate the ability to comprehend, analyze, and manipulate complex verbal and numerical information. For instance, questions such as "What number is one-fifth of one-fourth of one-ninth of 900?" assess deductive reasoning and the capacity to process sequential logical constraints.
Letter-number series (4 items): This task focuses on inductive reasoning and pattern recognition. Candidates are required to identify the underlying rule in an alphanumeric sequence, such as "K, N, P, S, U, ...," reflecting their ability to discern abstract relationships within structured data.
Matrix reasoning (4 items): Utilizing non-verbal, visuospatial stimuli, this section assesses fluid intelligence. Candidates must complete a logical pattern (e.g., "Select the best answer to complete the figure below"), measuring their ability to solve novel problems without relying on prior linguistic or cultural knowledge.
Three-dimensional rotation (4 items): This task specifically targets visuospatial processing and mental manipulation. By asking candidates to identify the correct rotation of a labeled cube (e.g., "Select the choice that could represent a rotation of the cube X"), the test evaluates the ability to form and transform mental representations of complex objects.

Data description

Video index: The dataset contains videos of participants who answer 2 generic questions and 4 personality questions. For each personality question, there is a corresponding personality trait rated by themselves. The filename of the video is organized as:
"participant id_quesition index_question type"
For example, the filename "5484821efdf99b07b28f2300_q1_generic.mp4" means:
- participant id: 5484821efdf99b07b28f2300
- question index: q1
- question type: generic
The question index and the corresponding personality trait are listed as shown in Table 1. For example, the question “q3” is asked to activate the personality trait of “Honesty-Humility”.
Ground truth labels: The train_data.csv and val_data.csv contain the ground truth labels for training and validating recognition models respectively. The csv files also include the meta information of the interviewees. You may use this meta information as additional information for recognition for both of tracks.
- age; gender: 1 = Male; 2 = Female; 3 = Non-binary/third gender; 4 = Prefer not to say;
- education: 1 = Less than high school; 2 = High school graduate; 3 = Some college; 4 = Bachelor degree or equivalent; 5 = Master's degree or equivalent; 6 = Doctorate; 7 = Prefer not to say
- work experience: How many years of work experience do you have?

Dataset split

The dataset used for the challenge is split into training (70%, n = 450), validation (10%, n = 64), and testing (20%, n = 130) sets. The training and validation sets will be made available to participants for algorithm development. The dataset is divided at the subject level, ensuring that videos from a single subject are assigned exclusively to one of the training, validation, or testing sets. Although the input videos differ between the two tracks (i.e., track 1 uses only videos for answering personality questions, while track 2 uses videos for answering all questions), the split remains consistent across both tracks of our challenge. In splitting the dataset, we consider the distribution of gender, age, and working experience of the subjects. Specifically, we employ joint sampling to ensure that these three sets maintain similar distributions of these demographic and experiential variables. The gender, age, working experience distribution of the training, validation and testing sets are shown in Figure 1.

**Figure 1.** The gender, age, working experience distributions of the training, validation and testing set.

Track 1: True personality assessment

In this track, you will develop models and algorithms to assess the self-reported personality traits based on subjects' responses to the corresponding personality questions. The task of this track is a single-input-single-label regression task. Below is an example for training models:

Input video: “5484821efdf99b07b28f2300_q3_personality.mp4”
Corresponding personality trait: q3->Honesty-Humanity
Label (personality trait): 3.5

The range of the personality trait ratings is [1,5]. Please note that you may also use videos which contain information from other personality traits for recognition. However, it should be stated clearly in the report you submit.

Evaluation metric: The primary evaluation metric is Mean Squared Error (MSE).

Track 2: Cognitive ability assessment

In this track, participants are tasked with developing models and algorithms to infer a candidate's cognitive ability from their asynchronous video interview responses. This track is formulated as an ordinal classification task, where models must categorize each candidate into one of three levels: 1, 2, or 3. The task of this track is a multi-input-single-label classification task. Below is an example for training models:

Input video: “60fccc84440f8e8c82ca0288_q1_generic.mp4”; “60fccc84440f8e8c82ca0288_q2_generic.mp4”; “60fccc84440f8e8c82ca0288_q3_personality.mp4”; “60fccc84440f8e8c82ca0288_q4_ personality.mp4”; “60fccc84440f8e8c82ca0288_q5_ personality.mp4”; “60fccc84440f8e8c82ca0288_q6_ personality.mp4”.
Label: 2

Evaluation metric: The primary evaluation metric is Accuracy (Acc).