Week 4 : Two-Sample t-tests

This week we will explore how to use Jamovi to test hypotheses about a dataset using t-tests. We will touch on some revision from previous weeks so please do jump back to past computer practicals for revision if required

Learning objectives for this session
	Quantitative Methods
	Independent samples t-tests
	Paired samples t-tests
	Assumptions of parametric tests

	Data Skills
	Computing a new variable from existing data
	Computing checks for normality and homogeneity of variance

	Open Science
	Exploring and understanding new datasets

1. The Dataset

We’ll be working with the dataset we collected after the break during the lecture in Week 3. This is a partial replication of a study published in the journal Psychological Science (Miller et al. 2023). The original paper found the following results summarised in the abstract.

Recent evidence shows that AI-generated faces are now indistinguishable from human faces. However, algorithms are trained disproportionately on White faces, and thus White AI faces may appear especially realistic. In Experiment 1 (N = 124 adults), alongside our reanalysis of previously published data, we showed that White AI faces are judged as human more often than actual human faces—a phenomenon we term AI hyperrealism. Paradoxically, people who made the most errors in this task were the most confident (a Dunning-Kruger effect). In Experiment 2 (N = 610 adults), we used face-space theory and participant qualitative reports to identify key facial attributes that distinguish AI from human faces but were misinterpreted by participants, leading to AI hyperrealism. However, the attributes permitted high accuracy using machine learning. These findings illustrate how psychological theory can inform understanding of AI outputs and provide direction for debiasing AI algorithms, thereby promoting the ethical use of AI.

We ran a quiz that students in the lecture could join on their phones/laptops. There were two parts to the quiz:

The faces section included 12 faces that were either photographs or real people or AI generated images of people. The stimuli were taked from the materials released by (Miller et al. 2023) on the Open Science Framework. Responses were either ‘Real person’ or ‘AI generated’.

The confidence section asked participants to rate their confidence in their ability to do the following three things.

Distinguish real photos of people from AI generated images (Experimental condition)
Distinguish photos of happy people from photos of sad people (Emotional control condition)
Distinguish photos of people you used to know in primary school from strangers (Memory control condition)

Scores were recorded on a scale from 1 (Completely confidence) to 10 (Not at all confident). The confidence section was repeated before and after the faces section to see if participants confidence changed as a result of doing the AI faces task

2. The Challenge

This week we will use both one sample and two sample t-tests to explore the following hypotheses.

People are able to distinguish AI generated faces from real photos of humans.
Confident people are better at distinguishing AI faces from real faces.
People’s confidence in distinguishing AI generated faces will reduce after performing the task, but their confidence about emotion perception and memory will not change.

Key step

Take a moment to think about these hypotheses. Which statistical test is most appropriate for each? Do they call for a one-tailed or a two-tailed test?

2. Exploring the data

It is critical to take some time to understand the data we work with before running critical hypothesis tests. Here we’ll take a look through the dataset to understand what information is present and if we’re happy to proceed with the analysis. This is similar to what we did in week 1 - you can refer back to the week 1 materials for additional guidance if you need it.

Key step

Before going any further, the data file rmb-week-3_lecture-quiz-data_ai-faces.csv into a new Jamovi session.

Take a read through the data columns. We have 26 in total with the following information.

Column Names	Description
`First Name`	Participant ID - always ‘Anonymous’
`DataUse`	Participant response to data re-use question
`AIConfidenceBefore`	Confidence in distinguishing AI faces from real BEFORE the task : 1 (Completely confident) to 10 (Not at all confident)
`EmoConfidenceBefore`	Confidence in distinguishing happy from sad faces BEFORE the task (Emotional control) : 1 (Completely confidence) to 10 (Not at all confident)
`MemoryConfidenceBefore`	Confidence in recognising a face from a long time ago BEFORE the task (Memory control) : 1 (Completely confidence) to 10 (Not at all confident)
`Face1_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face2_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face3_AI`	Result for face (1 is correct response, 0 is incorrect)
`Face4_AI`	Result for face (1 is correct response, 0 is incorrect)
`Face5_AI`	Result for face (1 is correct response, 0 is incorrect)
`Face6_AI`	Result for face (1 is correct response, 0 is incorrect)
`Face7_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face8_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face9_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face10_Real`	Result for face (1 is correct response, 0 is incorrect)
`Face11_AI`	Result for face (1 is correct response, 0 is incorrect)
`Face12_AI`	Result for face (1 is correct response, 0 is incorrect)
`Quiz1`	Response for revision quiz question
`Quiz2`	Response for revision quiz question
`Quiz3`	Response for revision quiz question
`AIConfidenceAfter`	Confidence in distinguishing AI faces from real AFTER the task
`EmoConfidenceAfter`	Confidence in distinguishing happy from sad faces AFTER the task (Emotional control)
`MemoryConfidenceAfter`	Confidence in recognising a face from a long time ago AFTER the task (Memory control)

Work through the following questions, try to get an answer yourself before clicking to see the result. Data exploration is a critical skill that you’ll need whenever looking a new data throughout your degree.

Data Skills - how many participants took part in the quiz?

We have 124 rows of data observations the spreadsheet.

Data Skills - did everyone consent to have their data included in this practical?

Yes, the responses in the DataUse column are always positive - “Yes, I’m happy for my data to be included”. We removed the data with negative responses before sharing the data here.

Data Skills - proportion of responsese would we expect to be accurate if participants responded randomly in the face questions.?

We would expect 50% accuracy if participants answered randomly. There were only two response options ‘Real person’ or ‘AI generated’.

You can use Descriptive Statistics to answer the following questions.

Data Skills - which face did participants identify most accurately?

Compute desriptive statistics for all 12 faces and look at the ‘Mean’.

Face 11 was most accurately identified as AI generated with 91.9% accuracy!

Face 4 (AI) and Face 1 (Real) were close behind.

Data Skills - which face did participants identify least accurately?

Compute desriptive statistics for all 12 faces and look at the ‘Mean’.

Face 9 was least accurately indified as a real human with 31.8% accuracy. Face 5 (AI was second least accurate)

Data Skills - did we get complete data from all participants in the face task?

Compute desriptive statistics for all 12 faces and look at the ‘Mean’.

No, quite a few participants dropped in and out during the task. We’re missing between 6 (Face 2 and 3) and 21 (Face 12) participants on each question.

Data Skills - were participants most confident in their AI discrimination, emotion recognition or memory before the face task?

Compute desriptive statistics for all AIConfidenceBefore, EmoConfidenceBefore and MemoryConfidenceBefore.

Participants were most confident in their ability to distinguish happy from sad faces in the Emotional face control condition with a score of 2.51. Participants were least confident in their ability to distinguish AI faces from real faces with a score of 4.54.

Data Skills - are the confidence scores before face task normally distributed?

Compute desriptive statistics for all AIConfidenceBefore, EmoConfidenceBefore and MemoryConfidenceBefore. Add the ‘Shapiro-Wilk’ statistic to the table.

It looks like none of these variabiles are normally distributied… The W statistics for AIConfidenceBefore is much higher than the other two, but the p-values indicate that all three show a departure from a normal distribution.

Add a Histogram to your descriptive plots - we can see that AI confidence looks close to normally distributed but there is a very large skew in both Emotional confidence and Memory confidence - some participants had very low confidence scores in these conditions!

3. Computing overall accuracy

The descriptive statistics gave us a good overview of the dataset and we can start working towards testing our hypotheses.

One critical piece of information is missing though! we have accuracy for each individual face but not an overall score for each participant. We’ll need to compute this new variable ourselves from the average accuracy of all twelve faces.

Data Skills - computing a variable from other columns

We can define our own variables in Jamovi using the ‘Compute’ function in the ‘Variables’ or the ‘Data’ tabs. Open a new Transformed variable.

This will open a menu with an option to give the new variable an name and description. Name the variable ‘PropFacesCorrect’ to indicate that it contatins the proportion of faces that the participant responded correctly on. You can add a description if you like though this is optional.

The variable is defined within the formula box below the name definitions. We want to compute the average accuracy across all 12 faces so we can add the formula to compute that into the box.

The formula should add all the columns together and divide the result by 12 (the total number of faces). Make sure that all the additions are grouped by parentheses! otherwise Jamovi will only divide the final value by 12 and add it to the others. This is an example of BODMAS - Brackets, Of, Division/Multiplication, Addition/Subtraction that you might have covered in maths in school. talk to your tutor to make sure that this step makes sense.

The formula should look something like this, I’ve removed some faces to simplify the visualisation. You should include them all.

(Face1_Real + Face2_Real + ... + Face11_AI + Face12_AI) / 12

Once this is complete, you should be able to find your new column of values.

Now, let’s take a look at our new variable. Compute some descritive statistics!

Data Skills - how many participants gave a response to all 12 faces?

We have an N of 86 with 38 participants not responding to one or more of the faces.

Data Skills - are the average accuracies normally distributed?

Take a look at the Shaprio-Wilk statistic and the histogram of PropFacesCorrect, this looks like a normally distributed data variable.

Data Skills - what proportion correct did the most and least accurate participants get?

Take a look at the maxiumum and minumim of hte descriptive statistics, and perhaps add the table of ‘Most extreme’ data values in the Outiers section.

Two individuals managed to get all 12 faces correct! Two individuals were at 25% accurate corresponding to 3 out of 12 faces correct.

4. Hypothesis 1 - People are able to distinguish AI generated faces from real photos of humans

Ok, we’re ready to test the first hypothesis. Use the information you know about the dataset and try to find an answer!

Check Your Understanding

Test the following hypothesis:

People are able to distinguish AI generated faces from real photos of humans

What sort of hypothesis is this and what is the most approprate statistical test?

Compute the statistics, do the data support the experimental or the null hypothesis?

We could write a statistical version of this hypothesis as something like this

People are able to distinguish AI generated faces from real photos of faces at an accuracy greater than chance level.

What sort of test do you need to run?

We need to run a one sample t-test that compares the PropCorrectFaces variable to a chance level of 0.5 (corresponding to 50%). The results could be reported as follows

A one sample t-test comparing the group average proportion of correctly identified faces (M = 0.634, SD=0.164) to chance level (proportion correct = 0.5) showed a significant effect, t(85) = 7.52, p<0.001. Participants were on average more accurate than chance at distinguishing AI generated face from photos of real faces.

Data Skills - do the results support hypothesis 1?

Yes, the statistical test allows us to reject the null hypothesis that participant performance was no different to chance level on this task and accept the experimental hypothesis that participants are able to distinguish AI generated faces from real photos of humans.

5. Hypothesis 2 - Confident people are better at distinguishing AI faces from real faces.

Now the second hypothesis. We don’t have everything we need to test this hypothesis yet. We’ll need some way to split our participants into two groups - one with high confidence in AI face detection and one with low confidence. Time to compute another variable.

Check Your Understanding

Compute a new variable named ConfidentBefore that separates the groups based on the median AI face detection confidence before the task.

You’ll need the median value for AIConfidenceBefore - you can compute this from descriptive statistics.

The computed variable will need some logical condition (using operators like ‘>’, ‘<’ or ‘==’) that separates participants with confidence above and below the median.

The median value for AIConfidenceBefore is 4, so our computed variable definition will look like this.

The values in ConfidentBefore will now be ‘True’ for people with high confidence and ‘False’ for people with low confidence. It doesn’t matter if you’ve done this the other way around - the tests will still work but the results will be flipped in the other direction (multiplied by -1)

With our new variable, we have what we need to run an independent samples t-test. This is very straightforward following the analyses we’ve run previously in the module.

Open the ‘Independent Sample t-test’ menu under ‘t-tests’. To run the analysis, drag PropFacesCorrect across as our dependent variable and our new ConfidentBefore variable as the grouping variable. The results should appear on the right automatically.

Once you have computed the core test - add the following options:

Descriptives
Homogeneity Test
Normality Test

Let’s think through the results

Data Skills - are the data normally distributed?

The Shairo-Wilk statistic is not significant, indicating that the data are likely to be normally distributed

Data Skills - do we have homogeneity of variance?

Levene’s statistic is not significant, indicating that the variance of the two groups is comparable. We can confirm this by looking at the standard deviations in the descriptives table. 0.171 and 0.158 are fairly similar so the test report makes intuitive sense.

Data Skills - which test should we report? Student’s, Welch’s or Mann-Whitney U?

Both assumptions of the standard Student’s t-test we looked at above are met by the data - we can proceed with Student’s test.

Data Skills - were confident partiticpants or unconfident participants more accurate?

Confident participants were more accurate! participants in this study are good at judging their ability.

We can see this by looking at the means in the descriptive statistics and the descriptives plot if you have run it.

Data Skills - do the results support hypothesis 2?

Yes, overall it seems like we can reject the null for hypothesis 2 and accept the experimental hypothesis that more confident participants are better at distinguishing AI faces from real human faces.

6. Hypothesis 3 - People’s confidence in their ability to distinguish AI generated faces will reduce after performing the task

Next we want to explore whether performing the face decision task changes peoples confidences in their abililty to detect AI generated faces. Remember that all participants categorised 12 faces with immediate correct/incorrect feedback and made confidence ratings at the start and end of the task.

Open the ‘Paired Sample t-test’ menu under ‘t-tests’. To run the analysis, drag AIConfidenceBefore and AIConfidenceAfter across as our pair of dependent variables. This is the format for running paired samples t-tests, the rest of the options should be familiar from our previous analyses.

Once you have computed the core test, do the same for the EmoConfidence and MemoryConfidence, and add the following options:

Descriptives
Normality Test

The results should appear on the right automatically.

Let’s think through the results…

Data Skills - does confidence in distinguishing AI faces from real faces change after completing the face task?

Yes, the results indicate that there is a significant difference. We could report the results like this:

A paired samples t-test showed a significant difference in participants confidence in their ability to distinguish AI faces from real faces before (M=4.48, SD=2.20) and after (M=5.66, SD=2.57) the face perception task. t(76) = -3.769, p<0.001. Confidence was significantly reduced after completing the task.

Data Skills - does confidence in distinguishing sad faces from happy faces change after completing the face task?

No, the results indicate that there is not a significant difference. We could report the results like this:

A paired samples t-test showed a significant difference in participants confidence in their ability to distinguish sad faces from happy faces before (M=2.57, SD=1.49) and after (M=2.74, SD=1.37) the face perception task. t(64) = -1.1017, p=0.313. Confidence was unchanged before and after the task.

Data Skills - does confidence in recognising a face you haven’t seen for a long time change after completing the face task?

No, the results indicate that there is not a significant difference. We could report the results like this:

A paired samples t-test showed a significant difference in participants confidence in their ability to recognise faces of people they haven’t seen for a long time before (M=3.67, SD=1.89) and after (M=3.70, SD=1.86) the face perception task. t(66) = -0.234, p=0.816. Confidence was unchanged before and after the task.

Data Skills - What test should we report for these comparisons? Are the parametric assumptions met?

Check the results of hte normality tests. The Shapiro-Wilk statistic is significant for all three tests! We should consider the non-parametric alternative test - add the ‘Wilcoxon Rank’ test to your analysis.

We should report the Wilcoxon Rank test along with the medians for non-parametric data. For example:

A Wilcoxon Rank test showed a significant difference in participants confidence in their ability to distinguish AI faces from real faces before (Median=4) and after (Median=5) the face perception task. W=680, p<0.001. Confidence was significantly reduced after completing the task.

Data Skills - do the results support hypothesis 3?

Yes, overall it seems like we can reject the null for hypothesis 3 and accept the experimental hypothesis. Participants were less confident in their ability to detect AI generated faces after completing the task but their confidence in detecting emotions and remembering faces from a long time ago remains unchanged.

7. Bonus Hypothesis! People are more accurate at identifying photos of real people compared to AI generated photos.

Let’s use our new skills from this practical to answer one last question.

Check Your Understanding

Use a t-test to test this hypothesis:

People are more accurate at identifying photos of real people compared to AI generated photos

You’ll several of the skills from this session to answer the question… think through what sort of variables you’ll need and what sort of test you’ll need.

We need 2 new variables to answer this question. We have already computed PropCorrectFaces in an earlier section, but we now need to make separate versions of this for AI faces and real faces…

The variable transforms will look something like this:

What sort of t-test will you need?

We need to compute a paired samples t-test to answer the question as each participant contributes to both the AI face and real face conditions. Compute the test along with some descriptive statistics, we can report the test as follows

A paired samples t-test showed a significant difference in the correct identification of AI faces (M=0.702, SD=0.198) compared to real faces (M=0.566, SD=0.229). t(85) = -4.60, p<0.001. AI faces were identified more accurately than real faces.

8. Summary

We’ve computed a range of tests to statistically assess our hypotheses today! One experiment can often yield enough data to run a wide range of analyses. It is always a good idea to start with your hypotheses and predictions to break the analysis down into manageable chunks.

References

Miller, Elizabeth J., Ben A. Steward, Zak Witkower, Clare A. M. Sutherland, Eva G. Krumhuber, and Amy Dawel. 2023. “AI Hyperrealism: Why AI Faces Are Perceived as More Real Than Human Ones.” Psychological Science 34 (12): 1390–1403. https://doi.org/10.1177/09567976231207095.