Are students with higher grades who score worse on retest cheaters?
The question received a substantial edit since the last of six answers.
The edited question contains an example of regression to the mean in the context
of student scores on a 100 question true-false test and an retest for the
top performers on an equivalent test. The retest shows substantially more average scores for the group of top performers
on the first test. What's going on? Were the students cheating the first time?
No, it is important to control for regression to the mean. Test performance for multiple choice tests is a combination of luck in guessing and ability/knowledge.
Some portion of the top performers' scores was due to good luck, which was not necessarily
repeatable the second time.
Or should they just stay away from the roulette wheel?
Let's first assume that no skill at all was involved, that the student's were just flipping (fair)
coins to determine their answers. What's the expected score? Well, each answer has independently a
50% chance of being the correct one, so we expect 50% of 100 or a score of 50.
But, that's an expected value. Some will do better merely by chance.
The probability of scoring at least 60% correctly according
to the binomial distribution is approximately 2.8%. So, in a group of 3000 students, the
expected number of students to get a grade of 60 or better is 85.
Now let's assume indeed there were 85 students with a score of 60% or better and retest them.
What's the expected score on retest under the same coin-flipping method? Its still 50% of 100!
What's the probability that a student being retested in this manner will score above 60%?
It's still 2.8%! So we should expect only 2 of the 85 (2.8%⋅85) to score at least 60% on retest.
Under this setup it is a fallacy to assume an expected score on retest different from the expected
score on the first test -- they are both 50% of 100. The gambler's fallacy would be to assume that
the good luck of the high scoring students is more likely to be balanced out by bad luck on retest.
Under this fallacy, you'd bet on the expected retest scores to be below 50. The hot-handed fallacy
(here) would be to assume that the good luck of the high scoring students is more likely to continue
and bet on the expected retest scores to be above 50.
Lucky coins and lucky flips
Reality is a bit more complicated. Let's update our model. First, it doesn't matter what the actual answers are
if we are just flipping coins, so let's just score by number of heads. So far, the model is equivalent.
Now let's assume 1000 coins are biased to be heads with probability of 55% (good coins G),
1000 coins are biased to be heads with probability of 45% (bad coins B), and 1000 have equal probability of being
heads or tails (fair coins F) and randomly distribute these.
This is analogous to assuming higher and lower ability/knowledge under the test taking example, but it is easier to reason
correctly about inanimate objects.
The expected score is (55⋅1000+45⋅1000+50⋅1000)/3000=50 for any student given the random distribution. So,
the expected score for the first test has not changed.
Now, the probability of scoring at least 60% correctly, again using the binomial distribution is 18.3% for good coins,
0.2% for bad coins, and of course 2.8% still for the fair coins. The probability of scoring at least 60% is, since an
equal number of each type of coin was randomly distributed, the average of these, or 7.1%. The expected number of students
scoring at least 60% correctly is 21.
Now, if we do indeed have 21 scoring at least 60% correctly under this setup of biased coins, what's the expected score on retest?
Not 50% of 100 anymore! Now you can work it out with Bayes theorem, but since we used equal size groups the probability of having a
type of coin given a outcome is (here) proportional to the probability of the outcome given the type of coin. In other words, there is
a 86%=18.3%/(18.3%+0.2%+2.8%) chance that those scoring at least 60% had a good coin, 1%=0.2%/(18.3%+0.2%+2.8%) had a bad coin,
and 13% had a fair coin. The expected value of scores on retest is therefore 86%⋅55+1%⋅45+13%⋅50=54.25 out of 100. This is lower than
actual scores of the first round, at least 60, but higher than the expected value of scores before the first round, 50.
So even when some coins are better than others, randomness in the coin flips means that selecting the top performers from a test will still exhibit some regression to the mean in a retest.
In this modified model, hot-handedness is no longer an outright fallacy -- scoring better in the first round does mean a higher probability
of having a good coin! However, gambler's fallacy is still a fallacy -- those who experienced good luck cannot be expected to be compensated
with bad luck on retest.