The Puzzle of Greater Male Variance
Abstract
Abstract
Greater male variability on tests of mental ability would explain why males predominate not only at the highest levels in mathematics and science, but in business, politics, and nearly all aspects of life. The literature on sex differences in mental test scores was reviewed to determine whether males in fact exhibit greater variability. Two methods were used to compare sex differences in variability, ratios of total test score variances and ratios of the number of males and females scoring at or above extreme score cutoffs, or “tail ratios”. Most samples were greater than 10K. The review also included studies of physical traits such as height, weight, and blood parameters; brain volume measures; and studies of variability in four taxa, mammals, birds, insects, and butterflies. In the vast majority of total test score ratios, males were more variable than females. Males were more likely to score in the extreme right tail, indicating higher aptitude, on tests of mathematics and spatial ability in which mean sex differences favor males. On tests of writing, vocabulary, and spelling, in which mean differences favor females, males were more extreme in the left tail, indicating lower ability. Males also exhibited greater variance in physical traits, blood parameters, and brain volume measures. Similar variance differences were also found in animals. Genetic theory proposes that greater variability depends on which sex is heterogametic, that is, has two different sex chromosomes. In mammals and insects, this is the male sex, with the XY sex chromosome pair. In birds and butterflies, this is the female sex, with the WZ sex chromosome pair. It is heterogamety and not sex that determines which sex is more variable. This is because in the heterogametic sex, recessives on the single X or Z chromosome are fully expressed generating a binomial distribution with higher trait variance, while in the homogametic sex, recessive traits are averaged generating a normal distribution with lower trait variance. Females have improved their performance on mathematics tests for the gifted, with ratios falling from 13:1 favoring males in 1980 to 3:1 favoring males from 1990 onward, but the improvement has stopped. Heterogamety predicts that females, because they are homogametic, will never equal males not only in the right extreme of mathematics ability and other ability distributions but in the distribution of any trait influenced by the X chromosome.
In 2007, Psychological Science in the Public Interest published an issue devoted to sex differences in science and mathematics written by a group of contributors from fields such as neuroscience, gifted-studies, cognitive development, cognitive gender differences, and evolutionary psychology (Halpern, Benbow, Geary, Gur, Hyde, & Gernsbacher). Each author reviewed material from their specialty relating to the problem of female under-representation in mathematics- and science-based fields in academia, research, and industry. The group concluded that observed sex differences result from both genetic influences and sociocultural differences in the treatment of males and females, most certainly an anodyne conclusion of the kind expected from a committee of disparate experts.
The 2007 article was in part a response to the controversy following statements made two years earlier by then Harvard University president Lawrence Summers to the effect that there were fewer women in academia and industry because fewer women score at the highest levels on tests of aptitude predicting success in mathematics and the hard sciences. This was tantamount to saying that, at the highest ability levels, women are inferior to men. Summers’ statement about women was a major factor in his 2006 resignation as Harvard’s president. Regardless of campus politics, which also played a large role in his departure, the question remains, why do men dominate at the highest levels of mathematics, science, business, and industry?
Last year, the issue of sex differences arose again, this time in the citadels of technology, in particular the Silicon Valley offices of Google and the claim by one software engineer, James Damore (2017), that, among other things, “differences in distributions of traits between men and women may in part explain why we don't have 50% representation of women in tech and leadership.” This was Summers’ explanation exactly, and for offering it, Damore experienced a fate similar to Summers’: He was fired.
The topic of cognitive gender differences being much too broad for any one article, this paper considers only the issue of greater male variance on mental ability tests. This review updates the 2007 Public Interest article now that more than 10 years have passed and considers new findings, which include much larger samples of test data than were then available, and recent research published by geneticists and behavior geneticists shedding new light on this controversy.
The approach I’ve taken in this review was dictated by the issue under study, the variance of mental ability as reported in a wide range of aptitude and achievement tests of children and adolescents. While it might seem preferable to perform a meta analysis of many study results, the variance statistic does not lend itself to this approach. Pooling effect sizes from different samples can give a better estimate of true population mean difference than individual studies. But it is difficult to know what group is represented in the result when a meta-analysis is performed on heterogeneous variances. Studies of variance should be based on large nationally representative samples, not heterogeneous studies of selected groups, some based on selected samples, thrown together and christened a meta-analysis leaving the reader with the question, “What is the population the parameter of which is being described?” For a brief yet astute discussion of the perils of using meta-analysis to study sex differences in variance, see Hedges and Nowell (1995) Pages 41-42.
If meta-analysis is inappropriate, what other approach can one take? In the studies reviewed here, a wide variety of tests were given to children ranging in age from early childhood through late adolescence. I have chosen to review the 17 studies individually, presenting reported sex differences in means, variances, and tail ratios in separate tables for each study. After reviewing each study, I discuss its merits, for example, Did the study assess aptitude or achievement? Did subject age affect the results? What patterns exist across tests of domains such as verbal, mathematical, spatial, and science aptitude; change in variances over time, etc?
None of the data in the studies reviewed here were generated by the researchers for the purpose of comparing male and female variance, nor could it be, the sample sizes required are too large. In every instance, tests were given to meet an institutional requirement such as college entrance or by a government agency assessing student progress or teacher/school effectiveness. The sample sizes were large, some having tested or screened millions of students randomly sampled from large countries such as the U.S. and UK, and in one case, the PISA test of 276,165 15-year-old students from 41 developed and developing nations. The results were data sets that largely spoke for themselves with little data manipulation needed. In most cases, I have reproduced the key tables from each study to enable inspection of the supporting evidence for the issue under review.
Requirements to Show Sex Differences in Variability
Sources of Differences in the Right Tail of Distributions
Of interest here is the difference in the frequency of extreme scores in the tails of test-score distributions. More males will be found in the extreme right of a distribution if males have a higher mean and both sexes have the same variance, or if both sexes have the same mean but males have a greater variance, or if males have both a higher mean and greater variance. No studies with the exception of Nowell and Hedges (1998) using a method developed by Lewis and Willingham (1995) attempt to partition the relative frequencies of extreme scores into components due to mean differences and variance differences.
Measures of Sex Differences
Virtually all of the reviewed studies of sex differences report the mean sex difference and the total test score variance ratio or VR. They also report “tail ratios,” a proportion comparing the number of males and females scoring above a given cut off in the extremes of distributions.
Means are universally compared using Cohen’s d (1988), the mean difference standardized by the pooled within-groups standard deviation:
(MeanM – MeanF)/√(VarM + VarF)
Cohen’s (1988) criteria for assessing the importance of d are .20, small; .50, medium; and .80, large. In most of the studies reviewed below, d was calculated as male mean – female mean. In those few studies were it was calculated female mean – male mean, I reversed the sign and so noted in the footnotes to the tables where I did this.
The standard deviation is the most commonly used measure of test-score dispersion, but its square, the variance, is a better measure of variability because its ratio can be used to compare the variability of different groups. To do this for males and females, the simple ratio of male over female variance is computed:
VarM/VarF.
A ratio of 1.0 indicates equal male and female variances. Ratios larger (smaller) than 1.0 indicate greater (lesser) male variance. Feingold (1992) suggested that a difference of 10 percent in the variance, or a variance ratio of 1.10, is the minimal required for the difference to have substantive importance. In all of the studies, the variance ratios were calculated as male variance/female variance.
Tail ratios are a simple means of comparing performance at the extremes of test score distributions. The counts of males and females within the 10%, 5%, and .01% cutoffs provides the ratio of males and females performing at or above that level:
CountM/CountF.
Much like VR, a tail ratio of 1.0 indicates that the number of males and females scoring above a given cut off is equal. Ratios larger (smaller) than 1.0 indicate more (fewer) males scored at or above the cut off. An increase in tail ratios with successively higher cut offs indicates that a greater disproportion of males is scoring at higher levels and thus possesses to a greater degree whatever latent trait the test measures. In all of the studies, the count ratios were calculated as male variance/female variance.
Some studies look only at the right tail of the distribution. Samples selected for high scorers will reduce the variance by truncating the left side of the distribution. For these studies, is it not useful to estimate the population variance but rather to compare the counts of males and females scoring at or above a given cut off score, that is, tail ratios. Lewis and Willingham (1995) found that the mean sex difference in restricted samples was correlated with the variance difference.
Volunteers are known to differ from the general population. Children who volunteer for enrichment programs, or whose parents do the volunteering, are likely to be more motivated and different from a nation-wide sample of non-volunteer same-age children in intelligence, motivation, SES, race-ethnicity, etc.
Factors Affecting Variance Differences
Age. Haworth, Wright, Luciano, Martin, de Geus, van Beijsterveldt, & Plomin (2014). found that the heritability of general intelligence increases with age, from .41 to .55 to .66 at ages 9, 11, and 17 respectively, in a study that pooled 10,689 MZ and DZ twin pairs from six studies done in four countries. Others have suggested that heritability is as high as .80 in late adulthood (Johnson, Carothers, & Deary, 2009; Plomin & Deary, 2015). At the very least, tests should be of young adults although there is no large-scale testing of persons older than those taking graduate school admissions tests like the GRE.
Range Restriction. Because we are interested in the relative number of males and females scoring in the tails of the test distribution, there should be no ceiling or floor effects. An ideal test would have few or no zero or perfect scores to assure that the difficulty of the test matched the ability of the test takers.
Unselected Samples. Ideally, samples should be unselected to ensure that the full distribution of ability within a population is tested. This can be achieved with a procedure such as national probability sampling, which is the best means to obtain a truly representative sample of the nation as a whole. There are large samples of selected populations such as the SAT and GRE, tests used to screen students for college and graduate school. But these samples are neither random nor representative despite samples numbering in the millions.
Aptitude Versus Achievement Tests. Aptitude tests measure student ability and achievement tests measure student learning and school effectiveness. It is better to study tests of aptitude rather than achievement if we are studying ability although it is impossible to study any ability divorced from previous learning experience. Unfortunately, there are few large-scale studies of “culture-free” tests of intelligence, such as the Raven Progressive Matrices. Societies economically developed enough to do large-scale testing also have compulsory schooling usually through the American equivalent of high school. Large-scale testing is done with school children and adolescents to monitor their progress and to screen for college and graduate school admissions. An example of using large-scale testing over a range of ages is the No Child Left Behind program, which required regular testing of elementary and high school children to determine whether they were achieving specific learning goals and whether teachers were performing up to standard (Zelizer, 2015). Some schools whose students failed to progress adequately were closed. The diversion of classroom time and school resources away from instruction to prepare students for these achievement tests has been a source of parental complaints (Strauss, 2015). To the extent that test preparation becomes “teaching to the test,” sex differences in the means and variances will be reduced.
Genetic theory
This paper will show that the sex difference in variance is due to the difference in their chromosome allotment, namely the difference in the sex chromosomes, XX for females and XY for males. There is no difference between males and females with regard to the 22 pairs of somatic chromosomes, the autosomes. Both males and females have the same chromosomes and the same coding regions and alleles on all 22 pairs of autosomes. The random process by which they are assigned those alleles is the same for both sexes. But with the sex chromosomes, X and Y, the genetic allotment is different. Because the Y chromosome that men receive is vestigial, it leaves the X chromosome unpaired so that not only are dominant alleles fully expressed, recessive alleles are also fully expressed. In females, the pairing of two X-chromosomes means that recessive traits are expressed only if there are recessive alleles on both X chromosomes, which reduces the probability that the recessive trait is expressed to the square of the probability for males, one source of lower female variability.
Johnson et al. (2009) demonstrated how this works in a simple model of a single gene with two alleles on the male single X chromosome,
A and a,
that will have the maximum population variance of
0.5 x 0.5 = 0.25
when the allele frequencies are equal. For both the dominant and recessive alleles, their probability of expression is equal to their proportion in the genome.
Matters are different for females, who have three genotypes arising from the same two alleles because the two alleles, A and a, are on one of two different chromosomes, XX,
AA = 0.25, Aa = 0.50, and aa = 0.25.
In a perfect world, this would lead to reduced variance in females because the phenotypic expression of AA and Aa are the same when there is complete dominance and thus a lower population variance of
(0.25 + 0.50) x (1.0 - 0.75)
0.75 x 0.25 = 0.188.
But the fact that females have two X chromosomes while males have one complicates matters. This is an imbalance that nature corrects by silencing or “inactivating” one of the two female X chromosomes. Which X chromosome is silenced is randomly determined early in gestation when the embryo is between 8 and 16 cells (Craig et al., 2009). Half of these cells will have chromosome X1 with allele A and half will have chromosome X2 with allele a. This equal splitting of the two X chromosomes and their different alleles will lead to phenotypic expression that is the average of allele A and a. Because half the females have the heterozygous genotype Aa, and one-quarter each have the homozygous AA and aa genotypes, their distribution is more approximately normal and has a smaller variance than the binomial distribution of A and a in males. In short, this is the source of greater male variance and is discussed in greater detail below.
Published Studies
The following studies present in some detail what I think is the most comprehensive review extant of the literature bearing on the issue of male and female variance in mental ability testing. The studies reviewed unmistakably make the case that 1) males are more variable with regard to virtually all tests of mathematical, spatial, and science aptitude and achievement at both the high and low ends of the respective test score distributions and 2) many tests of verbal aptitude and achievement especially at the low end of the test score distributions.
Benbow and Stanley (1980, 1983)
The issue of differential variance was given it’s current prominence by Benbow and Stanley who reported in two papers (1980 and 1983) on sex differences in mathematics based on large samples of mostly 7th grade children who were given both the verbal and mathematics sections of the Scholastic Aptitude Test as part of the Study of Mathematically Precious Youth (SMPY).
The SAT-Mathematics test (SAT-M) is normally taken by college-bound high school seniors, who at age 17 or 18, are 5 to 6 years older than the SMPY 7th graders, all of whom were age 12 except for a small number of 13 year olds in the early years of the study. Few of the SMPY children had taken algebra or had any formal training in the skills needed for the SAT-M. Benbow and Stanley (1980) gave the SAT-M not to test mathematical aptitude, but because the test was so far above the skill level of 7th graders that the SAT-M would be a test of their “numerical judgment, relational thinking, and logical reasoning.” Spearman (1904) would recognize the ability Benbow and Stanley were testing as general intelligence or g.
The 1980 report was based on the scores of 9,927 students who were recruited from the greater Baltimore region between 1972 and 1979 after scoring in the top 2 to 5 percent on a mathematics screening test. Tables 1 and 2 show the results for both the verbal and mathematics scores. Removing the 8th grade scores for December 1976 because of the small N, the mean d value of -.03 for the verbal scores shows that males and females were about equal in verbal reasoning and the mean variance ratio of 1.04 also suggests parity in variability. But the mean d value of .50 and the mean variance ratio of 1.58 for mathematics scores shows that males score a half standard deviation higher and were nearly 60 percent more variable than the females. The extreme mathematics scores were even more disparate, 16.6% of the males scoring above 600 but only 2.1% of the females.
In 1983, Benbow and Stanley reported results on the SAT-M from nearly 40,000 students in the mid-Atlantic region and another large group from a nationwide talent search within and beyond the Johns Hopkin’s talent search area. All students were under age 13. The results were similar to those from the earlier study. No difference was found in the SAT-Verbal, with the male and female means 367 and 365 respectively (Standard deviations not reported). But there was a 30-point mean difference on the SAT-Mathematics, with the male and female means 416 and 386 respectively. The variance ratio was 1.38, roughly the same as in 1980 and showing again that males are more variable than females. More importantly as shown in Table 3, the number of boys scoring above 700 on the SAT-M over both national search samples, was 13 times the number of girls (260:20), despite equal numbers of boys and girls taking the screening test.
Comment. Benbow and Stanley’s findings gave enormous impetus to research on sex differences in cognitive ability generally and to sex differences in variability specifically.
The extreme 13 to 1 ratio has become ingrained in the literature on sex differences in variability even though it has been out of date since 1990, the ratio now being 2.8:1, or in round numbers 3:1. Use of the SAT allowed Benbow and Stanley to avoid ceiling effects: Few students scored above 700, and in many years, no one scored a perfect 800. Their samples were young and highly selected, making it possible to generalize only to the very brightest students rather than to the population of seventh graders as a whole. The students were volunteers, a special group that probably differs in many ways from students in general although it is unlikely that these factors substantially affected the differential pattern of scoring. Comparing means and total score variances between two groups all of whom are in the right tail of the score distribution is questionable.
Benbow and Stanley (1980, 1983) were among the earliest to note that the traditional arguments made to explain the lower numbers of extremely able females in mathematics, such as lack of opportunity to study math and social attitudes discouraging females from pursuing careers in math and science, would create mean differences between the sexes not variance differences. Benbow and Stanley also noted that through 11th grade, boys and girls have taken the same math courses, obtain about the same grades, and rate similarly their liking for mathematics and their perception of mathematics as important. Summarizing their assessment of theories explaining male superiority at the highest score levels in their 1980 report, Benbow and Stanley stated that “boy-versus-girl socialization” as the only acceptable explanation of the sex difference is “premature,” and in 1983, said that the reasons boys “dominate the highest ranges of mathematical reasoning ability were unclear.”












