Quick question. When someone asks what age you were in a certain year do you give the age that you were before or after your birthday that year?
Do you give the younger or older of the two ages you were that year?
I give the younger (age before your birthday in a given calendar year)
I give the older (age after your birthday in a given calendar year)
Hi everyone! This poll got so many responses and I am very excited because 40,427 responses is not just great, it's actually statistically significant! And I have nearly enough university credits to have a statistics degree on top of my other 3, so my academia got activated.
Because, not only did I get a lot of responses, and a few interesting people born on New Years, I got a lot of comments varying in politeness that stated outright that the deciding factor between the two poll options was birth month, specifically if someone was born in the first or second half of the year. With the obvious implication that if someone is born early in the year they'll give the older number because they spend more of the year the age they "turned" that year, and people born later will give the younger number because they spent more of the year younger.
(There were also some people who said they couldn't imagine in their wildest dreams being asked this question… I really hope they didn't also respond to the poll)
Come along with me while I test this out!
Tumblr's userbase is largely USAmerican, as of March 2024 so I went with American statistics for this, despite being Australian myself. I downloaded data from the UN statistical database made available through UNData, of births in America by month from 1969-2019 as a csv file.
The way to test this hypothesis will be the The Chi-squared goodness of fit test, which is a non-parametric statistical test that calculates a statistic to measure the difference between observed and expected counts. This will tell us if the ratio of people born in the first half or second half of the year within the United States, is an effective predictor of the variable "do you tell people the older or younger age when asked how old you were in a year?", or if any discrepancy here is due to simple chance.
Null Hypothesis (H0): People self-report the age they were in a year based on the age they spend ≥50% of the calendar year being.
If the null is true, the percentage of people who choose "older" in the poll should equal the percentage of people born in the first half of the year (Jan–June).
Alternative Hypothesis (Ha): The amount of time spent younger or older within a year is independent of how people choose to self-report their age that year.
If the alternative hypothesis is true, then the poll results will significantly deviate from the actual US birth distribution stats.
I used pandas to clean the dataset (as each year had totals as well as the months), and matplotlib to make these graphs based on the UN dataset. I am happy to share my python code if anyone wants it for some reason.
The test statistic for the chi-square goodness of fit test is Pearson’s chi-square. I know how to do this in python but I honestly miss doing this sort of thing long hand.
SO!
df = n—1 = 2—1 df = 1
a = 0.05 (This is the measure of how sure we wanna be and this is the value I'm most used to)
The Critical Value for 1; 0.05 is 3.841
Onto the poll results:
Final result from 40,427 votes (sorry to people who reblogged after the poll closed to give thoughts in tags. I read them but am not incorporating your repsonses into the dataset).
A) I give the younger (age before your birthday in a given calendar year) 29.8%
B) I give the older (age after your birthday in a given calendar year) 70.2%
Reject H0: If x2 stat > critical value (or if p-value < a).
Fail to Reject: If x2 stat ≤ critical value
The number of people who answered that they gave the younger age is 12,047.2 (our observed value). And people who answered that they give the older age is 28,379.8 (also observed value).
Based on the two expected percentages of people born in the first half of the year (48.6%) and second half of the year (51.4%) the people who answered that they gave the younger age should number in at 20,779.5 (first expected value), and people who answered that they give the older age should number in at 19,647.5 (second expected value).
Tumblr doesn't do formula notation so if you want to see the Chi-square goodness of fit formula please refer to ecosia. or google ig. It's the sum of observed minus expected squared over the expected.
Younger category: (12047.2 - 20779.5)^2/20779.5 = (-8732.3)^2/20779.5 = 3669.629
Older category: (28379.8 - 19647.5)^2/19647.5 = (8732.3)^2/19647.5 = 3881.057
Total x2 statistic is: 3669.629 + 3881.057 = 7550.686
Our x2 Statistic: 7550.686
Critical Value: 3.841
Because 7550.686 is larger than 3.841 (and astronomically so!!), we reject the null hypothesis (That people self-report the age they were in a year based on the age they spend ≥50% of the calendar year being), with a p-value that is effectively zero (p < 0.000001).
To put this in perspective: a critical value of 3.841 means that we are examining if there is a 5% chance this pattern happened by pure coincidence. A score of over 7,500 means the odds of this distribution happening because of birth months are virtually non-existent.
This analysis assumes that the Tumblr user base reflects the general population's birth distribution. Unless something about Tumblr is drawing in a statistically improbable proportion of people born in the later half of the year, the deviation from the expected values is far too large to be explained by birth month alone.
Anyways so this is why "it's obvious" and "it's definitely what I immediately assumed" aren't sources.













