My dad told me he ordered a book about Bayesian statistics, and I accidentally blurted out without thinking āwait, Count Basie wrote a book about statistics?! The jazz guy?!!!ā and now heās not letting me live this shit down. He keeps mentioning his āCount Basie statistics bookā offhand in like every convo I have with him just to mess with me.
I dont really think thatās THAT ridiculous of a mistake to make⦠Jazz music is in fact very math-y. How was I supposed to know Count Basie wasnāt a statistics expert on the side (and honestly who can for certain say he wasnāt?)??? Jazz is like some of THE mathiest music. Bunch of nerds (affectionate).
There are some ideas we know are true, and others we know are false. However for the majority of ideas we do not know if they are true or fa
A Student's Guide to Bayesian Statistics
This is the course Iām ātakingā on Bayesian Statistics to prepare for doing research; I havenāt finished it but itās rather easy to understand if you understand basic calculus (double integrals, basic 3d graphing, etc.) but have never taken a statistics course (unless you count a 6th grade 2-4 week unit on statistics)
iām currently teaching myself a crash course on bayesian statistics to help me with phylogenetic tree modeling⦠i just cleaned my mug after a nice cup of tea. i like the look of a pristine mug much better than a tea stained mugā¦
Practical Bayesianism: The Sunrise Problem and Bernoulli Distributions
Hey guys. Iām going to try something a little different and I would really appreciate feedback. This series is going to attempt to cover a few simple formulas that Bayesian epistemologists can use to estimate probabilities in real life. This first post is going to start with the case of estimating Bernoulli random variables.
Disclaimer: Iām not a statistician and may not be able to answer all questions around this topic. I am a math major though and I will do my best. Also, while I write some slightly non-rigorous computations involving probability density functions, itās easy to rigorize with cumulative density functions.
Note of advice: This post uses LaTeX! In order to view it properly, you should click over to my blog
Prerequisites: This post assumes knowledge of Bayesā Theorem and integration.
TL;DR: If you notice an event with fixed probability happen $k$ times over $n$ trials, the probability that it will happen in the next trial is $\frac{k+1}{n+2}$.
The Problem:
Suppose a child wakes up and notices that every day the sun rises. Over the $N$ days heās been alive, the sun has risen $N$ times. A natural question for the child is āWhat is the probability that the sun rises tomorrow?ā
The naĆÆve answer, Iād expect, is $1$ ā the sun will almost surely rise. Those who have experience with probability theory may reject this simple answer for one reason ā $1$ and $0$ are not truly valid probabilities in a sense. See this LessWrong post for more details as to why.
A less naĆÆve answer is that we can calculate it using our certainties in the laws of physics, probability distributions over the sunās place in the sky, etc. However, in an everyday sense, this is somewhat impractical. While the laws of physics may be well known and allow us to bound our probabilities, they are nevertheless complex computationally. In the long run, weād like to generalize insights from this problem to other problems ā perhaps regarding human psychology, a famously tricky subject. In these other problems, the patterns may not be so well known and what is at least computationally possible, if not feasible, becomes impossible with our current level of technology.
The Model:
Thus, let us try to calculate the probability that the sun will rise purely from the measurements weāve taken ā the sun has risen $N$ times on $N$ days. We will model the sun rising as a Bernoulli random variable ā Let the sun rise with a probability $p$ while the probability that the sun does not rise remains $1-p$.
Formally, the Bernoulli distribution is defined on $\{ 0, 1\}$. $0$ has probability $p$ and $1$ has probability $1-p$. Note that $p$ characterizes the entire Bernoulli distribution ā this is nearly the simplest probability distribution we can work with. Let us also consider the problem in slightly greater generality. Suppose $n$ samples are independently drawn from a Bernoulli distribution and $k$ of them are $0$ (which is defined to have probability $p$). What is the probability that the next sample is also $0$?
Letās start with what we know ā the probability of drawing $k$ independent $0$ samples out of $n$ from a Bernoulli distribution with probability $p$ is given by:
$$P(k \text{ samples out of } n = 0| p) = {n \choose k} p^k (1-p)^{n-k}$$
This can be seen as every fixed sequence of $0$s and $1$s has the same likelihood of $p^k (1-p)^{n-k}$ and there are ${n \choose k}$ such sequences. The reader is invited to check the details. In order to invert this and find a posterior distribution for $p$ given the information we have, we can use Bayesā theorem. In order to use it, we must first choose a prior on $p$, the parameter we defined as part of the Bernoulli distribution.
Here the post becomes a little complicated intuitively - $p$ should be thought of as a parameter that characterizes the Bernoulli distribution. We canāt get at $p$ directly ā we can only estimate it via the trials weāve measured. So our prior is actually a probability distribution for $p$ which will eventually give us our Bernoulli distribution, which we will use to get our probability. This is a bit of a roundabout way to go and Iāve summarized it in the diagram below:
Weāll talk more about prior selection in a later post. For now, because we know nothing about $p$ except that it is a probability in $(0,1)$, we pick a prior distribution for $p$ that assumes the least information ā the uniform distribution on $(0,1)$ denoted $U(0,1)$.
The Math:
Consider now, Bayesā theorem.
$$P(p | k \text{ samples out of } n = 0) = \left(\frac{ P(k \text{ samples out of } n = 0| p)}{ P(k \text{ samples out of } n = 0)}\right) P(p)$$
By $P(p)$ in this case, we mean the probability that the $p$, the probability parameter which characterizes the BERNOULLI DISTRIBUTION, is equal to $p$. Ā To simplify notation let the prior probability distribution function of $p$ be $\sigma(p)$ and the posterior probability distribution function of $p$ be $\sigmaā(p)$. $P(p)$ using differentials is $\sigma(p) dp$. On the other hand, Ā $ P(p | k \text{ samples out of } n = 0) = \sigmaā(p)$.
So:
$$\sigmaā(p) dp = \left(\frac{ P(k \text{ samples out of } n = 0| p)}{ P(k \text{ samples out of } n = 0)}\right) \sigma(p) dp$$
We have $P(k \text{ samples out of } n = 0| p)$ from above so we just need to find $ P(k \text{ samples out of } n = 0)$. This is going to depend on the prior distribution and will come out to:
$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} P(k \text{ samples out of } n = 0| p) \sigma(p) dp$$
$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} {n \choose k} p^k (1-p)^{n-k} \sigma(p) dp$$
For the uniform prior we have $\sigma(p) = 1$ and so:
$$ P(k \text{ samples out of } n = 0) = \int_{0}^{1} {n \choose k} p^k (1-p)^{n-k} dp$$
This is a tough integral to evaluate. There are a number of ways to go about it, but here is a probabilistic way. The integral corresponds to the probability that if we pick $n+1$ random variables $X_0, X_1, ⦠, X_n$ where each $X_i ~ U(0,1)$ (each variable is uniformly distributed), $X_0$ will be the $k+1$th element in order.
You can see this by considering how you might write a computer simulation of the situation described ā first you pick your $p$ uniformly, then you can generate $n$ numbers uniformly from $(0,1)$, mark $0$ if the number is less than $p$, mark $1$ if itās greater. Ā By symmetry the probability is then $\frac{1}{n+1}$
So in summary:
$$\sigmaā(p) dp = \left(\frac{{n \choose k} p^k (1-p)^{n-k}}{\frac{1}{n+1}}\right) \sigma(p) dp$$
Cancelling out the differentials and rearranging we have:
$$\sigmaā(p) = (k+1) {n+1 \choose k+1} p^k (1-p)^{n-k}$$
But wait! This gives us our posterior distribution for $\sigmaā(p)$. But what we actually wanted was the probability that the next number picked would be $0$. We can compute this from $\sigmaā(p)$:
$$P(\text{next number is }0) = \int_{0}^{1} p \sigmaā(p) dp$$
$$ P(\text{next number is }0) = \int_{0}^{1} (k+1) {n+1 \choose k+1} p^{k+1} (1-p)^{n-k} dp$$
$$ P(\text{next number is }0) = (k+1) \left(\int_{0}^{1} {n+1 \choose k+1} p^{k+1} (1-p)^{n-k} dp \right)$$
Wait a second⦠we just computed this integral! So we have:
$$P(\text{next number is }0) = (k+1)\left(\frac{1}{(n+1)+1}\right) = \frac{k+1}{n+2}$$
The Result:
So if we want to calculate the probability the sun will rise tomorrow, we get the neat probability:
$$P(\text{the sun will rise tomorrow}) = \frac{n+1}{n+2}$$
More generally we have $(k+1)/(n+2)$ - a neat formula that is easy to memorize. Keep this in mind whenever you come across a situation in real life where you want to estimate probabilities. For example:
My friend tends to lie when his super hot girlfriend from Canada is involved. Four out of the five times Iāve asked him about her, he deflected. Whatās the probability heāll deflect the next time I ask?
When I open up a conversation on OkCupid with a question, Iāve been ignored 17 out of 30 times and been given a date 4 out of those 30 times. Whatās the probability that the next time I lead with a question Iāll get a date?
If I ask Bob to communicate a message to Alice without Eve finding out, heās done it correctly 9 out of the 15 times Iāve asked. If he can the net utility change will be 6 utils according to my utility aggregation function. If he canāt or Eve finds out and I tell Bob my message, the net utility change will be -10 utils. What is the net expected utility of asking Bob to undergo this mission and is it worth it to benefit the world?
Credits: Laplace for coming up with the original problem. Bayes for making a kickass theorem. Alison and evolution-is-just-a-theorem for encouragement.
Tags since this is a sideblog: @sinesalvatoremā, @evolution-is-just-a-theoremā, @proofsaretalkā
We're excited to release the latest research from our machine intelligence R&D team!Ā
This report and prototype explore probabilistic programming, an emerging programming paradigm that makes it easier to construct and fit Bayesian inference models in code. It's advanced statistics, simplified for data scientists looking to build models fast.
Bayesian inference has been popular in scientific research for a long time. The statistical technique allows us to encode expert knowledge into a model by stating prior beliefs about what we think our data looks like. These prior beliefs are then updated in light of new data, providing not one prediction, but a full distribution of likely answers with baked-in confidence rates. This allows us to asses the risk of our decisions with more nuance.
Bayesian methods lack widespread commercial use because they're tough to implement. But probabilistic programming reduces what used to take months of thorny statistical sampling into an afternoon of work.
This will further expand the utility of machine learning. Bayesian models aren't black boxes, a criterion for regulated industries like healthcare. Unlike deep learning networks, they don't require large, clean data sets or large amounts of GPU processing power to deliver results. And they bridge human knowledge with data, which may lead to breakthroughs in areas as diverse as anomaly detection and music analysis.Ā
Our work on probabilistic programming includes two prototypes and a report that teaches you:
How Bayesian inference works and where it's useful
Why probabilistic programming is becoming possible now
When to use probabilistic programming and what the code looks like
What tools and languages exist today and how they compare
Which vendors offer probabilistic programming products
Finally, as in all our research, we predict where this technology is going, and applications for which it will be useful in the next couple of years.
Probabilistic Real Estate Prototype
One powerful feature of probabilistic programming is the ability to build hierarchical models, which allow us to group observations together and learn from their similarities. This is practical in contexts like user segmentation: individual users often shares tastes with other users of the same sex, age group, or location, and hierarchical models provide more accurate predictions about individuals by leveraging learnings from the group.
We explored using probabilistic programming for hierarchical models in our Probabilistic Real Estate prototype. This prototype predicts future real estate prices across the New York City boroughs. It enables you to input your budget (say $1.6 million) and shows you the probability of finding properties in that price range across different neighborhoods and future time periods.
Hierarchical models helped make predictions in neighborhoods with sparse pricing data. In our model, we declared that apartments are in neighborhoods and neighborhoods are in boroughs; on average, apartments in one neighborhood are more similar to others in the same location than elsewhere. By modeling this way, we could learn about the West Village not only from the West Village, but also from the East Village and Brooklyn. That means, with little data about the West Village, we could use data from the East Village to fill in the gaps!Ā
Many companies suffer from imperfect, incomplete data. These types of inferences can be invaluable to improve predictions based on real-world dependencies.
Play around with the prototype! You'll see how the color gradients give you an intuitive sense for what probability distributions look like in practice.
How to Access our Reports & Prototypes
We're offering our research on probabilistic programming in a few ways:
Single Report & Prototype (digital and physical copies)
Annual Research Subscription (access to all our research)
Subscription & Advising (research & time with our team)
Special Projects (dedicated help to build a great data product)