We're excited to release the latest research from our machine intelligence R&D team!
This report and prototype explore probabilistic programming, an emerging programming paradigm that makes it easier to construct and fit Bayesian inference models in code. It's advanced statistics, simplified for data scientists looking to build models fast.
Bayesian inference has been popular in scientific research for a long time. The statistical technique allows us to encode expert knowledge into a model by stating prior beliefs about what we think our data looks like. These prior beliefs are then updated in light of new data, providing not one prediction, but a full distribution of likely answers with baked-in confidence rates. This allows us to asses the risk of our decisions with more nuance.
Bayesian methods lack widespread commercial use because they're tough to implement. But probabilistic programming reduces what used to take months of thorny statistical sampling into an afternoon of work.
This will further expand the utility of machine learning. Bayesian models aren't black boxes, a criterion for regulated industries like healthcare. Unlike deep learning networks, they don't require large, clean data sets or large amounts of GPU processing power to deliver results. And they bridge human knowledge with data, which may lead to breakthroughs in areas as diverse as anomaly detection and music analysis.
Our work on probabilistic programming includes two prototypes and a report that teaches you:
How Bayesian inference works and where it's useful
Why probabilistic programming is becoming possible now
When to use probabilistic programming and what the code looks like
What tools and languages exist today and how they compare
Which vendors offer probabilistic programming products
Finally, as in all our research, we predict where this technology is going, and applications for which it will be useful in the next couple of years.
Probabilistic Real Estate Prototype
One powerful feature of probabilistic programming is the ability to build hierarchical models, which allow us to group observations together and learn from their similarities. This is practical in contexts like user segmentation: individual users often shares tastes with other users of the same sex, age group, or location, and hierarchical models provide more accurate predictions about individuals by leveraging learnings from the group.
We explored using probabilistic programming for hierarchical models in our Probabilistic Real Estate prototype. This prototype predicts future real estate prices across the New York City boroughs. It enables you to input your budget (say $1.6 million) and shows you the probability of finding properties in that price range across different neighborhoods and future time periods.
Hierarchical models helped make predictions in neighborhoods with sparse pricing data. In our model, we declared that apartments are in neighborhoods and neighborhoods are in boroughs; on average, apartments in one neighborhood are more similar to others in the same location than elsewhere. By modeling this way, we could learn about the West Village not only from the West Village, but also from the East Village and Brooklyn. That means, with little data about the West Village, we could use data from the East Village to fill in the gaps!
Many companies suffer from imperfect, incomplete data. These types of inferences can be invaluable to improve predictions based on real-world dependencies.
Play around with the prototype! You'll see how the color gradients give you an intuitive sense for what probability distributions look like in practice.
How to Access our Reports & Prototypes
We're offering our research on probabilistic programming in a few ways:
Single Report & Prototype (digital and physical copies)
Annual Research Subscription (access to all our research)
Subscription & Advising (research & time with our team)
Special Projects (dedicated help to build a great data product)
Thomas Wiecki on Probabilistic Programming with PyMC3
A rolling regression with PyMC3: instead of the regression coefficients being constant over time (the points are daily stock prices of 2 stocks), this model assumes they follow a random-walk and can thus slowly adapt them over time to fit the data best.
Probabilistic programming is coming of age. While normal programming languages denote procedures, probabilistic programming languages denote models and perform inference on these models. Users write code to specify a model for their data, and the languages run sampling algorithms across probability distributions to output answers with confidence rates and levels of uncertainty across a full distribution. These languages, in turn, open up a whole range of analytical possibilities that have historically been too hard to implement in commercial products.
One sector where probabilistic programming will likely have significant impact is financial services. Be it when predicting future market behavior or loan defaults, when analyzing individual credit patterns or anomalies that might indicate fraud, financial services organizations live and breathe risk. In that world, a tool that makes it easy and fast to predict future scenarios while quantifying uncertainty could have tremendous impact. That’s why Thomas Wiecki, Director of Data Science for the crowdsourced investment management firm Quantopian, is so excited about probabilistic programming and the new release of PyMC3 3.0.
We interviewed Dr. Wiecki to get his thoughts on why probabilistic programming is taking off now and why he thinks it’s important. Check out his blog, and keep reading for highlights!
A key benefit of probabilistic programming is that it makes it easier to construct and fit Bayesian inference models. You have a history working with Bayesian methods in your doctoral work on cognition and psychiatry. How did you use them?
One of the main problems in psychiatry today is that disorders like depression or schizophrenia are diagnosed based purely on subjective reporting of symptoms, not biological traits you can measure. By way of comparison, imagine if a cardiologist were to prescribe heart medication based on answers you gave in a questionnaire! Even the categories used to diagnose depression aren’t that valid, as two patients may have completely different symptoms, caused by different underlying biological mechanisms, but both fall under the broad category “depressed.” My thesis tried to change that by identifying differences in cognitive function -- rather than reported symptoms -- to diagnose psychiatric diseases. Towards that goal, we used computational models of the brain, estimated in a Bayesian framework, to try to measure cognitive function. Once we had accurate measures of cognitive function, we used machine learning to train classifiers to predict whether individuals were suffering from certain psychiatric or neurological disorders. The ultimate goal was to replace disease categories based on subjective descriptions of symptoms with objectively measurable cognitive function. This new field of research is generally known as computational psychiatry, and is starting to take root in industries like pharmaceuticals to test the efficacy of new drugs.
What exactly was Bayesian about your approach?
We mainly used it to get accurate fits of our models to behavior. Bayesian methods are especially powerful when there is hierarchical structure in data. In computational psychiatry, individual subjects either belong to a healthy group or a group with psychiatric disease. In terms of cognitive function, individuals are likely to share similarities with other members of their group. Including these groupings into a hierarchical model gave more powerful and informed estimates about individual subjects so we could make better and more confident predictions with less data.
Bayesian inference provides robust means to test hypotheses by estimating how different two different groups are from one another.
How did you go from computational psychiatry to data science at Quantopian?
I started working part-time at Quantopian during my PhD and just loved the process of building an actual product and solving really difficult applied problems. After I finished my PhD, it was an easy decision to come on full-time and lead the data science efforts there. Quantopian is a community of over 100.000 scientists, developers, students, and finance professionals interested in algorithmic trading. We provide all the tools and data necessary to build state-of-the-art trading algorithms. As a company, we try to identify the most promising algorithms and work with the authors to license them for our upcoming fund, which will launch later this year. The authors retain the IP of their strategy and get a share of the net profits.
What’s one challenging data science problem you face at Quantopian?
Identifying the best strategies is a really interesting data science problem because people often overfit their strategies to historical data. A lot of strategies thus often look great historically but falter when actually used to trade with real money. As such, we let strategies bake in the oven a bit and accumulate out-of-sample data that the author of the strategy did not have access to, simply because it hadn’t happened yet when the strategy was conceived. We want to wait long enough to gain confidence, but not so long that strategies lose their edge. Probabilistic programming allows us to track uncertainty over time, informing us when we’ve waited long enough to have confidence that the strategy is actually viable and what level of risk we take on when investing in it.
It’s tricky to understand probabilistic programming when you first encounter it. How would you define it?
Probabilistic programming allows you to flexibly construct and fit Bayesian models in computer code. These models are generative: they relate unobservable causes to observable data, to simulate how we believe data is created in the real world. This is actually a very intuitive way to express how you think about a dataset and formulate specific questions. We start by specifying a model, something like “this data fits into a normal distribution”. Then, we run flexible estimation algorithms, like Markov Chain Monte Carlo (MCMC), to sample from the “posterior”, the distribution updated in light of our real-world data, which quantifies our belief into the most likely causes underlying the data. The key with probabilistic programming is that model construction and inference are almost completely independent. It used to be that those two were inherently tied together so you had to do a lot of math in order to fit a given model. Probabilistic programming can estimate almost any model you dream up which provides the data scientist with a lot of flexibility to iterate quickly on new models that might describe the data even better. Finally, because we operate in a Bayesian framework, the models rest on a very well thought out statistical foundation that handles uncertainty in a principled way.
Much of the math behind Bayesian inference and statistical sampling techniques like MCMC is not new, but probabilistic tooling is. Why is this taking off now?
There are mainly three reasons why probabilistic programming is more viable today than it was in the past. First is simply the increase in compute power, as these MCMC samplers are quite costly to run. Secondly, there have been theoretical advances in the sampling algorithms themselves, especially a new class called Hamiltonian Monte Carlo samplers. These are much more powerful and efficient in how they sample data, allowing us to fit highly complex models. Instead of sampling at random, Hamiltonian samplers use the gradient of the model to focus sampling on high probability areas. By contrast, older packages like BUGS could not compute gradients. Finally, the third required piece was software using automatic differentiation -- an automatic procedure to compute gradients on arbitrary models.
What are the skills required to use probabilistic programming? Can any data scientist get started today or are there prerequisites?
Probabilistic programming is like statistics for hackers. It used to be that even basic statistical modeling required a lot of fancy math. We also used to have to sacrifice the ability to really map the complexity in data to make models that were tractable, but just too simple. For example, with probabilistic programming we don’t have to do something like assume our data is normally distributed just to make our model tractable. This assumption is everywhere because it’s mathematically convenient, but no real-world data looks like this! Probabilistic programming enables us to capture these complex distributions. The required skills are the ability to code in a language like Python and a basic knowledge of probability to be able to state your model. There are also a lot of great resources out there to get started, like Bayesian Analysis with Python, Bayesian Methods for Hackers, and of course the soon-to-be-released Fast Forward Labs report!
Congratulations on the new release of PyMC3! What differentiates PyMC3 from other probabilistic programming languages? What kinds of problems does it solve best? What are its limitations?
Thanks, we are really excited to finally release it, as PyMC3 has been under continuous development for the last 5 years! Stan and PyMC3 are among the current state-of-the-art probabilistic programming frameworks. The main difference is that Stan requires you to write models in a custom language, while PyMC3 models are pure Python code. This makes model specification, interaction, and deployment easier and more direct. In addition to advanced Hamiltonian Monte Carlo samplers, PyMC3 also features streaming variational inference, which allows for very fast model estimation on large data sets as we fit a distribution to the posterior, rather than trying to sample from it. In version 3.1, we plan to support more variational inference algorithms and GPUs, which will make things go even faster!
For which applications is probabilistic programming the right tool? For which is it the wrong tool?
If you only care about pure prediction accuracy, probabilistic programming is probably the wrong tool. However, if you want to gain insight into your data, probabilistic programming allows you to build causal models with high interpretability. This is especially relevant in the sciences and in regulated sectors like healthcare, where predictions have to be justified and can’t just come from a black-box. Another benefit is that because we are in a Bayesian framework, we get uncertainty in our parameters and in our predictions, which is important for areas where we make high-stakes decisions under very noisy conditions, like in finance. Also, if you have prior information about a domain you can very directly build this into the model. For example, let’s say you wanted to estimate the risk of diabetes from a dataset. There are many things we already know even without looking at the data, like that high blood sugar increases that risk dramatically -- we can build that into the model by using an informed prior, something that’s not possible with most machine learning algorithms.
Finally, hierarchical models are very powerful, but often underappreciated. A lot of data sets have an inherent hierarchical structure. For example, take individual preferences of users on a fashion website. Each individual has unique tastes, but often shares tastes with similar users. For example, people are more likely to have similar taste if they have the same sex, or are in the same age group, or live in the same city, state, or country. Such a model can leverage what it has learned from other group members and apply it back to an individual, leading to much more accurate predictions, even in the case where we might only have few data points per individual (which can lead to cold start problems in collaborative filtering). These hierarchies exist everywhere but are all too rarely taken into account properly. Probabilistic programming is the perfect framework to construct and fit hierarchical models.
Interpretability is certainly an issue with deep neural nets, which also require far more data than Bayesian models to train. Do you think Bayesian methods will be important for the future of deep learning?
Yes, and it’s a very exciting area! As we’re able to specify and estimate deep nets or other machine learning methods in probabilistic programming, it could really become a lingua franca that removes the barrier between statistics and machine learning, giving a common tool to do both. One thing that’s great about PyMC3 is that the underlying library is Theano, which was originally developed for deep learning. Theano helps bridge these two areas, combining the power nets have to extract latent representations out of high-dimensional data with variational inference algorithms to estimate models in a Bayesian framework. Bayesian deep learning is hot right now, so much so that NIPS offered a day-long workshop. I’ve also written about the benefits in this post and this post, explaining how Bayesian methods provide more rigor around the uncertainty and estimations of deep net predictions and provides better simulations. Finally, Bayesian Deep Learning will also allow to build exciting new architectures, like Hierarchical Bayesian Deep Networks that are useful for transfer learning. A bit like the work you did to get stronger results from Pictograph using the Wordnet hierarchy.
Bayesian deep nets provide greater insight into the uncertainty around predicted values at a given point. Read more here.
What books, papers, and people have had the greatest influence on you and your career?
I love Dan Simmons’ Hyperion Cantos series, which got me hooked on science fiction. Michael Frank (my PhD advisor) and EJ Wagenmakers first introduced me to Bayesian statistics. The Stan guys, who developed the NUTS sampler and black-box variational inference, have had a huge influence on PyMC3. They continue to push the boundaries of applied Bayesian statistics. I also really like the work coming out of the labs of David Blei and Max Welling. We hope that PyMC3 will also be an influential tool on the productivity and capabilities on data scientists across the world.
How do you think data and AI will change the financial services industry over the next few years? What should all hedge fund managers know?
I think it’s already had a big impact on finance! And as the mountains of data continue to grow, so will the advantage computers have over humans in their ability to combine and extract information out of that data. Data scientists, with their ability to pull that data together and build the predictive models will be the center of attention. That is really at the core of what we’re doing at Quantopian. We believe that by giving people everywhere on earth a platform that’s state-of-the-art for free we can find that talent before anyone else can.
Why should you become skilled at Probabilistic Programming Language?
PPL is a programming language that is designed to describe probabilistic models and to perform inference in those models. PPL is closely related to graphical model and Bayesian networks, but are more expressive and flexible.
Probabilistic programming creates a system which helps to make the decision in the face of uncertainty. Probabilistic reasoning combines knowledge of a situation with the laws of probability. Probabilistic programming is a new approach to makes probabilistic reasoning systems easier to build and more widely applicable. The program used to inverse graphics as the basis of its conferencing.
It’s very crucial that you people know about the probabilistic reasoning that this is used for predicting stock prices, recommending movies, diagnosing computers, detecting cyber intrusions, and image detection. Another thing about this programming language is that it is also necessary for AI/AGI.
For decades, scientists developed probabilistic models in various fields of exploration without any of the benefits or dedicated programming languages or deep neural networks. Since these models involve Bayesian inference which is often intractable integrals, they sap the productivity of experts and are beyond the reach of non-experts. The compiler checks program for type errors and translates it to a form suitable for an inference procedure, which uses observed output data to fit the latent distributions. Probabilistic models which show great promise: they overtly represent uncertainty and demonstrated to enable explainable machine learning even in the important but difficult case of small training data.
Probabilistic programming is slowly gaining momentum over the past few years. There is an argument between "Intuitive Physics", "Inverse Graphics" and more generally for structured generative models. The traction in the industry also grown due to this, with uber releasing their own Probabilistic Programming library on top of PyTorch.
Students who are looking forward to learn Probabilistic Programming they can either go to institutes for learning it but nowadays there is trend of internet and everything is available online. But online platform has become the new trend both among students as well as teacher; teachers can also schedule there time accordingly and then give lectures and students can also gain the knowledge sitting anywhere in the world. As the online platform is the most convenient platform to learn anything at a very affordable price as well as it will save the time and the traveling expense and also you can watch the videos or notes provided by them again and again
Gain some Knowledge about the Effectiveness of Probabilistic Programming
The machine learning communities and the programming languages have developed an amount of research interest under the department of Probabilistic Programming over the last few years. The idea behind this concept is to export efficient PL concepts like reuse to statistical modeling and abstraction that is arduous and arcane task.
What and Why
Probabilistic Programming
Probabilistic programming is not always about writing software that behaves like probabilistically. For example if any of your program calls r and (3) as a specific part of the work as it is specified to do- as in cryptographic key generator or say in ASLR implementation in an OS kernel or a simulated annealing optimizer for various circuit designs- that’s all techniques are good but this is not about what this topic is about.
It seems to be best of not writing software completely. By the method of analogy, traditional languages such as C++, Python and Haskell are very different from the philosophy but you can utilize their power for any of them to write, say, a categorized system for cat pictures or an alternative way to LaTeX. Amongst them one must be better for a given domain but they are all workable. But it is not definite with the probabilistic programming languages (PPL). Its more like prolog: its surely a programming language that can write software.
Why Probabilistic Programming should be chosen?
Basically probabilistic programming is a tool for statistical modeling. Mainly the idea is to borrow lessons from programming languages and implement them to the problem of designing and using statistical model. Experts construct the statistical model by hand by drawing mathematical notation on paper but it’s an expert only because it’s hard to support it through mechanical reasons. The key insight in PP is that when statistic is don many times then it feels a lot like programming. Many new tools become realistic when it comes to use a real language for the modeling. You can start automatically with the task that is utilized to write on a paper for instance use.
A probabilistic programming language is a common programming language with rand and a big pile of similar tool which will help you to understand the program’s statistical behavior.
Both of the definition is exact. They emphasize on different angles with same core of idea. What makes sense to you it only depends on what you want to utilize in PP. But don’t get distracted with the fact that PPL program looks mostly like ordinary software implementation where the job is to run the program and get some type of output. The main focus of PP is analysis on execution.
Modeling the Problem
The way of approaching this problem through machine is to model the situation using random variants in which some of them are latent. This is latent random variables in a specific model for explaining the situation completely. By allowing you with latent variables make it very easy to reach directly to the problem.
Under the Hood of the Variational Autoencoder (in Prose and Code)
The Variational Autoencoder (VAE) neatly synthesizes unsupervised deep learning and variational Bayesian methods into one sleek package. In Part I of this series, we introduced the theory and intuition behind the VAE, an exciting development in machine learning for combined generative modeling and inference—“machines that imagine and reason.”
To recap: VAEs put a probabilistic spin on the basic autoencoder paradigm—treating their inputs, hidden representations, and reconstructed outputs as probabilistic random variables within a directed graphical model. With this Bayesian perspective, the encoder becomes a variational inference network, mapping observed inputs to (approximate) posterior distributions over latent space, and the decoder becomes a generative network, capable of mapping arbitrary latent coordinates back to distributions over the original data space.
The beauty of this setup is that we can take a principled Bayesian approach toward building systems with a rich internal “mental model” of the observed world, all by training a single, cleverly-designed deep neural network.
These benefits derive from an enriched understanding of data as merely the tip of the iceberg—the observed result of an underlying causative probabilistic process.
The power of the resulting model is captured by Feynman’s famous chalkboard quote: “What I cannot create, I do not understand.” When trained on MNIST handwritten digits, our VAE model can parse the information spread thinly over the high-dimensional observed world of pixels, and condense the most meaningful features into a structured distribution over reduced latent dimensions.
Having recovered the latent manifold and assigned it a coordinate system, it becomes trivial to walk from one point to another along the manifold, creatively generating realistic digits all the while:
In this post, we’ll take a look under the hood at the math and technical details that allow us to optimize the VAE model we sketched in Part I.
Along the way, we’ll show how to implement a VAE in TensorFlow—a library for efficient numerical computation using data flow graphs, with key features like automatic differentiation and parallelizability (across clusters, CPUs, GPUs…and TPUs if you’re lucky). You can find (and tinker with!) the full implementation here, along with a couple pre-trained models.
Building the Model
Let’s dive into code (Python 3.4), starting with the necessary imports:
import functools from functional import compose, partial import numpy as np import tensorflow as tf
One perk of these models is their modularity—VAEs are naturally amenable to swapping in whatever encoder/decoder architecture is most fitting for the task at hand: recurrent neural networks, convolutional and deconvolutional networks, etc.
For our purposes, we will model the relatively simple MNIST dataset using densely-connected layers, wired symmetrically around the hidden code.
We can initialize a Dense layer with our choice of nonlinearity for the layer nodes (i.e. neural network units that apply a nonlinear activation function to a linear combination of their inputs, as per line 18).
We’ll use ELUs (Exponential Linear Units), a recent advance in building nodes that learn quickly by avoiding the problem of vanishing gradients. We wrap up the class with a helper function (Dense.wbVars) for compatible random initialization of weights and biases, to further accelerate learning.
In TensorFlow, neural networks are defined as numerical computation graphs. We will build the graph using partial function composition of sequential layers, which is amenable to an arbitrary number of hidden layers.
def composeAll(*args): """Util for multiple function composition i.e. composed = composeAll([f, g, h]) composed(x) # == f(g(h(x))) """ # adapted from https://docs.python.org/3.1/howto/functional.html return partial(functools.reduce, compose)(*args)
Now that we’ve defined our model primitives, we can tackle the VAE itself.
Keep in mind: the TensorFlow computational graph is cleanly divorced from the numerical computations themselves. In other words, a tf.Graph wireframes the underlying skeleton of the model, upon which we may hang values only within the context of a tf.Session.
Below, we initialize class VAE and activate a session for future convenience (so we can initialize and evaluate tensors within a single session, e.g. to persist weights and biases across rounds of training).
Here are some relevant snippets, cobbled together from the full source code:
class VAE(): """Variational Autoencoder see: Kingma & Welling - Auto-Encoding Variational Bayes (https://arxiv.org/abs/1312.6114) """ DEFAULTS = { "batch_size": 128, "learning_rate": 1E-3, "dropout": 1., # keep_prob "lambda_l2_reg": 0., "nonlinearity": tf.nn.elu, "squashing": tf.nn.sigmoid } RESTORE_KEY = "to_restore" def __init__(self, architecture, d_hyperparams={}, meta_graph=None, save_graph_def=True, log_dir="./log"): """(Re)build a symmetric VAE model with given: * architecture (list of nodes per encoder layer); e.g. [1000, 500, 250, 10] specifies a VAE with 1000-D inputs, 10-D latents, & end-to-end architecture [1000, 500, 250, 10, 250, 500, 1000] * hyperparameters (optional dictionary of updates to `DEFAULTS`) """ self.architecture = architecture self.__dict__.update(VAE.DEFAULTS, **d_hyperparams) self.sesh = tf.Session() if not meta_graph: # new model handles = self._buildGraph() ... self.sesh.run(tf.initialize_all_variables())
Assuming that we are building a model from scratch (rather than restoring a saved meta_graph), the key initialization step is the call to VAE._buildGraph (line 32). This internal method constructs nodes representing the placeholders and operations through which the data will flow—before any data is actually piped in.
Finally, we unpack the iterable handles (populated by _buildGraph) into convenient class attributes—pointers not to numerical values, but rather to nodes in the graph:
... # unpack handles for tensor ops to feed or fetch (self.x_in, self.dropout_, self.z_mean, self.z_log_sigma, self.x_reconstructed, self.z_, self.x_reconstructed_, self.cost, self.global_step, self.train_op) = handles
How are these nodes defined? The _buildGraph method encapsulates the core of the VAE model framework—starting with the encoder/inference network:
def _buildGraph(self): x_in = tf.placeholder(tf.float32, shape=[None, # enables variable batch size self.architecture[0]], name="x") dropout = tf.placeholder_with_default(1., shape=[], name="dropout") # encoding / "recognition": q(z|x) encoding = [Dense("encoding", hidden_size, dropout, self.nonlinearity) # hidden layers reversed for function composition: outer -> inner for hidden_size in reversed(self.architecture[1:-1])] h_encoded = composeAll(encoding)(x_in) # latent distribution parameterized by hidden encoding # z ~ N(z_mean, np.exp(z_log_sigma)**2) z_mean = Dense("z_mean", self.architecture[-1], dropout)(h_encoded) z_log_sigma = Dense("z_log_sigma", self.architecture[-1], dropout)(h_encoded)
Here, we build a pipe from x_in (an empty placeholder for input data \(x\)), through the sequential hidden encoding, to the corresponding distribution over latent space—the variational approximate posterior, or hidden representation, \(z \sim q_\phi(z|x)\).
As observed in lines 14 - 15, latent \(z\) is distributed as a multivariate normal with mean \(\mu\) and diagonal covariance values \(\sigma^2\) (the square of the “sigma” in z_log_sigma) directly parameterized by the encoder: \(\mathcal{N}(\mu, \sigma^2I)\). In other words, we set out to “explain” highly complex observations as the consequence of an unobserved collection of simplified latent variables, i.e. independent Gaussians. (This is dictated by our choice of a conjugate spherical Gaussian prior over \(z\)—see Part I.)
Next, we sample from this latent distribution (in practice, one draw is enough given sufficient minibatch size, i.e. >100). This method involves a trick—can you figure out why?—that we will explore in more detail later.
z = self.sampleGaussian(z_mean, z_log_sigma)
The sampled \(z\) is then passed to the decoder/generative network, which symmetrically builds back out to generate the conditional distribution over input space, reconstruction \(\tilde{x} \sim p_\theta(x|z)\).
# decoding / "generative": p(x|z) decoding = [Dense("decoding", hidden_size, dropout, self.nonlinearity) for hidden_size in self.architecture[1:-1]] # assumes symmetry # final reconstruction: restore original dims, squash outputs [0, 1] decoding.insert(0, Dense( # prepend as outermost function "reconstruction", self.architecture[0], dropout, self.squashing)) x_reconstructed = tf.identity(composeAll(decoding)(z), name="x_reconstructed")
Alternately, we add a placeholder to directly feed arbitrary values of \(z\) to the generative network (to fabricate realistic outputs—no input data necessary!):
# ops to directly explore latent space # defaults to prior z ~ N(0, I) z_ = tf.placeholder_with_default(tf.random_normal([1, self.architecture[-1]]), shape=[None, self.architecture[-1]], name="latent_in") x_reconstructed_ = composeAll(decoding)(z_)
TensorFlow automatically flows data through the appropriate subgraph, based on the nodes that we fetch and feed with the tf.Session.run method. Defining the encoder, decoder, and end-to-end VAE is then trivial (see linked code).
We’ll finish the VAE._buildGraph method later in the post, as we walk through the nuances of the model.
The Reparameterization Trick
In order to estimate the latent representation \(z\) for a given observation \(x\), we want to sample from the approximate posterior \(q_\phi(z|x)\) according to the distribution defined by the encoder.
However, model training by gradient descent requires that our model be differentiable with respect to its learned parameters (which is how we propagate the gradients). This presupposes that the model is deterministic—i.e. a given input always returns the same output for a fixed set of parameters, so the only source of stochasticity are the inputs. Incorporating a probabilistic “sampling” node would make the model itself stochastic!
Instead, we inject randomness into the model by introducing input from an auxiliary random variable: \(\epsilon \sim p(\epsilon)\).
For our purposes, rather than sampling \(z\) directly from \(q_\phi(z|x) \sim \mathcal{N}(\mu, \sigma^2I)\), we generate Gaussian noise \(\epsilon \sim \mathcal{N}(0, I)\) and compute \[z = \mu + \sigma \odot \epsilon\] (where \(\odot\) is the element-wise product). In code:
def sampleGaussian(self, mu, log_sigma): """Draw sample from Gaussian with given shape, subject to random noise epsilon""" with tf.name_scope("sample_gaussian"): # reparameterization trick epsilon = tf.random_normal(tf.shape(log_sigma), name="epsilon") return mu + epsilon * tf.exp(log_sigma) # N(mu, sigma**2)
By “reparameterizing” this step, inference and generation become entirely differentiable and hence, learnable.
Cost Function
Now, in order to optimize the model, we need a metric for how well its parameters capture the true data-generating and latent distributions. That is, how likely is observation \(x\) under the joint distribution \(p(x, z)\)?
Recall that we represent the global encoder and decoder parameters (i.e. neural network weights and biases) as \(\phi\) and \(\theta\), respectively.
In other words, we want to simultaneously tune these complementary parameters such that we maximize \(log(p(x|\phi, \theta))\)—the log-likelihood across all datapoints \(x\) under the current model settings, after marginalizing out the latent variables \(z\). This term is also known as the model evidence.
We can express this marginal likelihood as the sum of what we’ll call the variational or evidence lower bound \(\mathcal{L}\) and the Kullback-Leibler (KL) divergence \(\mathcal{D}_{KL}\) between the approximate and true latent posteriors: \[ log(p(x)) = \mathcal{L}(\phi, \theta; x) + \mathcal{D}_{KL}(q_\phi(z|x) || p_\theta(z|x)) \]
Here, the KL divergence can be (fuzzily!) intuited as a metric for the misfit of the approximate posterior \(q_\phi\). We’ll delve into this further in a moment, but for now the important thing is that it is non-negative by definition; consequently, the first term acts as a lower bound on the total. So, we maximize the lower bound \(\mathcal{L}\) as a (computationally-tractable) proxy for the total marginal likelihood of the data under the model. (And the better our approximate posterior, the tighter the gap between the lower bound and the total model evidence.)
With some mathematical wrangling, we can decompose \(\mathcal{L}\) into the following objective function: \[ \mathcal{L}(\phi, \theta; x) = \mathbb{E}_{z \sim q_\phi(z|x)}[log(p_\theta(x|z))] - \mathcal{D}_{KL}(q_\phi(z|x) || p_\theta(z)) \] (Phrased as a cost, we optimize the model by minimizing \({-\mathcal{L}}\).)
Here, the perhaps unfriendly-looking first term is, in fact, familiar! It’s the probability density of generated output \(\tilde{x}\) given the inferred latent distribution over \(z\)—i.e. the (negative) expected reconstruction error. This loss term is intrinsic to perhaps every autoencoder: how accurately does the output replicate the input?
Choosing an appropriate metric for image resemblance is hard (but that’s another story). We’ll use the binary cross-entropy, which is commonly used for data like MNIST that can be modeled as Bernoulli trials. Expressed as a static method of the VAE class:
@staticmethod def crossEntropy(obs, actual, offset=1e-7): """Binary cross-entropy, per training example""" # (tf.Tensor, tf.Tensor, float) -> tf.Tensor with tf.name_scope("cross_entropy"): # bound by clipping to avoid nan obs_ = tf.clip_by_value(obs, offset, 1 - offset) return -tf.reduce_sum(actual * tf.log(obs_) + (1 - actual) * tf.log(1 - obs_), 1)
The second term in the objective is the KL divergence of the prior \(p\) from the (approximate) posterior \(q\) over the latent space. We’ll approach this conceptually, then mathematically.
The KL divergence \(\mathcal{D}_{KL}(q||p)\) is defined as the relative entropy between probability density functions \(q\) and \(p\). In information theory, entropy represents information content (measured in nats), so \(\mathcal{D}_{KL}\) quantifies the information gained by revising the candidate prior \(p\) to match some “ground truth” \(q\).
In a related vein, the KL divergence between posterior and prior beliefs (i.e. distributions) can be conceived as a measure of “surprise”: the extent to which the model must update its “worldview” (parameters) to accomodate new observations.
(Note that the formula is asymmetric—i.e. \(\mathcal{D}_{KL}(q||p) \neq \mathcal{D}_{KL}(p||q)\)—with implications for its use in generative models. This is also why it is not a true metric.)
By inducing the learned approximation \(q_\phi(z|x)\) (the encoder) to match the continuous imposed prior \(p(z)\), the KL term encourages robustness to small perturbations along the latent manifold, enabling smooth interpolation within and between classes (e.g. MNIST digits). This reduces “spottiness” in the latent space that is often observed in autoencoders without such regularization.
Mathematical bonus: we can strategically choose certain conjugate priors over \(z\) that let us analytically integrate the KL divergence, yielding a closed-form equation. This is true of the spherical Gaussian we chose, such that \[ {-\mathcal{D}}_{KL}(q_\phi(z|x) || p_\theta(z)) = \frac{1} 2 \sum{(1 + log(\sigma^2) - \mu^2 - \sigma^2)} \] (summed over the latent dimensions). In TensorFlow, that looks like this:
Together, these complementary loss terms capture the trade-off between expressivity and concision, between data complexity and simplicity of the prior. Reconstruction loss pushes the model toward perfectionist tendencies, while KL loss (along with the addition of auxiliary noise) encourages it to explore sensibly.
To elaborate (building on the VAE._buildGraph method started above):
Beyond its concise elegance and solid grounding in Bayesian theory, the cost function lends itself well to intuitive metaphor:
Information theory-wise, the VAE is a terse game of Telephone, with the aim of finding the minimum description length to convey the input from end to end. Here, reconstruction loss is the information “lost in translation,” while KL loss captures how overly “wordy” the model must be to convey the message through an unpredictable medium (hidden code imperfectly optimized for the input data).
Or, framing the VAE as a lossy compression algorithm, reconstruction loss accounts for the fidelity of (de)compression while KL loss penalizes the model for using a sub-optimal compression scheme.
Training
At last, our VAE cost function in hand (after factoring in optional \(\ell_2\)-regularization), we finish VAE._buildGraph with optimization nodes to be evaluated at each step of SGD (with the Adam optimizer)…
# optimization global_step = tf.Variable(0, trainable=False) with tf.name_scope("Adam_optimizer"): optimizer = tf.train.AdamOptimizer(self.learning_rate) tvars = tf.trainable_variables() grads_and_vars = optimizer.compute_gradients(cost, tvars) clipped = [(tf.clip_by_value(grad, -5, 5), tvar) # gradient clipping for grad, tvar in grads_and_vars] train_op = optimizer.apply_gradients(clipped, global_step=global_step, name="minimize_cost") # back-prop
…and return all of the nodes we want to access in the future to the VAE.__init__ method where buildGraph was called.
Using SGD to optimize the function parameters of the inference and generative networks simultaneously is called Stochastic Gradient Variational Bayes.
This is where TensorFlow really shines: all of the gradient backpropagation and parameter updates are performed via automatic differentation, and abstracted away from the researcher in the train_op (essentially) one-liner on line 48.
Model training (with optional cross-validation) is then as simple as feeding minibatches from dataset X to the x_in placeholder and evaluating (“fetching”) the train_op. Here are some relevant chunks, excerpted from the full class method:
def train(self, X, max_iter=np.inf, max_epochs=np.inf, cross_validate=True, verbose=True, save=False, outdir="./out", plots_outdir="./png"): try: err_train = 0 now = datetime.now().isoformat()[11:] print("------- Training begin: {} -------\n".format(now)) while True: x, _ = X.train.next_batch(self.batch_size) feed_dict = {self.x_in: x, self.dropout_: self.dropout} fetches = [self.x_reconstructed, self.cost, self.global_step, self.train_op] x_reconstructed, cost, i, _ = self.sesh.run(fetches, feed_dict) err_train += cost if i%1000 == 0 and verbose: print("round {} --> avg cost: ".format(i), err_train / i) if i >= max_iter or X.train.epochs_completed >= max_epochs: print("final avg cost (@ step {} = epoch {}): {}".format( i, X.train.epochs_completed, err_train / i)) now = datetime.now().isoformat()[11:] print("------- Training end: {} -------\n".format(now)) break
Helpfully, TensorFlow comes with a built-in visualization dashboard. Here’s the computational graph for an end-to-end VAE with two hidden encoder/decoder layers (that’s what all the tf.name_scope-ing was for):
Wrapping Up
The future of deep latent models lies in models that can reason about the world—“understanding” complex observations, transforming them into meaningful internal representations, and even leveraging these representations to make decisions—all while coping with scarce data, and in semisupervised or unsupervised settings. VAEs are an important step toward this future, demonstrating the power of new ways of thinking that result from unifying variational Bayesian methods and deep learning.
We now understand how these fields come together to make the VAE possible, through a theoretically-sound objective function that balances accuracy (reconstruction loss) with variational regularization (KL loss), and efficient optimization of the fully differentiable model thanks to the reparameterization trick.
We’ll wrap up for now with one more way of visualizing the condensed information encapsulated in VAE latent space.
Previously, we showed the correspondence between the inference and generative networks by plotting the encoder and decoder perspectives of the latent space in the same 2-D coordinate system. For the decoder perspective, this meant feeding linearly spaced latent coordinates to the generative network and plotting their corresponding outputs.
To get an undistorted sense of the full latent manifold, we can sample and decode latent space coordinates proportionally to the model’s distribution over latent space. In other words—thanks to variational regularization provided by the KL loss!—we simply sample relative to our chosen prior distribution over \(z\). In our case, this means sampling linearly spaced percentiles from the inverse CDF of a spherical Gaussian.1
Once again, evolving over (logarithmic) time:
Interestingly, we can see that the slim tails of the distribution (edges of the frame) are not well-formed. Presumably, this results from few observed inputs being mapped to latent posteriors with significant density in these regions.
Here are a few resulting constellations (from a single model):
Theoretically, we could subdivide the latent space into infinitely many points (limited in practice only by the computer’s floating point precision), and let the generative network dream up infinite constellations of creative variations on MNIST.
That’s enough digits for now! Keep your eyes out for the next installment, where we’ll tinker with the vanilla VAE model in the context of a new dataset.
– Miriam
Thanks Kyle McDonald (@kcimc) and Tom White (@dribnet) for noting this!↩
Introducing Variational Autoencoders (in Prose and Code)
Effective machine learning means building expressive models that sift out signal from noise—that simplify the complexity of real-world data, yet accurately intuit and capture its subtle underlying patterns.
Whatever the downstream application, a primary challenge often boils down to this: How do we represent, or even synthesize, complex data in the context of a tractable model?
This challenge is compounded when working in a limited data setting—especially when samples are in the form of richly-structured, high-dimensional observations like natural images, audio waveforms, or gene expression data.
Cue the Variational Autoencoder, a fascinating development in unsupervised machine learning that marries probabilistic Bayesian inference with deep learning.
Benefiting from advances in both research communities, the Variational Autoencoder addresses these challenges by leveraging innovative deep learning techniques grounded in a solid Bayesian theoretical framework...and can be explained through mesmerizing GIFs:
(Read on, and all will become clear...)
Intro
Traditional autoencoders are models (usually multilayer artificial neural networks) designed to output a reconstruction of their input. Specifically, autoencoders sequentially deconstruct input data into hidden representations, then use these representations to sequentially reconstruct outputs that resemble the originals. Fittingly, this process of teasing out a mapping from input to hidden representation is called representation learning.
The appeal of this setup is that the model learns its own definition of a "meaningful" representation based only on the data—no human-derived heuristics or labels! This approach stands in contrast to the majority of deep learning systems in production today, which rely on expensive-to-obtain labeled data ("This image is a kitten; this image is a panda."). Alternatives to such supervised learning frameworks provide a way to benefit from a world brimming with valuable raw data.
Though trained holistically, autoencoders are often built for the part instead of the whole: researchers might exploit the data-to-representation mapping for semantic embeddings, or the representation-to-output mapping for extraordinarily complex generative modeling
But an autoencoder with unlimited capacity is doomed to the role of a wonky, computationally-expensive Xerox machine. To ensure that the transformations to or from the hidden representation are useful, we impose some type of regularization or constraint. As a tradeoff for some loss in fidelity, such impositions push the model to distill the most salient features from a cacophonous real-world dataset.
Variational Autoencoders (VAEs) incorporate regularization by explicitly learning the joint distribution over data and a set of latent variables that is most compatible with observed datapoints and some designated prior distribution over latent space. The prior informs the model by shaping the corresponding posterior, conditioned on a given observation, into a regularized distribution over latent space (the coordinate system spanned by the hidden representation).
As a result, VAEs are an excellent tool for manifold learning—recovering the "true" manifold in lower-dimensional space along which the observed data lives with high probability mass—and generative modeling of complex datasets like images, text, and audio—conjuring up brand new examples, consistent with the observed training set, that do not exist in nature.
Building on other informative posts, this is the first installment of a guide to Variational Autoencoders: the lovechild of Bayesian inference and unsupervised deep learning.
In this post, we'll sketch out the model and provide an intuitive context for the math- and code-flavored follow-up. In Post II, we'll walk through a technical implementation of a VAE (in TensorFlow and Python 3). In Post III, we'll venture beyond the popular MNIST dataset using a twist on the vanilla VAE.
The Variational Autoencoder Setup
An end-to-end autoencoder (input to reconstructed input) can be split into two complementary networks: an encoder and a decoder. The encoder maps input \(x\) to a latent representation, or so-called hidden code, \(z\). The decoder maps the hidden code to reconstructed input value \(\tilde x\).
Whereas a vanilla autoencoder is deterministic, a Variational Autoencoder is stochastic—a mashup of:
a probabilistic encoder \(q_\phi(z|x)\), approximating the true (but intractable) posterior distribution \(p(z|x)\), and
a generative decoder \(p_\theta(x|z)\), which notably does not rely on any particular input \(x\).
Both the encoder and decoder are artificial neural networks (i.e. hierarchical, highly nonlinear functions) with tunable parameters \(\phi\) and \(\theta\), respectively.
Learning these conditional distributions is facilitated by enforcing a plausible mathematically-convenient prior over the latent variables, generally a standard spherical Gaussian: \(z \sim \mathcal{N}(0, I)\).
Given this conjugate prior, the encoder's job is to supply the mean and variance of the Gaussian posterior over each latent space dimension corresponding to a given input. Latent \(z\) is sampled from this distribution, then passed to the decoder to be transformed back into a distribution over the original data space.
In other words, a VAE represents a directed probabilistic graphical model, in which approximate inference is performed by the encoder and optimized alongside an easy-to-sample generative decoder. For this reason, these complementary halves are also known as the inference (or recognition) network and the generative network. By reformulating this graphical model as a differentiable neural net with a single, pithy cost function (derived from the variational lower bound), the whole package can be trained by stochastic gradient descent (SGD) thanks to the "amusing" universe we live in.
Bayes, Meet Neural Networks
In fact, many developments in deep learning research can also be understood through a probabilistic, or Bayesian, lens. Some of these analogies are more theoretical, whereas others share a parallel mathematical interpretation. For example, \(\ell_2\)-regularization can be viewed as imposing a Gaussian prior over neural network weights, and reinforcement learning can be formalized through variational inference.
VAEs exemplify a case where this relationship is made explicit and elegant, and variational Bayesian inference is the guiding principle shaping the model's cost function and instrinsic architecture.
Why does this setup make sense?
In the Bayesian worldview, datapoints are observations drawn from some data-generating distribution: (observed) variable \(x \sim p(x)\). So, the MNIST dataset of handwritten digits describes a random variable with an intricate set of dependencies among all 28*28 pixels. Each MNIST image offers a glimpse into one arrangement of 784 pixel values with high probability—whereas a 28*28 block of white noise, or the Jolly Roger, (theoretically) occupy low probability mass under the distribution.
It would be a headache to model the conditional dependencies in 784-dimensional pixel space. Instead, we make the simplifying assumption that the distribution over these observed variables is the consequence of a distribution over some set of hidden variables: \(z \sim p(z)\). Intuitively, this paradigm is analogous to how scientists study the natural world, by working backwards from observed phenomena to recover the unifying hidden laws that govern them. In the case of MNIST, these latent variables could represent concepts like number identity and tiltedness, whereas more complex natural images like the Frey faces could have latent dimensions for facial expression and azimuth.
Inference is the process of disentangling these rich real-world dependencies into simplified latent dependencies, by predicting \(p(z|x) -\) the distribution over one set of variables (the latent variables) conditioned on another variable (the observed data). (This is where Bayes' theorem enters the picture.)
With this Bayesian frame-of-mind, training a generative model is the same as learning the joint distribution over the data and latent variables: \(p(x, z)\). This approach lends itself well to small datasets, since inference relies on the data-generating distribution rather than individual datapoints per se. It also lets us bake prior knowledge into the model by imposing simplifying a priori distributions over variables.
Classical (iterative, non-learned) approaches to inference are often inefficient and do not scale well to large datasets. With a few theoretical and mathematical tricks, we can train a neural network to do the dirty work of both variational inference and generative modeling...while reaping the additional benefits deep learning provides (universal approximating power, cheap test-time evaluation, minibatched SGD, advances like batch normalization and dropout, etc).
The next post in the series will delve into these theoretical and mathematical tricks and show how to implement them in TensorFlow (a toolbox for efficient numerical computation with data flow graphs).
MNIST
For now, we will take our VAE model for a spin using handwritten MNIST digits.
import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data import vae # this is our model - to be explored in the next post IMG_DIM = 28 ARCHITECTURE = [IMG_DIM**2, # 784 pixels 500, 500, # intermediate encoding 50] # latent space dims # (and symmetrically back out again) HYPERPARAMS = { "batch_size": 128, "learning_rate": 1E-3, "dropout": 0.9, "lambda_l2_reg": 1E-5, "nonlinearity": tf.nn.elu, "squashing": tf.nn.sigmoid } mnist = input_data.read_data_sets("mnist_data") v = vae.VAE(ARCHITECTURE, HYPERPARAMS) v.train(mnist, max_iter=20000)
Let's verify the model by eye, by plotting how well it parses random MNIST inputs (top) and reconstructs them (bottom):
Note that these inputs are from the test set, so the model has never seen them before. Not bad!
For latent space visualizations, we can train a VAE with 2-D latent variables (though this space is generally too small for the intrinsic dimensionality of real-world data). Picturing this compressed latent space lets us see how the model has disentangled complex raw data into abstract higher-order features.
We'll visualize the latent manifold over the course of training in two ways, to see the complementary evolution of the encoder and decoder over (logarithmic) time.
This is how the encoder/inference network learns to map the training set from the input data space to the latent space...
...and this is how the decoder/generative network learns to map latent coordinates into reconstructions of the original data space:
Here we are sampling evenly-spaced percentiles along the latent manifold and plotting their corresponding output from the decoder, with the same axis labels as above.
Looking at both plots side-by-side clarifies how optimizing the encoder and decoder in tandem enables efficient pairing of inference and generation:
This tableau highlights the overall smoothness of the latent manifold—and how any "unrealistic" outputs from the generative decoder correspond to apparent discontinuities in the variational posterior of the encoder (e.g. between the "7-space" and the "1-space"). These gaps could probably be improved by experimenting with model hyperparameters.
Whereas the original data dotted a sparse landscape in 784 dimensions, where "realistic" images were few and far between, this 2-dimensional latent manifold is densely populated with such samples. Beyond its inherent visual coolness, latent space smoothness shows the model's ability to leverage its "understanding" of the underlying data-generating process to generalize beyond the training set.
Smooth interpolation within and between digits—in contrast to the spotty latent space characteristic of many autoencoders—is a direct result of the variational regularization intrinsic to VAEs.
Take-aways
Bayesian methods provide a framework for reasoning about uncertainty. Deep learning provides an efficient way to approximate arbitrarily complex functions, and ripe opportunities to probe uncertainty (over parameters, hyperparameters, data, model architectures...).
While differences in language can obscure overlapping ideas, recent research has revealed not just the power of cross-validating theories across fields (interesting in itself), but also a productive new methodology through a unified synthesis of the two.
This research becomes ever more relevant as we seek to leverage today's most interesting real-world data, which is often high-dimensional and rich in structure, yet limited in number and wholly or partially unlabeled.
(But don't take my word for it.)
Variational Autoencoders are:
A reminder that productive sparks fly when deep learning and Bayesian methods are not treated as alternatives, but combined.
Just the beginning of creative applications for deep learning.
Stay tuned for more technical details (math and code!) in Part II.
The Fast Forward Labs research team is developing our next prototype, which will demonstrate an application of probabilistic programming. Probabilistic programming languages are a set of high-level languages that lower the barrier to entry for Bayesian data analysis.
Bayesian data analysis is often seen as the best approach to machine learning. Models derived by this process are highly interpretable, in contrast to other modern models like neural networks and support vector machines. Transparency like this is crucial in industries - such as healthcare and financial services - that have a legal or ethical duty to ensure safety or fairness.
On top of that transparency, the results of Bayesian modeling are complete probability distributions, which means their predictions come with meaningful confidence intervals. Confidence is an important part of interpretability, but is also a key ingredient for deciding whether to act on a prediction immediately or incur the cost of obtaining more data (as in active learning).
Interpretability and confidence have made Bayesian inference very popular in experimental science, where the explicit goal is interpreting a model in the context of data and obtaining more data can be expensive. But Bayesian inference was little used outside academia until recently: as it turns out, the practical engineering challenges of applying it in businesses are enormous.
Probabilistic programming languages are changing the game. The algorithms used in Bayesian inference are baked into these languages as primitives, and the syntax is optimized to permit precise and concise specification of complex models. Thanks to recent algorithmic advances, users don’t even have set tuning parameters: they simply state the structure of the model, feed in the data, and let the language take care of the rest.
To illustrate the power of probabilistic programming, we developed an iPython notebook that shows how it simplifies and improves anomaly detection. In it, we show a traditional approach to anomaly detection, notice where that approach starts to fail, and show how probabilistic programming provides a more rigorous and robust approach.