Probabilistic programming has become a hot buzzword lately, but if you’re like me, you’ll learn to pronounce all the syllables long before you understand what it’s good for.
This post gives a layman’s introduction. Because it is simplified, the explanation is incomplete. It is also in some senses inaccurate.* But it’s a good starting point, and I’ve tried not to be too confusing.
Everyone else: keep reading.
What it is
Probabilistic Programming Languages (PPL’s for short) are tools for statistical modeling. This means they help programmers manage big piles of fluctuating numbers in a reasonable way. Just like you wouldn’t use a toaster to make soup or a use a bicycle to cross the Atlantic, probabilistic programming isn’t well-suited to every job. You would never want to code an operating system with it, for example, nor should it be used
Here’s one way to look at it: When it comes to Bayesian modeling, probabilistic programming is to compilers as compilers are to assembly language.
(Bayesian modeling, by the way, is not a scary term. It’s just a fancy way of saying we’re going to make educated guesses about the way the world works, and we’re going to represent those guesses as numbers. For example, I’m 90% certain that many of my readers will be confused at this point, and I’m 70% sure that you’re one of them. Those numbers may be right, or they may be wrong, but the point is they represent my current understanding of reality.)
Probabilistic programming lets researchers manage and manipulate those numbers in an abstract, high-level way. Let me illustrate my point:
Where it came from
Back in the ancient days of neolithic programming, coders wrote in assembly language. They used archaic three-letter commands to control which data got stored in which register, which mathematical operations the CPU executed, and which memory locations got copied to where. It was tedious. It was also awesome. It was the best tool ever invented at the time.
But eventually, programmers wanted more.
They didn’t want to worry about manually adding register values or copying numbers from here to there. They wanted to focus on high-level concepts like functions and subroutines, data flow and graphical output. And so more sophisticated languages were developed: compiled languages like C and Pascal or scripted languages like Perl and Python. These high-level languages allowed programmers to accomplish complex tasks with relatively little code. It also added a level of abstraction, helping coders to focus on what the program should do rather than on how it was implemented.
Fast-forward thirty years. Deep learning is here, and Big Data. (So is Netflix, but that’s not part of the story.) CPU speeds have increased by 27. Average RAM capacity has gone up by three orders of magnitude. In short, computers can crunch numbers faster and harder than ever before, and the glorious wilds of the internet ensure that we have plenty of numbers to crunch.
With so much data at our fingertips, machine learning techniques like Bayesian inference become more and more appealing. But there are drawbacks.
Bayesian inference is based on the idea that you can jiggle your world model based on observations. If you read this article and leave a confused comment, I can conclude that you were, in fact, a confused reader. If you leave an insightful and relevant comment, I can conclude that you were not confused. Those comments are observations, and if enough readers leave comments I will eventually be able to adjust the probabilities in my world model. Perhaps I will conclude that only 60% of readers are confused by the post. And perhaps I will conclude with 100% certainty that you are NOT one of them.
This sounds great in theory, but when researchers began to implement these methods programmatically, they quickly discovered how complicated it is. It’s not as simple as keeping a couple of probabilities in your head. You have to track the whole probability distribution – the behavior of ALL possible readers, not just one or two of them. And you have to assume that the probability distribution could look like anything: a Gaussian, a Bernoulli, an exponential… even a triple-headed monstrosity created by adding multiple distributions together. The data could be skewed in either direction. The standard deviation could be small or large.
Oh, and most of the distributions exist in n-dimensional space.
Probabilistic programming arose from a need to manipulate these distributions in a way that didn’t require weeks of programming time and wasn’t prone to programmer error. If you code up each distribution individually for every machine learning application you write, you end up wasting a lot of time. And because these suckers are tricky to code, you also end up fixing a lot of bugs. It’s a bit like writing Microsoft Windows in assembly language.
Even more importantly, the entire data flow of your program would be wrong. In Bayesian learning, you want the probability distributions to stand at the heart of your program. They’re what’s being manipulated, after all. They should be the central element to which everything connects, but with traditional programming languages, they end up getting shunted off to obscure little subroutines.
Enter probabilistic programming.
What it does
Probabilistic programming says, “I’m not a linear program. I’m a probability distribution with unknown parameters.” And then it hands you a bunch of tools to infer those parameters from observations.
And — this the part that bewilders people — the point of running the program is not to find out what happens at the end.
Instead, the point is to figure out the correct values for the variables that define your probability distributions. This is done by taking real data (which was probably scraped from the web somewhere) and comparing it to the output generated on each pass through your probabilistic code. Awesomely amazing machine learning techniques then allow you to shift your distribution’s parameters in a direction that will better match the observed data.
In other words, instead of thinking in terms of what happens first, how the program branches, and which if-then structure to use, you think in terms of your Bayesian model. “What do I think the world is like?” you ask yourself. “What kind of distributions with what kinds of parameters have the expressive power to learn to mimic the real world?”
You carefully design your probabilistic model, and then you run your code. On each pass through the main program loop, your probability primitives call native random number generators that sample from their associated distributions. Run the loop 50 times, and you’ll get fifty different results.
Now you take those 50 results and compare them to your real-world data. Did some numbers appear too often? Did others appear too infrequently, or not at all? Depending on how your sampled numbers compare to the real numbers, you jiggle the parameters of your random number generator. Rinse and repeat. Eventually, the numbers generated by your program will begin to represent data in the real world. Once that happens, your probabilistic program can be said to have learned something about the world. In techspeak, it has inferred the correct parameters for the distribution.
What it’s good for
Ok, nifty. But what’s the point? What do people actually do with this stuff?
Awesome amazing things, it turns out. Bayesian modeling can be used in robotic control, stock market prediction, business marketing, simulated theory of mind, geological extrapolations, weather analysis… anything which requires making predictions in the face of uncertainty.
In other words, it’s like, ridiculously useful. Except a lot of people have never bothered with it because it was so dang hard to implement.
Until now. One of the missions of our research lab is to develop new Probabilistic Programming Languages designed to combine the power of Bayesian modeling with deep neural networks. We hope this will enable innovative AI systems that are able to learn from examples like a neural network while simultaneously using Bayesian models to reason about an uncertain world.
Why are they called probabilistic programming languages? Aren’t they really just library functions built on top of an existing language?
Right now, yes. Most PPLs are built on top of something else. But they’re distinct enough in the way they operate that they can be viewed as a new language. For example, in a PPL, complex probability distributions are represented as primitives. They can be added and subtracted just like integers, and they come complete with operators that let you sample from the distribution or perform Bayesian inference based on a set of observations.
So… it’s kind of like Chinese doesn’t stop being its own language just because the words are transliterated into Latin characters instead of kanji. PPLs are often implemented using existing languages, but they’re really their own thing, and someday we expect to see more PPLs with their own compilers and scripting environments.
You say PPLs are to compilers as compilers are to assembly. But a compiler is just a tool for rendering code into machine language – and many languages don’t even use them. Wouldn’t it be more accurate to say “PPLs are to C as C is to assembly” or “PPLs are to high-level programming languages as high-level programming languages are to system-dependent languages”?
Well, yes. But I was looking for a baseline analogy that most people would instinctively understand. (Remember, I warned you this post was slightly inaccurate.)
What are the “Awesomely amazing machine learning techniques” that allow you to shift your distribution’s parameters?
Where can I learn more?
*Before I get lynched by the purists, I shall point out that Bayesian learning, by definition, begins with an inaccurate model which is iteratively refined. In other words, “You gotta start somewhere.”