The Future of Game Animation
Recently Ninja Theory Senior Animator Chris Goodall posed a question on Twitter: What do people think the future of game animation is going to be.
This is one of my favorite topics to think about, and so I was eager to share some thoughts.
Short Term: Motion Matching
GDC 2016 was Motion Matching’s big coming out party. The core ideas had been floating around the world of academic research for years before that, but this was the first time that actual game studios were starting to show this tech in practical scenarios. Two presentations were made: One by Kristjan Zadziuk, about prototypes in development at Ubisoft Toronto, and another by Simon Clavet about his work on For Honor. The buzz at the conference was palpable, and since then, there have been rumors circulating the industry, that a lot of other AAA teams are now starting to build their own Motion Matching technology.
For those who aren’t familiar, Motion Matching is a method of automatically picking which piece of animation should play next on a character, by allowing the system to make its own choices, as opposed to relying on Stateflow logic; which is the current, manually-crafted method, of deciding which animations should play.
The Motion Matching system makes these choices based on high-level goals that you feed into the system. So one of these high-level goals might be “2 seconds from now, I want the character to be in this position, and this facing direction”, which the system gets by predicting the future position of the character based on player inputs. Another common high level goal is, “match the position and velocity of the feet and the hips, as closely as possible to what was already happening in the previous frame”.
The end result is that Motion Matching has the potential to dramatically reduce the amount of work required when creating animation systems. It also tends to produce very high quality results: Since transitions from one move to the next, are taking into account hip and foot position and velocity, you tend to get really smooth blending, which is sometimes not the case with a traditional State Machine approach.
I expect that in the next few years, we’ll start to see Motion Matching used more and more in games. Of course, it doesn’t have to be an all or nothing switch from traditional systems; you can embed a Motion Matching system into a traditional State Machine, so for a while, you’ll see a kind of hybrid approach, where some moves will be using Motion Matching (e.g. locomotion), and others might use a more traditional implementation (e.g. scripted events). But I think gradually Motion Matching will replace the majority of moves that we see in games.
The initial response from some animators towards Motion Matching, was concern; that the ease with which you can create systems, might potentially reduce the need for animators. From what I’ve experienced so far, this is absolutely not the case: Motion Matching systems typically still benefit from the usual clipping down of data (or otherwise tagging data), and of course, that data is still better if it is cleaned up animation, rather than raw mocap.
The initial vision for Motion Matching was that you would be able to just throw a bunch of unstructured mocap into a Motion Matching database and the system would do everything for you, but it turns out this kind of approach doesn’t produce good results. Technically it does still work, but the system often makes unwanted choices (e.g. sometimes deciding that rather than playing a run cycle, it’s going to play the last two footsteps of an Idle to Start over and over and considers that a run), and so a lot of teams are finding that curating your animation data can give better results.
So in short, there will still be plenty for animators to do, in a Motion Matching world.
Short Term: Script Based Automation
At GDC 2016, I presented a new animation tool that Zach Hall and I had developed when I was working at Ubisoft Montreal. The tool automatically processed raw motion capture data into shippable quality animation. Before building this tool, we did an analysis of how our mocap animators were working, which showed that an estimated 50-80% of the tasks that they were doing, were things that could be automated. So, we set about automating those things.
In a way, what we did wasn’t particularly revolutionary: Every studio writes scripts to automate repetitive tasks, the only difference in our case was the degree to which were willing to do it. I’d also say that a key point was that we were really looking closely at what the animators were actually doing, whereas sometimes technical animators can think they know the problems animators are facing, but they’re actually building solutions for things that aren’t necessarily the most important things.
I got a very positive response to the GDC talk, though I’m yet to hear of other studios trying a similar approach.
I would hope that in future, more teams start to look seriously at automation and pipeline efficiency, because it really is a huge opportunity. A single technical animator can potentially save the work of many, many animators, if they’re aimed towards the right things. It’s just unfortunate, that it seems like more often than not, people tend to rely on what they’re familiar with, and so a manager might prefer to hire 10 more animators to brute force the work, rather than assign a technical animator to focus purely on improving efficiency.
I’m hopeful though that things will happen in this area.
Short to Mid-Term: Neural Networks - Motion Generation at Runtime
If you haven’t seen Daniel Holden et al’s paper on Phase-Function Neural Networks, drop what you’re doing and watch this now. This is the future of game animation, right here.
In my view, Neural Networks and Deep Learning are going to change everything (not just about game animation, not just about game development: everything). While we may not see Neural Network based animation systems shipping in games for a while, some developers are already doing experiments using something similar to Daniel’s approach.
Studios will begin to use animation data to train neural networks, and those networks will then be able to generate animation at runtime. Just like Motion Matching the data that it generates is based on high-level goals, so it makes this a natural successor to the some of the work that’s being done with Motion Matching.
There are a number of benefits to Neural Network (NN) based animation systems over a Motion Matching approach…
They’re cheaper memory-wise: You only store the trained network weights, and not actual animation data.
Motion Matching is picking from a pre-existing set of animation data. NNs on the other hand can generate poses that weren’t in the original data, just that makes sense in context with the original data. This allows for far more adaptive characters. So for example, if you want your character to run past a table and pick up an object from that table, the position of the object doesn’t have to perfectly match what was in the training data; there just needs to be enough examples of picking up objects from tables while moving, correlated with appropriate high-level goals, for the system to understand how that type of action works. Then when you’re generating animation at runtime, you can set goals that never existed exactly that way in the training data (like different object positions on the table, different speeds, etc), and it should be able to deal with that.
NNs need to be fed lots of training data, but one approach to creating this data is to do offline procedural adjustments to your mocap (the kind of adjustments that might normally be inappropriate to use at runtime), and then use the result as training data for the NN. This essentially gives you something similar to a runtime version of that offline process. So for example, Adjustment Blending is a method of adjusting animation, that produces high quality results, but is most suitable for offline processing. This is because it relies on knowledge of what the character is going to do in the future. However, you could use Adjustment Blending to create lots of examples of adjusted data, and then use that adjusted data to train the NN. This would essentially give you similar results to Adjustment Blending, but at runtime. Another example of this type of approach is the uneven terrain example used in Daniel’s PFNN paper.
There are some challenges with NNs too, that the industry will need to work through…
NNs are currently slow to train. You can’t see the results of your changes until hours later. This will hopefully get faster as time goes on, but it’s currently an issue.
NNs are even more of a black box than Motion Matching. If the NN does something you don’t want it to do, it can be incredibly difficult to figure out why.
NNs rely on being fed a lot of example data. The more data, and the higher quality the data, the better. With this in mind, it’s likely only going to be appropriate for mocap, at least at first. You’ll also have “style transfer” which will help us to produce more stylized animation, but it’ll be a long time before we’re able to generate high-quality, Pixar style animation because there isn’t enough of that animation in the world, to train the system.
Short to Mid-Term: Animation Capture - Quality and Volume
As mentioned, NNs need to be fed vast amounts of data, and you generally need this data to be consistent and high-quality. Part of the reason that Deep Learning has made such rapid advancements in the last few years, is because of the vast amounts of data available on the Internet.
With this in mind, I see there being huge benefits to focussing on animation capture quality, and methods for capturing large amounts of animation data, very quickly. The amounts of data that we’re talking about here are so large, that it would be too much for an animator to clean up manually, so ideally we’ll need to use the raw data that comes out of the capture system, or treat the data in some sort of automated way.
Improvements in synchronized, body, finger, and facial motion capture will certainly help. Longer term I would expect to see far more full body 4D capture, and a focus on surfaces and muscles rather than bones and traditional skinning methods.
One area that I expect to get very good in the next few years is the ability for NNs to generate motion data from a single video source, rather than dedicated capture systems. Researcher Michael Black and his team are already working on this kind of thing, and I’m guessing that very soon, the results will start to be as good or better than optical systems.
If this happens, it’ll be an absolute game changer: Teams will be able to source their data from any video footage, so imagine the entire wealth of movies, TV, CCTV footage, people’s home videos, etc. all being sources for mocap data. Moreover, depending on the fidelity of video footage, and the quality of the NN system, you’ll likely be able to derive more than just skeletal data from this footage. You’ll eventually be able to estimate fingers, facial, muscle, subcutaneous layers, skin, etc: All things that are useful, and usable.
Long Term - The Incredible and Scary Future - Semantics
Some NNs are already able to derive semantic information from photos and video footage, and this is where things really start to get crazy. These systems are able to make accurate guesses about who and what are in images, what the relationships are, and so on. These types of systems are continuing to improve at a super-fast rate.
So say for example, you build an NN that can look at video footage and not only generate the motion of the person in the footage, but also accurately guess whether the person in the footage is male or female, guess how old they are, guess their ethnicity, maybe even guess their personality traits, their level of education, how wealthy they are, what type of job they do, what their political stance is, etc. Imagine that you then associate all that information with the generated motion, and then use that as part of the motion generation training data for the NN that generates animation on-the-fly.
So now you can set character traits as high-level goals for the system. So maybe your game director can simply say: Create me a character that moves like a 50 year old, overweight man, who is shy, and is recovering from an injured ankle: The system sets those parameters as part of it’s goals, and so when it generates the motion it generates with those parameters in mind.
I’ve just been talking about animation so far, but the same advancement is happening in other game development disciplines, and so by this point, there will also be systems to generate faces, bodies, clothing, etc. and so these same parameters can be applied in those systems to generate character meshes that are also context appropriate.
So you’re now able to build any type of character just by asking.
What if you derive semantic understanding from scenes and places. For example: What if you use CCTV footage, along with the NN that analyzes people, to get a semantic understanding of city demographics. What type of people travel through what type of areas. What type of people live in what types of appartment buildings. What type of people drive vs use public transport. Which people go to Starbucks vs the artisinal local coffee chain. What type of people give money to the homeless, etc.
Now you feed this information into an NN that generates city neighborhoods, or whole cities, or whole continents full of cities. First, it generates a set of demographics, then it uses the character and animation NNs, to populate each city in appropriate ways.
So maybe now all the game director has to say is “Make me a city like London circa 1975”, and as long as there are enough data sources for what a city like that should be like, the system will generate an appropriate city, with appropriate people, who have appropriate behaviour.
Want to get even crazier...
Maybe at this point the game director who’s asking for all of this, isn’t even a game director anymore; maybe it’s just the player, asking directly for what they want.
“I want to play a game in the style of James Bond, but set it in the 1800s.”
“I want to play a brand new Star Wars story, from the perspective of Chewbacca.”
Eventually, we tie this in to devices that track emotional responses as the player is playing e.g. cameras that look at facial responses, or wearables like smart watches that track heart rate. Maybe you don’t even ask for a subject matter, maybe you say how you want to feel.
“I want to play an experience that makes me feel happy.”
“I want to have an experience that gives me a sense of family and belonging.”
“I want to experience a story that gives me the same sense of childish wonder as when I first read the Harry Potter books.”
Maybe in the next step you don’t even ask the system for anything. Maybe the system scans you as soon as you enter your door, understands what mood you’re in, and generates a complimentary experience.
At this point you start to delve into philosophical questions about what it even is to be human and whether the human experience means anything, if you’re just having your every whim automatically appeased, so maybe I should leave things there.
So yeah, that’s my road-map for the crazy future of game animation and game development as a whole.
Oh also I guess at some point we’ll get good full body IK.