The Future of Game Animation
Recently Ninja Theory Senior Animator Chris Goodall posed a question on Twitter: What do people think the future of game animation is going to be.
This is one of my favorite topics to think about, and so I was eager to share some thoughts.
Short Term: Motion Matching
GDC 2016 was Motion Matchingâs big coming out party. The core ideas had been floating around the world of academic research for years before that, but this was the first time that actual game studios were starting to show this tech in practical scenarios. Two presentations were made: One by Kristjan Zadziuk, about prototypes in development at Ubisoft Toronto, and another by Simon Clavet about his work on For Honor. The buzz at the conference was palpable, and since then, there have been rumors circulating the industry, that a lot of other AAA teams are now starting to build their own Motion Matching technology.
For those who arenât familiar, Motion Matching is a method of automatically picking which piece of animation should play next on a character, by allowing the system to make its own choices, as opposed to relying on Stateflow logic; which is the current, manually-crafted method, of deciding which animations should play.
The Motion Matching system makes these choices based on high-level goals that you feed into the system. So one of these high-level goals might be â2 seconds from now, I want the character to be in this position, and this facing directionâ, which the system gets by predicting the future position of the character based on player inputs. Another common high level goal is, âmatch the position and velocity of the feet and the hips, as closely as possible to what was already happening in the previous frameâ.
The end result is that Motion Matching has the potential to dramatically reduce the amount of work required when creating animation systems. It also tends to produce very high quality results: Since transitions from one move to the next, are taking into account hip and foot position and velocity, you tend to get really smooth blending, which is sometimes not the case with a traditional State Machine approach.
I expect that in the next few years, weâll start to see Motion Matching used more and more in games. Of course, it doesnât have to be an all or nothing switch from traditional systems; you can embed a Motion Matching system into a traditional State Machine, so for a while, youâll see a kind of hybrid approach, where some moves will be using Motion Matching (e.g. locomotion), and others might use a more traditional implementation (e.g. scripted events). But I think gradually Motion Matching will replace the majority of moves that we see in games.
The initial response from some animators towards Motion Matching, was concern; that the ease with which you can create systems, might potentially reduce the need for animators. From what Iâve experienced so far, this is absolutely not the case: Motion Matching systems typically still benefit from the usual clipping down of data (or otherwise tagging data), and of course, that data is still better if it is cleaned up animation, rather than raw mocap.
The initial vision for Motion Matching was that you would be able to just throw a bunch of unstructured mocap into a Motion Matching database and the system would do everything for you, but it turns out this kind of approach doesnât produce good results. Technically it does still work, but the system often makes unwanted choices (e.g. sometimes deciding that rather than playing a run cycle, itâs going to play the last two footsteps of an Idle to Start over and over and considers that a run), and so a lot of teams are finding that curating your animation data can give better results.
So in short, there will still be plenty for animators to do, in a Motion Matching world.
Short Term: Script Based Automation
At GDC 2016, I presented a new animation tool that Zach Hall and I had developed when I was working at Ubisoft Montreal. The tool automatically processed raw motion capture data into shippable quality animation. Before building this tool, we did an analysis of how our mocap animators were working, which showed that an estimated 50-80% of the tasks that they were doing, were things that could be automated. So, we set about automating those things.
In a way, what we did wasnât particularly revolutionary: Every studio writes scripts to automate repetitive tasks, the only difference in our case was the degree to which were willing to do it. Iâd also say that a key point was that we were really looking closely at what the animators were actually doing, whereas sometimes technical animators can think they know the problems animators are facing, but theyâre actually building solutions for things that arenât necessarily the most important things.
I got a very positive response to the GDC talk, though Iâm yet to hear of other studios trying a similar approach.
I would hope that in future, more teams start to look seriously at automation and pipeline efficiency, because it really is a huge opportunity. A single technical animator can potentially save the work of many, many animators, if theyâre aimed towards the right things. Itâs just unfortunate, that it seems like more often than not, people tend to rely on what theyâre familiar with, and so a manager might prefer to hire 10 more animators to brute force the work, rather than assign a technical animator to focus purely on improving efficiency.
Iâm hopeful though that things will happen in this area.
Short to Mid-Term: Neural Networks - Â Motion Generation at Runtime
If you havenât seen Daniel Holden et alâs paper on Phase-Function Neural Networks, drop what youâre doing and watch this now. This is the future of game animation, right here.
In my view, Neural Networks and Deep Learning are going to change everything (not just about game animation, not just about game development: everything). While we may not see Neural Network based animation systems shipping in games for a while, some developers are already doing experiments using something similar to Danielâs approach.
Studios will begin to use animation data to train neural networks, and those networks will then be able to generate animation at runtime. Just like Motion Matching the data that it generates is based on high-level goals, so it makes this a natural successor to the some of the work thatâs being done with Motion Matching.
There are a number of benefits to Neural Network (NN) based animation systems over a Motion Matching approachâŠ
Theyâre cheaper memory-wise: You only store the trained network weights, and not actual animation data.
Motion Matching is picking from a pre-existing set of animation data. NNs on the other hand can generate poses that werenât in the original data, just that makes sense in context with the original data. This allows for far more adaptive characters. So for example, if you want your character to run past a table and pick up an object from that table, the position of the object doesnât have to perfectly match what was in the training data; there just needs to be enough examples of picking up objects from tables while moving, correlated with appropriate high-level goals, for the system to understand how that type of action works. Then when youâre generating animation at runtime, you can set goals that never existed exactly that way in the training data (like different object positions on the table, different speeds, etc), and it should be able to deal with that.
NNs need to be fed lots of training data, but one approach to creating this data is to do offline procedural adjustments to your mocap (the kind of adjustments that might normally be inappropriate to use at runtime), and then use the result as training data for the NN. This essentially gives you something similar to a runtime version of that offline process. So for example, Adjustment Blending is a method of adjusting animation, that produces high quality results, but is most suitable for offline processing. This is because it relies on knowledge of what the character is going to do in the future. However, you could use Adjustment Blending to create lots of examples of adjusted data, and then use that adjusted data to train the NN. This would essentially give you similar results to Adjustment Blending, but at runtime. Another example of this type of approach is the uneven terrain example used in Danielâs PFNN paper.
There are some challenges with NNs too, that the industry will need to work throughâŠ
NNs are currently slow to train. You canât see the results of your changes until hours later. This will hopefully get faster as time goes on, but itâs currently an issue.
NNs are even more of a black box than Motion Matching. If the NN does something you donât want it to do, it can be incredibly difficult to figure out why.
NNs rely on being fed a lot of example data. The more data, and the higher quality the data, the better. With this in mind, itâs likely only going to be appropriate for mocap, at least at first. Youâll also have âstyle transferâ which will help us to produce more stylized animation, but itâll be a long time before weâre able to generate high-quality, Pixar style animation because there isnât enough of that animation in the world, to train the system.
Short to Mid-Term: Animation Capture - Quality and Volume
As mentioned, NNs need to be fed vast amounts of data, and you generally need this data to be consistent and high-quality. Part of the reason that Deep Learning has made such rapid advancements in the last few years, is because of the vast amounts of data available on the Internet.
With this in mind, I see there being huge benefits to focussing on animation capture quality, and methods for capturing large amounts of animation data, very quickly. The amounts of data that weâre talking about here are so large, that it would be too much for an animator to clean up manually, so ideally weâll need to use the raw data that comes out of the capture system, or treat the data in some sort of automated way.
Improvements in synchronized, body, finger, and facial motion capture will certainly help. Longer term I would expect to see far more full body 4D capture, and a focus on surfaces and muscles rather than bones and traditional skinning methods.
One area that I expect to get very good in the next few years is the ability for NNs to generate motion data from a single video source, rather than dedicated capture systems. Researcher Michael Black and his team are already working on this kind of thing, and Iâm guessing that very soon, the results will start to be as good or better than optical systems.
If this happens, itâll be an absolute game changer: Teams will be able to source their data from any video footage, so imagine the entire wealth of movies, TV, CCTV footage, peopleâs home videos, etc. all being sources for mocap data. Moreover, depending on the fidelity of video footage, and the quality of the NN system, youâll likely be able to derive more than just skeletal data from this footage. Youâll eventually be able to estimate fingers, facial, muscle, subcutaneous layers, skin, etc: All things that are useful, and usable.
Long Term - The Incredible and Scary Future - Semantics
Some NNs are already able to derive semantic information from photos and video footage, and this is where things really start to get crazy. These systems are able to make accurate guesses about who and what are in images, what the relationships are, and so on. These types of systems are continuing to improve at a super-fast rate.
So say for example, you build an NN that can look at video footage and not only generate the motion of the person in the footage, but also accurately guess whether the person in the footage is male or female, guess how old they are, guess their ethnicity, maybe even guess their personality traits, their level of education, how wealthy they are, what type of job they do, what their political stance is, etc. Imagine that you then associate all that information with the generated motion, and then use that as part of the motion generation training data for the NN that generates animation on-the-fly.
So now you can set character traits as high-level goals for the system. So maybe your game director can simply say: Create me a character that moves like a 50 year old, overweight man, who is shy, and is recovering from an injured ankle: The system sets those parameters as part of itâs goals, and so when it generates the motion it generates with those parameters in mind.
Iâve just been talking about animation so far, but the same advancement is happening in other game development disciplines, and so by this point, there will also be systems to generate faces, bodies, clothing, etc. and so these same parameters can be applied in those systems to generate character meshes that are also context appropriate.
So youâre now able to build any type of character just by asking.
Letâs get more crazy...
What if you derive semantic understanding from scenes and places. For example: What if you use CCTV footage, along with the NN that analyzes people, to get a semantic understanding of city demographics. What type of people travel through what type of areas. What type of people live in what types of appartment buildings. What type of people drive vs use public transport. Which people go to Starbucks vs the artisinal local coffee chain. What type of people give money to the homeless, etc.
Now you feed this information into an NN that generates city neighborhoods, or whole cities, or whole continents full of cities. First, it generates a set of demographics, then it uses the character and animation NNs, to populate each city in appropriate ways.
So maybe now all the game director has to say is âMake me a city like London circa 1975â, and as long as there are enough data sources for what a city like that should be like, the system will generate an appropriate city, with appropriate people, who have appropriate behaviour.
Want to get even crazier...
Maybe at this point the game director whoâs asking for all of this, isnât even a game director anymore; maybe itâs just the player, asking directly for what they want.
âI want to play a game in the style of James Bond, but set it in the 1800s.â
âI want to play a brand new Star Wars story, from the perspective of Chewbacca.â
Eventually, we tie this in to devices that track emotional responses as the player is playing e.g. cameras that look at facial responses, or wearables like smart watches that track heart rate. Maybe you donât even ask for a subject matter, maybe you say how you want to feel.
âI want to play an experience that makes me feel happy.â
âI want to have an experience that gives me a sense of family and belonging.â
âI want to experience a story that gives me the same sense of childish wonder as when I first read the Harry Potter books.â
Maybe in the next step you donât even ask the system for anything. Maybe the system scans you as soon as you enter your door, understands what mood youâre in, and generates a complimentary experience.
At this point you start to delve into philosophical questions about what it even is to be human and whether the human experience means anything, if youâre just having your every whim automatically appeased, so maybe I should leave things there.
So yeah, thatâs my road-map for the crazy future of game animation and game development as a whole.
Oh also I guess at some point weâll get good full body IK.