Case Study: AI in Wargames (DoD)
Generating winning warfighting strategies that humans can’t through reinforcement learning
Earlier this year, Derek Martin and John Nagl of the U.S. Army War College (USAWC) published Eagle against the Dragon: Wargaming at the Army War College. The article described the benefit of wargames as a form of experiential learning and the creation of two purpose-built games for the USAWC’s China Integrated Course: Pacific Posture and Pacific Overmatch. Wargaming has been a part of professional military education in the US since 1887, when McCarty Little conducted the first wargame at the US Naval War College. The long history of games in the military, starting with the Kriegsspiel in Prussia, may be familiar to some hobby gamers but it is usually news to modern gamers and game developers who are more familiar with online multiplayer games like Fortnite and Valorant.
We were in the group that didn't know about wargames, but as gaming and tech professionals, we were fascinated with the fact that the military was using games for education and analysis. We reached out to active duty and retired military service members and read wargaming articles and reports to find out how we could help. The area that caught our attention the most was the lack of capable red teams. According to the US Government Accountability Office DoD Wargaming Report, “DOD officials from multiple organizations reported difficulty in finding or developing sufficiently qualified experts who can portray the adversary in wargames. Experts who can portray an adversary are referred to as red cell players; are experts commonly drawn from the intelligence community; and are heavily tasked to participate in DOD’s many wargames.”
When we read this quote, it made us think about our work with AI opponents (bots) in video games. Creating AI bots happens to be something that the video game industry does best and we wanted to see if we could create one that can act as red team / red cell players.
We reached out to the USAWC and asked if we could develop a reinforcement learning bot/policy for one of their China games and present a tech demo of it at the US Connections Wargaming Conference for research purposes. We’re glad that they said yes. This blog post details lessons learned from building this tech demo, the technical process of converting a wargame into an reinforcement learning (RL) problem, and next steps.
Lessons Learned
Here are our top two lessons learned:
We believe the most practical value that “AI” can deliver for wargames is through reinforcement learning. Most DoD wargame practitioners are wary of Industry’s claims of adding AI to wargames. When we looked into this, we found that most companies, small or large, are just slapping an AI sticker onto what they already had, whether it was related to AI or not. Furthermore, their “AI” referred to LLMs, specifically building a wrapper on top of ChatGPT. We’re going to be the first company to go extremely detailed in explaining what we mean by “AI” in wargames (reinforcement learning).
We believe that reinforcement learning is more valuable for analytical wargames vs. educational wargames. Pacific Posture is an educational wargame and most educational wargames are designed to generate small group discussion with facilitators, not have a model decide an outcome. That is why we had to add the concept of a victory point architecture (VP) to enable reinforcement learning training of our policy (more on this later in this blog post). Analytical wargames have reward structures defined, which makes them already suited for reinforcement learning.
Pacific Posture Overview
It’s important to describe the context of why the game was made and the general mechanics of it before we describe the technical process of converting a wargame into an RL problem. We’re going to copy and paste here the publicly available information on Pacific Posture from the USAWC blog:
“In 2019, the Secretary of the Army tasked the Army War College to examine competition within the U.S. Indo-Pacific Command (INDOPACOM) area of responsibility (AOR) and rethink the Army’s theater posture. The faculty partnered with Department of Strategic Wargaming members to design a wargame that could assist in identifying positional gaps that might hinder the United States in a crisis or conflict to determine whether the U.S. Army is correctly postured for the current threat.
The Pacific Posture wargame provided an opportunity for the research team to gain insights into strategies that used the instruments of national power to gain regional influence, hoping to provide a position of advantage during a crisis. Players wielded DIME (diplomacy, informational, military, and economic) actions to advance influence and ultimately gain access to regional countries, working within the context of actual recent events and a simulated (and infinitely variable) future set of international relations. Military activities were part of the teams’ actions, but there were no military engagements or events permitted that could escalate tensions past the threshold of crisis into conflict. This wargame was about competition, utilizing mechanisms reinforcing the players’ ability to apply strategic empathy and align a whole-of-government (WoG) approach to achieve their strategic goals. Although first designed as an analytic wargame, it became apparent that the game also had tremendous value as an educational tool.
The wargame was modified to span most of the INDOPACOM AOR, incorporating each country’s current relationships with China and the United States. Markers depict the strength of each relationship by indicating what level of basing rights each of the major powers would have in a crisis or conflict. Each team attempts to achieve their desired level of access, basing, and overflight (ABO) within the targeted countries by investing diplomatic, informational, or economic effort. Teams submit an investment strategy that allocates a defined amount of effort and detailed narrative. The level of detail that the teams include within their narrative partly determines the effect along with the priority that they give to that effort in relation to other activities. An adjudication cell takes these factors into account in determining the outcome, either a net gain or loss of influence for either China or the United States in that country.
Using the wargame Pacific Posture as an experiential learning event forces students to compete for operational access against a thinking adversary while using all the instruments of national power. To be successful they must grasp the fundamentals of strategic empathy, history, and geography and work to further advance the equities of their country in a way that also benefits allies and partners. The design and framework of the wargame permits players to address contemporary events and incorporate practical assumptions about international relations. The conflict-based wargame Pacific Overmatch incorporates the ABO results of Pacific Posture, holding the students accountable for the consequences of their strategy.”
Game architecture that enables RL
One does not just simply add “AI” or reinforcement learning to a game, whether it’s a wargame or a commercial video game. There are multiple ways to do this, we used the following three-component architecture for this project:
The simulation library. This library, written in C++, but with functions exposed as C externs, houses the core game logic and offers the following functionality:
The ability to submit turns (both US and PRC)
The ability to adjudicate the results of the turn
The ability to get the state of the game at any point
The RL framework. In our case, RLLib was used as the framework. More detail on how Pacific Posture was converted to an RL problem is specified below. The key thing about the RL framework is that it outputs a model serialized in the ONNX framework.
The rendering layer. In our case, Unreal Engine was chosen. The C++ game code was imported as a library, and the Unreal Neural Network Engine (NNE) plugin was used to import the ONNX model.
These components together produce a game that includes a policy that can play the game in an approximately optimal way. The rest of this blogpost is very technical and useful primarily for either wargame practitioners or software engineers.
Converting Math to Code
All wargames have an underlying model that calculates what happens during a turn and the first step in this project was to understand these formulas and convert them to code for use in the library. We’re only showing some sample formulas here as illustrations of how to convert them into mathematical formulas used for reinforcement learning. If you’re in the DoD and interested in the additional details, please reach out and we’ll share them with you.
Let:
We define Net Effort as:
To normalize Effort minus Corruption by the Corruption limit, we first divide Net Effort by 100:
Next, we square the value and divide by the Corruption limit:
Subtract this value from the original normalized Net Effort:
Finally, we multiply the result by 100 and round it to the nearest integer:
Putting it all together in one formula, it is:
Reinforcement Learning introduction
RL is one of the core types of AI we use at Exia Labs. We’re only providing a brief overview here, but there are a lot of other fantastic resources online such as the Sutton book or the Roy class.
In RL, an agent interacts with an environment by taking actions. The policy decides which actions to take. That policy is trained by taking the state of the environment, converting them to observations, calculating the reward of the action that was previously taken, and then guessing the next actions.
In RL, there is a balance between exploitation and exploration. That is, choosing between what the policy thinks is the best choice vs trying something new to test out new options. Much like how a human brain might operate.
In most RL problems, one cannot optimally solve the entire problem due to the intractable state size of the environment. Thus, in those cases, function approximation is used to predict the value of each state, called the value function. In our case, and in most practical cases, a neural network is used as the approximator. The goal being to optimize the neural network by minimizing a loss function such that the function, as best as it can, approximates the value function, which is the value of each state (or state-action pair) in the entire environment.
Defining the game as an RL problem
We must next take the rules of the game and convert them into an RL problem. As mentioned before, we used RLLib as the framework, which means using Gymnasium, an open source Python library for defining RL problems. There are 2 main loops when using RLLib to train a model.
The policy evaluation loop, which is where most of the configuring and work is done. What we need to do here is:
Create the custom environment that interacts with our environment to collect observations and rewards
Custom action distribution to sample actions
Policy optimization loop, which is where you choose the algorithm to run, and the various hyperparameters involved in the optimization such as batch size and learning rate.
In both loops, there are configuration choices that need to be made when defining this problem. This is a list of the important choices:
Algorithm and Model choice
We did not end up needing anything more complex than the default options that RLLib provides for these. This means we used PPO as the optimization algorithm and a feed forward neural network as the model.
Creating a custom environment
The custom environment was essentially a python class that used ctypes to import the C++ game code as a DLL and call into its functions. This is why the C++ library must expose its functions as c extern, that is the only way to call into it using ctypes. This custom environment housed a lot of the most important elements of the policy evaluation loop. In its step function it:
Gets the state of the game before the turn is submitted
Prepares the action and submits it to the library
Calls adjudicate on the library
Gets the state of the game after the turn is submitted
Calculates the reward function (much more on this later)
Policy structure
There were two possibilities we saw to structure the policy for this game: single-agent and multi-agent. Both had potential value, however in the end, the multi-agent approach achieved the better results. We believe the reason for this is that the single-agent had unfair knowledge of each other's moves and thus ends up favoring a co-operative version of play which does not work well once the knowledge of the others moves is hidden in real play against humans.
Hyperparameters
These are configurations related to the policy optimization loop. These numbers can be different for each RL problem and tuning them is normally an important process in achieving good results. Ray has a feature called Ray Tune which allows you to spawn trials with different hyperparameter settings, and mutate those settings until you land on good hyperparameters for your RL problem. The specific parameters we tuned were:
hyperparam_mutations={
"lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5],
"train_batch_size": lambda: random.randint(200, 4000),
"entropy_coeff": [0.01, 0.1, 0.5],
"gamma": [0.95, 0.99, 0.999],
}
The charts for these runs are shown below in more detail, but we found a few interesting results:
The best run had, specifically: lr 1e-05, entropy 0.01, batch size 737, gamma 0.999
Large batch sizes above 1000 produced very poor results.
1e-05 lr produced the best results, while 5e-05 lr produced terrible results in general.
The gamma and entropy did not have a major impact overall on the training results.
Defining the observation and action space
The observation space for the game can be best defined by the “status” field from the rules of the game. These are things like “trade”, “mil-to-mil exercises”, etc. In the original formulas, these are defined using a number, ranging from -6 to 6, the negative numbers leaning towards PRC and the positive towards the US. Each whole number indicates one of these statuses. So for example, 1.2 would mean mil-to-mil exercises with the US while -4.4 would mean permanent basing with the PRC. To achieve this in RLLib, we define the observation space as a single number for each country that ranges from -6 to 6:
gym.spaces.Box(low=-6, high=6, shape=(NUM_COUNTRIES,), dtype=float)
This is sufficient and works well to define the observations received from the game and capture the full state of the game. The action space, however, is more tricky. There are several possible options and iterations that we went through in the design phase, but eventually landed on this:
gym.spaces.MultiDiscrete([4] * NUM_COUNTRIES)
This represents that each country has 4 actions: 0, 5, 10, 15.
Action Distribution
Let’s dig into that action space a bit more. While the observation space was easy to choose and posed no problems, the action space was much trickier. Recall that we chose:
gym.spaces.MultiDiscrete([4] * NUM_COUNTRIES)
For the action space. A naive implementation of this could have been:
gym.spaces.MultiDiscrete(NUM_COUNTRIES)
One number for each country. The problem with this approach is that the default distribution would produce a lot (and in fact mostly) invalid actions. In the naïve implementation, it would take millions of steps to even achieve an action that was valid at all. While possible to brute force your way through this, there needs to be a more intelligent way. Thus, we introduce the constraints.
With the above constraint, the number of possible combinations for each turn are reduced several orders of magnitude to about 6 trillion. To achieve this result in the RLlib framework, we need to introduce the concept of a custom action distribution that limits the actions to only valid actions, while keeping the “distribution” property needed to effectively run the RL algorithm.
The custom action distribution is used before the “step” function is called in the custom environment. It takes the logits from our neural network, and then gets a sampled action from its distribution. More precisely, to create a custom action distribution with RLLib that uses the torch framework, you need to implement:
Init: takes the logits as input and stores them
Sample: Produces a sample action from the input that was created during init. This is normally done by converting the input logits into probabilities using a softmax and using that probability as a distribution.
Logp: Given an action, expects the probability of that action being chosen given the inputs. This is also normally done by converting the input logits into probabilities using a softmax and using that probability as a distribution.
Entropy: The entropy of the input logits, converted to probabilities using softmax
KL: The kl divergence of the input logits, converted to probabilities using softmax
That is a bit complex, but we can break it down into a simpler problem that we truly care about:
Given a set of logits from a neural network of size (batch_size, [4 * 29]), create an action of shape (batch_size, [29]) where the above constraints are met.
This was a challenging task given the complexity of the constraints. There were several versions attempted, but in the end a specific algorithm worked best. The pseudocode for this is given below:
Initialize total_sum to 0
Repeat until total_sum reaches 100:
Softmax the logits to get probabilities of shape (4 * num_countries)
Choose an index from those logits by sampling from softmax probs
Zero out the 4 probabilities related to the country in the logits
Convert the chosen index into a discrete action
Add the discrete action to total_sum
There are various edge cases that make this a less-than-clean solution. For example, when approaching the end of the sum (<15 remaining sum), we limit the total number of options in the probabilities to only the options that would achieve the total sum of 100. This complexity adds some problems in the logp function, but is close enough to accurate to have done the job.
Reward Function Design
RL needs something to optimize against, it is one of the most important things to design well to get the best performance. The approaches that were attempted for reward design:
Give reward based on the difference between their initial observation space and final. For example, if the US went from 2.2 to 3.2, that is an increase of 1 and therefore they get 1 reward.
The problem with this approach was that Pacific Posture is a symmetric game. That means that a positive movement on one side is a perfectly equal negative movement on the other side. This symmetry produces ineffective learning results due to lack of any useful signal.
Give reward when players move between whole numbers. This means a change of status and is the true value signal the game cares about.
This reward structure suffered in a similar way to the above. While not as symmetric, the signal was still poor. It also encouraged micro changes to status and did not get at the heart of the game's design and goals.
Both these reward structures simply did not work due to the symmetrical nature of the observation space. This meant we needed an alternative reward system that can grade the observations in a non-symmetrical way.
Thus, we introduced the concept of Victory Points (VP).
Victory Points (VP)
Not all countries give the same value to a player in this game, and there needs to be some strategy associated with why certain countries are being chosen over others. Thus, we decided to introduce the concept of “victory points” to the game's design. VP is a single number that indicates how “good” we think a player is doing based on the observation space of the game. Just to be clear, VP does not exist in the original Pacific Posture game, it was added in our version to enable RL training.
We started by choosing an arbitrary VP design that we thought would provide a decent signal. What we noticed was that this changed the game completely and proved to be a very valuable signal. We also noticed that it was very easy to change the design of what the VP is, and changing that design led the bots to choosing different strategies. For example, you could encourage choosing countries with larger militaries or choosing countries with larger economies.
This was a critical discovery for us. VP as a proxy for reward is a critical tool to train a policy and control its strategic direction. Thus VP became the reward signal used to optimize against.
RL Results
After several attempts, we ended up with a relatively simple approach for the reward structure: the reward was simply the delta between initial VP and final VP at the end of 4 turns.
Training metrics
The following are the results from the various trials that were run utilizing Ray Tune to tune the hyperparameters. As mentioned before, the best run amongst all trials is in the below graph, with the hyperparameters: lr 1e-05, entropy 0.01, batch size 737, gamma 0.999.
Wins against baseline
While the rewards look good, it is important to compare these policies to a baseline. Thus, all of these policies and versions were tested against the baseline of “random policy” which chooses random. After 1000 simulations of the game with the policy, we record some metrics on overall VP gain/loss to track.
US: 200% better than baseline
PRC: 250% better than baseline
Exact numbers:
US policy, PRC random:
US Victory Points - Mean: 1.484, Max: 6, Min: -15, Mode: 3
PRC Victory Points - Mean: 2.077, Max: 6, Min: -2, Mode: 2
US random, PRC policy:
US Victory Points - Mean: -0.574, Max: 5, Min: -7, Mode: 0
PRC Victory Points - Mean: 5.252, Max: 10, Min: -1, Mode: 6
-> % python pacific_posture_simulator.py --multi-agent --randomize-country US --model-path-us models/v1.6.0/policy_US/model.onnx --model-path-prc models/v1.6.0/policy_PRC/model.onnx
Playing policy vs policy with models US: models/v1.6.0/policy_US/model.onnx, PRC: models/v1.6.0/policy_PRC/model.onnx for 1000 runs
US Victory Points - Mean: 142.426, Max: 148, Min: 136, Mode: 143
PRC Victory Points - Mean: 98.252, Max: 103, Min: 92, Mode: 99
-> % python pacific_posture_simulator.py --multi-agent --randomize-country PRC --model-path-us models/v1.6.0/policy_US/model.onnx --model-path-prc models/v1.6.0/policy_PRC/model.onnx
Playing policy vs policy with models US: models/v1.6.0/policy_US/model.onnx, PRC: models/v1.6.0/policy_PRC/model.onnx for 1000 runs
US Victory Points - Mean: 144.484, Max: 149, Min: 128, Mode: 146
PRC Victory Points - Mean: 95.077, Max: 99, Min: 91, Mode: 95
Importing the model & library into Unreal
With the simulation library made, and the model finalized, we need to tie it all together and render it. We chose Unreal to be our game engine and renderer. First, the simulation library was imported into Unreal and exposed in blueprints so that the core logic of the game would be exactly the same as what the model was trained on. Second, serialize the model in ONXX format and import it into Unreal via the Neural Network Engine plugin.
Once imported, there is one last thing to do. Due to the nature of the chosen action space, the logits are still of the shape [4 * num_countries]. Essentially, we need to choose the actions from these logits one final time within Unreal itself. If you recall, this was originally done using the sample function of the custom action distribution. We do not have that code in Unreal, so we must write something new.
The optimal way to do this is to utilize integer programming and create a solver that chooses the optimal integers with the highest probabilities while maintaining the sum constraint. There are libraries that do this kind of optimization solving that are used for this purpose.
Once that is done, the model and solver can be exposed via blueprints and hooked into the rest of the game logic to finalize the use of this model as the AI of the game.
Next steps
We learned a lot from adding a reinforcement learning policy to Pacific Posture, an educational wargame. We will now turn our attention to adding reinforcement learning to a complex analytical wargame. DoD partners interested in learning more about how we can add reinforcement learning to your wargames, please visit our booth at the upcoming TIDE conference or please reach out to us at contact@exialabs.com.