Rollout Heuristics: POMCP for Stochastic Contingent Planning

22 Apr 2024

This paper is available on arxiv under CC 4.0 license.


(1) Oded Blumenthal, Software and Information Systems Engineering, Ben Gurion University, Israel;

(2) Guy Shani, Software and Information Systems Engineering, Ben Gurion University, Israel.

4 POMCP for Stochastic Contingent Planning

We focus here on goal-oriented POMDP domains specified as stochastic contingent planning problems. We now define this concept formally.

This definition does not support noisy observations, however, this is not truly a limitation. One can compile a noisy observation into a deterministic observation over an artificial fact whose value changes stochastically. Consider, for example, a sensor that nosily detects whether there is a wall in front of a robot. Instead of noisily observing whether there is a wall, we can deterministically detect a green light that is lit when the sensor (stochastically) detects a wall. That is, we can observe the green light without noise, but the green light is only noisily correlated with the existence of a wall.

Algorithm 1 describes the POMCP implementation for stochastic contingent planning problems. When the agent needs to act, it calls Search. Search repeatedly samples a state (line 3-4) and simulates forward execution given this state is the true underlying system state.

The Simulate procedure is recursive. We first check whether the current tree node is a goal belief. This is done be regressing the negation of the goal formula ¬G through the history of the current node. For goal beliefs, the value is 0, and we can stop.

Our implementation of POMCP also stops deepening the tree after a predefined threshold Max TreeDepth. If that threshold is reached (line 12), we run a Rollout to compute an estimation for the cost of reaching the goal from this node (line 13).

In line 15 we check whether this node has already been expanded, and if not, we compute its children. We do so only for applicable actions whose preconditions are satisfied in the current belief (line 17). Again, this is computed using regression over the history.

We now select an action a using the UCT exploration-exploitation criterion (line 21), and sample a next state and an observation (lines 22-27). We call Simulate recursively in line 28.

Lines 29-32 update the value of the current node. Lines 28,29 update the counters for the executed action and received observation. Line 31 computes the value for the action as a weighted average over all observations. Line 32 computes the value of the node as the minimal cost among all actions. Our value update, which we empirically found to be more useful, is different than the original POMCP, which uses incremental updates, and more similar to the value update in DESPOT [18].

The Rollout procedure receives as input the current simulated state s, as well as a set B of states (particles) in the node from which the rollout begins. B is used by some of our rollout heuristics, as we explain below. The rollout executes actions given the heuristic rollout policy until the goal has been reached, or a maximal number of steps has been reached.