Scott Worley's Blog

Conservation of Utility in Markov Decision Processes

Use utility-difference as reward.

by Scott Worley, 2018-04-19

The standard Markov Decision Process formalism is loose. It allows one to specify funhouse worlds of no economic or practical interest. Analogously, building a robot or learning algorithm that works well in all possible physics, including physics without conservation of energy, necessarily makes it less effective in our physics. I think it would be interesting to narrow our attention to MDPs with conservation of utility.

Instead of using any `(state,action,state')→ℝ` mapping as the reward function, these MDPs use a utility function to assign a real-valued value to every state and use the difference of utilities as the reward:

`R_a(s,s') = U(s') - U(s)`

This gives rewards anticommutativity -- undoing an action undoes the reward -- and path independence. This makes it much harder to accidentally provide have the boat go in circles forever incentives. Without this, imagine the sand constantly slipping around under your feet. This gives the agent solid ground on which to walk/reason.

Homework: Prove that the optimal policy of an MDP is VNM rational if and only if the reward function conserves utility.