Invariance of state-dependent baseline in Vanilla Policy Gradient
My handwritten proof to understand why the Vanilla Policy Gradient does not change when subtracting a state-dependent baseline of our choice. Used OpenAI Spinning Up in Deep RL as a starting point.