- 1 Reinforcement Learning in Central Banking
- 1.1 Modeling Framework
- 1.2 Comparison of Q-Learning Optimal Interest and the Taylor Rule
- 1.3 Implementation of Q-Learner to Find Optimal Policy $\zeta$
- 1.4 Code Implementation of Iterated Q-Learning to Find Optimum Interest Rate
- 1.5 Can the Q-Function Run the US Economy?
- 1.6 Possible Extensions, Final Thoughts

Reinforcement learning (RL) is a sub-discipline of machine learning concerned with solving problems of optimal control. Typical use cases are self-driving cars, robot vacuums, and game-playing AI, but the situation faced by a central bank also falls very neatly into the RL paradigm.

RL considers an agent, usually a robot or an AI, who interacts with an external environment, comprised of states $s_t \in \mathcal{S}$. The agent does two things: takes actions $a_t \in \mathcal{A}$ and receives rewards $R: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R}$. Rewards are thus a map of state-action pairs, converting these pairs to real numbers, thereby allowing an agent to evaluate his performance.

Feedback is introduced into the model by having agent actions determine transition probabilities between states. We consider 1st order Markov dynamics to match what we have seen in class, where the probability of any given series of outcomes ${y_t}_0^T$ is given by $p(y_0, y_1, ... , y_T) = \Pi_{t=1}^T p(y_{t+1} | y_{t}, a_t)$

The goal of the agent is to maximize expected rewards over a planning horizon. To ensure convergence, we add a discount factor $\beta < 1$, so total expected rewards over the planning horizon is expressed:

$$\mathbb{E}\left [ \sum_{t=0}^\infty \beta^t R(s_t, a_t) \; | \; t \right ]$$In the case of the central banker, actions $a_t$ are interest rates $i_t$, and rewards are the negative loss function (allowing for maximization rather than minimization). For a loss function, we can use a slightly modified form of the function shown in class:

$$ \mathcal{L} = (\mathbb{E}_t\,[\pi_{t+1}^2] + \alpha i_t^2)$$$\pi_t$ has been replace with the expected next step inflation, because we assume the usual functional form for expected inflation: $\mathbb{E}_t\,[\pi_{t+1}] = A\pi_t + Bi_t$. Note this has been converted from a "backwards looking" AR(1) model to a "forward-looking" one, by use of the expectations operator.

Loss is the squared values of the target and the instrument, with the interest rate scaled by a "cost parameter" $\alpha > 1$, to represent a bank's risk aversion to large changes in $i_t$. To change the problem from one of loss minimization to reward maximization, we define rewards as the negative of the one-step loss function. Plugging the value of $\mathbb{E}_t\,[\pi_{t+1}]$ and expanding, we obtain:

$$ R_{t}(\pi_{t}, i_{t}) = -(A^2\pi_{t}^2 + 2AB\pi_{t} i_{t} + B^2 i_{t}^2 + \alpha i_{t}^2)$$The banker will chose values of $i_t$ in order to maximize rewards over the entire planning horizon. To do this, he requires a policy function $\zeta: \mathcal{S} \mapsto \mathcal{A}$. The policy function $\zeta$ gives, for any observed value of $\pi$, a value of $i$. In class, we have imposed a linear functional form for the policy function: $i_t = \zeta(\pi_t) = F\pi_t$. Here, suppose that the functional form of $\zeta$ is unknown. The agent can follow any policy. Whatever policy he follows, we can describe the total *expected* trajectory of rewards obtained from that policy as a **value function** $V^\zeta_0(\pi_0, i_0 | i_0 = \zeta(\pi_0))$.

The optimization is solved using a Bellman equation:

$$V^\zeta_t(\pi_t, i_t) = R(\pi_t, i_t, | \zeta) + \beta \mathbb{E}_t\,[V^\zeta_{t+1} (\pi_{t+1}, i_{t+1} | \zeta)]$$The goal is to find the *optimal* $\zeta^\star$, such that:

The optimal policy could then produce an optimal value function, $V^\star_t(\pi_t)$, for which all actions are determined ahead of time. In an RL setting, however, we would like to be able to vary the action, in order to have the agent "learn" the optimal action by trial and error in the data, rather than solving for it analytically (as we could from this point). To do this, we define an **Action-Value Function**, the celebrated "Q-function" of Q-Learning:

In this equation, we propose to take whatever action we like now, and then promise to follow the policy $\zeta$ later. When the policy we follow later is the optimal policy $\zeta^\star$, we get an equation for the optimal Q-function:

$$Q_t^\star(\pi, i) = \mathbb{E}_t\,[R_t(\pi_t, i_t)] + \beta \mathbb{E}_t\,[V^\star_{t+1}(\pi_{t+1}, | \pi)]$$And note that when we use our "free action" in the first term of the Q-function to take the optimal action, we return to the optimal value function, so:

$$V_t^\star = \max_i Q_t^\star(\pi, i)$$By substitution we obtain the recursive Bellman optimality equation for the Q-function:

$$Q_t^\star(\pi, i) = \mathbb{E}_t [ R_t(\pi, i)] + \beta \max_{i_{t+1} \in \mathcal{A}} \mathbb{E}_t [Q_{t+1}^\star (\pi_{t+1}, i_{t+1}) | \pi, i]$$Explicitly substituting for rewards:

$$Q_t^\star(\pi, i) = \mathbb{E}_t \left[\beta Q_{t+1}^\star - A^2\pi_{t}^2 - 2AB\pi_{t} i_{t} - B^2 i_{t}^2 - \alpha i_{t}^2 \right]$$The Q-function is quadratic in "action" $i_t$, so we can solve for $i_t^\star$ by setting $\frac{\partial Q_t^\star(\pi, i)}{\partial i_t} = 0$ and solving for $i_t$. Doing so yields:

$$i_t^\star = -\frac{AB\pi_t}{B^2 + \alpha}$$A "sanity check" of the above RL model is to compare the solution found above and the Taylor Rule, written:

$$i_t = F\pi$$In fact, the Q-learning model has found a Taylor rule, with $F = -\frac{AB}{B^2 + \alpha}$. For $F > 0$, we require $B <0$ and $A > 0$, which are the usual numeric requirements of the SISO model with Taylor rule seen in class. The formula also displays the expected first derivatives. It is decreasing in both B and $\alpha$, meaning that if the transmission channel becomes stronger, lower values of the instrument are required to maximize total rewards. In addition, as the cost of using the instrument goes up, the instrument is used less. On the other hand, as the autocorrelation coefficient of inflation, A, goes up, the instrument will need to be used more aggressively to maximize total rewards.

To implement the model numerically, we will place an agent into an artifical economy goverened by parameters A, B, and $\alpha$, which will not be known to the agent. The goal is to learn an optimal policy function $\zeta^\star(\pi) = i_t^\star, \;\; \forall t\in T$. We will do this using iterated Q-Learning, a method of updated approximation based on the Robbins-Monro algorithm for approximation of an average from samples. Given N observations, we update a running approximation of the mean, $\hat{x}$ one observation at a time, following the following formula:

$$\hat{x}_{k+1} = (1 - \eta_k)\hat{x}_k + \eta_k x_k$$Where $\hat{x}_k$ is an approximation of the mean after seeing the k-th element of data, and $\eta_k \in (0, 1)$ is a "learning rate", which decreases as a function of k. Robbins and Monro demonstrate that, under certain conditions, the estimate $\hat{x}_k$ converges to the true mean with probability 1. We can apply this same principle to numeric updating of the Q-function:

$$Q_{t,k+1}^\star = (1 - \eta_k) Q^\star_{t,k} (\pi, i) + \eta_k \left [R_t(\pi_t, i_t) + \beta \max_{i \in \mathcal{A}} Q_{t+1, k}^\star(\pi_{t+1}, i) \right]$$This adds a new dimension to the data, k, which will be used to average across iterations of the economy. The experiment will proceed as follows:

- The agent enters the environment knowing nothing about it, takes purely random actions for 500 time periods, and records his rewards.
- After 500 actions, the economy is reset to t=0, and the agent takes 500 more actions, this time using all available data from previous economies to estimate the Q-function
- The process is repeated until convergence is detected as $Q_{t+1} - Q_t < \text{tolerance}$

In [67]:

```
import gym
from gym import spaces
from gym.utils import seeding
# Create an environment for the agent to act in. The class Artificial_Economy defines the parameters of action (range of
# admissible values for i and π), stores the hidden values of parameters A and B, and computes rewards and next
# step inflation based on the agent's action.
#Two modes are provided. In this first, the action space (raising and lowering the interest rate) are discretized.
#The banker has 7 choices. First, he can do nothing. The next 6 choices are to do a [large, medium, or small]
#[increase or decrease] of the interest rate. This allows for implementation of a Q-Learning solution
#to the problem. A second mode allows the banker to choose any continuous value for the interst rate,
#in which case a more sophisticated solution is required.
class Artificial_Economy(gym.Env):
def __init__(self, A, B, α, mode='discrete'):
self.A = A
self.B = B
self.α = α
self.π_0 = np.random.uniform(-1, 10)
self.i_0 = np.random.uniform(-2, 20)
self._max_episode_steps = None
self._current_step = 0
self.mode = mode
self.goal_π = 0 #The goal paramter defines when the simulation is "over". Here, the inflation target is 0.
self.min_π = -1.
self.max_π = 10.
self.min_i = -2.
self.max_i = 20.
self.low = np.array([self.min_π, self.min_i], dtype=np.float64)
self.high = np.array([self.max_π, self.max_i], dtype=np.float64)
if self.mode == 'discrete':
self.action_space = [-5, -3, -1, 0, 1, 3, 5]
self.force = 0.01
elif self.mode == 'continuous':
self.action_space = spaces.Box(low=self.min_i, high=self.max_i, shape=(1, ), dtype=np.float32)
self.observation_space = spaces.Box(low=self.min_π, high=self.max_π, shape=(1,), dtype=np.float64)
self.seed()
self.reset()
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
self.π, self.i = self.state
if self.mode == 'discrete':
self.i += (action-1) * self.force
self.i = min(self.i, self.max_i)
self.i = max(self.i, self.min_i)
else:
self.i = action
self.π = self.A * self.π + self.B * self.i
if self._max_episode_steps is not None:
if self._current_step <= self._max_episode_steps:
done = bool(np.abs(self.π - self.goal_π) < 0.1)
self._current_step += 1
else:
done = True
else:
done = bool(np.abs(self.π - self.goal_π) < 0.1)
reward = -(self.A**2 * self.π**2 + 2 * self.A * self.B * self.π * self.i + self.B**2 * self.i ** 2 + self.α * self.i **2)
self.state = (self.π, self.i)
return self.state, reward, done, {}
def reset(self):
self.state = np.array([self.np_random.uniform(low=self.min_π, high=self.max_π),
self.np_random.uniform(low=self.min_i, high=self.max_i)])
self._current_step = 0
return self.state
def get_inflation(self):
if isinstance(self.π, np.ndarray):
inf = self.π.squeeze().item()
else:
inf = self.π
return inf
def get_interest(self):
if isinstance(self.i, np.ndarray):
intr = self.i.squeeze().item()
else:
intr = self.i
return intr
```

In [68]:

```
from IPython.display import clear_output
import pickle
n_states = 50
π_space = np.linspace(-1., 10, n_states)
i_space = np.linspace(-2., 20, n_states)
def get_state(observation):
π, i = observation
π_bin = np.digitize(π, π_space)
i_bin = np.digitize(i, i_space)
return (π_bin, i_bin)
def max_action(Q, state, actions=[-5, -3, -1, 0, 1, 3, 5]):
values = np.array([Q[state, a] for a in actions])
action = np.argmax(values)
return actions[action]
#Construct state space
states = []
for inflation in range(n_states+1):
for interest in range(n_states+1):
states.append((inflation, interest))
Q = {}
for state in states:
for action in [-5, -3, -1, 0, 1, 3, 5]:
Q[state, action] = 0
# Instead of initializing an empty Q-Function, a pre-trained model could be passed in to do additional training.
# with open('C:/Users/Jesse/Desktop/q_model_trained.p', 'rb') as file:
# Q = pickle.load(file)
def run_model(states, Q, env, n_games, beta, eta, eps, save_results=True):
score = 0
total_rewards = np.zeros(n_games)
interests = []
inflations = []
failed_episodes = 0
successful_episodes = 0
previous_inflation = 0
for i in range(n_games):
done = False
obs = env.reset()
state = get_state(obs)
score = 0
this_interest = []
this_inflation = []
while not done:
action = np.random.choice([-5, -3, -1, 0, 1, 3, 5]) if np.random.random() < eps\
else max_action(Q, state)
obs_, reward, done, info = env.step(action)
state_ = get_state(obs_)
score += reward
action_ = max_action(Q, state_)
Q[state, action] = (1 - eta) * Q[state, action] + \
eta*(reward + beta*Q[state_, action_])
state = state_
this_interest.append(env.get_interest())
this_inflation.append(env.get_inflation())
if step % 50000 == 0:
clear_output(wait=True)
print(f'Episode: {i}, Score: {score:0.0f}, Epsilon: {eps:0.2f}')
print(f'Step: {step}, Current inflation:{env.get_inflation():0.2f}, Current interest:{env.get_interest():0.2f}')
print(f'Successful Episodes: {successful_episodes}, Failed Episodes: {failed_episodes}, Previous Inflation: {previous_inflation}')
if done:
previous_inflation = env.get_inflation()
if step >= env._max_episode_steps:
failed_episodes += 1
else:
successful_episodes += 1
total_rewards[i] = score
interests.append(this_interest)
inflations.append(this_inflation)
eps = eps - 2/n_games if eps > 0.01 else 0.01
if save_results:
with open('C:/Users/Jesse/Desktop/q_model_trained.p', 'wb') as file:
pickle.dump(Q, file)
with open('C:/Users/Jesse/Desktop/q_model_trained_interests.p', 'wb') as file:
pickle.dump(interests, file)
with open('C:/Users/Jesse/Desktop/q_model_trained_inflations.p', 'wb') as file:
pickle.dump(inflations, file)
with open('C:/Users/Jesse/Desktop/q_model_trained_rewards.p', 'wb') as file:
pickle.dump(total_rewards, file)
return Q, total_rewards, interests, inflations
```

In [69]:

```
env = Artificial_Economy(1.1, -0.1, 1, mode='discrete')
env._max_episode_steps = 1000
n_games = 50000
eta = 0.1
beta = 0.99
eps = 1.0 #Explore-exploit parameter (Explore eps% of the time)
#Here is the code to actually run the Q-learner if you would like. Otherwise, the pickle files with results from my
#training of the model are loaded to create the graphs below.
# Q ,total_rewards , interests , inflations = run_model(states, Q, env, n_games, beta, eta, eps, save_results=True)
```

In [70]:

```
q_file = 'C:/Users/Jesse/Desktop/q_model_trained.p'
interest_file = 'C:/Users/Jesse/Desktop/q_model_trained_interests.p'
inflation_file = 'C:/Users/Jesse/Desktop/q_model_trained_inflations.p'
rewards_file = 'C:/Users/Jesse/Desktop/q_model_trained_rewards.p'
with open(q_file, 'rb') as file:
Q = pickle.load(file)
with open(interest_file, 'rb') as file:
interests = pickle.load(file)
with open(inflation_file, 'rb') as file:
inflations = pickle.load(file)
with open(rewards_file, 'rb') as file:
total_rewards = pickle.load(file)
```

50,000 economies were run, and the resulting 50-economy rolling average loss is plotted below. The plot shows a smooth logarithmic increase, indicating that learning took place, and the agent because better at managing the economy over time. The after total loss remained extremely volatile, however, and quite low, peaking at around -50,000. This is a because certain economies were unsolvable within the parameters of the discretized game. The inflation rate was capped at -1 and +10, representing hyper-inflation and deflation. At this points, the agent could not bring the system under control quickly enough to find the global maximum in the loss function, and instead learned to keep interest near 0 to find a local maximum in the loss function. This is clearly a limitation of the Q-learning approach, where actions and states must be discretized. Sometimes larger, more extreme measure are needed, but they are not available to the agent.

In [71]:

```
%matplotlib inline
fig, ax = plt.subplots()
mean_rewards = np.zeros(n_games)
for r in range(n_games):
mean_rewards[r] = np.mean(total_rewards[max(0, r-50): (r+1)])
ax.plot(np.arange(n_games), mean_rewards)
```

Out[71]:

In [72]:

```
q_df = pd.Series(Q)
q_df = q_df.to_frame().reset_index()
q_df.columns = ['State', 'Action', 'Q']
Q_plot = q_df.groupby('State')['Q'].max().to_frame()
Q_plot = Q_plot.reset_index()
Q_plot['π'] = Q_plot['State'].apply(lambda x: x[0])
Q_plot['i'] = Q_plot['State'].apply(lambda x: x[1])
Q_plot = Q_plot.drop(columns='State')
```

To get a sense of how the learner was doing, the negative value of Q (the expected total rewards for a given state-action pair) is plotted below. Only the Q-value associated with the optimal action is plotted (recall that for each state pair ($\pi$, i), 7 possible actions exist). The function is clearly converging to the saddle-path optimal solution described analytically by the standard optimal control approach. This gives confidence that, if we were to further discretized the state space, provide more actions to the agent, and run the simulation an arbitrarily large number of times, it would converge to the dynamic programming solution, without having any knowledge of parameters A and B.

The area of the saddle solution is a large plateau because much of this state-action space is unexplored. There are still many 0 values in the Q-function from initialization. To have a fully estimated function, many many more simulations are required.

In [73]:

```
from mpl_toolkits.mplot3d import Axes3D
fig, ax = plt.subplots(figsize=(15,8))
ax = Axes3D(fig)
ax.plot_trisurf(Q_plot.π, Q_plot.i, -Q_plot.Q, cmap=cm.viridis)
ax.set_xlabel('π', fontweight='bold', fontsize=16)
ax.set_ylabel('i', fontweight='bold', fontsize=16)
ax.set_zlabel('-Q', fontweight='bold', fontsize=16)
ax.view_init(30, 240)
ax.set_title('Value of Q-Function for All Interest, Inflation Pairs; Best Action Only', fontsize=21, y=1.05)
plt.show()
```

How the model is learning to behave can be seen by sampling interest and inflation paths from different simulated economies. In the early iterations, the interest rate (blue line) is jittery, as the model is taking many random actions to explore the state-action space and find optimal actions. As a result, the system is more likely to crash at 10 (hyperinflation) or -1 (deflationary spiral). We see the hyperinflation-type failure in economy 29477, and the deflationary spiral in economies 11 and 584.

In the later simulations, the trajectory of the interest rate is smooth and moves quickly in the correct direction to get the system into balance, then makes small adjustments to achieve near-zero inflation. In addition, the agent learns to act more decisively to get out of -1 and 10 traps. Comparing economies 11 and 15966, the slope of the interest rate decrease has become much much steep, as the agent now knows to rapidly decrease interest to get out of a deflationary situation.

Still, there are limitations. To bring economy 29477 to 0 inflation, the agent needed to rapidly increase the interest rate until it was significantly above the rate of inflation (as in economy 22200). Instead, he got stuck in a local minimum, and decreased the interest rate to try to minimize total losses rather than searching for the global solution.

In [74]:

```
%matplotlib inline
fig, ax = plt.subplots(2, 4, figsize=(24, 8), dpi=100)
to_plot = [11, 584, 43880, 7477, 15966, 22200, 29477, 35554, len(interests)]
for axis, index in zip(fig.axes, to_plot):
axis.plot(np.arange(len(interests[index-1])), interests[index-1], color='tab:blue', label='Interest')
axis.plot(np.arange(len(inflations[index-1])), inflations[index-1], color='tab:red', label='Inflation')
axis.set_title(index)
jesse_tools.remove_chart_ink(axis)
axis.grid(ls='--', lw=0.5)
lines = []
labels = []
for ax in fig.axes:
axLine, axLabel = ax.get_legend_handles_labels()
lines.extend(axLine)
labels.extend(axLabel)
fig.legend(lines[:2], labels[:2],
loc = 1, ncol=6, bbox_to_anchor=(.57,1.05), bbox_transform=fig.transFigure, fontsize=12)
fig.suptitle('Interest and Inflation from 8 Sample Q-Learning Simulations', fontsize=21, y=1.1)
fig.tight_layout()
plt.show()
```

As a final exercise, the fitted Q-Function is used to determine an interest rate by looking at US historical inflation rates. The exercise quite crude, because the inflation rate is taken as deterministic, in that the interest rate chosen by the Q-function at time t does not change inflation at time t+1. Nevertheless, comparing the real historical outcome with what the model would have done, faced with the same situation at every step, is perhaps interesting.

The first thing to note is that all interest rates chosen by the Q-function fall between 2.4 and 3.0. As noted above when discussing the model's failures in extreme situations, it was not given the chance to learn how to move the interest rate by more than 0.1 in a single time step.

Nevertheless, the relationship between the Q-determined interest rate and the historical interest rate (in faint orange) is quite close, especially after the mid-70s, at least in direction of change, if not magnitude. After the twin-peaks in the 70s, rates are repeatedly cut through the 80s and into the early 90s, after which the interest rate seems to settle into an equilibrium around 2.4%.

In [85]:

```
i_history = []
for count, π in enumerate(df['GDPDEF'].values):
if count == 0: i = df['Federal Funds Rate'].values[0]
state = get_state((π, i))
i += max_action(Q, state) * 0.01
i_history.append(i)
```

In [105]:

```
fig, ax = plt.subplots(figsize=(12,4), dpi=100)
ax2 = ax.twinx()
ax3 = ax.twinx()
l1, = ax.plot(df.index, i_history, color='tab:Blue', label='Interest Rate', lw=2)
ax.grid(ls='--', lw=0.5)
l2, = ax2.plot(df.index, df['GDPDEF'], color='tab:Red', label='Inflation Rate')
l3, = ax3.plot(df.index, df['Federal Funds Rate'], color='tab:Orange', alpha=0.2)
for axis in [ax, ax2, ax3]:
jesse_tools.remove_chart_ink(axis)
ax3.set_yticks([])
ax.set_ylabel('Interest Rate (Q-Determined)')
ax2.set_ylabel('% Change in Inflation (Historic)')
ax.set_title('Q-Function Interest Rate, US Historical Data')
plt.legend([l1, l2, l3], ['Interest Rate', 'Inflation Rate', 'Historical Interest'])
plt.show()
```

The model economy set up here, and the Q-function itself, are just toy examples of how an interest rate could be optimally determined from data without any model of the world. In principle, given infinite interest rate data, a similar exercise could be performed using observed inflation data. In practice, because macroeconomic data is collected only monthly or quarterly, no such infinite spring of data exists. That does not mean there is no solution.

The best place to begin to extend this exercise is in the Artificial_Economy environment. Rather than using an SISO model of the world, locally Gaussian samples of historical data could be used. That is, after selecting an interest rate in response to a randomly selected $\pi_0$, find in the data the most similar $(i, \pi)$ pair, and return it with some Gaussian noise, built from the mean and standard deviation in a fixed window around the observed point. This would, at the very least, allow for a simulation that more closely approximates the underlying data generating process for inflation.

In addition, a regime-switching model could be estimated from the data. By estimating a simple transition matrix between "low autocorrelation" and "high autocorrelation" regimes, the artificial economy could have some exogenous shifts in the parameters A, B, or both, that the model would need to learn. A more simple method could also be to simply randomly shock the system, forcing the model to make adjustments.

The Q-Function is also a very crude method, that simply tries to visit every state-action pair infinite times and record the results. More elegant solutions exist, which are also less data intensive. Policy gradient methods, for example, use linear regression to update a running estimate of the mean and standard deviation of the state-transition probability matrix, allowing the model to estimate expected rewards at an arbitrary time horizon. Regression-based methods can also incorporate additional signals, such as the instruments used in question 13. Such methods can also return continuous-valued actions, eliminating the restrictions on single-step changes in the interest rate.

Finally are inverse reinforcement learning methods, which take rewards, actions, and states as inputs, and attempt to model the underlying utility function of the agent being observed, building a model of the agent based on her actions. Such an approach could also be applied to the central bank, as all the necessary elements are observed (though the loss function, e.g. the reward function, needs to be assumed). Using a regression based approach, a utility function of arbitrary complexity could be parameterized using a small number of linear weights, and the trained parameters could be used to build predictions of future bank behavior.

While quite different from the other questions addressed in this assignment, direct estimation of parameters A, B, and F in different economic regimes, I hope this section has at least been interesting.

In [ ]:

```
```