In [1]:

```
import plotly.io as pio
pio.renderers.default = "notebook_connected"
```

Recently I started to explore AWS DeepRacer, the 1:18 scale autonomous driving car empowered by Reinforcement Learning (RL). It becomes my first RL project with hands-on experience. It is quite fun, partly because AWS has made it very easy for newbie to get started, with their DeepRacer Console.

After I spent out all the free-tier money (which is equivalent to about 10 hours of training and evaluation run time), and a little bit more, I decided to setup up a local environment (using Matt Camp's deepracer-local) to continue my journey.

This write-up serves as a technical memo about all my learnings.

In [2]:

```
# Some paths.
MODEL_DIR = "/home/kylechung/deepracer-local/data/minio/bucket/"
TRACK_URL = "https://github.com/aws-samples/aws-deepracer-workshops/raw/master/log-analysis/tracks/reinvent_base.npy"
```

This is the first playground and also the challenge when we want to train a vehicle that can complete a given lap. The official document is a very good starting point to understand the context. There are also lots of examples as how to implement a working reward function.

Here is the default reward function provided by AWS DeepRacer Console:

```
def reward_function(params):
track_width = params["track_width"]
distance_from_center = params["distance_from_center"]
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3
return float(reward)
```

The reward function encourages the vehicle to stay close to the cetner line.
Along with a simple action space with `speed in [0.33, 0.67, 1]`

and `steering_angle in [-30, 0, 30]`

,
it is possible to train our first model that can stably finish the lap.
Here is a metric progress recorded with a discount factor `0.9`

and learning rate `0.001`

(10X the default) for 600 episodes:

In [3]:

```
# Define utility function for plotting metrics.
import os
import json
import pandas as pd
import plotly.express as px
def parse_metrics(infile, starting_episode=0):
with open(infile) as f:
metrics = pd.DataFrame(json.load(f)["metrics"])
metrics_t = metrics.query("phase == 'training'").copy()
metrics_t["episode"] += starting_episode
metrics_e = metrics.query("phase == 'evaluation'").copy()
metrics_e["episode"] += starting_episode
return metrics_t, metrics_e
def plot_metrics(train_metrics, eval_metrics, title=None):
train_metrics.reset_index(drop=True, inplace=True)
train_metrics["iter"] = (train_metrics.index / 20 + 1).astype(int)
eval_metrics["iter"] = (eval_metrics["episode"] / 20).astype(int)
train_progress = train_metrics.groupby("iter")["completion_percentage"].mean()
train_progress = train_progress.to_frame().reset_index()
train_progress["phase"] = "training"
eval_progress = eval_metrics.groupby("iter")["completion_percentage"].mean()
eval_progress = eval_progress.to_frame().reset_index()
eval_progress["phase"] = "evaluation"
progress = pd.concat([train_progress, eval_progress])
fig = px.line(progress, x="iter", y="completion_percentage", color="phase",
labels={"iter": "Training Iteration",
"completion_percentage": "Mean Percentage of Completion"},
title=title)
return fig
```

In [4]:

```
train_metrics_default, eval_metrics_default = parse_metrics(
os.path.join(MODEL_DIR, "default/TrainingMetrics.json"))
plot_metrics(train_metrics_default, eval_metrics_default)
```

The vehicle with the slow speed action (max at 1) will finish the lap around 36 seconds:

In [5]:

```
# Calculate the average elapsed time for lap completion in evaluation.
# Note that for each iteration (20 episodes) there are 6 evaluation run.
# Here we took only the evaluation after training for at least 500 episodes.
eval_metrics_default.query(
"episode_status == 'Lap complete' and episode >= 500")[
"elapsed_time_in_milliseconds"].mean() / 1000
```

Out[5]:

It turns out that if the goal is just to finish a lap,
many reward functions can do the job.
It becomes challenging if we want the vehicle to run *faster AND stable* to finish a lap.

Among all the available state variables,
I found that `waypoints`

is the most interesting one and has a lot of potentials to shape how we want the vehicle to drive itself.
I've searched a lot on the Internet and there is one article I found particularly inspiring: AWS Deepracer â€” How to train a model in 15 minutes.
I shamelessly borrow some of the codes directly from the author's repository.
After a lot of experiments I've also confirmed that the approach is not only convincing but also working,
especially in physical racing.

Waypoints are lane markers along the track. They are distributed across the center line. We can get waypoints for all the AWS DeepRacer tracks from the repo of AWS DeepRacer Workshop.

If you are not comfortable about whether the information is all up-to-date,
you can just print the `waypoints`

parameter in the reward function and extract them from the RoboMaker logs once model training is started.

Let's download the `waypoints`

for the `re:Invent 2018`

track:

In [6]:

```
import io
import requests
from numpy import load as load_npy
def maybe_download_waypoints(url):
file = os.path.basename(url)
if os.path.exists(file):
wp = load_npy(file)
else:
response = requests.get(url)
wp = load_npy(io.BytesIO(response.content))
waypoints = wp[:,:2].tolist()
return waypoints
```

In [7]:

```
waypoints = maybe_download_waypoints(TRACK_URL)
```

Since `waypoints`

are no more than a series of x-y coordinates,
we can plot them on a coordinate system for visualization.

In [8]:

```
import plotly.graph_objects as go
def plot_waypoints(waypoints, annotate=True, title=None):
if annotate:
text = [str(i) for i in range(len(waypoints))]
else:
text = None
x, y = zip(*waypoints)
fig = go.Figure(data=go.Scatter(x=x, y=y, mode="markers+text",
text=text, textposition="bottom center"))
fig.update_layout(
xaxis=dict(showgrid=False, zeroline=False),
yaxis=dict(showgrid=False, zeroline=False, scaleanchor = "x", scaleratio=1),
title="AWS DeepRacer re:invent 2018 Track" if title is None else title
)
fig.show()
```

In [9]:

```
plot_waypoints(waypoints)
```

One immediate observation is that:
`waypoints`

are NOT *evenly distributed* along the way!
This can be an issue if we use them to represent the center line which supposes to be (at least in theory) a set of infinite points.
One way to remedy this is to up-sample the points to create a denser set:

In [10]:

```
def up_sample(waypoints, k):
p = waypoints
n = len(p)
return [[i / k * p[(j+1) % n][0] + (1 - i / k) * p[j][0],
i / k * p[(j+1) % n][1] + (1 - i / k) * p[j][1]] for j in range(n) for i in range(k)]
```

In [11]:

```
# Plot the same track but with 10X denser.
plot_waypoints(up_sample(waypoints, 10), annotate=False,
title="re:invent 2018 Track Waypoints Up Sampled")
```

The AWS DeepRacer environment is a simplified world. Everything can be characterized by a 2-D Euclidean system. At each step, the agent (the vehicle) is at a specific coordinate and its objective and most of the state parameters can also be interpreted by that system.

Let's plot a even simplified environment *snapshot*:
at a specific step assuming the vehicle is located at the origin (we can always transform the coordinates to have the car being at the origin) and there is a nearest waypoint along the way to go.

In [12]:

```
def plot_xy_base(points):
x, y = zip(*points)
fig = go.Figure()
fig.add_trace(go.Scatter(x=[-6, 6], y=[0, 0], mode="lines",
line=dict(color="RoyalBlue"), showlegend=False))
fig.add_trace(go.Scatter(x=[0, 0], y=[-6, 6], mode="lines",
line=dict(color="RoyalBlue"), showlegend=False))
fig.add_trace(go.Scatter(
x=x, y=y, mode="markers+text", marker=dict(color="black", size=10),
text=["Current Position (0, 0)", "Next Waypoint (3, 4)"],
textfont_size=14, textposition="bottom center", showlegend=False))
fig.update_layout(height=600, width=600, title="Car in a Step to Make the Next Move")
fig.update_xaxes(range=[-5, 5])
fig.update_yaxes(range=[-5, 5])
# Add direction with arrow.
fig.add_annotation(dict(
showarrow=True,
x=point_a[0], y=point_a[1], ax=point_0[0], ay=point_0[1],
xref="x", yref="y", axref="x", ayref="y",
arrowhead=4, arrowsize=2, arrowcolor="red", arrowwidth=2
))
# Add the theta symbol.
fig.add_trace(go.Scatter(
x=(.75,), y=(.5,), mode="text", text=r"$\theta$",
textfont=dict(size=20, color="red"), showlegend=False
))
return fig
```

In [13]:

```
point_0 = (0, 0) # Assume this is our current position.
point_a = (3, 4) # Assume this is the closet next waypoint on the track.
plot_xy_base([point_0, point_a])
```

We want the vehicle to drive toward the correct direction.
By "correct direction" one obvious candidate is the `waypoints`

that outline the center line of the track.
So the problem can be simplified to the following:

Given my current position, what is the heading direction toward the next closest waypoint along the track?

There are two relevant parameters regarding this: `heading`

and `steering_angle`

.
Let's examine them one by one.

`heading`

to Guide the Vehicle¶The parameter `heading`

is a real number in between `[-180, +180]`

,
as an angle measured counter-clock wise and relative to the x-axis.

Given the current position and the next closest waypoint we can use trigonometry to determine the angle. Take the above coordinate system as example, according to trigonometry we have the following equation:

$$ \theta = \arctan{\frac{dy}{dx}}, $$measured in radians.

So the degree in the above system can be calculated as the following:

In [14]:

```
import math
def angle(x, y):
a = math.degrees(math.atan2(
y[1] - x[1],
y[0] - x[0]
))
return a
```

In [15]:

```
angle(point_0, point_a) # Solve for theta in the plot.
```

Out[15]:

This is indeed a well-known 3-4-5 right angle triangle. I use it on purpose of course. :)

In the AWS DeepRacer environment $\theta$ is expressed as our `heading`

parameter.
At each step we always know the `heading`

of our vehicle.
So we can use this information to determine what will be the next point ahead should we kept the current direction unchanged AND drive for a distance `r`

.

Again based on trigonometry we have:

$$ \begin{aligned} \sin \theta &= \frac{r}{dy}, \\ \cos \theta &= \frac{r}{dx}, \end{aligned} $$where the radius $r$ equals to

$$ r = \sqrt{dx^2 + dy^2}. $$Now given that we already know our `heading`

and also how long we'd like to drive,
we can solve for the next *heading point* as:

We can easily implement this:

In [16]:

```
def heading_point(p, heading, r):
h = (
p[0] + r * math.cos(math.radians(heading)),
p[1] + r * math.sin(math.radians(heading))
)
return h
```

Assuming on the previous coordiate system our vehicle is at the origin, but the heading direction is toward the north west, at a degree of, say, 110°. Then the heading point will be:

In [17]:

```
hp = heading_point(point_0, 110, 5)
hp
```

Out[17]:

This is illustrated by the following updated system in plot:

In [18]:

```
fig = plot_xy_base([point_0, point_a])
fig.add_trace(go.Scatter(
x=(hp[0],), y=(hp[1],), mode="markers+text", marker=dict(color="black", size=10),
text=["Heading Point (?, ?)"],
textfont_size=14, textposition="bottom center", showlegend=False))
fig.add_annotation(dict(
showarrow=True,
x=hp[0], y=hp[1], ax=point_0[0], ay=point_0[1],
xref="x", yref="y", axref="x", ayref="y",
arrowhead=4, arrowsize=2, arrowcolor="orange", arrowwidth=2
))
fig.show()
```

So essentially we can design a reward function to incentivize the vehicle in a way such that,
at each step,
to align its `heading`

to the desired angle to the next desired target point.
The closer the difference,
the higher the reward.

Now we can either use the degree difference or the vector length difference to quantify the reward:

$$ \text{reward}(\cdot) = \text{reward}\big(\theta - \text{heading}\big). $$The reward should be decreasing in increasing ($\theta$ - `heading`

),
and the maximum possible difference (the worst case) is exactly 180°,
where thevehicle is heading to exactly the opposite direction against the desired one.

Using the above environment as example,
the worst case will be the car heading for point `(-3, -4)`

:

In [19]:

```
angle(point_0, point_a) - angle(point_0, (-3, -4)) # Assuming heading for the opposite direction.
```

Out[19]:

A simple linear function to do the job can be something like:

In [20]:

```
def score_heading_delta(current_point, heading_point, desired_point):
desired = angle(current_point, desired_point)
heading = angle(current_point, heading_point)
return 1 - abs((desired - heading) / 180)
# Possible rewards given the specific state illustrated in the above plot.
some_headings = list(range(-180, 180, 10))
heading_points = [heading_point(point_0, h, 5) for h in some_headings]
possible_rewards_1 = [score_heading_delta(point_0, h, point_a) for h in heading_points]
```

Let's plot the reward distribution given the above state:

In [21]:

```
import plotly.express as px # Let's use the higher-level API this time.
px.scatter(x=some_headings, y=possible_rewards_1,
labels={"x": "Possible Heading Delta (in Degrees)", "y": "Reward"})
```

It is symmetric and has no preference over deviation to the right or to the left of the desired direction.

Or if we use the distance between the heading point and the desired point:

In [22]:

```
def dist(p1, p2):
return math.hypot(p1[0] - p2[0], p1[1] - p2[1])
def score_heading_vector_delta(current_point, heading_point, desired_point):
heading_r = dist(current_point, heading_point)
desired_r = dist(current_point, desired_point)
delta_r = dist(heading_point, desired_point)
return 1 - (delta_r / (desired_r * 2))
possible_rewards_2 = [score_heading_vector_delta(point_0, h, point_a) for h in heading_points]
px.scatter(x=some_headings, y=possible_rewards_2,
labels={"x": "Possible Heading Delta (in Degrees)", "y": "Reward"})
```

To be a bit more concrete, the reward function can be written as:

```
def reward_function(params):
x, y = params["x"], params["y"]
all_wheels_on_track = params["all_wheels_on_track"]
waypoints = params["waypoints"]
heading = params["heading"]
next_waypoint = waypoints[params["closest_waypoints"][1]]
reward = 1e-3
if all_wheels_on_track:
r = math.hypot(x - next_waypoint[0], y - next_waypoint[1])
heading_point = heading_point((x, y), heading, r)
delta = math.hypot(heading_point[0] - next_waypoint[0],
heading_point[1] - next_waypoint[1])
reward += (1 - (delta / (r * 2)))
return reward
```

There is a problem though.

The `heading`

of the vehicle doesn't actually indicate precisely its next direction.
This is because even though the body may have a degree of 110°,
while its wheels may have a `steering_angle`

of, that say, -18°.
In such case,
even though the body is heading to the very wrong direction (as indicated in the yellow line in the previous plot),
it is indeed trying to turn to the right to get closer to the desired direction (the red line).
Such behavior should deserve a little reward rather than a penalty.

`heading`

+ `steering_angle`

¶After we derive the difference between the `heading`

and the desired direction,
we need to further take into account `steering_angle`

in order to implement the correct reward.
Here is a final version of the reward function based on this concept:

In [23]:

```
import math
def dist(x, y):
return math.sqrt((x[0]-y[0])**2 + (x[1]-y[1])**2)
def angle(x, y):
a = math.degrees(math.atan2(
y[1] - x[1],
y[0] - x[0]
))
return a
def up_sample(waypoints, k):
p = waypoints
n = len(p)
return [[i / k * p[(j+1) % n][0] + (1 - i / k) * p[j][0],
i / k * p[(j+1) % n][1] + (1 - i / k) * p[j][1]] for j in range(n) for i in range(k)]
def closest_waypoint_ind(p, waypoints):
distances = [dist(wp, p) for wp in waypoints]
min_dist = min(distances)
return distances.index(min_dist)
def score_delta_steering(delta, worst=60):
return max(1 - abs(delta / worst), 0)
def reward_function(params):
reward = 1e-3
# Read enviroment paramters.
waypoints = params["waypoints"]
track_width = params["track_width"]
# Read states
x, y = params["x"], params["y"]
heading = params["heading"]
steering_angle = params["steering_angle"]
# Up-sample waypoints to form a series of dense racing line points.
waypoints = up_sample(waypoints, k=30)
# Get the closest waypoint given current position (x, y).
which_closest = closest_waypoint_ind((x, y), waypoints)
# Re-order the waypoints from the cloest for latter lookup.
following_waypoints = waypoints[which_closest:] + waypoints[:which_closest]
# Determine the desired heading angle based on a target waypoint.
# 1. Locate the target waypoint with a search radius.
# Target point should be the cloest waypoint just outside the radious.
search_radius = track_width * 0.9
target_waypoint = waypoints[which_closest]
for i, p in enumerate(following_waypoints):
if dist(p, (x, y)) > search_radius:
target_waypoint = following_waypoints[i]
break
# 2. Determine the desired steering angle.
target_heading = angle((x, y), target_waypoint)
target_steering = target_heading - heading
delta_steering = steering_angle - target_steering
# Reward based on difference between current and desired steering_angle.
reward += score_delta_steering(delta_steering, worst=45)
return float(reward)
```

Remember that `waypoints`

are not evenly distributed,
so it would be better to up-sample (interpolate) the `waypoints`

before we decide the target waypoint to go.
That is, we will ignore the built-in parameter `closet_waypoints`

,
and use our own version of up-sampled `waypoints`

instead.

Let's visualize the reward function based on the `steering_angle`

difference:

In [24]:

```
possible_steering_deltas = list(range(-180, 180, 5))
possible_rewards_3 = [score_delta_steering(d, worst=45) for d in possible_steering_deltas]
px.scatter(x=possible_steering_deltas, y=possible_rewards_3,
labels={"x": "Possible Steering Delta (in Degrees)", "y": "Reward"})
```

We set the parameter `worst = 45`

so as long as the steering delta is larger than this it will be 0.
And of course the reward is maxed when the steering difference is exactly 0, meaning we are `heading`

to exactly the desired direction.
Theoretically the worst difference can be 180 but that would be too loose on the incentive scheme.

The concept above can be applied to any racing line other than that of `waypoints`

.
For example there are racers calculating track-specific optimal path and use that path to calculate the reward.
Ideally that will make the completion time shorter since the agent is heading for a short cut.

But it won't generalize.

Be ware that we didn't use the *entire* `waypoints`

to calculate the reward.
Instead, we use only those within a reasonable range (specifically, 90% of the track width) as the search area to target the desired next point.
So this is *realistic* in a sense that a driver will always try to look ahead the going way.

It is not totally realistic though.

Since the actual physical vehicle won't really use reward function to drive itself, but will use its pre-trained policy function (a deep neural network model), using waypoints to train its policy network will bind the vehicle to learn only the specific track, and also the specific direction (counterclockwise in our case).

The `speed`

parameter is much more trickier than it first seems.
I'm having nightmare with it while trying to make it work.

First of all, `speed`

is actually *throttle*.
When the vehicle takes an action of a certain `speed`

, it will *accelerate* to that `speed`

if its current `speed`

is behind the target `speed`

.
In DeepRacer version 2020 a `speed`

of `4`

will easily get the vehicle so fast and sudden such that it usually will lose control in the following steps.
(The agent took about 15 steps per second.)

A naive reward function trying to incentivize the `speed`

,
such as this one:

```
def reward_function(params):
reward = 1e-3
reward += params["speed"]
return reward
```

in general will NOT work.

This is because what the vehicle perceives about this reward is that `higher speed = higher reward`

.
However, higher speed doesn't mean shorter time to complete the lap.
Imagine a vehicle to circle around (provided that there is enough space for it to exploit) infinitely to gain inifite reward,
without even trying to complete the lap.

When combining such speed reward with other forms of reward, the problem is mitigated but yet not solved, and indeed becomes even more complicated.

A balance between the reward scores gain from `speed`

and other rewards must be maintained or the vehicle will still exploit the reward without trying to complete the lap faster.
It could take considerable amount of time to do proper experiment.
I havn't figure out a good way to effectively use `speed`

in my reward at all.

Here is one of my most expected (yet still failing) implementation:

```
def is_near_straight(waypoints, k=120):
angles = []
for i in range(k):
angles.append(angle(waypoints[i], waypoints[i + 1]))
mean = sum(angles) / len(angles)
sd = math.sqrt(sum([(x - mean)**2 for x in angles]) / len(angles))
return sd <= 0.01
```

The idea is to only encourage speed up when there is a long-enough straight line in the way to go.
We use the variance of the consecutive pair-wise angles to determine if a given segment is nearly straight.
In the `re:invent 2018`

track there will only be two segments safe enough to do such action.
Yet I still cannot make it work with the other part of my reward function to make the completion time shorter.
I've also tried adding `steering_angle`

restriction along with the straight line speed-up,
but to no avail.

Action space is another important aspect of the RL framework. A pre-trained model can switch its reward function at any time, but the action space will be embeded into the model's network architecture so once it is determined for the same model it is not possible to change it anymore.

To be more specific,
the *policy network* is acting like a classifier over the possible actions pre-defined.
So it compute logits for each action and a softmax operation will give the probability of each action being taken given the input observation (or state).
This is why number of actions cannot changed once a model is trained.
It doesn't suggest, however, that the context of the action cannot change.
For example, we can switch the same number of actions with increasing speed for a pre-trained model.
But the resulting impact is hardly predictable or even justifiable.

There are two dimensions about action space for AWS DeepRacer agent:

`speed`

`steering_angle`

To start training DeepRacer it is recommended to use a lower `speed`

(< 1) since the progress will be more observable,
in terms of average reward and completion percentage along the training iteration.
This is especially useful when we are testing out a new `reward_function`

that no one knows whether it will work or not.

Once a `reward_function`

is proven to work,
we can gradually increase the `speed`

in the action space to see if it only took longer to converge or if it ended up losing control.
Unfortunately we need to re-train the model with new action space so the iteration can take quite long to complete.

Let's assume we have the following action space specified:

```
[
{
"steering_angle": -30,
"speed": 1,
"index": 0
},
{
"steering_angle": -30,
"speed": 2,
"index": 1
},
{
"steering_angle": -30,
"speed": 3,
"index": 2
},
{
"steering_angle": -15,
"speed": 1,
"index": 3
},
{
"steering_angle": -15,
"speed": 2,
"index": 4
},
{
"steering_angle": -15,
"speed": 3,
"index": 5
},
{
"steering_angle": 0,
"speed": 1,
"index": 6
},
{
"steering_angle": 0,
"speed": 2,
"index": 7
},
{
"steering_angle": 0,
"speed": 3,
"index": 8
},
{
"steering_angle": 15,
"speed": 1,
"index": 9
},
{
"steering_angle": 15,
"speed": 2,
"index": 10
},
{
"steering_angle": 15,
"speed": 3,
"index": 11
},
{
"steering_angle": 30,
"speed": 1,
"index": 12
},
{
"steering_angle": 30,
"speed": 2,
"index": 13
},
{
"steering_angle": 30,
"speed": 3,
"index": 14
}
]
```

For a successful model we can examine the evaluation simulation log from RoboMaker:

In [25]:

```
%%bash
# The local robomaker container doesn't seem to output simulation log for evaluation phase.
# So this log is downloaded from an evaluation run on DeepRacer Console.
cat ~/Downloads/robo.log | grep SIM_TRACE_LOG > /tmp/robo.log
```

In [26]:

```
import pandas as pd
# Be aware that the reward number is calculated by a default function when the simulation is for evaluation run.
sim_logs = pd.read_csv("/tmp/robo.log", header=None)
sim_logs.columns = [
"episode",
"step",
"x",
"y",
"heading",
"steering_angle",
"speed",
"action_taken",
"reward",
"job_completed",
"all_wheels_on_track",
"progress",
"closest_waypoint_index",
"track_length",
"time",
"status"
]
sim_logs.head()
```

Out[26]:

Then we can count how many times each action has been taken given a successful lap completion:

In [27]:

```
def action_count(df):
cnt = df.groupby("action_taken").size().to_frame(name="frequency").reset_index()
cnt["pct"] = cnt["frequency"] / cnt["frequency"].sum()
return cnt
act_cnt = action_count(sim_logs[sim_logs["episode"].str.endswith("0")])
fig = px.bar(act_cnt, x="action_taken", y="frequency", text="pct",
title="Action Distribution for a Successful Lap")
fig.update_traces(texttemplate="%{text:.2%}", textposition="outside")
fig
```

We realize some actions are rarely taken. For this particular trial action 0 (large angle turn-right with slow speed) has never been used. It is also true that for the specific track turning to the left is more important than turning to the right.

Can we reduce the action space based on this finding in order to speed up the training with equal performance of the vehicle to complete the lap? After several experiments it seems that the answer is a NO. No you shouldn't reduce the action space simply because they are not used a lot. Because they are still in used. In this case I guess both quantity and quality matters.

On the other hand,
should we increase the action space by adding more granularity?
If the track is complicated the answer should be a yes.
For as simple track as the `re:invent 2018`

track,
I've tried adding more granularity in `steering_angle`

and the model is much more stable in actual physical racing.

Here is a training progress of a model with 21 actions (7 `steering_angle`

s with 3 `speed`

s),
the steering angle reward function,
with the first 60 iterations (20 episodes per iteration) using the default learning rate with a discount factor `0.9`

:

In [28]:

```
train_metrics_0, eval_metrics_0 = parse_metrics(
os.path.join(MODEL_DIR, "a21-base/TrainingMetrics.json"))
plot_metrics(train_metrics_0, eval_metrics_0,
title="Training Progress on Model with 21 Actions: First 60 Iterations")
```

The steady (though slow) upward improvement is a good sign that the model can learn things episode over episode.

We train the model in total 180 iterations (3600 episodes) and by every 60 iterations we lower the learning rate by 1e-4. Here is the entire metric tracking:

In [29]:

```
train_metrics_1, eval_metrics_1 = parse_metrics(
os.path.join(MODEL_DIR, "a21-120/TrainingMetrics.json"),
starting_episode=1200)
train_metrics_2, eval_metrics_2 = parse_metrics(
os.path.join(MODEL_DIR, "a21-180/TrainingMetrics.json"),
starting_episode=2400)
train_metrics = pd.concat([train_metrics_0, train_metrics_1, train_metrics_2])
eval_metrics = pd.concat([eval_metrics_0, eval_metrics_1, eval_metrics_2])
plot_metrics(train_metrics, eval_metrics,
title="Training Progress on Model with 21 Actions")
```

Though the evaluation suggests that the model above will need about 15 seconds to finish the lap,
in actual physical racing it is possible to speed up the vehicle to finish the lap *within 9 seconds.*
There are also cases where the vehicle can run very fast in virtual simulation but not able to do so in a physical track.

So here is another learning:

*Simulation run is VERY DIFFERENT from physical racing.*

This is indeed documented in AWS DeepRacer Developer Guide as the *Simulated-to-Real Performance Gaps* problem.

For physical racing, stability is probably more important than time of completion since it is possible to manually speed up the vehicle. That is, a vehicle that is more stable can benefit more from the manual throttle.

For the reward function we are using, it is possible to reduce the discount factor to speed up the training, but only up to a cerain extent.

My experiments show that if we use action speed <= 1,
we can set discount factor as low as `0.5`

and the model will be able to learn very fast within an hour.
But if we are to use a faster base and max speed,
a low discount factor may trap the model into local optimal after several iterations.
(It will start to do endless spin.)
A factor of `0.8`

or `0.9`

will still be safer.

In addition, using a lower discount factor to speed up training is not suitable universally for all reward functions. For example if we use 0.5 discount on the default reward, this is the resulting metrics in the first 20 iterations:

In [30]:

```
plot_metrics(*parse_metrics(
os.path.join(MODEL_DIR, "DeepRacer-Metrics/TrainingMetrics-default-05.json")),
title="A Model Failed to Learn Fast with a Low Discount Factor")
```

Comparing to our first ride with a discount factor of `0.9`

,
it certainly shows that the agent is more struggling to learn from the beginning.

My Local Environment for runnung AWS DeepRacer:

```
OS : Ubuntu 20.04
GPU: GTX 1060 Max-Q
```

Yes I'm using a gaming laptop to train the model. It is not as fast as the DeepRacer console but it can definitely save me A LOT of money while I'm doing lots of experiments.

For the slim model (with 3 layers of CNN and is the default architecture) the GPU memory usage will be peak at around 4.3 GiB, which is kinda manageable for most of the modern graphic card released in the recent past 2 years.

One thing to note is the Nvidia Docker Runtime.
Nvidia has updated their Docker Toolkit with native GPU devicce support for running docker.
But it won't support the older syntax of docker gpu runtime,
which is still in use for `docker-compose`

.
Since `docker-compose`

is required for the local setup,
though it may seem redundant but in order to make sure the environment works we will need to install `nvidia-docker2`

*as well*.

A lot of resources on the Internet is based on version 2019. To establish a local environment for that version one can use DeepRacer for Dummies. But be aware that the model won't be able to export to the current DeepRacer Console since it is not backward-compatible.

In the older version the underlying neural network model is based on `rl-coach==0.11`

while the new version is using `rl-coach==1.0`

.
This is the main reason why it is not compatible.

The RoboMaker is also very different.

In addition, the `speed`

parameter is very different.
A unit `speed`

in versio 2020 is equal to 3.5 the older version.
So a `speed = 4`

in version 2020 is as fast as `speed = 14`

in version 2019.

I found OpenAI Spinning Up a very good material as an entry point to the RL world for data science practitioners.