Four AI players learned bike race with Q Learning - オッサンはDesktopが好き

This article is a translation of Japanese ver.
Original ver. is here*1.

　Hi, this is chang. In this article, 4 artificial intelligent racers (M*cEwan, P*tacchi, C*vendish, C*nchellara) competed each other and learned their tactics.

0. Bicycle road race

　Bicycle road race is spectacle. It has some special factors like slipstream. They make race complex and full of humanity. To check the background knowledge for seeing bike race, please see the published articles*2 *3.

　Today I used 4 riders as models. I guess if you know them, you can enjoy the article much more. I picked YouTube videos that featured them*4 *5 *6 *7. I hope the videos help you. Sorry but I could not find the information source that compared the characteristics of the four.

1. Program

(1) Field

　Race is competed on the image of 32 × 12. It starts from the upper edge. If an agent reaches the bottom edge before rivals, it wins. The bar at the right side of the image shows the remaining energy of riders.

　Today I increased the width from 8 to 12 because of increased players.

f:id:changlikesdesktop:20210304064842p:plain:w400 — Field

(2) Player

　Today I did not distinguish between players and competitors. Instead, 4 players equally competed against each other.

M*cEwan

f:id:changlikesdesktop:20210304065929p:plain:w300 — M*cEwan

actions of 5 patterns
1 and 2 dash consumes 1 and 2 energy, respectively
Super sprint with energy shortage goes only remaining energy value
If remaining energy is 0, sprint(0) and super sprint(4) is the same to go straight(2)
Initial energy is 5
Big lateral motion with 2 pixels are set for catching the wheels of rivals.
M*cEwan type: weak in long sprint(=small initial energy) but good at quick action(=big lateral move)

P*tacchi

f:id:changlikesdesktop:20210304100359p:plain:w300 — P*tacchi

actions of 5 patterns
1 and 2 dash consumes 1 and 2 energy, respectively
Super sprint with energy shortage goes only remaining energy value
If remaining energy is 0, sprint(0) and super sprint(4) is the same to go straight(2)
Initial energy is 6
P*tacchi type: Large body(=small lateral move) and good at long sprint(=large initial energy)

C*vendish

f:id:changlikesdesktop:20210304100723p:plain:w300 — C*vendish

actions of 5 patterns
1 and 3 dash consumes 1 and 3 energy, respectively
Super sprint with energy shortage goes only remaining energy value
If remaining energy is 0, sprint(0) and super sprint(4) is the same to go straight(2)
Initial energy is 5
C*vendish type: Superior in explorive power(=super sprint of 3 pixels)

C*nchellara

f:id:changlikesdesktop:20210304100749p:plain:w300 — C*nchellara

actions of 5 patterns
1 dash consumes 1 energy
If remaining energy is 0, sprint(0 or 4) is the same to go straight(2)
Initial energy is 10
C\nchellara type: Poor sprint(=no super sprint) but special in endurance(=large initial energy)

(3) Slipstream

　I imitated slipstream using simple rules. Players can recover their energy by taking back of rivals.

f:id:changlikesdesktop:20210304071517p:plain:w300 — Imitated slipstream

If a player is 1 pixel behind of others, the player's energy recovers 2
If a player is 2 pixel behind of others, the player's energy recovers 1

(4) Reward

Win: 1.0
Lose: -1.0

　I used the all or nothing concept like the previous. If drawn, reward is the same to lose(=-1.0).

Note: Please also check published articles to check the detail of the program.

2. Result

　There were much more variation in results because of increased players. Although I wanted to test cases as many as possible, it took time. So I stopped calculation at 10 cases.

(1) Winning percentage

Case	M*cEwan	P*tacchi	C*vendish	C*ncellara	Draw
0	0.073	0.157	0.503	0.000	0.267
1	0.497	0.107	0.177	0.000	0.219
2	0.037	0.010	0.057	0.010	0.886
3	0.390	0.013	0.123	0.003	0.597
4	0.133	0.320	0.130	0.000	0.417
5	0.033	0.017	0.087	0.003	0.860
6	0.043	0.817	0.063	0.000	0.077
7	0.650	0.120	0.040	0.020	0.170
8	0.023	0.330	0.247	0.200	0.200
9	0.220	0.250	0.247	0.033	0.250
Sum	0.210	0.210	0.170	0.030	0.390

　Table above shows the winning percentage of 300 races using neural networks after training. It showed that:

M*cEwan and P*tacchi dominated races
C*vendish was ranked in third
C*anchellara was an only loser

Note: I would like to say that I do not have a hate on C*nchellara

(2) Race tactics

　I made gif animations of the winning pattern of each racer. The gif animations of all the cases are here*8.

f:id:changlikesdesktop:20210303175056g:plain:w500 — M*cEwan(red) sprinted from behind rivals. Extract of Case 1

f:id:changlikesdesktop:20210303175206g:plain:w500 — P*tacchi(orange) sprinted from behind rivals. Extract from Case 6

f:id:changlikesdesktop:20210303175237g:plain:w500 — C*vendish(yellow) super-sprinted from behind rivals. Extract of Case 9

f:id:changlikesdesktop:20210303175310g:plain:w500 — C*anchellara(sky-blue) overpowered. No training. He was the strongest in random motions, at the early stage of learning

Case	M*cEwan	P*taccchi	C*vendish	C*nchellara
0	/	/	break	/
1	splint	/	/	/
2	break	/	break	/
3	break	/	break	/
4	break	sprint	break	/
5	break	/	break	/
6	/	sprint	/	/
7	break	/	/	/
8	break	sprint	sprint	/
9	break	sprint	sprint	/

　The table above shows the tactics of each case. I felt that the tactics were naive compared to previous tests with two players*9 *10. Although they learned goal sprints, they often went too fast and lose legs before finish. In addition, all the four did not battle for races. In many cases, only two sprinted and the rest looked on from behind.

3. Consideration

(1) Training epochs

　One of the causes for poor tactics was just a lack of training epochs.

f:id:changlikesdesktop:20210304090532p:plain:w400 — Winning percentage during learning of Case 0

　Above is the transition of winning percentage during training of case 0. The wins of P*tacchis and C*vendish increased at about 25000 epoch and kept rising till the end(50000 epoch). It is possible that more skilled sprint was acquired with additional training. Although I increased epoch from 20000 to 50000 today, it did not seem to be enough. I think this is because the variation in the field were dramatically increased with increased players.

(2) Negative brain

　Not all the four players were aggressive in many cases. I'm trying to understand this phenomena.

　Let's see again the transition of winning percentage of case 0. When M*cEwan and C*vendish improved at 25000 epochs, the wins of C*nchellara decreased. In this program, only the four players exist. So if one wins, other one necessarily loses. It was difficult to train all the players at the same time.

　In many cases, C*nchellara dominated races at the early stage of training, 25000 epoch or less. This is because his high initial energy is quite advantageous in random motions. But shown in the gif aminations above, C*nchellara tended to lose with his energy remained. If he used his energy, he may win. But he did not do that. I think his neural network, that gained reward with random motions in the early stage of training, denied when rivals became strong. The winning pattern of the past was destroyed. In the result, the network fell into the situation of negative brain.

　The riders in real sometimes lose their rhythm after a series of loses. I guess it is the similar situation.

train.py

if terminal:
    # experience replay
    if reward_t[0] > 0:
        agent1.experience_replay()
    elif reward_t[1] > 0:
        agent2.experience_replay()
    elif reward_t[2] > 0:
        agent3.experience_replay()
    elif reward_t[3] > 0:
        agent4.experience_replay()

　By the way, we today have 4 players. Each player learned more loses than wins (lose 3/4 and won 1/4) and was easy to be in negative brain. So I tried to use only the experience of wins like the code above. Sadly, it did not work well.

dqn_agent.py

class DQNAgent:
    ...
    def __init__(self, enable_actions, input_shape, environment_name, model_name):
        ...
        self.replay_memory_size = 10

　In addition, to forget the past loses, I set the replay memory size as 10 races. It also did not work well. I guess that the experience of loses gradually increased and dominated the memory in the end.

　In real, riders do not always run the same races. In many cases, riders trained in their home countries and face in big races. It is possible that training in different races has the efficiency to avoid the negative brain. Like Fabian Canchellara of 2010 and Philippe Girbert in 2011, a rider sometimes dominates the big races. I think it is possible that many rivals were in the state of negative brain.

　In trial, I gathered the networks that riders were up-beated (M*cEwan: Case 1, P*tacchi: Case 6, C*vendish: Case 0, C*ncheralla: random) and made them race. So aggressive!!! Never tired of watching close battles.

f:id:changlikesdesktop:20210303184209g:plain:w200 — Race among positive brains. Example 1(6 races). Red=M*cEwan, Orange=P*tacchi, Yellow=C*vendish, Sky blue=C*nchellara

f:id:changlikesdesktop:20210303184308g:plain:w200 — Race among positive brains. Example 2(6 races). Red=M*cEwan, Orange=P*tacchi, Yellow=C*vendish, Sky blue=C*nchellara

(3) C*nchellara was a trainig rival

　I am afraid of bashing from Canchellara's funs... I say this again. I do not have a hate on him. I'm writing objective analysis about the phenomena.

　Fact, C*nchellra was an only loser today, but he played an important role in competition. Let's see the learning transition of case 0 again. P*tacchi and C*vendish were likely to become strong through the fight with C*nchellara.

f:id:changlikesdesktop:20210304093025p:plain:w400 — Win rate during learning of Case 2

　It is obvious in case 2 shown above. Case 2 had many drawn game and tended to be boring.

　In the winning transition of case 2, C*nchellara's dominance at the early stage of training did not happen. No players rose their winning rate. In the result, so many drawn games made all the players be defensive. It shows that all the players fall into numb without rivals who take a risk. I can say that elegant tactics generate from the battle against the strong, physically superior(=great initial energy in this game) rivals.

4. Afterward

　It may be possible to mention not only the tactical aspects but also the training plans based on the brain science.

　By the way, it was quite hard to edit the article because the learning took a lot of time. My computer worked with illumination at night and prevented my sleep. And to he honest, translation is also bothersome, ha ha ha... I have to pay with attention on my health.

　Next, I will try the race with team. I wonder why C*vendish was not well today. His characteristics must be strong if he saves power till the end of races. The race among personals, that requires riders to chase breakes by themselves, was not suit for him. I do want to observe the super sprint of C*vendish from the team train.

　I updated the source code*11．