Foreword

This chapter introduces the implementation of reinforcement learning using PaddlePaddle to complete a classic control game. For game introductions, please refer to the Gym official website. We will be playing the CartPole-v1 game, where the goal is to control the left-right movement of a slider to prevent the vertical pole from falling. Using reinforcement learning, the AI will continuously learn by interacting with the game, gaining rewards or penalties to develop a model. The AI technology used in the super-hard AI opponents in Honor of Kings is similar to this approach.

PaddlePaddle Program

Create a Python file named DQN.py. Import the required dependency libraries. If gym is not installed, you can install it using the command pip3 install gym.

import numpy as np
import paddle.fluid as fluid
import random
import gym
from collections import deque
from paddle.fluid.param_attr import ParamAttr

Define a simple network consisting of 4 fully connected layers, specifying parameter names for subsequent model parameter updates. This is necessary because two models will be generated later, and their parameter updates must be distinguishable.

# Define a deep neural network with specified parameter names for later parameter updates
def DQNetWork(ipt, variable_field):
    fc1 = fluid.layers.fc(input=ipt,
                          size=24,
                          act='relu',
                          param_attr=ParamAttr(name='{}_fc1'.format(variable_field)),
                          bias_attr=ParamAttr(name='{}_fc1_b'.format(variable_field)))
    fc2 = fluid.layers.fc(input=fc1,
                          size=24,
                          act='relu',
                          param_attr=ParamAttr(name='{}_fc2'.format(variable_field)),
                          bias_attr=ParamAttr(name='{}_fc2_b'.format(variable_field)))
    out = fluid.layers.fc(input=fc2,
                          size=2,
                          param_attr=ParamAttr(name='{}_fc3'.format(variable_field)),
                          bias_attr=ParamAttr(name='{}_fc3_b'.format(variable_field)))
    return out

Define a function to update parameters by pruning parameters based on specified names.

# Define parameter update program
def _build_sync_target_network():
    # Get all parameters
    vars = list(fluid.default_main_program().list_vars())
    # Filter parameters for both networks
    policy_vars = list(filter(lambda x: 'GRAD' not in x.name and 'policy' in x.name, vars))
    target_vars = list(filter(lambda x: 'GRAD' not in x.name and 'target' in x.name, vars))
    policy_vars.sort(key=lambda x: x.name)
    target_vars.sort(key=lambda x: x.name)

    # Clone the main program to create a program for parameter update
    sync_program = fluid.default_main_program().clone()
    with fluid.program_guard(sync_program):
        sync_ops = []
        for i, var in enumerate(policy_vars):
            sync_op = fluid.layers.assign(policy_vars[i], target_vars[i])
            sync_ops.append(sync_op)
    # Prune the second program to complete parameter update
    sync_program = sync_program._prune(sync_ops)
    return sync_program

Define 5 data output layers:
- state_data: Input layer for current game state data
- action_data: Input layer for game action data (two actions: 0 and 1)
- reward_data: Input layer for game reward data
- next_state_data: Input layer for next game state data
- done_data: Input layer for game termination status

# Define input data
state_data = fluid.layers.data(name='state', shape=[4], dtype='float32')
action_data = fluid.layers.data(name='action', shape=[1], dtype='int64')
reward_data = fluid.layers.data(name='reward', shape=[], dtype='float32')
next_state_data = fluid.layers.data(name='next_state', shape=[4], dtype='float32')
done_data = fluid.layers.data(name='done', shape=[], dtype='float32') 

Define training parameters, including the epsilon-greedy exploration strategy parameters.

# Define training parameters
batch_size = 32
num_episodes = 300
num_exploration_episodes = 100
max_len_episode = 1000
learning_rate = 1e-3
gamma = 1.0
initial_epsilon = 1.0
final_epsilon = 0.01

Create the game environment using CartPole-v1. You can create other games by referring to the official game names.

# Instantiate the game environment with the specified game name
env = gym.make("CartPole-v1")
replay_buffer = deque(maxlen=10000)

Obtain the first network model with parameters containing the policy string.

# Get the network
state_model = DQNetWork(state_data, 'policy')

Clone the prediction program for future game action prediction.

# Clone the prediction program
predict_program = fluid.default_main_program().clone()

Define the loss function. Unlike standard loss functions, this uses a squared error loss with a custom target value derived from rewards and next states.

# Define the loss function
action_onehot = fluid.layers.one_hot(action_data, 2)
action_value = fluid.layers.elementwise_mul(action_onehot, state_model)
pred_action_value = fluid.layers.reduce_sum(action_value, dim=1)

targetQ_predict_value = DQNetWork(next_state_data, 'target')
best_v = fluid.layers.reduce_max(targetQ_predict_value, dim=1)
best_v.stop_gradient = True
target = reward_data + gamma * best_v * (1.0 - done_data)

cost = fluid.layers.square_error_cost(pred_action_value, target)
avg_cost = fluid.layers.reduce_mean(cost)

Obtain the parameter update program for later execution.

# Get the parameter update program
_sync_program = _build_sync_target_network()

Define the optimization method using AdamOptimizer, which is preferred by the author.

# Define the optimization method
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=learning_rate, epsilon=1e-3)
opt = optimizer.minimize(avg_cost)

Initialize the executor.

# Create and initialize the executor
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
epsilon = initial_epsilon

The main training loop (detailed explanation below):

  • Start by getting the initial game state after each episode ends.
  • Use an epsilon-greedy exploration strategy that adjusts based on training progress.
  • Perform a game episode with state transitions, rewards, and termination checks.
  • Store experiences in a replay buffer for training.
  • When the buffer size reaches the batch size, start training using the experience replay method.
  • Periodically update the target network with the policy network parameters.
update_num = 0
# Start playing the game
for epsilon_id in range(num_episodes):
    # Initialize environment and get initial state
    state = env.reset()
    epsilon = max(initial_epsilon * (num_exploration_episodes - epsilon_id) /
                  num_exploration_episodes, final_epsilon)
    for t in range(max_len_episode):
        # Render the game (optional)
        # env.render()
        state = np.expand_dims(state, axis=0)
        # Epsilon-greedy exploration strategy
        if random.random() < epsilon:
            # Random action with probability epsilon
            action = env.action_space.sample()
        else:
            # Model-predicted action
            action = exe.run(predict_program,
                             feed={'state': state.astype('float32')},
                             fetch_list=[state_model])[0]
            action = np.squeeze(action, axis=0)
            action = np.argmax(action)

        # Execute action and get next state, reward, and termination info
        next_state, reward, done, info = env.step(action)

        # Penalize terminal states
        reward = -10 if done else reward
        # Store experience in replay buffer
        replay_buffer.append((state, action, reward, next_state, done))
        state = next_state

        # Print progress when episode ends
        if done:
            print('Pass:%d, epsilon:%f, score:%d' % (epsilon_id, epsilon, t))
            break

        # Train when buffer has enough samples
        if len(replay_buffer) >= batch_size:
            batch_state, batch_action, batch_reward, batch_next_state, batch_done = \
                [np.array(a, np.float32) for a in zip(*random.sample(replay_buffer, batch_size))]

            # Update target network periodically
            if update_num % 200 == 0:
                exe.run(program=_sync_program)
            update_num += 1

            # Reshape data for training
            batch_action = np.expand_dims(batch_action, axis=-1)
            batch_next_state = np.expand_dims(batch_next_state, axis=1)

            # Execute training
            exe.run(program=fluid.default_main_program(),
                    feed={'state': batch_state,
                          'action': batch_action.astype('int64'),
                          'reward': batch_reward,
                          'next_state': batch_next_state,
                          'done': batch_done})

Sample training output:

......
Pass:70, epsilon:0.300000, score:234
Pass:71, epsilon:0.290000, score:272
Pass:72, epsilon:0.280000, score:254
Pass:73, epsilon:0.270000, score:148
Pass:74, epsilon:0.260000, score:147
Pass:75, epsilon:0.250000, score:342
Pass:76, epsilon:0.240000, score:295
Pass:77, epsilon:0.230000, score:290
Pass:78, epsilon:0.220000, score:276
Pass:79, epsilon:0.210000, score:279
......

This completes the implementation of reinforcement learning with PaddlePaddle to play the game. Reinforcement learning has many applications, such as robot obstacle avoidance and other intelligent control tasks.

Synchronize to Baidu AI Studio: http://aistudio.baidu.com/aistudio/projectdetail/31310
Synchronize to Kesci K-Lab: https://www.kesci.com/home/project/5c3eaac54223d9002bfef5ae
Project code on GitHub: https://github.com/yeyupiaoling/LearnPaddle2/tree/master/note7

Note: The latest code is on GitHub.


Previous Chapter: 《PaddlePaddle From Beginner to Expert》Six - Generative Adversarial Networks
Next Chapter: 《PaddlePaddle From Beginner to Expert》Eight - Model Saving and Loading


References

  1. https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md
  2. https://github.com/snowkylin/TensorFlow-cn
Xiaoye