### CDS NYU
### DS-GA 3001 | Reinforcement Learning
### Lab 05
### March 01, 2023


# Deep Q-learning algorithm (from scratch...)

<br>

---

## Professor


Jeremy Curuksu, PhD -- jeremy.cur@nyu.edu

## Section Leader


Anudeep Tubati -- at5373@nyu.edu


## Goal of Today's Lab 

In this lab, we will create a Deep Reinforcement Learning method called Deep-Q-Network (DQN). We will again use a simple environment from OpenAI Gym, but you will showcase the enormous gain we get by switching from tabular Q-Learning to Deep Q Learning.

## Resources

* https://gymnasium.farama.org/


#### New Env Installation
We need to create a new environment (just 3 packages) for this lab as it has breaking dependencies compared to our usual environment.

1. `conda create --name py39_fl_bird python=3.9 -y`
2. `conda activate py39_fl_bird`
3. `python -m pip install --upgrade pip`
4. `pip install flappy-bird-gym`
5. `pip install tensorflow==2.7`
6. `pip install ipykernel`
7. `python -m ipykernel install --user --name=py39_fl_bird`

#### Make sure to execute this notebook in the newly installed kernel.

# 2. Solve *Flappy Bird* with DQN

In this use case, we will implement DQN to play the video game `FlappyBird`, using a different, "*Object Oriented Programming*" style.

Link to doc of the game: https://pypi.org/project/flappy-bird-gym/


## Imports

In [None]:
import random
import time 
import numpy as np
import flappy_bird_gym
from collections import deque
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import load_model, save_model, Sequential
from tensorflow.keras.optimizers import RMSprop


In [None]:
# Load the CartPole Gym environment with graphical rendering to vizualize the environment
env = flappy_bird_gym.make("FlappyBird-v0")

env.action_space


## Execute random actions just to get familiar with the environment

In [None]:
# Set to initial state
env.reset()  

# Loop over 1000 steps 
for i in range(1000):
    env.render()                   # Render on the screen
    action = 0                     # Action "Don't jump"
    if i%19 == 0: action = 1       # Action "Jump" (1/19 lets the bird live a few sec)
    new_state, reward, done, info = env.step(action)   # Carry out the action
    
    if done:
         env.reset()
            
    time.sleep(0.05)  
            
env.close()


## Define a function that implement an Artificial Neural Network

In [None]:
def NeuralNetwork(input_shape, output_shape):
    model = Sequential()
    model.add(Dense(512, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(256, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(output_shape, activation='linear', kernel_initializer='he_uniform'))
    model.compile(loss='mse', optimizer=RMSprop(lr=0.0001, rho=0.95, epsilon=0.01), metrics=['accuracy'])
    model.summary()
    return model


## Define a class that implement the DQN agents

In this OOP style of implementation, we will define an agent class which contains:
* Gym environment parameters
* DQN Hyperparameters 
* Method to take actions (actor)
* Method to learn and update DQN parameters
* Method to train the DQN by interacting with the environment
* Method to vizualize the trained agent behavior in test simulations


In [None]:
class DQNAgent:
    
    def __init__(self):    
        # Gym. environment variables
        self.env = flappy_bird_gym.make("FlappyBird-v0")
        self.episodes = 1000
        self.state_space = self.env.observation_space.shape[0]
        self.action_space = self.env.action_space.n
        self.memory = deque(maxlen=2000)

        # Hyperparameters
        self.gamma = 0.95
        self.epsilon = 1
        self.epsilon_decay = 0.999
        self.epsilon_min = 0.01
        self.batch_number = 64 #16, 32, 128, 256
        
        self.train_start = 1000
        self.jump_prob = 0.01
        self.model = NeuralNetwork(input_shape=(self.state_space,), output_shape=self.action_space)

    def act(self, state):
        if np.random.random() > self.epsilon:
            return np.argmax(self.model.predict(state))
        return 1 if np.random.random() < self.jump_prob else 0

    def learn(self):
        # Make sure we have enough data
        if len(self.memory) < self.train_start:
            return

        # Create minibatch
        minibatch = random.sample(self.memory, min(len(self.memory), self.batch_number))
        
        # Variables to store minibatch info
        state = np.zeros((self.batch_number, self.state_space))
        next_state = np.zeros((self.batch_number, self.state_space))
        action, reward, done = [], [], []

        # Store data in variables
        for i in range(self.batch_number):
            state[i] = minibatch[i][0]
            action.append(minibatch[i][1])
            reward.append(minibatch[i][2])
            next_state[i] = minibatch[i][3]
            done.append(minibatch[i][4])

        # Predict y label
        target = self.model.predict(state)
        target_next = self.model.predict(next_state)

        for i in range(self.batch_number):
            if done[i]:
                # NOTE: fill in
            else:
                # NOTE: fill in

        self.model.fit(state, target, batch_size=self.batch_number, verbose=0)

        
    def train(self, vizualization=True):
        
        best_score = float('-inf')
        # n episode Iterations for training
        for i in range(self.episodes):
            # Environment variables for training 
            state = self.env.reset()
            state = np.reshape(state, [1, self.state_space])
            done = False
            score = 0
            self.epsilon = self.epsilon * self.epsilon_decay if self.epsilon * self.epsilon_decay > self.epsilon_min else self.epsilon_min

            while not done:
                if vizualization: self.env.render()
                action = self.act(state)
                next_state, reward, done, info = self.env.step(action)

                # Reshape next state
                next_state = np.reshape(next_state, [1, self.state_space])
                score += 1

                if done:
                    reward -= 100

                self.memory.append((state, action, reward, next_state, done))
                state = next_state

                if done:
                    
                    if i%5 == 0: print('Episode: {}\nScore: {}\nEpsilon: {:.2}'.format(i, score, self.epsilon))
                    
                    # Save model if it beats current high score
                    if score > best_score: 
                        self.model.save('flappybrain.h5')
                        print(f"Great score! {score}")
                        best_score = score
                        
                        if score > 200:  # stop learning if we score high enough
                            return

                self.learn()

    # Visualize a pre-trained model in test simulations
    def perform(self):
        self.model = load_model('flappybrain.h5')
        while 1:
            state = self.env.reset()
            state = np.reshape(state, [1, self.state_space])
            done = False
            score = 0

            while not done:
                self.env.render()
                action = np.argmax(self.model.predict(state))
                next_state, reward, done, info = self.env.step(action)
                state = np.reshape(next_state, [1, self.state_space])
                score += 1

                print("Current Score: {}".format(score))

                if done: 
                    print('DEAD')
                    break


## Train the agent, or vizualize a pre-trained agent in test simulations

With this OOP style implementation, we can invoke either the `train` method, or the `perform` method on a pre-trained agent. 

In [None]:
if __name__ == '__main__':
    agent = DQNAgent()
    agent.train(vizualization=False)

In [None]:
if __name__ == '__main__':
    agent = DQNAgent()
    agent.perform()

# Exercise: 

## Implement a similar DQN algorithm to play at the `SuperMarioBros-v0` game.

Link to doc of the game: https://pypi.org/project/gym-super-mario-bros/

### Super Mario Bros setup

In [None]:
# Imports
import random
import time
import gym
import numpy as np
from collections import deque
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.optimizers import Adam
from keras.models import load_model

# Packages needed for Super Mario Bros
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from nes_py.wrappers import JoypadSpace
from IPython.display import clear_output

# Set up Super Mario Bros 
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, RIGHT_ONLY)

### Execute random actions just to get familiar with the environment

In [None]:
total_reward = 0
done = True

for step in range(1000):
    env.render()
    
    if done:
        state = env.reset()
    state, reward, done, info = env.step(env.action_space.sample())
    print(info)
    total_reward += reward
    clear_output(wait=True)
    
    time.sleep(0.01) 
    
env.close()

In [None]:
# Implement your custom DQN class here!

## Example solution:

### Define a class that implement a DQN agent, with a CNN to process pixels

In [None]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        # Create variables for our agent
        self.state_space = state_size
        self.action_space = action_size
        self.memory = deque(maxlen=5000)
        self.gamma = 0.8
        self.chosenAction = 0
        
        # Exploration vs explotation
        self.epsilon = 0.1
        self.max_epsilon = 1
        self.min_epsilon = 0.01
        self.decay_epsilon = 0.0001
        
        # Build Neural Networks for Agent
        self.main_network = self.build_network()
        self.target_network = self.build_network()
        self.update_target_network()
        
        
    def build_network(self):
        model = Sequential()
        model.add(Conv2D(64, (4,4), strides=4, padding='same', input_shape=self.state_space))
        model.add(Activation('relu'))
        
        model.add(Conv2D(64, (4,4), strides=2, padding='same'))
        model.add(Activation('relu'))
        
        model.add(Conv2D(64, (3,3), strides=1, padding='same'))
        model.add(Activation('relu'))
        model.add(Flatten())
        
        model.add(Dense(512, activation='relu'))
        model.add(Dense(256, activation='relu'))
        model.add(Dense(self.action_space, activation='linear'))
        
        model.compile(loss='mse', optimizer=Adam())
        
        return model
        
    def update_target_network(self):
        self.target_network.set_weights(self.main_network.get_weights())
        
    def act(self, state, onGround):
        if onGround < 83:
            # print("On Ground")
            if random.uniform(0,1) < self.epsilon:
                self.chosenAction = np.random.randint(self.action_space)
                return self.chosenAction
            Q_value = self.main_network.predict(state)
            self.chosenAction = np.argmax(Q_value[0])
            return self.chosenAction
        else: 
            # print("Not on Ground")
            return self.chosenAction
        
    def update_epsilon(self, episode):
        self.epsilon = self.min_epsilon + (self.max_epsilon - self.min_epsilon) * np.exp(-self.decay_epsilon * episode)
        
    # Train the network
    def train(self, batch_size):
        #minibatch from memory
        minibatch = random.sample(self.memory, batch_size)
        
        #Get variables from batch so we can find q-value
        for state, action, reward, next_state, done in minibatch:
            target = self.main_network.predict(state)
            print(target)
            
            if done:
                target[0][action] = reward
            else:
                target[0][action] = (reward + self.gamma * np.amax(self.target_network.predict(next_state)))
                
            self.main_network.fit(state, target, epochs=1, verbose=0)
        
    def store_transition(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        
    def get_pred_act(self, state):
        Q_values = self.main_network.predict(state)
        print(Q_values)
        return np.argmax(Q_values[0])
        
    def load(self, name):
        self.main_network = load_model(name)
        self.target_network = load_model(name)
    
    def save(self, name):
        self.main_network.save(name)
        

### Train the agent

In [None]:
num_episodes = 1000000
num_timesteps = 400000
batch_size = 64
DEBUG_LENGTH = 300

In [None]:
action_space = env.action_space.n
state_space = (80, 88, 1)

from PIL import Image

def preprocess_state(state):
    image = Image.fromarray(state)
    image = image.resize((88, 80))
    image = image.convert('L')
    image = np.array(image)
    
    return image


In [None]:
dqn = DQNAgent(state_space, action_space)

In [None]:
print('STARTING TRAINING')

stuck_buffer = deque(maxlen=DEBUG_LENGTH)

for i in range(num_episodes):
    Return = 0
    done = False
    time_step = 0
    onGround = 79
    
    state = preprocess_state(env.reset())
    state = state.reshape(-1, 80, 88, 1)
    
    for t in range(num_timesteps):
        env.render()
        time_step += 1
        
        if t> 1 and stuck_buffer.count(stuck_buffer[-1]) > DEBUG_LENGTH - 50:
            action = dqn.act(state, onGround=79)
        else:
            action = dqn.act(state, onGround)
        
        print("ACTION IS"+str(action))
        
        next_state, reward, done, info =env.step(action)
        
        onGround = info['y_pos']
        stuck_buffer.append(info['x_pos'])
        
        next_state = preprocess_state(next_state)
        next_state = next_state.reshape(-1, 80, 88, 1)
        
        dqn.store_transition(state, action, reward, next_state, done)
        state = next_state
        
        Return += reward
        print("Episode is: {}\nTotal Time Step: {}\nCurrent Reward: {}\nEpsilon is: {}".format(str(i), str(time_step), str(Return), str(dqn.epsilon)))
        
        clear_output(wait=True)
        
        if done:
            break
        
        if len(dqn.memory) > batch_size and i > 0:
            dqn.train(batch_size)
            
    dqn.update_epsilon(i)
    clear_output(wait=True)
    dqn.update_target_network()

    dqn.save('marioRL.h5')
    
env.close()
            

### Vizualize the trained agent in test simulations

In [None]:
dqn.load('MarioRLmediumtrain.h5')

while 1: 
    done = False
    state = preprocess_state(env.reset())
    state = state.reshape(-1, 80, 88, 1)
    total_reward = 0
    onGround = 79
    
    while not done:
        env.render()
        action = dqn.act(state, onGround)
        next_state, reward, done, info = env.step(action)
        
        onGround = info['y_pos']
        
        next_state = preprocess_state(next_state)
        next_state = next_state.reshape(-1, 80, 88, 1)
        state = next_state
        
env.close()

As next step, one could try to 1) train more, 2) test different hyperparameters, 3) check out different/more sophisticated implementations (there exits several papers and blog posts on "RL for Super Mario Bros")

## Thank you everyone!