18 minute read

Machine learning technique: Introduction to reinforcement learning

Emmanuel Ohiri

Emmanuel Ohiri

Reinforcement learning (RL) is one of the most dynamic and captivating machine learning techniques. Unlike other machine learning techniques, such as supervised learning and unsupervised learning, reinforcement learning takes a unique approach inspired by how humans and animals learn through trial and error.

Reinforcement learning revolves around the concept of agents interacting with an environment, learning to make decisions that maximize cumulative rewards over time. In this article, we will explore reinforcement learning, breaking down its components, mechanisms, and applications.

Reinforcement learning requires powerful computing resources to train agents effectively. CUDO Compute offers access to scalable, cost-effective global network of high-performance GPUs for AI development. Accelerate your reinforcement learning development and deploy your agents faster with CUDO Compute. Get started today!

Table of Contents

  1. Basics of machine learning
  2. What is reinforcement learning?
  3. Key concepts in reinforcement learning
  4. How reinforcement learning works
  5. Types of reinforcement learning
  6. Core algorithms in reinforcement learning
  7. Practical applications of reinforcement learning
  8. Conclusion

Basics of machine learning

To understand reinforcement learning effectively, it is crucial to understand how it fits within the broader spectrum of machine learning. Machine learning is a branch of artificial intelligence (AI) that enables systems to learn and adapt without being explicitly programmed. The main types of machine learning include:

  1. Supervised Learning: Involves learning from labeled data. The model is trained on a dataset where both inputs and outputs are known, and its goal is to learn the mapping between them.

Read more about supervised learning here: Introduction to supervised learning

  1. Unsupervised Learning: Deals with unlabeled data. The model identifies patterns and structures from the data without explicit output labels.

Read more about unsupervised learning here: Introduction to unsupervised learning

  1. Reinforcement Learning: Focuses on decision-making and is modeled as an interaction between an agent and an environment, where the agent aims to maximize cumulative rewards through its actions.

What is reinforcement learning?

Reinforcement learning is a type of machine learning where an agent learns how to behave in an environment by performing actions and receiving feedback in the form of rewards or penalties. The agent aims to find the optimal policy—a strategy for choosing actions—that maximizes its long-term reward.

What is reinforcement learning image 1 Source: towards data science

At its core, RL is influenced by behavioral psychology, mimicking the way living beings learn through interaction with their surroundings. For example, imagine a robot learning to navigate a maze. Through trial and error, it discovers that certain paths lead to dead ends (penalties) while others lead it closer to the exit (rewards).

The process is iterative, with the agent continuously improving its decision-making based on accumulated experience.

Key Concepts in Reinforcement Learning

Understanding reinforcement learning requires familiarity with several foundational concepts:

  • Agent: This is the learner or decision-maker in the RL system. Think of it as the brain of the operation. The agent observes the environment, takes actions, and learns from the consequences.
  • Example: In a chess game, the agent would be the AI player deciding which moves to make.
  • Environment: This is the external world or system that the agent interacts with. It can be anything from a physical environment (like a robot navigating a room) to a virtual environment (like a game simulator). The environment responds to the agent's actions and provides feedback.
  • Example: In a self-driving car scenario, the environment would include the roads, other vehicles, pedestrians, traffic signals, and weather conditions.
  • State (S): A state is a snapshot of the environment at a particular point in time. It provides all the necessary information the agent needs to make an informed decision.
  • Example: In a video game, the state might include the player's position, the enemies' positions, the available weapons, and the remaining health points.
  • Action (A): An action is anything the agent can do to interact with the environment. The action space is the set of all possible actions the agent can take.
  • Example: In a robotics application, actions could include moving forward, turning left or right, grasping an object, or releasing an object.
  • Reward (R): This is the feedback signal the agent receives from the environment after taking an action. It indicates how good or bad the action was in terms of achieving the agent's goal. Rewards can be positive, negative, or zero.
  • Example: In a game, reaching a higher level might give a positive reward, while losing a life might result in a negative reward.

What is reinforcement learning image 2 Source: Paper

  • Policy (π): A policy is a mapping from states to actions. It essentially defines the agent's strategy or behavior. It tells the agent which action to take in each possible state.
  • Example: A simple policy for a self-driving car might be "If the traffic light is red, then stop; if the traffic light is green, then go."
  • Value Function (V): This function estimates the long-term value of being in a particular state. It predicts the expected cumulative reward the agent will receive if it starts in that state and follows a specific policy.
  • Example: In a maze, states closer to the goal would generally have higher values, as they are more likely to lead to the reward of reaching the exit.
  • Q-Value or Action-Value (Q): This function is similar to the value function, but it also considers the action taken. It estimates the expected cumulative reward for taking a specific action in a given state and then following a policy.
  • Example: In a game, a high Q-value for a particular action in a given state suggests that taking that action will likely lead to a good outcome.

These concepts are interconnected and work together to enable the agent to learn and adapt in its environment. Here is how it works.

How reinforcement learning works

The reinforcement learning process is a continuous cycle of interaction and learning. It's like a conversation between the agent and the environment, where the agent learns to make better decisions based on the feedback received.

To illustrate this process, let's consider a simple Q-learning implementation for a grid-based environment. The code demonstrates how an agent can learn to navigate a grid to reach a specific goal.

Simple Reinforcement Learning Code (Q-Learning) for Beginners

# Simple Reinforcement Learning Code (Q-Learning) for Beginners
import numpy as np
import random

# Environment Setup
class Environment:
    # ... (code for Environment class) ...

# Agent Setup
class Agent:
    # ... (code for Agent class) ...

# Training Process
def train_agent(episodes, grid_size, goal_position):
    # ... (code for train_agent function) ...

# Running the Training
grid_size = 4
goal_position = (3, 3)
episodes = 1000

trained_agent = train_agent(episodes, grid_size, goal_position)

# Testing the Trained Agent
# ... (code for testing the agent) ...

The code implements a simple reinforcement learning algorithm using Q-learning to train an agent in a grid-based environment. Here's a detailed breakdown of what each part does:

  • Environment Setup: The Environment class represents a simple grid world. It has a specified size (e.g., a 4x4 grid), and there is a goal position that the agent tries to reach. The environment can reset to start a new episode, and it responds to the agent's actions by updating its position and providing rewards. Moving closer to the goal gives a small negative reward (-0.01) to encourage finding the shortest path, and reaching the goal gives a reward of +1.
  • Agent Setup: The Agent class uses a Q-table to learn the best action for each position (state) in the grid. The Q-table stores values for each (state, action) pair, which helps the agent decide which action is best for each state. The agent chooses actions using an epsilon-greedy strategy: it either explores a random action or exploits the best-known action based on the Q-table. Using the Q-learning update rule, the Q-value is updated based on the reward and the best possible future value.
  • Training Process: The train_agent function trains the agent for a given number of episodes. In each episode, the agent starts at the top-left corner (0, 0) and interacts with the environment until it reaches the goal or completes several steps. The agent learns by updating the Q-table based on its experience.
  • Testing the Trained Agent: After training, the code tests the agent to see if it can consistently reach the goal. The agent starts from the initial state and moves according to the learned Q-table, aiming to reach the goal at (3, 3).

Now, let's delve into the step-by-step process of how this reinforcement learning algorithm works:

Initialization: This is where the learning journey begins. The agent is placed in the environment in a starting state. In our code, this is done by calling env.reset(), which places the agent at the top-left corner of the grid.

state = env.reset()  # Initialize state

Action Selection: Now, the agent needs to decide what to do. It uses its current policy to select an action. In this case, the agent uses an epsilon-greedy strategy, choosing between exploration (random action) and exploitation (best action based on the Q-table).

action = agent.choose_action(state)

Environment Response: Once the agent takes an action, the environment responds by transitioning to a new state and providing a reward. The env.step(action) function in the code handles this, updating the agent's position and returning the new state, reward, and a flag indicating if the episode is done.

next_state, reward, done = env.step(action)

Policy Update: The agent uses the reward signal to update its policy, learning to favor actions that lead to higher rewards. In this Q-learning implementation, the agent.update_q_value function updates the Q-table based on the observed reward and the estimated future value.

agent.update_q_value(state, action, reward, next_state)

Iteration: Steps 2 to 4 are repeated in a loop, allowing the agent to learn and improve its policy continuously. The for loop in the train_agent function iterates through a specified number of episodes, and the while loop within each episode continues until the agent reaches the goal or a maximum number of steps is reached.

for episode in range(episodes):
    state = env.reset()
    done = False

    while not done:
        # ... (action selection, environment response, policy update) ...
        state = next_state

What is reinforcement learning image 3

This feedback loop is the essence of reinforcement learning. It allows the agent to learn from experiences and adapt its behavior to achieve its objectives in a complex and dynamic environment.

Here is the entire code:

# Simple Reinforcement Learning Code (Q-Learning) for Beginners
import numpy as np
import random

# Environment Setup
class Environment:
    def __init__(self, grid_size, goal_position):
        self.grid_size = grid_size  # Size of the grid (e.g., 4 means a 4x4 grid)
        self.goal_position = goal_position  # Goal position (row, col)

    def reset(self):
        # Start the agent in the top-left corner
        self.agent_position = (0, 0)
        return self.agent_position

    def step(self, action):
        # Possible actions: 0 = Up, 1 = Down, 2 = Left, 3 = Right
        row, col = self.agent_position

        if action == 0 and row > 0:  # Up
            row -= 1
        elif action == 1 and row < self.grid_size - 1:  # Down
            row += 1
        elif action == 2 and col > 0:  # Left
            col -= 1
        elif action == 3 and col < self.grid_size - 1:  # Right
            col += 1

        self.agent_position = (row, col)

        # Reward for reaching the goal
        if self.agent_position == self.goal_position:
            reward = 1
            done = True
        else:
            reward = -0.01  # Small negative reward for each step to encourage finding the shortest path
            done = False

        return self.agent_position, reward, done

# Agent Setup
class Agent:
    def __init__(self, grid_size, epsilon=0.1, alpha=0.5, gamma=0.9):
        self.grid_size = grid_size
        self.q_table = np.zeros((grid_size, grid_size, 4))  # Q-value for each state-action pair
        self.epsilon = epsilon  # Exploration rate
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor

    def choose_action(self, state):
        # Epsilon-greedy action selection
        if random.uniform(0, 1) < self.epsilon:
            return random.randint(0, 3)  # Explore: Choose a random action
        else:
            row, col = state
            return np.argmax(self.q_table[row, col])  # Exploit: Choose the best action based on Q-table

    def update_q_value(self, state, action, reward, next_state):
        # Q-value update using the Q-learning formula
        row, col = state
        next_row, next_col = next_state
        best_next_action = np.max(self.q_table[next_row, next_col])
        current_q_value = self.q_table[row, col, action]

        # Update the Q-table value for the current state and action
        self.q_table[row, col, action] = current_q_value + self.alpha * (
            reward + self.gamma * best_next_action - current_q_value
        )

# Training Process
def train_agent(episodes, grid_size, goal_position):
    env = Environment(grid_size, goal_position)
    agent = Agent(grid_size)

    for episode in range(episodes):
        state = env.reset()  # Initialize state
        done = False

        while not done:
            # Agent selects an action
            action = agent.choose_action(state)

            # Environment responds to the action
            next_state, reward, done = env.step(action)

            # Update the agent's policy using the reward and next state
            agent.update_q_value(state, action, reward, next_state)

            # Move to the next state
            state = next_state

        # Print progress every 100 episodes
        if (episode + 1) % 100 == 0:
            print(f"Episode {episode + 1}/{episodes} completed")

    return agent

# Running the Training
grid_size = 4
goal_position = (3, 3)  # The goal is in the bottom-right corner
episodes = 1000

trained_agent = train_agent(episodes, grid_size, goal_position)

# Testing the Trained Agent
print("\nTesting the trained agent...")
env = Environment(grid_size, goal_position)
state = env.reset()
done = False
steps = 0

while not done:
    action = trained_agent.choose_action(state)
    state, reward, done = env.step(action)
    steps += 1
    print(f"Step {steps}: Moved to {state}")

print("Goal reached!")

Types of reinforcement learning

Reinforcement learning algorithms can be broadly classified into two main types based on how they interact with the environment:

1. Model-free reinforcement learning

In model-free RL, the agent learns directly from its experiences interacting with the environment without building an explicit model of how the environment works. It's like learning to ride a bike by trial and error without understanding the physics of balance and motion.

  • Characteristics of model-free reinforcement learning:
  • Direct Learning: The agent learns through direct interaction, observing states, taking actions, and receiving rewards.
  • No Environment Model: There's no internal representation or model of the environment's dynamics.
  • Trial-and-Error Focus: Learning is primarily driven by trial and error, with the agent exploring different actions and learning from the consequences.

2. Model-based reinforcement learning

In model-based RL, the agent first tries to learn a model of the environment. This model captures the relationships between states, actions, and rewards. Once the model is learned, the agent can use it to plan its actions and predict future outcomes. It's like learning to ride a bike by first studying the physics of balance and then using that knowledge to guide your actions.

What is reinforcement learning image 4 Source: Paper

  • Characteristics of model-based reinforcement learning:
  • Environment Model: The agent builds an internal representation of the environment's dynamics.
  • Planning: The agent can use the model to simulate different actions and their potential consequences, allowing it to plan ahead and make more informed decisions.
  • Sample Efficiency: Model-based methods can often learn with fewer interactions with the environment because they can use the model to generate a simulated experience.
FeatureModel-Free RLModel-Based RL
Environment knowledgeNo explicit modelLearns a model
Learning approachDirect interaction, trial and errorPlanning and prediction
Sample efficiencyCan require a lot of dataCan be more sample-efficient
ComplexityGenerally simpler to implementMore complex
AdaptabilityCan adapt to changes in the environment through experienceMay need to re-learn the model if the environment changes significantly

The choice between model-free and model-based RL depends on the specific application and the characteristics of the environment. In some cases, a combination of both approaches might be most effective. Before we discuss the use cases of reinforcement learning, let’s explore its core algorithms.

Core algorithms in reinforcement learning

Reinforcement learning encompasses a variety of algorithms. Some of the most notable include:

  1. Q-Learning:

Q-learning is a model-free, value-based algorithm that learns an action-value function (Q-function) to estimate the expected cumulative reward for taking a given action in a given state.

It updates these Q-values using the Bellman equation, which essentially links the value of the current state-action pair to the expected value of the next state and the immediate reward received.

To balance exploration (trying new actions) and exploitation (choosing actions known to be good), Q-learning is often combined with techniques like epsilon-greedy exploration.

  1. Deep Q-Networks (DQN):

Deep Q-Networks (DQN) extend the capabilities of Q-learning by using deep neural networks to approximate the Q-function. Its advancement helps it effectively handle high-dimensional state spaces, such as those found in games with visual input.

To enhance stability and convergence during the learning process, DQNs often incorporate techniques like experience replay, where past experiences are stored and reused for training, and target networks, which provide more stable learning targets.

  1. State-Action-Reward-State-Action:

State-Action-Reward-State-Action (SARSA) is a model-free, value-based algorithm similar to Q-learning. However, it distinguishes itself by being an on-policy algorithm, meaning it learns the value of the policy it is currently following. This is achieved by updating Q-values based on the action taken in the next state according to the current policy rather than the greedy action.

What is reinforcement learning image 5 Source: Paper

This characteristic makes SARSA useful when it's important to learn a safe policy (a policy that avoids actions that could lead to undesirable outcomes). Since SARSA considers the actual action it will take in the next state, it's less likely to explore potentially dangerous actions, making it a safer and more cautious learner compared to some off-policy methods.

  1. Policy Gradient Methods:

Policy gradient methods directly learn a policy with the goal of finding the optimal policy that maximizes cumulative reward. They achieve this by updating the policy parameters through estimations of the gradient of the expected reward.

  1. Actor-Critic Methods:

Actor-critic methods combine value-based and policy-based approaches by using an actor (policy) to select actions and a critic (value function) to estimate the value of those actions or the states they lead to, which allows the critic to provide feedback to the actor, guiding it towards more effective learning and ultimately improving its decision-making process.

  1. Proximal Policy Optimization (PPO):

Proximal Policy Optimization (PPO) is a popular policy gradient method that enhances stability by constraining policy updates, preventing drastic changes that could hinder learning. This approach has proven effective across various tasks, achieving good performance while remaining relatively easy to implement.

  1. Dyna-Q:

Dyna-Q bridges the gap between model-free and model-based learning by learning both a Q-function (similar to Q-learning) and a model of the environment. The learned model is then used to generate simulated experiences, which are used to further refine and update the Q-function, leading to more efficient and effective learning.

  1. Monte Carlo Tree Search (MCTS):

Monte Carlo Tree Search (MCTS) is a tree search algorithm frequently used in model-based reinforcement learning. It uses a model of the environment to simulate potential outcomes, effectively guiding the search for optimal actions.

MCTS can be used in games with large branching factors, where the sheer number of possible actions at each step makes it computationally impractical to explore the entire game tree.

This list covers some of the most common and influential algorithms in reinforcement learning. The choice of algorithm depends on the specific problem, the nature of the environment, and the desired balance between exploration and exploitation.

Practical applications of reinforcement learning

Reinforcement learning is used in various real-world applications. Here are some notable examples:

  1. Gaming

In gaming AI, RL enables the creation of agents that can master complex games at superhuman levels. A prime example is DeepMind's AlphaGo, which defeated a world champion Go player, showcasing the potential of RL in mastering games with enormous search spaces.

Beyond board games, RL is being applied to video games across various genres, from first-person shooters to strategy games, resulting in more challenging and dynamic opponents. Furthermore, RL can be utilized to generate game content like levels and quests, leading to richer and more unpredictable gaming experiences.

  1. Robotics

Using RL, robots can learn complex tasks through trial and error, much like humans do. This includes tasks like grasping objects, navigating cluttered environments, assembling products, and even performing delicate surgical procedures.

RL enhances their robustness and versatility by enabling robots to adapt to the uncertainties and dynamic nature of real-world environments. Furthermore, RL facilitates safer and more effective human-robot collaboration by allowing robots to learn optimal interaction strategies.

  1. Finance

RL algorithms can be trained to analyze market data, identify patterns, and make trading decisions in real-time, potentially leading to improved returns and reduced risk.

Furthermore, RL can optimize investment portfolios by dynamically allocating assets based on evolving market conditions and investor preferences. Its ability to learn patterns of normal and abnormal behavior also makes RL a valuable tool for identifying fraudulent transactions.

  1. Healthcare

Reinforcement learning holds great promise for advancing healthcare by personalizing treatment plans based on individual patient characteristics and medical history. It can also accelerate drug discovery by simulating the effects of different drug candidates on biological systems, enabling researchers to identify promising leads more efficiently.

Furthermore, RL can be applied to control prosthetic limbs, allowing for more natural and intuitive movements, and to personalize rehabilitation programs for patients recovering from injuries.

These are just a few examples of how reinforcement learning is being applied in the real world.

Conclusion

Reinforcement learning offers a powerful and dynamic approach to machine learning, enabling agents to learn through interaction and feedback. By understanding its core components, algorithms, and applications, we can create intelligent systems capable of solving complex challenges in various domains.

CUDO Compute is the ideal platform for your reinforcement learning development. Access top-tier NVIDIA GPUs, like the NVIDIA H200, at affordable pricing to train your agents, simulate complex environments, and easily deploy your agents. Click here to get started, or contact us.

Starting from $2.15/hr

NVIDIA H100's are now available on-demand

A cost-effective option for AI, VFX and HPC workloads. Prices starting from $2.15/hr

Subscribe to our Newsletter

Subscribe to the CUDO Compute Newsletter to get the latest product news, updates and insights.