OpenAI Gym Projects Save

OpenAI Gym environment solutions using Deep Reinforcement Learning.

Project README

OpenAI-Gym-Projects

Gym

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, Gym's API has become the field standard for doing this.

Gym environments solved:

Classic Control

Control theory problems from the classic RL literature.

CartPole-v1

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity. A reward of +1 is provided for every timestep that the pole remains upright and the maximum number of steps per episode is 500. Hence, a perfect agent would be able to achieve a reward of +500 every episode.

Solution using Proximal Policy Optimization (PPO)

drawing

Agent after 1250 episodes

MountainCar-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. A reward of -1 is provided for every timestep until the goal is reached or 200 timesteps have passed.

Solution using Double Dueling Deep Q Learning (DQN) with Prioritized Experience Replay

drawing

Agent after 4000 episodes

MountainCarContinuous-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal

Solution using Soft Actor-Critic (SAC) with Prioritized Experience Replay

drawing

Agent after 700 episodes

Acrobot-v1

The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height. A reward of -1 is provided for every timestep until the goal is reached or 500 timesteps have passed.

Solution using Double Dueling Deep Q Learning (DQN) with Prioritized Experience Replay

drawing

Agent after 10000 episodes

Pendulum-v0

The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.er link up to a given height. A reward of -1 is provided for every timestep until the goal is reached or 500 timesteps have passed.

Solution using Deep Deterministic Policy Gradient (DDPG) with Prioritized Experience Replay

drawing

Agent after 750 episodes

Box2D

Continuous control tasks in the Box2D simulator.

LunarLander-v2

Navigate the lander to its landing pad. Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

Solution using Double Dueling Deep Q Learning (DQN)

drawing

Agent after 1500 episodes

LunarLanderContinuous-v2

Navigate the lander to its landing pad. Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 1000 episodes

BipedalWalker-v3

Train a bipedal robot to walk. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 1250 episodes

BipedalWalkerHardcore-v3

Train a bipedal robot to run through an obstacle course. Hardcore version course contains ladders, stumps and pitfalls. Time limit is increased due to obstacles. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 15500 episodes

100 episode performance evaluation

Reward: 305.40 ± 21.35

MuJoCo

Continuous control tasks, running in a fast physics simulator.

InvertedPendulum-v2

An inverted pendulum that needs to be balanced by a cart. The agent gets a reward for every timestep that the pendulum has not fallen off the cart, with a maximum reward of +1000.

Proximal Policy Optimisation (PPO)

drawing

Agent after 1400 episodes

100 episode performance evaluation

Reward: 920.71± 224.04

InvertedDoublePendulum-v2

Balance a pole on a pole on a cart. The agent gets a reward for every timestep that the pendulum has not fallen off the cart.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 1000 episodes

100 episode performance evaluation

Reward: 9359.88 ± 0.08

Reacher-v2

A 2D robot trying to reach a randomly located target. The robot gets a negative reward the furthest away it is from the target location.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 650 episodes

100 episode performance evaluation

Reward: -4.75 ± 1.67

HalfCheetah-v2

Make a 2D cheetah robot run. The robot gets a positive reward the furthest it travels.

Solution using Deep Deterministic Policy Gradient (DDPG)

drawing

Agent after 600 episodes

100 episode performance evaluation

Reward: 10374.65 ± 202.81

Hopper-v2

A 2D robot that learns to hop. The agent gets a positive reward the furthest it travels.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 1450 episodes

100 episode performance evaluation

Reward: 3625.53 ± 9.00

Walker2d-v2

A 2D robot that learns to walk. The agent gets a positive reward the furthest it travels.

Solution using Soft Actor-Critic (SAC)

drawing

Agent after 3250 episodes

100 episode performance evaluation

Reward: 5317.38 ± 15.86

Robotics

Simulated goal-based tasks for the Fetch and ShadowHand robots.

FetchReach-v1

Move fetch to the goal position. A goal position is randomly chosen in 3D space. Control Fetch's end effector to reach that goal as quickly as possible. A negative reward is given at every timestep that the agent has not reached the goal position.

Solution using Soft Actor-Critic (SAC) with prioritized experience replay

drawing

Agent after 2000 episodes

100 episode performance evaluation

Reward: -1.78 ± 0.88

Atari

Retro Atari video game environments.

Pong-v5

Pong is a table tennis–themed twitch arcade sports video game. You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal. You get score points for getting the ball to pass the opponent’s paddle. You lose points if the ball passes your paddle. Each episode contains 21 games.

Double Dueling Deep Q Learning (DQN) with Prioritized Experience Replay

drawing

Agent after 2000 episodes

100 episode performance evaluation

Reward: 21.00 ± 0.00

Open Source Agenda is not affiliated with "OpenAI Gym Projects" Project. README Source: NickKaparinos/OpenAI-Gym-Projects

Open Source Agenda Badge

Open Source Agenda Rating