Models

A model encapsulate two PyTorch networks (body and head).
It defines how actions are sampled from the network and a training procedure.

BaseModel

class torchrl.models.BaseModel(model, batcher, *, cuda_default=True)[source]

Bases: torchrl.nn.container.ModuleExtended, abc.ABC

Basic TorchRL model. Takes two Config objects that identify the body(ies) and head(s) of the model.

Parameters:
  • model (nn.Module) – A pytorch model.
  • batcher (torchrl.batcher) – A torchrl batcher.
  • num_epochs (int) – How many times to train over the entire dataset (Default is 1).
  • num_mini_batches (int) – How many mini-batches to subset the batch (Default is 1, so all the batch is used at once).
  • opt_fn (torch.optim) – The optimizer reference function (the constructor, not the instance) (Default is Adam).
  • opt_params (dict) – Parameters for the optimizer (Default is empty dict).
  • clip_grad_norm (float) – Max norm of the gradients, if float(‘inf’) no clipping is done (Default is float(‘inf’)).
  • loss_coef (float) – Used when sharing networks, should balance the contribution of the grads of each model.
  • cuda_default (bool) – If True and cuda is supported, use it (Default is True).
batch_keys

The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.

register_losses

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
static output_layer(input_shape, action_info)[source]

The final layer of the model, will be appended to the model head.

Parameters:
  • input_shape (int or tuple) – The shape of the input to this layer.
  • action_info (dict) – Dictionary containing information about the action space.

Examples

The output of most PG models have the same dimension as the action, but the output of the Value models is rank 1. This is where this is defined.

forward(x)[source]

Defines the computation performed at every call.

Parameters:x (numpy.ndarray) – The environment state.
attach_logger(logger)[source]

Register a logger to this model.

Parameters:logger (torchrl.utils.logger) –
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.
classmethod from_config(config, batcher=None, body=None, head=None, **kwargs)[source]

Creates a model from a configuration file.

Parameters:
  • config (Config) – Should contatin at least a network definition (nn_config section).
  • env (torchrl.envs) – A torchrl environment (Default is None and must be present in the config).
  • kwargs (key-word arguments) – Extra arguments that will be passed to the class constructor.
Returns:

A TorchRL model.

Return type:

torchrl.models

ValueModel

class torchrl.models.ValueModel(model, batcher, **kwargs)[source]

Bases: torchrl.models.base_model.BaseModel

A standard regression model, can be used to estimate the value of states or Q values.

Parameters:clip_range (float) – Similar to PPOClip, limits the change between the new and old value function.
batch_keys

The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.

register_losses()[source]

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.
static output_layer(input_shape, action_info)[source]

The final layer of the model, will be appended to the model head.

Parameters:
  • input_shape (int or tuple) – The shape of the input to this layer.
  • action_info (dict) – Dictionary containing information about the action space.

Examples

The output of most PG models have the same dimension as the action, but the output of the Value models is rank 1. This is where this is defined.

BasePGModel

class torchrl.models.BasePGModel(model, batcher, *, entropy_coef=0, **kwargs)[source]

Bases: torchrl.models.base_model.BaseModel

Base class for all Policy Gradient Models.

entropy_loss(batch)[source]

Adds a entropy cost to the loss function, with the intent of encouraging exploration.

Parameters:batch (Batch) – The batch should contain all the information necessary to compute the gradients.
create_dist(parameters)[source]

Specify how the policy distributions should be created. The type of the distribution depends on the environment.

Parameters:
  • parameters (np.array) –
  • parameters are used to create a distribution (The) –
  • or discrete depending on the type of the environment) ((continuous) –
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.
static output_layer(input_shape, action_info)[source]

The final layer of the model, will be appended to the model head.

Parameters:
  • input_shape (int or tuple) – The shape of the input to this layer.
  • action_info (dict) – Dictionary containing information about the action space.

Examples

The output of most PG models have the same dimension as the action, but the output of the Value models is rank 1. This is where this is defined.

static select_action(model, state, step)[source]

Define how the actions are selected, in this case the actions are sampled from a distribution which values are given be a NN.

Parameters:state (np.array) – The state of the environment (can be a batch of states).

VanillaPGModel

class torchrl.models.VanillaPGModel(model, batcher, *, entropy_coef=0, **kwargs)[source]

Bases: torchrl.models.base_pg_model.BasePGModel

The classical Policy Gradient algorithm.

batch_keys

The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.

register_losses()[source]

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
pg_loss(batch)[source]

Compute loss based on the policy gradient theorem.

Parameters:batch (Batch) – The batch should contain all the information necessary to compute the gradients.

A2CModel

class torchrl.models.A2CModel(model, batcher, *, entropy_coef=0, **kwargs)[source]

Bases: torchrl.models.vanilla_pg_model.VanillaPGModel

A2C is just a parallel implementation of the actor-critic algorithm.

So just be sure to create a list of envs and pass to torchrl.envs.ParallelEnv to reproduce A2C.

SurrogatePGModel

class torchrl.models.SurrogatePGModel(model, batcher, *, entropy_coef=0, **kwargs)[source]

Bases: torchrl.models.base_pg_model.BasePGModel

The Surrogate Policy Gradient algorithm instead maximizes a “surrogate” objective, given by:

\[L^{CPI}({\theta}) = \hat{E}_t \left[\frac{\pi_{\theta}(a|s)} {\pi_{\theta_{old}}(a|s)} \hat{A} \right ]\]
batch_keys

The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.

register_losses()[source]

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
surrogate_pg_loss(batch)[source]

The surrogate pg loss, as described before.

Parameters:batch (Batch) –
calculate_prob_ratio(new_log_probs, old_log_probs)[source]

Calculates the probability ratio between two policies.

Parameters:
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.

PPOClipModel

class torchrl.models.PPOClipModel(model, batcher, ppo_clip_range=0.2, **kwargs)[source]

Bases: torchrl.models.surrogate_pg_model.SurrogatePGModel

Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.

Parameters:
  • ppo_clip_range (float) – Clipping value for the probability ratio (Default is 0.2).
  • num_epochs (int) – How many times to train over the entire dataset (Default is 10).
register_losses()[source]

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
ppo_clip_loss(batch)[source]

Calculate the PPO Clip loss as described in the paper.

Parameters:batch (Batch) –
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.

PPOAdaptiveModel

class torchrl.models.PPOAdaptiveModel(model, batcher, *, kl_target=0.01, kl_penalty=1.0, **kwargs)[source]

Bases: torchrl.models.surrogate_pg_model.SurrogatePGModel

Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.

Parameters:num_epochs (int) – How many times to train over the entire dataset (Default is 10).
register_losses()[source]

Append losses to self.losses, the losses are used at optimizer_step() for calculating the gradients.

Parameters:batch (dict) – The batch should contain all the information necessary to compute the gradients.
write_logs(batch)[source]

Write logs to the terminal and to a tf log file.

Parameters:batch (Batch) – Some logs might need the batch for calculation.