Models¶
BaseModel¶
-
class
torchrl.models.
BaseModel
(model, batcher, *, cuda_default=True)[source]¶ Bases:
torchrl.nn.container.ModuleExtended
,abc.ABC
Basic TorchRL model. Takes two
Config
objects that identify the body(ies) and head(s) of the model.Parameters: - model (nn.Module) – A pytorch model.
- batcher (torchrl.batcher) – A torchrl batcher.
- num_epochs (int) – How many times to train over the entire dataset (Default is 1).
- num_mini_batches (int) – How many mini-batches to subset the batch (Default is 1, so all the batch is used at once).
- opt_fn (torch.optim) – The optimizer reference function (the constructor, not the instance) (Default is Adam).
- opt_params (dict) – Parameters for the optimizer (Default is empty dict).
- clip_grad_norm (float) – Max norm of the gradients, if float(‘inf’) no clipping is done (Default is float(‘inf’)).
- loss_coef (float) – Used when sharing networks, should balance the contribution of the grads of each model.
- cuda_default (bool) – If True and cuda is supported, use it (Default is True).
-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
static
output_layer
(input_shape, action_info)[source]¶ The final layer of the model, will be appended to the model head.
Parameters: Examples
The output of most PG models have the same dimension as the action, but the output of the Value models is rank 1. This is where this is defined.
-
forward
(x)[source]¶ Defines the computation performed at every call.
Parameters: x (numpy.ndarray) – The environment state.
-
attach_logger
(logger)[source]¶ Register a logger to this model.
Parameters: logger (torchrl.utils.logger) –
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
classmethod
from_config
(config, batcher=None, body=None, head=None, **kwargs)[source]¶ Creates a model from a configuration file.
Parameters: - config (Config) – Should contatin at least a network definition (
nn_config
section). - env (torchrl.envs) – A torchrl environment (Default is None and must be present in the config).
- kwargs (key-word arguments) – Extra arguments that will be passed to the class constructor.
Returns: A TorchRL model.
Return type: torchrl.models
- config (Config) – Should contatin at least a network definition (
ValueModel¶
-
class
torchrl.models.
ValueModel
(model, batcher, **kwargs)[source]¶ Bases:
torchrl.models.base_model.BaseModel
A standard regression model, can be used to estimate the value of states or Q values.
Parameters: clip_range (float) – Similar to PPOClip, limits the change between the new and old value function. -
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
BasePGModel¶
-
class
torchrl.models.
BasePGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_model.BaseModel
Base class for all Policy Gradient Models.
-
entropy_loss
(batch)[source]¶ Adds a entropy cost to the loss function, with the intent of encouraging exploration.
Parameters: batch (Batch) – The batch should contain all the information necessary to compute the gradients.
-
create_dist
(parameters)[source]¶ Specify how the policy distributions should be created. The type of the distribution depends on the environment.
Parameters: - parameters (np.array) –
- parameters are used to create a distribution (The) –
- or discrete depending on the type of the environment) ((continuous) –
-
write_logs
(batch)[source]¶ Write logs to the terminal and to a tf log file.
Parameters: batch (Batch) – Some logs might need the batch for calculation.
-
VanillaPGModel¶
-
class
torchrl.models.
VanillaPGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_pg_model.BasePGModel
The classical Policy Gradient algorithm.
-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
A2CModel¶
-
class
torchrl.models.
A2CModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.vanilla_pg_model.VanillaPGModel
A2C is just a parallel implementation of the actor-critic algorithm.
So just be sure to create a list of envs and pass to
torchrl.envs.ParallelEnv
to reproduce A2C.
SurrogatePGModel¶
-
class
torchrl.models.
SurrogatePGModel
(model, batcher, *, entropy_coef=0, **kwargs)[source]¶ Bases:
torchrl.models.base_pg_model.BasePGModel
The Surrogate Policy Gradient algorithm instead maximizes a “surrogate” objective, given by:
\[L^{CPI}({\theta}) = \hat{E}_t \left[\frac{\pi_{\theta}(a|s)} {\pi_{\theta_{old}}(a|s)} \hat{A} \right ]\]-
batch_keys
¶ The batch keys needed for computing all losses. This is done to reduce overhead when sampling a dataloader, it makes sure only the requested keys are being sampled.
-
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
surrogate_pg_loss
(batch)[source]¶ The surrogate pg loss, as described before.
Parameters: batch (Batch) –
-
calculate_prob_ratio
(new_log_probs, old_log_probs)[source]¶ Calculates the probability ratio between two policies.
Parameters: - new_log_probs (torch.Tensor) –
- old_log_probs (torch.Tensor) –
-
PPOClipModel¶
-
class
torchrl.models.
PPOClipModel
(model, batcher, ppo_clip_range=0.2, **kwargs)[source]¶ Bases:
torchrl.models.surrogate_pg_model.SurrogatePGModel
Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.
Parameters: -
register_losses
()[source]¶ Append losses to
self.losses
, the losses are used atoptimizer_step()
for calculating the gradients.Parameters: batch (dict) – The batch should contain all the information necessary to compute the gradients.
-
PPOAdaptiveModel¶
-
class
torchrl.models.
PPOAdaptiveModel
(model, batcher, *, kl_target=0.01, kl_penalty=1.0, **kwargs)[source]¶ Bases:
torchrl.models.surrogate_pg_model.SurrogatePGModel
Proximal Policy Optimization as described in https://arxiv.org/pdf/1707.06347.pdf.
Parameters: num_epochs (int) – How many times to train over the entire dataset (Default is 10).