Protocol for Evaluating Reinforcement Learning Environments in Real Time
draft-perlert-wg-00

Abstract

This document defines a simple UDP protocol for communicating a server simulating a reinforcement learning environment and a client observing it and responding with actions.

Reinforcement learning problems are usually defined within the scope of a Markov Decission Process (MDP) where an agent sends an action belonging to an action space to an environment. The environment acts as a black box returning an observation and a reward for the agent, whose goal is to maximize the total obtained rewards.

Although the problem statement is easy to understand, there are no conventions on how to communicate a reinforcement learning simulation with a client agent, either in a local network or over the Internet. Additionally, giving an answer to this can be especially useful when it comes to multiagent support and analysis.

The protocol PERLERT defined in this document assumes that server and client have shared certain information beforehand via another way of communication like a web page served using HTTP protocol. For example, the client must know a port number and an instance number before proceeding to participate in a simulation run on a server.

Also, although it is often desired to know the full feedback from the environment, PERLERT focuses on real-time interaction where human agents can interact with AI agents even if that means that information can be lost due to network packet loss.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on February 14, 2021.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction

1.1. Requirements Language

2. Communication Phases
3. Messages Specification

3.1. Terms
3.2. Client Message Types
3.3. Server Message Types

4. UDP/IP Ports
5. Example Case
6. Additional Considerations
7. IANA Considerations
8. Security Considerations
9. Normative References
Author's Address

1. Introduction

This document specifies PERLERT (Protocol for Evaluation of Reinforcement Learning Environments in Real Time).

It is intended to be used in the context of reinforcement learning problems analysis. In reinforcement learning problems an agent sends an action to an environment. The environment acts as a black box returning an observation and a reward for the agent, whose goal is to maximize the total obtained rewards.

The main purpose of PERLERT is to make it easier to test and integrate differently implemented agents and run simulation servers separatedly from those agents.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

2. Communication Phases

There are two main separated phases in which client and server shall exchange PERLERT messages.

lobby: This phase is oriented to let agent clients register themselves within the available slots informed by the server. It is especially useful when it comes to environments with multiagent support.
rollout: This is the main phase. The term "rollout" here acts as a synonym of "simulation". In this section the loop:

(action) -> (observation, reward)

...takes place until clients are notified by the server that the simulation has finished.

3. Messages Specification

Messages defined in the following sections MUST be implemented as UDP/IP datagrams.

Also, all messages SHOULD use the same text encoding. It is RECOMMENDED that both server and client encode messages using UTF8.

3.1. Terms

In order of appearance:

SERVER_INSTANCE_NAME: Tag used to distinguish different environments being held by one same server, e.g.: "cartpole".
SERVER_INSTANCE_NUMBER: Positive integer used to distinguish different instances of the same environment being held by one same server, e.g.: "0".
HEADER: Shorthand for SERVER_INSTANCE_NAME:SERVER_INSTANCE_NUMBER, e.g.: "cartpole:0".
SERVER_LOBBY_PORT: UDP/IP port on which server is listening for incoming messages related to the lobby phase. It is necessary that clients know the SERVER_LOBBY_PORT beforehand.
SERVER_ROLLOUT_PORT: UDP/IP port on which server is listening for incoming messages related to the rollout phase. It will be notified by the server to the clients right before the simulation starts.
CLIENT_PORT: UDP/IP port of agent clients. Server SHOULD NOT send datagrams to clients if they have not been registered first, following the process explained in next section.
AGENT_KEY: Key used to identify one available agent slot, e.g.: "agent0".
AGENT_TAG: Tag used to identify one agent filling one available slot. Specific clients can use a custom tag to identify themselves within the scope of the lobby phase, e.g.: "john_doe_q_learning".
BOOL_VALUE: "true" or "false" particles, without backticks.
ACTION: Action chosen by an agent. It MUST NOT contain the colon character (:), semicolon (;), or equal sign (=). There are no other restrictions on how this field is formed as long as it is well understood by both client and server, e.g.: "move_left" or "5,6.78".
SLOT_STATUS: "open" or "close" particles, without backticks.
AGENT_KIND: Freeform field used to differentiate aspects of agents relevant during the lobby phase, e.g.: "citizen" or "zombie". It MUST NOT contain the colon character (:), semicolon (;), comma (,) or equal sign (=). There are no other restrictions on how this field is formed as long as it is well understood by both client and server.
READY_STATUS: "ready" or "not_ready" particles, without backticks.
AGENT_SLOT: Shorthand for AGENT_KEY=SLOT_STATUS,AGENT_KIND,AGENT_TAG,READY_STATUS;
[AGENT_SLOT]: Appearance of 1..n AGENT_SLOT.
MESSAGE: Informative message sent by server instances during lobby phase.
TIMESTAMP: Number of milliseconds since UNIX Epoch (Jan 1, 1970) according to server time.
STEP_NUMBER: Positive integer indicating the step number for a running simulation.
OBSERVATION: Observation for an agent received upon a simulation step run on the server. It MUST NOT contain the semicolon character (;), or equal sign (=). There are no other restrictions on how this field is formed as long as it is well understood by both client and server, e.g.: "x:0.54,y:0.95".
REWARD: Reward for an agent received upon a simulation step run on the server, usually modeled as a single floating point value. It MUST NOT contain the semicolon character (;), or equal sign (=).
EXTRA: Additional information for an agent received upon a simulation step run on the server. It MUST NOT contain the semicolon character (;), or equal sign (=). There are no other restrictions on how this field is formed as long as it is well understood by both client and server, e.g.: "did_jump:true,jump_length:6.84".

3.2. Client Message Types

This section specifies the content format for the message types that shall be implemented by PERLERT clients.

lobby information request: Message sent by clients to request lobby information associated with a given server instance.

HEADER;lobby
lobby registration request: Message sent by clients to request to participate in a simulation server instance.

HEADER;register=AGENT_KEY,AGENT_TAG

Clients are allowed to issue multiple lobby registration requests, but only the last one correctly received by the server will take effect.
lobby ready request: Message sent by clients to inform the server whether they are ready to participate in the simulation or not.

HEADER;ready=AGENT_KEY,BOOL_VALUE
rollout action: Message sent by clients to inform about the desired action to be run in the simulation. It is not needed to send a "rollout action" message per each simulation timestep. Instead, the server will use the last received action for each client and feed it into the environment until receiving a new action. Server instances can choose which action feed to the environment simulation until agent clients provide a valid action.

HEADER;action=ACTION

3.3. Server Message Types

This section specifies the content format for the message types that shall be implemented by PERLERT servers.

lobby information: Message responded by servers informing clients about lobby agent slots. This datagram MUST be sent to a client upon receiving a "lobby information request", and to all clients whenever the lobby is altered due to a "lobby registration request" or a "lobby ready request".

HEADER;[AGENT_SLOT]

The message format MAY omit the trailing semicolon character (;).
lobby registration response: Message sent by servers upon a successful registration request.

HEADER;registered=AGENT_KEY

Servers MUST NOT allow a single client to be registered in multiple slots. Before proceeding to register one client in one agent slot, such client must be removed from any slot where it may have been registered first.

Servers MUST register clients with a default "not_ready" status.
lobby message: Message sent by servers to registered clients containing relevant general information.

HEADER;message=MESSAGE
lobby start: Message sent by servers to all registered clients informing about the UDP/IP port for the rollout once the simulation is about to start. The server can choose to start the simulation at any time but it MUST NOT do it if any client is in a "not_ready" status.

HEADER;start=port:SERVER_ROLLOUT_PORT
rollout step: Message sent by servers to all registered clients containing the information provided by the environment for a single step. Note that "rollout step" messages should be sent in a regular datastream containing enough data per time unit so that clients can properly render the environment, but should not exceed a reasonable amount of UDP packets. It is RECOMMENDED to limit a maximum of 30 "rollout step" packets per second.

HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=BOOL_VALUE

Server MAY send additional information by concatenating an extra particle like this:

HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=BOOL_VALUE;extra=EXTRA

Because several messages of this type will be sent over the network, it is recommended that they are as condensed as possible. For example, it is RECOMMENDED that floating point values either belonging to the OBSERVATION or the REWARD are rounded to a minimal needed amount of decimals.

4. UDP/IP Ports

All messages sent by one client MUST use the same UDP/IP source CLIENT_PORT during the whole information exchange process, since the agent sends a "lobby registration request" to the server until it receives a "rollout step" response with "done" flag as "true".

"lobby information", "lobby registration response", "lobby message", and "lobby start" datagrams MUST use the same UDP/IP source SERVER_LOBBY_PORT for a given server instance.

"rollout step" datagrams MUST use the same UDP/IP source SERVER_ROLLOUT_PORT for a given server instance.

5. Example Case

This section provides a brief example of datagrams exchanged by one client and one server during a PERLERT session.

  CLIENT                                           SERVER

  ==================== LOBBY PHASE ======================
  UDP port: 55555                         UDP port: 32322

  city:7;lobby -------------------------------------->

     <-------------- city:7;agent0=open,citizen,cpu,ready

  city:7;register=agent0,patrick -------------------->
  
     <-------------------------- city:7;registered=agent0

  city:7;ready=agent0,true -------------------------->

     <--------- city:7;agent0=close,citizen,patrick,ready
     <----------- city:7;message=Simulation will start...

     <--------------------------- city:7;start=port:32323

  ==================== ROLLOUT PHASE =====================
  UDP port: 55555                         UDP port: 32323

  city:7;action=walk -------------------------------->
  
     <-- city:7:1590853116323:0;obs=45;reward=0;done=false
     <-- city:7:1590853121058:0;obs=47;reward=0;done=false
     <-- city:7:1590853126423:0;obs=48;reward=1;done=false
     <-- city:7:1590853130429:0;obs=49;reward=0;done=false
     <--- city:7:1590853134833:0;obs=51;reward=1;done=true

Figure 1

6. Additional Considerations

Because packet loss might prevent some PERLERT information from arriving to the other end, the following considerations are to be taken into account:

After sending the "lobby start" message, the server instance SHOULD keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the "lobby start" message to any client communicating to such port after the simulation has started.

After the simulation is finished for a given client, this is, the "rollout step" message contains the "done" flag as "true", the server instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10) seconds and listening to datagrams from such client. The server instance SHOULD resend the appropriate "rollout step" datagram upon receiving a client message within that period.

7. IANA Considerations

This memo includes no request to IANA.

8. Security Considerations

Both client and server implementations SHOULD use a fixed buffer size as small as possible for receiving the UDP/IP packets.

Both client and server MAY cipher the content of the messages. Although asymmetric publick/private key pairs usage is recommended, it is also encourage to use symmetric ciphering with a pre-shared key

PERLERT is especially vulnerable to IP spoofing attacks, because actions received during the rollout phase are only identified by the IP of the sender. Using an VPN is RECOMMENDED in order to tunnelize the information exchange.

9. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC3629]	Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003.
[RFC768]	Postel, J., "User Datagram Protocol", August 1980.

Author's Address

Ruben Montero University of A Coruna Rua San Roque 9 A Coruna, Galicia 15002 ES Phone: +34 692 983 851 EMail: ruben.montero@udc.es