Internet DRAFT - draft-perlert-wg
draft-perlert-wg
Internet Engineering Task Force R. Montero
Internet-Draft University of A Coruna
Intended status: Informational August 13, 2020
Expires: February 14, 2021
Protocol for Evaluating Reinforcement Learning Environments in Real Time
draft-perlert-wg-00
Abstract
This document defines a simple UDP protocol for communicating a
server simulating a reinforcement learning environment and a client
observing it and responding with actions.
Reinforcement learning problems are usually defined within the scope
of a Markov Decission Process (MDP) where an agent sends an action
belonging to an action space to an environment. The environment acts
as a black box returning an observation and a reward for the agent,
whose goal is to maximize the total obtained rewards.
Although the problem statement is easy to understand, there are no
conventions on how to communicate a reinforcement learning simulation
with a client agent, either in a local network or over the Internet.
Additionally, giving an answer to this can be especially useful when
it comes to multiagent support and analysis.
The protocol PERLERT defined in this document assumes that server and
client have shared certain information beforehand via another way of
communication like a web page served using HTTP protocol. For
example, the client must know a port number and an instance number
before proceeding to participate in a simulation run on a server.
Also, although it is often desired to know the full feedback from the
environment, PERLERT focuses on real-time interaction where human
agents can interact with AI agents even if that means that
information can be lost due to network packet loss.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Montero Expires February 14, 2021 [Page 1]
Internet-Draft PERLERT August 2020
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 14, 2021.
Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Communication Phases . . . . . . . . . . . . . . . . . . . . 3
3. Messages Specification . . . . . . . . . . . . . . . . . . . 3
3.1. Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2. Client Message Types . . . . . . . . . . . . . . . . . . 5
3.3. Server Message Types . . . . . . . . . . . . . . . . . . 6
4. UDP/IP Ports . . . . . . . . . . . . . . . . . . . . . . . . 7
5. Example Case . . . . . . . . . . . . . . . . . . . . . . . . 8
6. Additional Considerations . . . . . . . . . . . . . . . . . . 8
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
8. Security Considerations . . . . . . . . . . . . . . . . . . . 9
9. Normative References . . . . . . . . . . . . . . . . . . . . 9
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
This document specifies PERLERT (Protocol for Evaluation of
Reinforcement Learning Environments in Real Time).
It is intended to be used in the context of reinforcement learning
problems analysis. In reinforcement learning problems an agent sends
an action to an environment. The environment acts as a black box
Montero Expires February 14, 2021 [Page 2]
Internet-Draft PERLERT August 2020
returning an observation and a reward for the agent, whose goal is to
maximize the total obtained rewards.
The main purpose of PERLERT is to make it easier to test and
integrate differently implemented agents and run simulation servers
separatedly from those agents.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
2. Communication Phases
There are two main separated phases in which client and server shall
exchange PERLERT messages.
lobby
This phase is oriented to let agent clients register themselves
within the available slots informed by the server. It is
especially useful when it comes to environments with multiagent
support.
rollout
This is the main phase. The term "rollout" here acts as a
synonym of "simulation". In this section the loop:
(action) -> (observation, reward)
...takes place until clients are notified by the server that the
simulation has finished.
3. Messages Specification
Messages defined in the following sections MUST be implemented as
UDP/IP datagrams [RFC768].
Also, all messages SHOULD use the same text encoding. It is
RECOMMENDED that both server and client encode messages using UTF8
[RFC3629].
3.1. Terms
In order of appearance:
SERVER_INSTANCE_NAME Tag used to distinguish different environments
being held by one same server, e.g.: "cartpole".
Montero Expires February 14, 2021 [Page 3]
Internet-Draft PERLERT August 2020
SERVER_INSTANCE_NUMBER Positive integer used to distinguish
different instances of the same environment being held by one
same server, e.g.: "0".
HEADER Shorthand for SERVER_INSTANCE_NAME:SERVER_INSTANCE_NUMBER,
e.g.: "cartpole:0".
SERVER_LOBBY_PORT UDP/IP port on which server is listening for
incoming messages related to the lobby phase. It is necessary
that clients know the SERVER_LOBBY_PORT beforehand.
SERVER_ROLLOUT_PORT UDP/IP port on which server is listening for
incoming messages related to the rollout phase. It will be
notified by the server to the clients right before the simulation
starts.
CLIENT_PORT UDP/IP port of agent clients. Server SHOULD NOT send
datagrams to clients if they have not been registered first,
following the process explained in next section.
AGENT_KEY Key used to identify one available agent slot, e.g.:
"agent0".
AGENT_TAG Tag used to identify one agent filling one available slot.
Specific clients can use a custom tag to identify themselves
within the scope of the lobby phase, e.g.: "john_doe_q_learning".
BOOL_VALUE "true" or "false" particles, without backticks.
ACTION Action chosen by an agent. It MUST NOT contain the colon
character (:), semicolon (;), or equal sign (=). There are no
other restrictions on how this field is formed as long as it is
well understood by both client and server, e.g.: "move_left" or
"5,6.78".
SLOT_STATUS "open" or "close" particles, without backticks.
AGENT_KIND Freeform field used to differentiate aspects of agents
relevant during the lobby phase, e.g.: "citizen" or "zombie". It
MUST NOT contain the colon character (:), semicolon (;), comma
(,) or equal sign (=). There are no other restrictions on how
this field is formed as long as it is well understood by both
client and server.
READY_STATUS "ready" or "not_ready" particles, without backticks.
AGENT_SLOT Shorthand for
AGENT_KEY=SLOT_STATUS,AGENT_KIND,AGENT_TAG,READY_STATUS;
Montero Expires February 14, 2021 [Page 4]
Internet-Draft PERLERT August 2020
[AGENT_SLOT] Appearance of 1..n AGENT_SLOT.
MESSAGE Informative message sent by server instances during lobby
phase.
TIMESTAMP Number of milliseconds since UNIX Epoch (Jan 1, 1970)
according to server time.
STEP_NUMBER Positive integer indicating the step number for a
running simulation.
OBSERVATION Observation for an agent received upon a simulation step
run on the server. It MUST NOT contain the semicolon character
(;), or equal sign (=). There are no other restrictions on how
this field is formed as long as it is well understood by both
client and server, e.g.: "x:0.54,y:0.95".
REWARD Reward for an agent received upon a simulation step run on
the server, usually modeled as a single floating point value. It
MUST NOT contain the semicolon character (;), or equal sign (=).
EXTRA Additional information for an agent received upon a simulation
step run on the server. It MUST NOT contain the semicolon
character (;), or equal sign (=). There are no other
restrictions on how this field is formed as long as it is well
understood by both client and server, e.g.:
"did_jump:true,jump_length:6.84".
3.2. Client Message Types
This section specifies the content format for the message types that
shall be implemented by PERLERT clients.
lobby information request
Message sent by clients to request lobby information associated
with a given server instance.
HEADER;lobby
lobby registration request
Message sent by clients to request to participate in a simulation
server instance.
HEADER;register=AGENT_KEY,AGENT_TAG
Clients are allowed to issue multiple lobby registration
requests, but only the last one correctly received by the server
will take effect.
Montero Expires February 14, 2021 [Page 5]
Internet-Draft PERLERT August 2020
lobby ready request
Message sent by clients to inform the server whether they are
ready to participate in the simulation or not.
HEADER;ready=AGENT_KEY,BOOL_VALUE
rollout action
Message sent by clients to inform about the desired action to be
run in the simulation. It is not needed to send a "rollout
action" message per each simulation timestep. Instead, the
server will use the last received action for each client and feed
it into the environment until receiving a new action. Server
instances can choose which action feed to the environment
simulation until agent clients provide a valid action.
HEADER;action=ACTION
3.3. Server Message Types
This section specifies the content format for the message types that
shall be implemented by PERLERT servers.
lobby information
Message responded by servers informing clients about lobby agent
slots. This datagram MUST be sent to a client upon receiving a
"lobby information request", and to all clients whenever the
lobby is altered due to a "lobby registration request" or a
"lobby ready request".
HEADER;[AGENT_SLOT]
The message format MAY omit the trailing semicolon character (;).
lobby registration response
Message sent by servers upon a successful registration request.
HEADER;registered=AGENT_KEY
Servers MUST NOT allow a single client to be registered in
multiple slots. Before proceeding to register one client in one
agent slot, such client must be removed from any slot where it
may have been registered first.
Servers MUST register clients with a default "not_ready" status.
Montero Expires February 14, 2021 [Page 6]
Internet-Draft PERLERT August 2020
lobby message
Message sent by servers to registered clients containing relevant
general information.
HEADER;message=MESSAGE
lobby start
Message sent by servers to all registered clients informing about
the UDP/IP port for the rollout once the simulation is about to
start. The server can choose to start the simulation at any time
but it MUST NOT do it if any client is in a "not_ready" status.
HEADER;start=port:SERVER_ROLLOUT_PORT
rollout step
Message sent by servers to all registered clients containing the
information provided by the environment for a single step. Note
that "rollout step" messages should be sent in a regular
datastream containing enough data per time unit so that clients
can properly render the environment, but should not exceed a
reasonable amount of UDP packets. It is RECOMMENDED to limit a
maximum of 30 "rollout step" packets per second.
HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
OOL_VALUE
Server MAY send additional information by concatenating an extra
particle like this:
HEADER:TIMESTAMP:STEP_NUMBER;obs=OBSERVATION;reward=REWARD;done=B
OOL_VALUE;extra=EXTRA
Because several messages of this type will be sent over the
network, it is recommended that they are as condensed as
possible. For example, it is RECOMMENDED that floating point
values either belonging to the OBSERVATION or the REWARD are
rounded to a minimal needed amount of decimals.
4. UDP/IP Ports
All messages sent by one client MUST use the same UDP/IP source
CLIENT_PORT during the whole information exchange process, since the
agent sends a "lobby registration request" to the server until it
receives a "rollout step" response with "done" flag as "true".
"lobby information", "lobby registration response", "lobby message",
and "lobby start" datagrams MUST use the same UDP/IP source
SERVER_LOBBY_PORT for a given server instance.
Montero Expires February 14, 2021 [Page 7]
Internet-Draft PERLERT August 2020
"rollout step" datagrams MUST use the same UDP/IP source
SERVER_ROLLOUT_PORT for a given server instance.
5. Example Case
This section provides a brief example of datagrams exchanged by one
client and one server during a PERLERT session.
CLIENT SERVER
==================== LOBBY PHASE ======================
UDP port: 55555 UDP port: 32322
city:7;lobby -------------------------------------->
<-------------- city:7;agent0=open,citizen,cpu,ready
city:7;register=agent0,patrick -------------------->
<-------------------------- city:7;registered=agent0
city:7;ready=agent0,true -------------------------->
<--------- city:7;agent0=close,citizen,patrick,ready
<----------- city:7;message=Simulation will start...
<--------------------------- city:7;start=port:32323
==================== ROLLOUT PHASE =====================
UDP port: 55555 UDP port: 32323
city:7;action=walk -------------------------------->
<-- city:7:1590853116323:0;obs=45;reward=0;done=false
<-- city:7:1590853121058:0;obs=47;reward=0;done=false
<-- city:7:1590853126423:0;obs=48;reward=1;done=false
<-- city:7:1590853130429:0;obs=49;reward=0;done=false
<--- city:7:1590853134833:0;obs=51;reward=1;done=true
Figure 1
6. Additional Considerations
Because packet loss might prevent some PERLERT information from
arriving to the other end, the following considerations are to be
taken into account:
Montero Expires February 14, 2021 [Page 8]
Internet-Draft PERLERT August 2020
After sending the "lobby start" message, the server instance SHOULD
keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the
"lobby start" message to any client communicating to such port after
the simulation has started.
After the simulation is finished for a given client, this is, the
"rollout step" message contains the "done" flag as "true", the server
instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10)
seconds and listening to datagrams from such client. The server
instance SHOULD resend the appropriate "rollout step" datagram upon
receiving a client message within that period.
7. IANA Considerations
This memo includes no request to IANA.
8. Security Considerations
Both client and server implementations SHOULD use a fixed buffer size
as small as possible for receiving the UDP/IP packets.
Both client and server MAY cipher the content of the messages.
Although asymmetric publick/private key pairs usage is recommended,
it is also encourage to use symmetric ciphering with a pre-shared key
PERLERT is especially vulnerable to IP spoofing attacks, because
actions received during the rollout phase are only identified by the
IP of the sender. Using an VPN is RECOMMENDED in order to tunnelize
the information exchange.
9. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
2003, <https://www.rfc-editor.org/info/rfc3629>.
[RFC768] Postel, J., "User Datagram Protocol", August 1980,
<https://tools.ietf.org/html/rfc768>.
Montero Expires February 14, 2021 [Page 9]
Internet-Draft PERLERT August 2020
Author's Address
Ruben Montero
University of A Coruna
Rua San Roque 9
A Coruna, Galicia 15002
ES
Phone: +34 692 983 851
Email: ruben.montero@udc.es
Montero Expires February 14, 2021 [Page 10]