Internet Engineering Task Force | R. Montero |
Internet-Draft | University of A Coruna |
Intended status: Informational | August 13, 2020 |
Expires: February 14, 2021 |
Protocol for Evaluating Reinforcement Learning Environments in Real Time
draft-perlert-wg-00
This document defines a simple UDP protocol for communicating a server simulating a reinforcement learning environment and a client observing it and responding with actions.
Reinforcement learning problems are usually defined within the scope of a Markov Decission Process (MDP) where an agent sends an action belonging to an action space to an environment. The environment acts as a black box returning an observation and a reward for the agent, whose goal is to maximize the total obtained rewards.
Although the problem statement is easy to understand, there are no conventions on how to communicate a reinforcement learning simulation with a client agent, either in a local network or over the Internet. Additionally, giving an answer to this can be especially useful when it comes to multiagent support and analysis.
The protocol PERLERT defined in this document assumes that server and client have shared certain information beforehand via another way of communication like a web page served using HTTP protocol. For example, the client must know a port number and an instance number before proceeding to participate in a simulation run on a server.
Also, although it is often desired to know the full feedback from the environment, PERLERT focuses on real-time interaction where human agents can interact with AI agents even if that means that information can be lost due to network packet loss.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 14, 2021.
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document specifies PERLERT (Protocol for Evaluation of Reinforcement Learning Environments in Real Time).
It is intended to be used in the context of reinforcement learning problems analysis. In reinforcement learning problems an agent sends an action to an environment. The environment acts as a black box returning an observation and a reward for the agent, whose goal is to maximize the total obtained rewards.
The main purpose of PERLERT is to make it easier to test and integrate differently implemented agents and run simulation servers separatedly from those agents.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
There are two main separated phases in which client and server shall exchange PERLERT messages.
Messages defined in the following sections MUST be implemented as UDP/IP datagrams.
Also, all messages SHOULD use the same text encoding. It is RECOMMENDED that both server and client encode messages using UTF8.
In order of appearance:
This section specifies the content format for the message types that shall be implemented by PERLERT clients.
This section specifies the content format for the message types that shall be implemented by PERLERT servers.
All messages sent by one client MUST use the same UDP/IP source CLIENT_PORT during the whole information exchange process, since the agent sends a "lobby registration request" to the server until it receives a "rollout step" response with "done" flag as "true".
"lobby information", "lobby registration response", "lobby message", and "lobby start" datagrams MUST use the same UDP/IP source SERVER_LOBBY_PORT for a given server instance.
"rollout step" datagrams MUST use the same UDP/IP source SERVER_ROLLOUT_PORT for a given server instance.
This section provides a brief example of datagrams exchanged by one client and one server during a PERLERT session.
CLIENT SERVER ==================== LOBBY PHASE ====================== UDP port: 55555 UDP port: 32322 city:7;lobby --------------------------------------> <-------------- city:7;agent0=open,citizen,cpu,ready city:7;register=agent0,patrick --------------------> <-------------------------- city:7;registered=agent0 city:7;ready=agent0,true --------------------------> <--------- city:7;agent0=close,citizen,patrick,ready <----------- city:7;message=Simulation will start... <--------------------------- city:7;start=port:32323 ==================== ROLLOUT PHASE ===================== UDP port: 55555 UDP port: 32323 city:7;action=walk --------------------------------> <-- city:7:1590853116323:0;obs=45;reward=0;done=false <-- city:7:1590853121058:0;obs=47;reward=0;done=false <-- city:7:1590853126423:0;obs=48;reward=1;done=false <-- city:7:1590853130429:0;obs=49;reward=0;done=false <--- city:7:1590853134833:0;obs=51;reward=1;done=true
Figure 1
Because packet loss might prevent some PERLERT information from arriving to the other end, the following considerations are to be taken into account:
After sending the "lobby start" message, the server instance SHOULD keep the SERVER_LOBBY_PORT open for five (5) seconds and resend the "lobby start" message to any client communicating to such port after the simulation has started.
After the simulation is finished for a given client, this is, the "rollout step" message contains the "done" flag as "true", the server instance SHOULD keep the SERVER_ROLLOUT_PORT open for ten (10) seconds and listening to datagrams from such client. The server instance SHOULD resend the appropriate "rollout step" datagram upon receiving a client message within that period.
This memo includes no request to IANA.
Both client and server implementations SHOULD use a fixed buffer size as small as possible for receiving the UDP/IP packets.
Both client and server MAY cipher the content of the messages. Although asymmetric publick/private key pairs usage is recommended, it is also encourage to use symmetric ciphering with a pre-shared key
PERLERT is especially vulnerable to IP spoofing attacks, because actions received during the rollout phase are only identified by the IP of the sender. Using an VPN is RECOMMENDED in order to tunnelize the information exchange.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC3629] | Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003. |
[RFC768] | Postel, J., "User Datagram Protocol", August 1980. |