NFSv4 | D. Noveck, Ed. |
Internet-Draft | P. Shivam |
Updates: 3530 (if approved) | C. Lever |
Intended status: Standards Track | B. Baker |
Expires: October 01, 2014 | ORACLE |
March 30, 2014 |
NFSv4.0 migration: Specification Update
draft-ietf-nfsv4-rfc3530-migration-update-04
The migration feature of NFSv4 allows for responsibility for a single filesystem to move from one server to another, without disruption to clients. Recent implementation experience has shown problems in the existing specification for this feature in NFSv4.0. This document clarifies and corrects the NFSv4.0 specification (RFC3530 and possible successors) to address these problems.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on October 01, 2014.
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document is a standards track document which corrects the existing definitive specification of the NFSv4.0 protocol, in [RFC3530] and the one expected to become definitive (now in [cur-rfc3530-bis]). Given this fact, one should take the current document into account when learning about NFSv4.0, particularly if one is concerned with issues that relate to:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
Implementation experience with transparent state migration has exposed a number of problems with the existing specification of this feature, in [RFC3530] and in RFC3530bis (see the draft at [cur-rfc3530-bis]). The symptoms were:
An analysis of these symptoms leads to the conclusion that existing specifications have erred. They assume that locking state, including both state ids and clientid4's, should be transferred as part of transparent state migration. The troubling symptoms arise from the failure to describe how migrating state is to be integrated with existing client definition structures on the destination server.
Specification of requirements for the server to appropriately merge stateids associated with a common client boot instance encounters a difficult problem. The issue is that the common client practice with regard to the presentation of unique strings specifying client identity makes it essentially impossible for the client to determine whether or not two stateids, originally generated on different servers are referable to the same client. This practice is allowed and endorsed, although not "RECOMMENDED", by existing NFSv4.0 specifications ([RFC3530] and RFC3530bis, whose current draft is at [cur-rfc3530-bis]).
To further complicate matters, upon prototyping of clients implementing an alternative approach, it has been found that there exist servers which do not work well with these new clients. It appears that current circumstances, in which a particular client implementation pattern had been adopted universally, has resulted in servers not being able to interoperate against alternate client implementation patterns. As a result, we have a situation which requires careful attention to compatibility issues to untangle.
This document updates the existing NFSv4.0 specifications ([RFC3530] and RFC3530bis, whose current draft is at [cur-rfc3530-bis]) as follows:
For a more complete explanation of the choices made in addressing these issues, see [info-migr]).
This chapter is a replacement for sections 8.1.1 and 8.1.2 in [RFC3530] and for sections 9.1.1 and 9.1.2 in RFC3530bis (see the draft at [cur-rfc3530-bis]). The replaced sections are named "client ID" and "Server Release of Clientid."
It supersedes the replaced sections.
Because of the need for greater attention to and careful description of this area, this chapter is much larger than the sections it replaces. The principal changes/additions made by this chapter are:
The NFSv4 protocol contains a number of protocol entities to identify clients and client-based entities, for locking-related purposes:
The basis of the client identification infrastructure is encapsulated in the following data structure:
struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT>; };
The nfs_client_id4 structure uniquely defines a client boot instance as follows:
There are several considerations for how the client generates the id string:
Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully completed, the client uses the shorthand client identifier, of type clientid4, instead of the longer and less compact nfs_client_id4 structure. This shorthand client identifier (a client ID) is assigned by the server and should be chosen so that it will not conflict with a client ID previously assigned by same server. This applies across server restarts or reboots.
Note that the SETCLIENTID and SETCLIENTID_CONFIRM operations have a secondary purpose of establishing the information the server needs to make callbacks to the client for the purpose of supporting delegations. The client is able to change this information via SETCLIENTID and SETCLIENTID_CONFIRM within the same incarnation of the client without causing removal of the client's leased state.
Distinct servers MAY assign clientid4's independently, and will generally do so. Therefore, a client has to be prepared to deal with multiple instances of the same clientid4 value received on distinct IP addresses, denoting separate entities. When trunking of server IP addresses is not a consideration, a client should keep track of (IP-address, clientid4) pairs, so that each pair is distinct. For a discussion of how to address the issue in the face of possible trunking of server IP addresses, see Section 4.4.
Owners of opens and owners of byte-range locks are separate entities and remain separate even if the same opaque arrays are used to designate owners of each. The protocol distinguishes between open-owners (represented by open_owner4 structures) and lock-owners (represented by lock_owner4 structures).
Both sorts of owners consist of a clientid4 and an opaque owner string. For each client, the set of distinct owner values used with that client constitutes the set of owners of that type, for the given client.
Each open is associated with a specific open-owner while each byte-range lock is associated with a lock-owner and an open-owner, the latter being the open-owner associated with the open file under which the LOCK operation was done.
When a clientid4 is presented to a server and that clientid4 is not valid, the server will reject the request with the an error that depends on the reason for clientid4 invalidity. The error NFS4ERR_ADMIN_REVOKED is returned when the invalidation is the result of administrative action, When the clientid4 is unrecognizable, the error NFS4ERR_STALE_CLIENTID or NFS4ERR_EXPIRED may be returned. An unrecognizable clientid4 can occur for a number of reasons:
In the event of a server reboot, loss of lease state due to lease expiration, or administrative revocation of a clientid4, the client must obtain a new clientid4 by use of the SETCLIENTID operation and then proceed to any other necessary recovery for the server reboot case (See the section entitled "Server Failure and Recovery"). In cases of server or client error resulting in this error, use of SETCLIENTID to establish a new lease is desirable as well.
In the last two cases, different recovery procedures are required. See Section 5.3 for details. Note that in cases in which there is any uncertainty about which sort of handling is applicable, the distinguishing characteristic is that in reboot-like cases, the clientid4 and all associated stateids cease to exist while in migration-related cases, the clientid4 ceases to exist while the stateids are still valid.
The client must also employ the SETCLIENTID operation when it receives a NFS4ERR_STALE_STATEID error using a stateid derived from its current clientid4, since this indicates a situation, such as server reboot which has invalidated the existing clientid4 and associated stateids (see the section entitled "lock-owner" for details).
See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM for a complete specification of these operations.
If the server determines that the client holds no associated state for its clientid4, the server may choose to release that clientid4. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. It should be clear that the server must be very hesitant to release a client ID since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a client ID unless there had been no activity from that client for many minutes.
Note that if the id string in a SETCLIENTID request is properly constructed, and if the client takes care to use the same principal for each successive use of SETCLIENTID, then, barring an active denial of service attack, NFS4ERR_CLID_INUSE should never be returned.
However, client bugs, server bugs, or perhaps a deliberate change of the principal owner of the id string (such as may occur in the case in which a client changes security flavors, and under the new flavor, there is no mapping to the previous owner) will in rare cases result in NFS4ERR_CLID_INUSE.
In that event, when the server gets a SETCLIENTID specifying a client id string for which the server has a clientid4 that currently has no state, or for which it has state, but where the lease has expired, the server MUST allow the SETCLIENTID, rather than returning NFS4ERR_CLID_INUSE. The server MUST then confirm the new client ID if followed by the appropriate SETCLIENTID_CONFIRM.
One particular aspect of the construction of the nfs_client_id4 string has proved recurrently troublesome. The client has a choice of:
Note that implementation considerations, including compatibility with existing servers, may make it desirable for a client to use both approaches, based on configuration information, such as mount options. This issue will be discussed in Section 4.7.
Construction of the client id string has arisen as a difficult issue because of the way in which the NFS protocols have evolved.
NFSv4.0 is unfortunately halfway between these two. The two client id string approaches have arisen in attempts to deal with the changing requirements of the protocol as implementation has proceeded and features that were not very substantial in early implementations of [RFC3530], became more substantial as implementation proceeded.
Both approaches have to deal with the asymmetry in client and server identity information between client and server. Each seeks to make the client's and the server's views match. In the process, each encounters some combination of inelegant protocol features and/or implementation difficulties. The choice of which to use is up to the client implementer and the sections below try to give some useful guidance.
The non-uniform client id string approach is an attempt to handle these matters in NFSv4.0 client implementations in as NFSv3-like a way as possible.
For a client using the non-uniform approach, all internal recording of clientid4 values is to include, whether explicitly or implicitly, the server IP address so that one always has an (IP-address, clientid4) pair. Two such pairs from different servers are always distinct even when the clientid4 values are the same, as they may occasionally be. In this approach, such equality is always treated as simple happenstance.
Making the client id string different on different server IP addresses results in a situation in which a server has no way of tying together information from the same client, when the client accesses multiple server IP addresses. As a result, it will treat a single client as multiple clients with separate leases for each server network address. Since there is no way in the protocol for the client to determine if two network addresses are connected to the same server, the resulting lack of knowledge is symmetrical and can result in simpler client implementations in which there is a single clientid/lease per server network addresses.
Support for migration, particularly with transparent state migration, is more complex in the case of non-uniform client id strings. For example, migration of a lease can result in multiple leases for the same client accessing the same server addresses, vitiating many of the advantages of this approach. Therefore, client implementations that support migration with transparent state migration SHOULD NOT use the non-uniform client id string approach, except where it is necessary for compatibility with existing server implementations (For details of arranging use of multiple client id string approaches, see Section 4.7).
When the client id string is kept uniform, the server has the basis to have a single clientid4/lease for each distinct client. The problem that has to be addressed is the lack of explicit server identity information, which was made available in NFSv4.1.
When the same client id string is given to multiple IP addresses, the client can determine whether two IP addresses correspond to a single server, based on the server's behavior. This is the inverse of the strategy adopted for the non-uniform approach in which different server IP addresses are told about different clients, simply to prevent a server from manifesting behavior that is inconsistent with there being a single server for each IP address, in line with the traditions of NFS. So, to compare:
The uniform client id string approach makes it necessary to exercise more care in the definition of the nfs_client_id4 boot verifier:
The following are advantages for the implementation of using the uniform client id string approach:
The following implementation considerations might cause issues for client implementations.
How to balance these considerations depends on implementation goals.
As noted above, a client which needs to use the uniform client id string approach (e.g. to support migration), may also need to support existing servers with implementations that do not work properly in this case.
Some examples of such server issues include:
In order to support use of these sorts of servers, the client can use different client id string approaches for different mounts, as long as:
One effective way for clients to handle this is to support the uniform client id string approach as the default, but allow a mount option to specify use of the non-uniform client id string approach for particular mount points, as long as such mount points are not used when migration is to be supported.
In the case in which the same server has multiple mounts, and both approaches are specified for the same server, the client could have multiple clientid's corresponding to the same server, one for each approach and would then have to keep these separate.
This section provides an example of how trunking determination could be done by a client following the uniform client id string approach (whether this is used for all mounts or not). Clients need not follow this procedure but implementers should make sure that the issues dealt with by this procedure are all properly addressed.
We need to clarify the various possible purposes of trunking determination and the corresponding requirements as to server behavior. The following points should be noted:
For a client using the uniform approach, clientid4 values are treated as important information in determining server trunking patterns. For two different IP addresses to return the same clientid4 value is a necessary, though not a sufficient condition for them to be considered as connected to the same server. As a result, when two different IP addresses return the same clientid4, the client needs to determine, using the procedure given below or otherwise, whether the IP addresses are connected to the same server. For such clients, all internal recording of clientid4 values needs to include, whether explicitly or implicitly, identification of the server from which the clientid4 was received so that one always has a (server, clientid4) pair. Two such pairs from different servers are always considered distinct even when the clientid4 values are the same, as they may occasionally be.
In order to make this approach work, the client must have accessible, for each nfs_client_id4 used by the uniform approach (only one in general) a list of all server IP addresses, together with the associated clientid4 values, SETCLIENTID principals and authentication flavors. As a part of the associated data structures, there should be the ability to mark a server IP structure as having the same server as another and to mark an IP address as currently unresolved. One way to do this is to a allow each such entry to point to another with the pointer value being one of:
In order to keep the above information current, in the interests of the most effective trunking determination, RENEWs should be periodically done on each server. However, even if this is not done, the primary purpose of the trunking determination algorithm, to prevent confusion due to trunking hidden from the client, will be achieved.
Given this apparatus, when a SETCLIENTID is done and a clientid4 returned, the data structure can be searched for a matching clientid4 and if such is found, further processing can be done to determine whether the clientid4 match is accidental, or the result of trunking.
In this algorithm, when SETCLIENTID is done it will use the common nfs_client_id4 and specify the current target IP address as part of the callback parameters. We call the clientid4 and SETCLIENTID verifier returned by this operation XC and XV.
Note that when the client has done previous SETCLIENTID's, to any IP addresses, with more than one principal or authentication flavor, we have the possibility of receiving NFS4ERR_CLID_INUSE, since we do not yet know which of our connections with existing IP addresses might be trunked with our current one. In the event that the SETCLIENTID fails with NFS4ERR_CLID_INUSE, one must try all other combinations of principals and authentication flavors currently in use and eventually one will be correct and not return NFS4ERR_CLID_INUSE.
Note that at this point, no SETCLIENTID_CONFIRM has yet been done. This is because our SETCLIENTID has either established a new clientid4 on a previously unknown server or changed the callback parameters on a clientid4 associated with some already known server. Given that we don't want to confirm something that we are not sure we want to happen, what is to be done next depends on information about existing clientid4's.
For each lead IP address IPn with a clientid4 matching XC, the following steps are done. Because the RPC to do a SETCLIENTID could take considerable time, it is desirable for the client to perform these operations in parallel. Note that because the clientid4 is a 64-bit value, the number of such IP addresses that would need to be tested is expected to be quite small, even when the client is interacting with many NFSv4.0 servers. Thus, while parallel processing is desirable, it is not necessary.
Once the SCn values are gathered up by the procedure above, they are each tested by being used as the verifier for a SETCLIENTID_CONFIRM operation directed to the original IP address X, whose trunking relationships are to be determined. These RPC operations may be done in parallel.
There are a number of things that should be noted at this point.
Further processing depends on the success or failure of the various SETCLIENTD_CONFIRM operations done in the step above.
In either of the cases, the entry is considered resolved and processing can be restarted for IP addresses whose clientid4 matched XC but whose resolution had been deferred.
The procedure described above must be performed so as to exclude the possibility that multiple SETCLIENTID's, done to different server IP addresses and returning the same clientid4 might "race" in such a fashion that there is no explicit determination of whether they correspond to the same server. The following possibilities for serialization are all valid and implementers may choose among them based on a tradeoff between performance and complexity. They are listed in order of increasing parallelism:
The procedure above has made no explicit mention of the possibility that server reboot can occur at any time. To address this possibility the client should make sure the following steps are taken:
Another situation not discussed explicitly above is the possibility that a SETCLIENTID done to one of the IPn addresses might take so long that it is necessary to time out the operation, to prevent unacceptably delaying the MOUNT operation. One simple possibility is to simply fail the MOUNT at this point. Because the average number of IP addresses that might have to be tested is quite small, this will not greatly increase the probability of MOUNT failure. Other possible approaches are:
This section gives more detailed guidance on client id construction. Note that among the items suggested for inclusion, there are many that may conceivably change. In order for the client id string to remain valid in such circumstances, the client should either:
A file is not always a valid choice to store such information, given the existence of diskless clients. In such situations, whatever facilities exist for a client to store configuration information such as boot arguments should be used.
Given the considerations listed in Section 4.2, an example of a well generated id string is one that includes:
This chapter is a replacement for section 7.7.6, "Lock State and File System transitions", in RFC3530bis (see the draft at [cur-rfc3530-bis]).
With respect to [RFC3530], it serves as a replacement for section 8.14, "Migration, Replication, and State".
It supersedes the replaced sections.
These changes can be briefly summarized as follows:
When responsibility for handling a given filesystem is transferred to a new server (migration) or the client chooses to use an alternate server (e.g., in response to server unresponsiveness) in the context of filesystem replication, the appropriate handling of state shared between the client and server (i.e., locks, leases, stateids, and client IDs) is as described below. The handling differs between migration and replication.
If a server replica or a server immigrating a filesystem agrees to, or is expected to, accept opaque values from the client that originated from another server, then it is a wise implementation practice for the servers to encode the "opaque" values in network byte order. When doing so, servers acting as replicas or immigrating filesystems will be able to parse values like stateids, directory cookies, filehandles, etc. even if their native byte order is different from that of other servers cooperating in the replication and migration of the filesystem.
In the case of migration, the servers involved in the migration of a filesystem SHOULD transfer all server state associated with the migrating filesystem from source to the destination server. This must be done in a way that is transparent to the client. This state transfer will ease the client's transition when a filesystem migration occurs. If the servers are successful in transferring all state, the client will continue to use stateids assigned by the original server. Therefore the new server must recognize these stateids as valid and treat them as representing the same locks as they did on the source server.
In this context, the phrase "the same locks" means:
If transferring stateids from server to server would result in a conflict for an existing stateid for the destination server with the existing client, transparent state migration MUST NOT happen for that client. Servers participating in using transparent state migration should co-ordinate their stateid assignment policies to make this situation unlikely or impossible. The means by which this might be done, like all of the inter-server interactions for migration, are not specified by the NFS version 4.0 protocol.
A client may determine the disposition of migrated state by using a stateid associated with the migrated state on the new server.
Since responsibility for an entire filesystem is transferred with a migration event, there is no possibility that conflicts will arise on the destination server as a result of the transfer of locks.
The servers may choose not to transfer the state information upon migration. However, this choice is discouraged, except where specific issues such as stateid conflicts make it necessary. When a server implements migration and it does not transfer state information, it SHOULD provide a filesystem-specific grace period, to allow clients to reclaim locks associated with files in the migrated filesystem. If it did not do so, clients would have to re-obtain locks, with no assurance that a conflicting lock was not granted after the filesystem was migrated and before the lock was re-obtained.
In the case of migration without state transfer, when the client presents state information from the original server (e.g. in a RENEW op or a READ op of zero length), the client must be prepared to receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_BAD_STATEID from the new server. The client should then recover its state information as it normally would in response to a server failure. The new server must take care to allow for the recovery of state information as it would in the event of server restart.
In those situations in which state has not been transferred, as shown by a return of NFS4ERR_BAD_STATEID, the client may attempt to reclaim locks in order to take advantage of cases in which the destination server has set up a file-system-specific grace period in support of the migration.
Handling of clientid values is similar to that for stateids. However, there are some differences that derive from the fact that a clientid is an object which spans multiple filesystems while a stateid is inherently limited to a single filesystem.
The clientid4 and nfs_client_id4 information (id string and boot verifier) will be transferred with the rest of the state information and the destination server should use that information to determine appropriate clientid4 handling. Although the destination server may make state stored under an existing lease available under the clientid4 used on the source server, the client should not assume that this is always so. In particular,
When leases are not merged, the transfer of state should result in creation of a confirmed client record with empty callback information but matching the {v, x, c} with v and x derived from the transferred client information and c chosen by the destination server.
In such cases, the client SHOULD re-establish new callback information with the new server as soon as possible, according to sequences described in sections "Operation 35: SETCLIENTID - Negotiate Client ID" and "Operation 36: SETCLIENTID_CONFIRM - Confirm Client ID". This ensures that server operations are not delayed due to an inability to recall delegations. The client can determine the new clientid (the value c) from the response to SETCLIENTID.
The client can use its own information about leases with the destination server to see if lease merger should have happened. When there is any ambiguity, the client MAY use the above procedure to set the proper callback information and find out, as part of the process, the correct value of its clientid with respect to the server in question.
In addition to stateids, the locks they represent, and clientid information, servers also need to transfer information related to the current status of openowners and lockowners.
This information includes:
When clients are implemented to isolate each openowner and lockowner to a particular filesystem, the server SHOULD transfer this information together with the lock state. The owner ceases to exist on the source server and is reconstituted on the destination server.
Note that when servers take this approach for all owners whose state is limited to the particular filesystem being migrated, doing so will not cause difficulties for clients not adhering to an approach in which owners are isolated to particular filesystems. As long as the client recognizes the loss of transferred state, the protocol allows the owner in question to disappear and the client may have to deal with an owner confirmation request that would not have occurred in the absence of the migration.
When migration occurs and the source server discovers an owner whose state includes the migrated filesystem but other filesystems as well, it cannot transfer the associated owner state. Instead, the existing owner state stays in place but propagation of owner state is done as specified below
Note that a server may obey all of the conditions above without the overhead of keeping track of set of filesystems that any particular owner has been associated with. Consider a situation in which the source server has decided to keep lock-related state associated with a filesystem fixed, preparatory to propagating it to the destination filesystem. If a client is free to create new locks associated with existing owners on other filesystems, the owner information may be propagated to the destination filesystem, even though, at the time the filesystem migration is recognized by the client to have occurred, the last operation associated with the owner may not be associated with the migrating filesystem.
When source server propagates owner-related state associated with owners that span multiple filesystems, it will propagate the owner sequence value to the destination server, while retaining it on the source server, as long as there exists state associated with the owner. When owner information is propagated in this way, source and destination servers start with the same owner sequence value which is then updated independently, as the client makes owner-related requests to the servers. Note that each server will have some period in which the associated sequence value for an owner is identical to the one transferred as part of migration. At those times, when a server receives a request with a matching owner sequence value, it MUST NOT respond with the associated stored response if the associated filesystem is not, when the reissued request is received, part of the set of filesystems handled by that server.
One sort of case may require more complex handling. When multiple filesystem are migrated, in sequence, to a specific destination server, an owner may be migrated to a destination server, on which it was already present, leading to the issue of how the resident owner information and that being newly migrated are to be reconciled.
If filesystem migration encounters a situation where owner information needs to be merged, it MAY decline to transfer such state, even if it chooses to handle other cases in which locks for a given owner are spread among multiple filesystems.
As a way of understanding the situations which need to be addressed when owner information needs to be merged, consider the following scenario:
As a result of such situations, the server needs to provide support for dealing with retransmission of owner-sequenced requests that diverges from the typical model in which there is support for retransmission of replies only for a request whose sequence value exactly matches the last one sent. Such support only needs to be provided for requests issued before the migration event whose status as the last by sequence is invalidated by the migration event.
When servers do support such merger of owner information on the destination server, the following rules are to be adhered to:
As a result of the operation of these rules, there are three ways in which we can have more reply data than what is typically present, i.e. data for a single request per owner whose sequence is the last one received, where the next sequence to be used is one beyond that.
Here are some guidelines as to when servers can drop such additional reply data which is created as part of owner information migration.
Since client switch-over in the case of replication is not under server control, the handling of state is different. In this case, leases, stateids and client IDs do not have validity across a transition from one server to another. The client must re-establish its locks on the new server. This can be compared to the re-establishment of locks by means of reclaim-type requests after a server reboot. The difference is that the server has no provision to distinguish requests reclaiming locks from those obtaining new locks or to defer the latter. Thus, a client re-establishing a lock on the new server (by means of a LOCK or OPEN request), may have the requests denied due to a conflicting lock. Since replication is intended for read-only use of filesystems, such denial of locks should not pose large difficulties in practice. When an attempt to re-establish a lock on a new server is denied, the client should treat the situation as if its original lock had been revoked.
A filesystem can be migrated to another server while a client that has state related to that filesystem is not actively submitting requests to it. In this case, the migration is reported to the client during lease renewal. Lease renewal can occur either explicitly via a RENEW operation, or implicitly when the client performs a lease-renewing operation on another filesystem on that server.
In order for the client to schedule renewal of leases that may have been relocated to the new server, the client must find out about lease relocation before those leases expire. Similarly, when migration occurs but there has not been transparent state migration, the client needs to find out about the change soon enough to be able to reclaim the lock within the destination server's grace period. To accomplish this, all operations which implicitly renew leases for a client (such as OPEN, CLOSE, READ, WRITE, RENEW, LOCK, and others), will return the error NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be renewed has been transferred to a new server. Note that when the transfer of responsibility leaves remaining state for that lease on the source server, the lease is renewed just as it would have been in the NFS4ERR_OK case, despite returning the error. The transfer of responsibility happens when the server receives a GETATTR(fs_locations) from the client for each filesystem for which a lease has been moved to a new server. Normally it does this after receiving an NFS4ERR_MOVED for an access to the filesystem but the server is not required to verify that this happens in order to terminate the return of NFS4ERR_LEASE_MOVED. By convention, the compounds containing GETATTR(fs_locations) SHOULD include an appended RENEW operation to permit the server to identify the client getting the information.
Note that the NFS4ERR_LEASE_MOVED error is only required when responsibility for at least one stateid has been affected. In the case of a null lease, where the only associated state is a clientid, an NFS4ERR_LEASE_MOVED error SHOULD NOT be generated.
Upon receiving the NFS4ERR_LEASE_MOVED error, a client that supports filesystem migration MUST perform the necessary GETATTR operation for each of the filesystems containing state that have been migrated and so give the server evidence that it is aware of the migration of the filesystem. Once the client has done this for all migrated filesystems on which the client holds state, the server MUST resume normal handling of stateful requests from that client.
One way in which clients can do this efficiently in the presence of large numbers of filesystems is described below. This approach divides the process into two phases, one devoted to finding the migrated filesystems and the second devoted to doing the necessary GETATTRs.
The client can find the migrated filesystems by building and issuing one or more COMPOUND requests, each consisting of a set of PUTFH/GETFH pairs, each pair using an fh in one of the filesystems in question. All such COMPOUND requests can be done in parallel. The successful completion of such a request indicates that none of the filesystems interrogated have been migrated while termination with NFS4ERR_MOVED indicates that the filesystem getting the error has migrated while those interrogated before it in the same COMPOUND have not. Those whose interrogation follows the error remain in an uncertain state and can be interrogated by restarting the requests from after the point at which NFS4ERR_MOVED was returned or by issuing a new set of COMPOUND requests for the filesystems which remain in an uncertain state.
Once the migrated filesystems have been found, all that is needed is for the client to give evidence to the server that it is aware of the migrated status of filesystems found by this process, by interrogating the fs_locations attribute for an fh within each of the migrated filesystems. The client can do this by building and issuing one or more COMPOUND requests, each of which consists of a set of PUTFH operations, each followed by a GETATTR of the fs_locations attribute. A RENEW is necessary to enable the operations to be associated with the lease returning NFS4ERR_LEASE_MOVED. Once the client has done this for all migrated filesystems on which the client holds state, the server will resume normal handling of stateful requests from that client.
In order to support legacy clients that do not handle the NFS4ERR_LEASE_MOVED error correctly, the server SHOULD time out after a wait of at least two lease periods, at which time it will resume normal handling of stateful requests from all clients. If a client attempts to access the migrated files, the server MUST reply NFS4ERR_MOVED. In this situation, it is likely that the client would find its lease expired although a server may use "courtesy" locks to mitigate the issue.
When the client receives an NFS4ERR_MOVED error, the client can follow the normal process to obtain the destination server information (through the fs_locations attribute) and perform renewal of those leases on the new server. If the server has not had state transferred to it transparently, the client will receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server, as described above. The client can then recover state information as it does in the event of server failure.
Aside from recovering from a migration, there are other reasons a client may wish to retrieve fs_locations information from a server. When a server becomes unresponsive, for example, a client may use cached fs_locations data to discover an alternate server hosting the same filesystem data. A client may periodically request fs_locations data from a server in order to keep its cache of fs_locations data fresh.
Since a GETATTR(fs_locations) operation would be used for refreshing cached fs_locations data, a server could mistake such a request as indicating recognition of an NFS4ERR_LEASE_MOVED condition. Therefore a compound which is not intended to signal that a client has recognized a migrated lease SHOULD be prefixed with a guard operation which fails with NFS4ERR_MOVED if the file handle being queried is no longer present on the server. The guard can be as simple as a GETFH operation.
Though unlikely, it is possible that the target of such a compound could be migrated in the time after the guard operation is executed on the server but before the GETATTR(fs_locations) operation is encountered. When a client issues a GETATTR(fs_locations) operation as part of a compound not intended to signal recognition of a migrated lease, it SHOULD be prepared to process fs_locations data in the reply that shows the current location of the filesystem is gone.
In order that the client may appropriately manage its leases in the case of migration, the destination server must establish proper values for the lease_time attribute.
When state is transferred transparently, that state should include the correct value of the lease_time attribute. The lease_time attribute on the destination server must never be less than that on the source since this would result in premature expiration of leases granted by the source server. Upon migration in which state is transferred transparently, the client is under no obligation to re-fetch the lease_time attribute and may continue to use the value previously fetched (on the source server).
In the case in which lease merger occurs as part of state transfer, the lease_time attribute of the destination lease remains in effect. The client can simply renew that lease with its existing lease_time attribute. State in the source lease is renewed at the time of transfer so that it cannot expire, as long as the destination lease is appropriately renewed.
If state has not been transferred transparently (i.e., the client needs to reclaim or re-obtain its locks), the client should fetch the value of lease_time on the new (i.e., destination) server, and use it for subsequent locking requests. However the server must respect a grace period at least as long as the lease_time on the source server, in order to ensure that clients have ample time to reclaim their locks before potentially conflicting non-reclaimed locks are granted. The means by which the new server obtains the value of lease_time on the old server is left to the server implementations. It is not specified by the NFS version 4.0 protocol.
This chapter provides suggestions to help server implementers deal with issues involved in the transparent transfer of filesystem-related data between servers. Servers are not obliged to follow these suggestions, but should be sure that their approach to the issues handle all the potential problems addressed below.
In many cases, state transfer will be part of a larger function wherein the contents of a filesystem are transferred from server to server. Although specifics will vary with the implementation, the relation between the transfer of persistent file data and metadata and the transfer of state will typically be described by one of the cases below.
When transferring locking state from the source to a destination server, there will be occasions when the source server will need to prevent operations that modify the state being transferred. For example, if the locking state at time T is sent to the destination server, any state change that occurs on the source server after that time but before the filesystem transfer is made effective will mean that the state on the destination server will differ from that on the source server, which matches what the client would expect to see.
In general, a server can prevent some set of server-maintained data from changing by returning NFS4ERR_DELAY on operations which attempt to change that data. In the case of locking state for NFSv4.0, there are two specific issues that might interfere:
Note that the first problem and many instances of the second can be addressed by returning NFS4ERR_DELAY on the operations that establish a filehandle within the target as one of the filehandles associated with the request, i.e. as either the current or saved filehandle. This would require returning NFS4ERR_DELAY under the following circumstances:
Note that if the server establishes and maintains a situation in which no request has, as either the current or saved filehandle, a filehandle within the target filesystem, no special handling of SAVEFH or RESTOREFH is required. Thus the fact that these operations cannot return NFS4ERR_DELAY is not a problem since neither will establish a filehandle in the target filesystem as the current filehandle.
If the server is to establish the situation described above, it may have to take special note of long-running requests which started before state migration. Part of any solution to this issue will involve distinguishing two separate points in time at which handling for the target filesystem will change. Let us distinguish;
For a server to decide on T', it must ensure that requests started before T, cannot change target filesystem locking state, given that all those started after T are dealt with by returning NFS4ERR_DELAY upon setting filehandles within the target filesystem. Among the ways of doing this are:
The set of operations that change locking state include two that cannot be dealt with by the above approach, because they are not filesystem-specific and do not use a current filehandle as an implicit parameter.
The approach outlined above, wherein NFS$ERR_DELAY is returned based primarily on the use of current and saved filehandles in the filesystem, prevents all reference to the transitioning filesystem, rather than limiting the delayed operations to those that change locking state on the transitioning filesystem. Because of this, servers may choose to limit the time during which this broad approach is used by adopting a layered approach to the issue.
A possible sequence would be the following.
This chapter contains a number of items which relate to the changes in the chapters above, but which, for one reason or another, exist in different portions of the specification to be updated.
We summarize here all the remaining changes, not included in the two main chapters.
The definition of this error is now as follows
The existing error tables should be considered modified to allow NFS4ERR_DELAY to be returned by RELEASE_LOCKOWNER. However, the scope of this addition is limited and is not to be considered as making this error return generally acceptable.
It needs to be made clear that servers may not return this error to clients not prepared to support filesystem migration. Such clients may be following the error specifications in [RFC3530] and [cur-rfc3530-bis] and so might not expect NFS4ERR_DELAY to be returned on RELEASE_LOCKOWNER.
The following constraint applies to this additional error return, as if it were a note appearing together with the newly allowed error code:
client, callback, callback_ident -> clientid, setclientid_confirm
struct SETCLIENTID4args { nfs_client_id4 client; cb_client4 callback; uint32_t callback_ident; };
struct SETCLIENTID4resok { clientid4 clientid; verifier4 setclientid_confirm; }; union SETCLIENTID4res switch (nfsstat4 status) { case NFS4_OK: SETCLIENTID4resok resok4; case NFS4ERR_CLID_INUSE: clientaddr4 client_using; default: void; };
The client uses the SETCLIENTID operation to notify the server of its intention to use a particular client identifier, callback, and callback_ident for subsequent requests that entail creating lock, share reservation, and delegation state on the server. Upon successful completion the server will return a shorthand client ID which, if confirmed via a separate step, will be used in subsequent file locking and file open requests. Confirmation of the client ID must be done via the SETCLIENTID_CONFIRM operation to return the client ID and setclientid_confirm values, as verifiers, to the server. The reason why two verifiers are necessary is that it is possible to use SETCLIENTID and SETCLIENTID_CONFIRM to modify the callback and callback_ident information but not the shorthand client ID. In that event, the setclientid_confirm value is effectively the only verifier.
The callback information provided in this operation will be used if the client is provided an open delegation at a future point. Therefore, the client must correctly reflect the program and port numbers for the callback program at the time SETCLIENTID is used.
The callback_ident value is used by the server on the callback. The client can leverage the callback_ident to eliminate the need for more than one callback RPC program number, while still being able to determine which server is initiating the callback.
To understand how to implement SETCLIENTID, make the following notations. Let:
Since SETCLIENTID is a non-idempotent operation, let us assume that the server is implementing the duplicate request cache (DRC).
When the server gets a SETCLIENTID { v, x, k } request, it first does a number of preliminary checks as listed below before proceeding to the main part of SETCLIENTID processing.
If the SETCLIENTID has not been dealt with by DRC processing, and has not been rejected with an NFS4ERR_CLID_INUSE error, then the main part of SETCLIENTID processing proceeds, as described below.
The server generates the clientid and setclientid_confirm values and must take care to ensure that these values are extremely unlikely to ever be regenerated.
The last paragraph of the "Security Considerations" section should be revised to read as follows:
Is modified as specified in Section 7.5.
This document does not require actions by IANA.
The editor and authors of this document gratefully acknowledge the contributions of Trond Myklebust of NetApp and Robert Thurlow of Oracle. We also thank Tom Haynes of NetApp and Spencer Shepler of Microsoft for their guidance and suggestions.
Special thanks go to members of the Oracle Solaris NFS team, especially Rick Mesta and James Wahlig, for their work implementing an NFSv4.0 migration prototype and identifying many of the issues addressed here.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
[RFC3530] | Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M. and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003. |
[RFC1813] | Callaghan, B., Pawlowski, B. and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, June 1995. |
[RFC5661] | Shepler, S., Eisler, M. and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 5661, January 2010. |
[cur-rfc3530-bis] | Haynes, T. and D. Noveck, "Network File System (NFS) Version 4 Protocol", 2014. Work in progress. |
[info-migr] | Noveck, D., Shivam, P., Lever, C. and B. Baker, "NFSv4 migration: Implementation experience and spec issues to resolve ", 2014. Work in progress. |