Network Working Group | Y. Cui |
Internet-Draft | Z. Lai |
Intended status: Informational | L. Sun |
Expires: May 5, 2016 | Tsinghua University |
November 2, 2015 |
Internet Storage Sync: Problem Statement
draft-cui-iss-problem-03
Internet storage services have become more and more popular. They attract a huge number of users and produce a significant share of Internet traffic. Most existing Internet storage services make use of proprietary sync protocols with different capabilities to achieve the data sync. However, a single Internet storage service using its proprietary sync protocols has intrinsic limitations on service usability and network performance. This document outlines the related problems caused by using proprietary sync protocols and missing key capabilities. It also shows a demand for designing a standard sync protocol to achieve better usability and sync performance.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 5, 2016.
Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Internet storage services provide a convenient way for users to synchronize local files or folders with remote servers. In recent years, Internet storage services have gained tremendous popularity and accounted for a large amount of Internet traffic. This high public interest also pushes various providers to enter the Internet storage market. Services like Dropbox, Google Drive, OneDrive and Box are becoming pervasive in people's routine. Dropbox, typically considered as one of the leading providers, annouced that they have more than 400 million registered users in June, 2015 [users], and this number will keep growing in the future. Internet storage services enable the users to access, operate and share their data from anywhere, on any devices, at any time and with any connectivity. Internet storage services also provide powerful APIs which allow third-party applications to offload the burden of data storage and management to the server. By aggregating users' files or application data in the server, Internet storage services are becoming the "data entrance" for personal users.
Sync protocol is the key design consideration of Internet storage services. The sync protocol can be equipped with several capabilities to optimize the storage usage and speed up data transmission. Existing Internet storage services employ their proprietary sync protocols to store/retrieve user data to/from the remote servers. However, using proprietary sync protocols with different capabilities in different Internet Storage services has intrinsic limitations on service usability and network performance.
Multi-service usability: Users may use multiple Internet storage services for the diversity of performance and functionality. In addition, an Internet storage service has full access to user data, the user data is at risk when the service is attacked or when authorities require the providers to expose their data. Some enterprise users may want to use their own network-based storage service. Furthermore, it is complicated for developers to use different APIs to combine their application with Internet storage service. It also makes it unavailable for an Internet storage service user to synchronize data with the users of other service. Moreover, to use multi-service a user may install a series of client applications with similar functionality, which wastes the local resource and sacrifices the user experience.
Missing or misusing capabilities: Previous works show that existing Internet storage services have different capability configurations and implementations. These capabilities are closely related to each other and help to efficiently synchronize user data. However, most of the storage services are found to be lack of key capabilities or the capabilities are not reasonably configured, which may result in unexpected sync failure and sync inefficiency. How to reasonably design and implement capabilities in the sync protocol has indeed become a critical problem for the providers.
To address the problems mentioned above, an open and standard sync protocol is required. In addition, this standard sync protocol are expected to support the useful capabilities to avoid unexpected sync failures and improve network performance.
This document outlines the problems arisen in existing Internet storage services with various proprietary sync protocols. Section 2 lists the terminology and related concepts of Internet storage services. Section 3 introduces the architecture of existing Internet storage services. Section 4 describes the main problems and issues that need to be considered. Section 5 explains the advantages of using open and standard sync protocol. Section 6 shows a high-level understanding of the sync protocol. Section 7 identifies the differences between ISS and related work in IETF (i.e. WebDAV).
Data synchronization (sync): A primary technique for Internet storage services. It enables the client to automatically update local file changes to the remote servers through network communications.
Client: An application which is installed at the user side (i.e. on multiple terminals). It enables users to access and experience Internet storage service.
Control server: The entity that takes the responsibility of authenticating users, managing metadata information and also notifying changes to the client. It stores authentication and metadata information of users.
Data storage server: The entity that stores the synchronized files of users.
Control data: The control information exchanged with control server to fulfil the data sync process. Typical control data includes metadata (e.g. hashes for chunks), authentication information and etc.
Content data: The original data of the local file, often in forms of small chunks.
Sync protocol: A communication protocol between client and remote servers to achieve data sync. It contains control flow and data flow. Sync protocols are always built on HTTPS/HTTP.
Sync efficiency: A performance metric that indicates how fast the changes can be synchronized to the Internet with the lowest traffic overhead.
Useful capabilities to improve sync efficiency:
The architecture of most Internet storage services is generally composed of three major components: client, control server and data storage server. And the whole architecture is shown in Figure 1.
* * * * * * * * * * * * * * * * * * * * * * * INTERNET * * +------------+ +------------+ * ------| Control | | +------------+ * | * | server | | |Data storage|======== | * +------------+ + | servers | * | | * +------------+ * | | * * * * * * * * * * * * * * | Control Flow * * * * * * * * Data Flow | | | | | +--------+ | ---------------------| Client |===================== +--------+ Figure 1
With the help of sync protocol, all the three components could communicate with each other. Control server is responsible for storing all the control data, including authentication information, metadata and etc. And once there are changes made on synchronized files, the control server will notify the clients. However the other type of data, content data, is stored in the form of chunks on the data storage servers with no knowledge of sources, users and relationship with other data chunks. As a result, a complete user file will be split into small chunks and those chunks may be stored on several different data storage servers. These two types of servers are separate logical entities and are usually deployed in different locations. Every time the client synchronize a local file to the Internet, it needs to exchange control data and content data with different types of servers in different flows.
Existing popular Internet storage services, including Dropbox, OneDrive, GoogleDrive and etc, are using their own proprietary sync protocols to achieve the data sync. Using different proprietary protocols are always considered not to be beneficial to the development of Internet services. This section describes current problems for Internet storage services caused by their sync protocols. We summarize six specific problems from three different aspects: service usability, protocol capabilities and concurrent work ability. As we discussed in Section 1, users prefer to use multiple storage services for the considerations of performance, reliability and security. Service usability among multiple services is still lacking to some extent due to the proprietary format of sync protocols. Section 4.1, Section 4.2 and Section 4.3 describe the problems which are concerned with the usability. Moreover, previous works and measurements have revealed that most sync protocols are lack of key service capabilities or the capabilities are not well configured, which significantly degrades the network performance, especially in the mobile and wireless environment. Section 4.4 and Section 4.5 illustrate the problems of current protocol capabilities. In addition, the unsatisfied concurrent work ability is specified in Section 4.6.
Popular Internet storage services provide APIs that extend access to the content management features in client software for use in third-party applications. In practical platform, these APIs take care of synchronizing data with Internet storage servers through a familiar system-like way. Behind the scenes, API synchronize changes to the server and automatically notify the client when changes are made on other devices. These APIs can also include some further advanced features or functions, e.g. revision or restoration of files, to make the client work better. Different providers have different APIs provided to the developers and their APIs have different styles and features in order to support different platforms (e.g. Windows and Andorid).
Third-party applications prefer to combine multiple Internet storage services into their applications to achieve better performance, reliability and security. However, for these developers who want to use multiple storage services, they need to learn the APIs of all service providers in order to design and implement their own clients. Although there have already been some successful third party clients that support multiple services (e.g. ExpanDrive [ExpanDrive], IFTTT [IFTTT]), it is not easy for the developers to learn and apply so many different APIs to develop and maintain their third party clients.
Synchronizing is one of the most important functions provided by Internet storage services. With this function provided, files in the Internet could be easily shared and manipulated by different people and groups. Anyone who is permitted to read and download the file is able to modify and upload new versions of this file to the Internet.
However, this synchronizing function merely works well inside a single service. Users who are using the same Internet storage service could easily achieve the sharing (i.e. download) and coordinated operations on their files. When referring to the synchronizing among different Internet storage services, it is not complete since the sync among different services is not available. For example, if a Dropbox user wants to work on a cooperative file with a Google Drive user currently, he is only able to share this file with the other one by sending an open HTTP link of this file. After clicking on that link, the Google Drive user could only download this file through HTTP. However, the Google Drive user can only read and download the shared file. He cannot modify and update the shared file since Dropbox and Google Drive are using two different proprietary sync protocols. This is because the cooperative file is stored on Dropbox servers. A Google Drive client cannot download/upload the file through Dropbox's sync protocol since it has no idea of the Dropbox's sync protocol. Different services using different proprietary sync protocols results in the unavailability.
The emergency of more and more Internet storage services provides users with a wide range of choices for storing their local files remotely. Like other Internet applications, users are not restricted to use only one of those services. Actually, they tend to have multiple accounts for different Internet storage services and experience them simultaneously. One important reason is that users are always pursuing better functionality. For example, Dropbox is better at file processing, OneDrive is better at the interoperability and compatibility with Microsoft Office while GoogleDrive has a better performance at mail attachment. To enable all the desired functions and features, a simple way is to register and use all the desired Internet storage services. Furthermore, people may simply need multiple Internet storage services for larger storage space and higher reliability.
However, using different Internet storage service results in a problem that users have to install multiple similar client applications. Since almost all commercial Internet storage services have their own proprietary sync protocols and corresponding client applications, installing and running multiple similar client applications sacrifices the user experience and also increases the complexity of synchronizing files with different providers' servers in Internet. For instance, users usually suffer from duplicate operations in order to upload the same file to their different service accounts.
Data sync is not a simple remote file transfer process, it can implement several capabilities to optimize the data storage usage and speed up data transmissions. There exists five well-known capabilities that can be employed by Internet storage services to improve the sync efficiency and reliability: chunking, bundling, deduplication, delta-encoding and compression. All these capabilities are aimed to help to efficiently synchronize user data via Internet communications.
However, the investigation of [Benchmarking] shows that different Internet storage services have different capability configurations and implementations. And most existing Internet storage services do not implement all the five capabilities in their sync protocol. Lack of such capabilities can do affect the sync efficiency. Table 1 from [QuickSync] shows different capabilities implementations of four popular Internet storage services (i.e. Dropbox, GoogleDrive, OneDrive and Seafile) on Windows OS.
+----------------+-------------+-------------+-------------+-------------+ | Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile | | | | | | | +----------------+-------------+-------------+-------------+-------------+ | Chunking | 4MB | 8MB | Variable | Variable | +----------------+-------------+-------------+-------------+-------------+ | Bundling | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Deduplication | Yes | No | No | Yes | +----------------+-------------+-------------+-------------+-------------+ | Delta-encoding | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Compression | Yes | Yes | No | No | +----------------+-------------+-------------+-------------+-------------+ Table 1
Measurements and study from [QuickSync] also reveal that those key capabilities significantly affect the sync performance. Most of them should be implemented and well configured to achieve data sync. The remaining part of this subsection lists the problems caused by insufficient or unreasonably configured capabilities.
Chunking is the most widely implemented capability that simplifies the transmission recovery when the sync of a large file is interrupted. Different implementations of chunking has different chunking schemes (i.e. dynamic chunking or static chunking) and chunk sizes. Chunking is closely related to deduplication since the deduplication is performed in the chunk granularity. Typically, smaller chunk size and dynamic chunking scheme (e.g. Content Defined Chunking) are better for detecting and eliminating redundancy. However the ability to detect more redundancy is not always equal to better sync efficiency since it will introduce more computation overhead (i.e. finding more redundancy needs more CPU time). Aggressive dynamic chunking scheme (e.g. Content Defined Chunking) performs better in a high delay (i.e. high RTT) environment, while fixed-size scheme performs well in good network conditions. A trade-off between computation time and transmission time need to be considered to achieve an effective chunking. A better chunking strategy may be network-aware which means the sync should be able to employ appropriate chunking strategy according to its current network condition.
Delta-encoding is an algorithm that can be used to find the different portion of two files and achieve incremental sync. However, not all Internet storage services implement delta-encoding. One possible reason is that most delta-encoding algorithms work at the granularity of file, while to save the storage space thus reducing the cost, files are often split into chunks to manage for Internet storage services. Naively piecing together all chunks to reconstruct the whole file to achieve incremental sync would waste massive intra-cluster bandwidth. Therefore, some Internet storage services, e.g. Dropbox, implement delta-encoding at the chunk granularity. The delta-encoding is performed between two chunks in the original and modified version respectively according to the chunk offset from the beginning of the file. If a service uses the fixed size chunking method, some types of modifications, e.g. inserting some new data at the head of a file, may cause that the two chunks used to perform delta-encoding have very little similarity. In this circumstance, delta-encoding is unable to reveal the delta between the original and modified file so that the incremental sync fails. To solve the problem, we need to design an improved delta-encoding algorithm with appropriate chunking that makes the incremental sync always available in various scenarios.
Small files are more likely to be modified and synchronized frequently. For example, people usually collaborate on a number of small files (e.g. a project's source code always consists of multiple small files). In a high delay environment, synchronizing large number of small files is not efficient. One reason is that most existing Internet storage services employ a sequential acknowledgement mechanism. Under this circumstance, the next chunk is only allowed to be transmitted until the last chunk's acknowledgement has been received. The sequential acknowledgement mechanism wastes the limited bandwidth since the TCP connection is in idle state for a long time. Bundling small files together and employing delayed acknowledgement mechanism can effectively make full use of limited bandwidth so that the whole sync time and traffic overhead can be significantly decreased.
The increasing number of mobile terminals introduces the requirement of synchronizing data on any device via any connectivity at anytime and anywhere. A change made on the data through the desktop is required to be automatically transferred to the user's mobile phone or other mobile devices. Based on the measurements from [Look_at_Mobile_Cloud], the problem of missing capabilities is more severe when referring to the mobile Internet storage services. The root cause and problem are twofold:
First of all, mobile devices have limited storage and computation ability, it is really hard to implement all the five useful capabilities discussed previously on a mobile client since the implementation of those capabilities will bring extra overhead (Table 2 shows the implementations for capabilities on Android OS). The measurement results from [Look_at_Mobile_Cloud] shows that none of existing mobile Internet storage services implement all the five key capabilities and only very few of them could be found on a mobile Internet storage client. That explains why most Internet storage services wastes limited bandwidth, produce large useless traffic and suffer long sync time in the mobile environment. How to implement all the desired capabilities with lower requirement of storage and computation resources is a critical problem needs to be addressed.
+----------------+-------------+-------------+-------------+-------------+ | Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile | | | | | | | +----------------+-------------+-------------+-------------+-------------+ | Chunking | 4MB | 260K | 1MB | No | +----------------+-------------+-------------+-------------+-------------+ | Bundling | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Deduplication | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Delta-encoding | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Compression | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ Table 2
Secondly, sync protocol cannot well handle network disruptions caused by unstable network connection. For example, some services fail to resume sync if the data transmission is interrupted, or incur too much additional recovery overhead when exception happens. A well designed sync protocol that guarantees reliability and efficiency in mobile or wireless networks is expected.
With the popularity of Internet storage services, collaborative work is becoming an important feature of such services. This feature is especially important and provides convenience for a team or an organization since participants could easily retrieve and edit the target file on the Internet. Currently, such collaborative work ability is still unsatisfactory that some common and frequent operations may lead to redundant file versions. More specifically, parallel updates from different end users may result in a version conflict. If two or more users are editing the same file concurrently, it is hard to make the file updated correctly. To ensure every participant's modification would be considered, the typical way is to lock the file and allow other participants to create different versions for the same file. To obtain a final version, participants have to negotiate with each other about their modifications (versions) and merge the final version manually. This would definitely affect the work efficiency since people have to spend lots of time and effort on managing redundant versions and merging a final version.
A desired concurrent work ability is when different people are working on the same file, the client should automatically create exclusive versions for their users locally. And after they finished and uploaded to the server, the server would automatically merge different versions to get a final version without any human involvement. Furthermore, a better solution is like what [GoogleDocs] does which provides actual real-time edit. Multiple people could edit the same file and are able to find each other's cursor and real-time operation. Such desired ability does help to improve the collaborative work ability but is really challenging when designing a protocol.
An open and standard sync protocol between client and server can effectively address some problems mentioned above. The sync protocol consists of two types of flows: control flow and data flow. Control flow is between client and control server. It is intended for user authentication, metadata management and also the active notification of data changes. Data flow is between client and data storage servers, which is only for transmitting actual file data (in the form of numerous chunks). The combination of control flow and data flow enables the whole data sync. According to the analysis of problems above, the key capabilities could be supported as optional features in the sync protocol and it would be better if the protocol is network-aware. The rest of this section lists the advantages of employing an open and standard sync protocol.
First off, with a standard sync protocol provided, a third party client that supports multiple Internet storage services is easy to implement since APIs provided by different providers would be unnecessary or at least simplified. This would attract more and more people or organizations to develop and implement their own client (sometimes it is even possible for the user himself to implement his client). As a result, users do not need multiple clients for multiple services any more and their user experience is improved. Furthermore, the competition in the (third party) client market is increasing which is beneficial for the users. They are able to choose their clients flexibly and the frequent updates of clients enable users to obtain more functions and better user experience.
Another advantage of having standard sync protocol is that the sync among different services is available or at least possible to achieve. If two different services both employ the standard sync protocol, their users could synchronize files with each other using the same standard sync protocol (not the basic HTTP download any more). In this way, users from different services could achieve sharing and coordinated operations on their local files.
Using standard sync protocol also makes it easy to improve Internet storage services. Compared with the existing proprietary formats, standard sync protocol is totally open and designed by many contributors. People are welcome to revise and improve the standard protocol. We believe that both users and providers will benefit a lot from such a standard sync protocol.
Client Control Server Data Storage Server | | | |---meta data, auth info-->| | |<-------start sync--------| | | sync preparation | | | | | |--------------------store/retrieve------------------>| |<--------------------ok/content----------------------| | ... | |--------------------store/retrieve------------------>| |<--------------------ok/content----------------------| | data transmission | | | | |---meta data, ver info--->| | |<-----conclude sync-------| | | sync finish | | | | | Figure 2
Figure 2 shows a preliminary and high level understanding of the sync protocol. The whole sync process could be divided into three stages: sync preparation, data transmission and sync finish. In the first stage, the client should exchange its metadata, authentication information with the control server to initiate a sync process. During this stage, the capabilities including network-aware chunking and deduplication should be performed. In the second stage, data transmission, client sends/retrieves chunks to/from the data storage servers. To speed up the data sync and make it more reliable, the capabilities like bundling and delta-encoding could be employed. When the sync finishes (i.e. sync finish stage), the client would send its metadata again for the control server to check and conclude the sync process. Also some version information is exchanged for the version control. From this understanding we could derive that the control flow and data flow are closely related, which cannot work without each other.
WebDAV ([RFC4918]) provides an alternative way to exchange local data with remote web servers. It can be treated as previous IETF effort on file collections, authoring and versioning over HTTP. WebDAV mainly focuses on the authoring and versioning for distributed web contents. Typical WebDAV protocol extends HTTP protocol to enable users to collaboratively edit and manage files on remote servers. WebDAV focuses on the distributed work (authoring and versioning) while ISS will focus on the data sync. A potential major difference between data sync and distributed authoring/versioning is the frequency of data transmission. In data sync, the client will automatically exchange data with remote servers when there are any changes. In reality, every time you perform 'save' operation of a file, the client will solicit a data sync process. Such frequent data transmission will cause a large amount of network traffic. This introduces challenges to the design of sync protocols. A possible solution is to make use of those well-known service capabilities and make the protocol to be network-aware to some extent. The ISS protocol suite could build on the WebDAV protocol or basic HTTP protocol.
TBD
The authors would like to thank Barry Leiba, Mark Nottingham, Julian Reschke, Marc Blanchet, Mike Bishop, Haibin Song, Philip Hallam Baker, Michiel de Jong and Ted Lemon for their valuable comments and contributions to this work.