Network Working Group | Y. Cui |
Internet-Draft | L. Sun |
Intended status: Informational | Tsinghua University |
Expires: February 21, 2016 | August 20, 2015 |
Internet Storage Sync: Problem Statement
draft-cui-iss-problem-00
Internet storage services have become more and more popular. They attract huge number of users and produce a significant share of Internet traffic. However, most existing Internet storage services make use of proprietary sync protocols to achieve the data synchronization. And almost all of them are proved to be not efficient enough and have room for improvement. This document outlines the related problems caused by inefficient proprietary sync protocols and shows a demand for an efficient and standard sync protocol.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 21, 2016.
Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Internet storage services provide a way for users to synchronize local files or folders with remote servers. It enables the users to back up and share their local data on the Internet and makes it possible for users to access, retrieve and modify their synchronized data via multiple terminals. In recent years, the explosion in the popularity of Internet storage service has attracted more and more users and also providers (e.g. Google, Microsoft, Amazon and etc.) offering cheaper and larger storage space. Dropbox, typically considered as the market leader, announced that they have more than 400 million registered users. Thus it is not strange that Internet storage services have accounted for a significant share of Internet traffic and this number will keep growing in the future.
Existing Internet storage services employ data synchronization (sync) to perform retrieving/uploading local files from/to the remote servers. A sync protocol between client and server is required to achieve that. Almost all existing Internet storage services use their own proprietary sync protocols. Having various proprietary sync protocols impedes the development of Internet storage service since it sacrifice the user experience when users desire to experience multiple services or wish to share local files with users from other services. Furthermore, having proprietary protocols increases the complexity of developing new Internet storage service for the new comers in this market.
Previous work shows that most existing sync protocols employed by different Internet storage services are not efficient that usually waste limited bandwidth and introduce extra traffic. Such inefficiency issues are more challenging in mobile and wireless environments. For example, the wireless connectivity is interrupted when the user is uploading the pictures using his mobile phone. It turns out that the synchronization is failed but still consumes lots of traffic and cell phone battery. The unsatisfactory performance caused by the limitations of immature sync protocols and poor system designs has indeed become a critical problem in the development of Internet storage service.
To address the problems mentioned above, an open and standard storage sync protocol is required. In addition to this, this standard sync protocol are expected to be more efficient which is able to accelerate sync process, reduce unnecessary traffic and have a better performance in mobile and wireless environments.
This document outlines the problems arisen in existing Internet storage services with inefficient proprietary sync protocols. Section 2 lists the terminology and related concepts of Internet storage service. Section 3 introduces the architecture of existing Internet storage services. Section 4 describes the main problems and issues that need to be considered. Section 5 explains the advantages of using open and standard sync protocol.
Data synchronization (sync): The most important operation of Internet storage services and is more than remote file transfer. It makes it possible for the client to automatically update changes to the files stored in the remote servers. Changes on a local file will be notified to the client promptly
Client: An application which is installed at the user side (i.e. on multiple terminals). It enables users to access and experience Internet storage service.
Control server: The entity that takes the responsibility of authenticating users, managing metadata information and also notifying changes to the client. It stores authentication and metadata information of users.
Data storage server: The entity that stores the synchronized files of users.
Control data: The control information exchanged with control server to fulfil the data sync process. Typical control data includes metadata (e.g. hashes for chunks), authentication information and etc.
Content data: The original data of user local file, often in forms of small chunks.
Sync protocol: A communication protocol between client and remote servers to achieve data synchronization. It contains control flow and data flow. Sync protocols are always built on HTTPS/HTTP.
Sync efficiency: A performance metric that indicates how fast the changes can be synchronized to the Internet with the lowest traffic overhead.
Useful capabilities to improve sync efficiency:
The architecture of most Internet storage services is generally composed of three major components: client, control server and data storage server. And the whole architecture is shown in Figure 1.
* * * * * * * * * * * * * * * * * * * * * * * INTERNET * * +------------+ +------------+ * ------| Control | | +------------+ * | * | server | | |Data storage|======== | * +------------+ + | servers | * | | * +------------+ * | | * * * * * * * * * * * * * * | Control Flow * * * * * * * * Data Flow | | | | | +--------+ | ---------------------| Client |===================== +--------+ Figure 1
With the help of sync protocol, all the three components could communicate with each other. Control server is responsible for storing all the control data, including authentication information, metadata and etc. And once there are changes made on synchronized files, the control server will notify the clients. However the other type of data, content data, is stored in the form of chunks on the data storage servers with no knowledge of sources, users and relationship with other data chunks. That is to say, one user file may be split and stored on several different data storage servers. These two kinds of servers are separate logical entities and are usually deployed in different locations. Every time the client synchronize a local file to the Internet, it needs to exchange both control data with the control server and content data with the data storage servers.
Existing popular Internet storage services, including Dropbox, OneDrive, GoogleDrive and etc, are using their own proprietary sync protocols to achieve the data synchronization. Using different proprietary protocols are always considered not to be beneficial to the development of Internet services. Moreover, previous works and measurements have revealed that the sync efficiency of existing Internet storage services do not have a good performance. This section describes current problems for Internet storage services caused by their sync protocols. We summarize five specific problems and such problems are looking forward to be addressed by the IETF community.
Popular Internet storage services provide APIs to encourage more people to develop third party clients. The APIs allow user programs to access to the service provider's servers and then synchronize local files with those servers. These APIs can also include some further advanced features or functions to make the client work better. Different providers have different APIs provided to the developers and it is really common that their APIs have different styles and features. Typically, the service provider need to provide different sets of APIs for different platforms (e.g. Windows or Android) and update them frequently.
As for the developers, they need to learn the provided APIs in order to design and implement their own clients. It is an obvious advantage for a third party client that it can support multiple Internet storage services. There have already been some successful third party clients that support multiple services (e.g. ExpanDrive [ExpanDrive], IFTTT [IFTTT]). However it is not easy for the developers to learn and apply so many different APIs to develop and maintain their third party clients.
In summary, it is obvious that both providers and developers suffer from the complicated support for APIs to some extent.
Sharing is one of the most important functions provided by Internet storage services. With this function provided, files in the Internet could be easily synchronized and manipulated by different people and groups. Anyone who is permitted to read and download the file is able to modify and upload new versions of this file to the Internet.
However, this sharing function merely works well inside a single service. That is to say, users who are using the same Internet storage service could easily achieve the sharing and coordinated operations on their files. When referring to the sharing among different Internet storage services, it is not complete since the sync among different services is not available. Different services using different proprietary sync protocols results in the unavailability. For example, if the shared files are stored on a Dropbox server, a GoogleDrive client cannot retrieve/upload them through Dropbox's sync protocol since it has no idea of the Dropbox's sync protocol. And it is apparently impossible to employ its own GoogleDrive's sync protocol to retrieve/upload files on Dropbox server.
Currently, if a Dropbox user still wishes to share his file with a GoogleDrive user, he will make it with the help of basic HTTP connections. The Dropbox user will send an HTTP link of this file to the GoogleDrive user. After clicking on that link, the GoogleDrive user could download this file through HTTP. However the only thing that the GoogleDrive user can do with the shared file is to read and download it. He cannot modify and update the shared file since Dropbox and GoogleDrive are using two different proprietary sync protocols.
The existing sharing function among different services is actually incomplete (i.e. only download is available) and far away from people's expectation. In order to achieve a complete and useful sharing function, the sync among different services should be available.
The emergency of more and more Internet storage services provides users with a wide range of choices for storing their local files remotely. Like other Internet applications, users are not restricted to use only one of those services. Actually, they tend to have multiple accounts for different Internet storage services and experience them simultaneously. One important reason is that users are always pursuing better functionality. For example, Dropbox is better at file processing, OneDrive is better at the interoperability and compatibility with Microsoft Office while GoogleDrive has a better performance at mail attachment. To enable all the desired functions and features, a simple way is to register and use all the desired Internet storage services. Furthermore, people may simply need multiple Internet storage services for larger storage space and higher reliability.
However, having and using different Internet storage service results in a problem that user should have multiple similar client applications. Since almost all commercial Internet storage services have their own proprietary sync protocols and corresponding client applications. Installing and running multiple client applications sacrifices the user experience and also increases the complexity of syncing files with different providers' servers in Internet. For instance, users usually suffer from duplicate operations in order to upload the same file to their different service accounts.
Data synchronization is not a simple remote file transfer process, it can implement several capabilities to optimize the data storage usage and speed up data transmissions. There exists five well-known capabilities that are employed by Internet storage services to improve the sync efficiency: chunking, bundling, deduplication, delta-encoding and compression.
However, the investigation of [Benchmarking] shows that different Internet storage services have different capability configurations and implementations. And most existing Internet storage services do not implement all the five capabilities during their sync processes. Lack of such capabilities can do affect the sync efficiency. For example, when user wishes to synchronize multiple small files to the Internet, bundling is really a useful capability to reduce the sync time. If the bundling is not implemented, the user will suffer from TCP slow start effect since there will be a new connection for each small file. Bundling small files together can effectively reduce the number of TCP connections so that the whole sync time and traffic overhead can be significantly decreased. Further measurement details and conclusions for other capabilities could be found in [Benchmarking]. Table 1 shows different capabilities implementations of four popular Internet storage services (i.e. Dropbox, GoogleDrive, OneDrive and Seafile) on Windows OS.
+----------------+-------------+-------------+-------------+-------------+ | Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile | | | | | | | +----------------+-------------+-------------+-------------+-------------+ | Chunking | 4MB | 8MB | Variable | Variable | +----------------+-------------+-------------+-------------+-------------+ | Bundling | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Deduplication | Yes | No | No | Yes | +----------------+-------------+-------------+-------------+-------------+ | Delta-encoding | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Compression | Yes | Yes | No | No | +----------------+-------------+-------------+-------------+-------------+ Table 1
Measurements and study from [QuickSync] reveal that sync efficiency of current Internet storage services still have plenty of rooms for improvement since they do not understand and implement the key capabilities and sync protocol correctly. The remaining part of this subsection lists few specific problems.
Chunking is the most widely implemented capability that simplifies the transmission recovery when the synchronization of a large file is interrupted. Different implementations of chunking has different chunking schemes (i.e. dynamic chunking or static chunking) and chunk sizes. Typically, smaller chunk size and dynamic chunking scheme (e.g. Content Defined Chunking) are better for detecting and eliminating redundancy. While the ability to detect more redundancy is not always equal to better sync efficiency since it will introduce more computation overhead. A trade-off between computation time and transmission time need to be considered to achieve an effective chunking. A better chunking strategy may be network-aware which means the sync should be able to employ appropriate chunking strategy according to its current network condition.
Delta-encoding is an algorithm to achieve incremental sync that only modified data is transmitted. It is hard to be implemented that only Dropbox from existing commercial Internet storage services supports this capability. However, measurement results from [QuickSync] show that incremental sync is not always available for all the cases. For some typical sync workloads, the incremental sync results in sync traffic 10 times larger than the necessary modified size. We need to design an improved delta-encoding algorithm that makes the incremental sync always available in various scenarios.
Application-layer acknowledgement mechanism is another critical feature that has an impact on sync time and efficiency. Most existing Internet storage services employ a sequential acknowledgement mechanism that the next chunk is only allowed to be transmitted until the last chunk's acknowledgement has been received. As a result, users usually suffer from high sync latency when synchronizing many small files in a high RTT environment. A delayed acknowledgement mechanism enables the client to send and pipeline chunks without waiting for previous acknowledgements that markedly reduces the sync time.
The increasing number of mobile terminals introduces the requirement of synchronizing data on any device via any connectivity at anytime and anywhere. A change made on the data through the desktop is required to be automatically transferred to the user's mobile phone or other mobile devices. Based on the measurements from [Look_at_Mobile_Cloud], current mobile Internet storage services do not have a satisfactory performance on sync efficiency. The root cause and problem are twofold:
First of all, mobile devices have limited storage and computation ability, it is really hard to implement all the five useful capabilities discussed previously on a mobile client (Table 2 shows the implementations for capabilities on Android OS). And the measurement results from [Look_at_Mobile_Cloud] shows that none of existing mobile Internet storage services implement all the five key capabilities. Actually, only very few of them could be found on a mobile Internet storage client. That explains why most Internet storage services wastes limited bandwidth, produce large useless traffic and suffer long sync time in the mobile environment. How to implement all the desired capabilities with lower requirement of storage and computation resources is a critical problem needs to be addressed.
+----------------+-------------+-------------+-------------+-------------+ | Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile | | | | | | | +----------------+-------------+-------------+-------------+-------------+ | Chunking | 4MB | 260K | 1MB | No | +----------------+-------------+-------------+-------------+-------------+ | Bundling | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Deduplication | Yes | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Delta-encoding | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ | Compression | No | No | No | No | +----------------+-------------+-------------+-------------+-------------+ Table 2
Secondly, wireless connectivity is not very stable due to the nature of signals. Its limited bandwidth, higher packet loss and other drawbacks have a higher requirement for the incremental sync. It is not a wise choice to use full-file sync in wireless condition since users may suffer frequent sync failures and large traffic. [Look_at_Mobile_Cloud] points out two challenges that account for the complexity and difficulty of implementing incremental sync in a practical mobile Internet storage services. First is that many existing Internet storage services are built on top of RESTful infrastructure which means the data is only allowed to be accessed at the file level. Second is that most delta-encoding algorithms work in the file granularity. Both of them have a conflict with the architecture of Internet storage services which splits the file into small chunks and stores them in different servers distributedly.
An open and standard sync protocol between client and server can effectively address the problems mentioned above. The sync protocol consists of two types of flows: control flow and data flow. Control flow is between client and control server. It is intended for user authentication, metadata management and also the active notification of data changes. Data flow is between client and data storage servers which is only for transmitting actual file data (in the form of numerous chunks). The combining work of control flow and data flow enables the whole data synchronization. According to the analysis of problems above, the key capabilities should be supported as options in the sync protocol and it would be better if the protocol is network-aware. The rest of this section lists the advantages of employing an open and standard sync protocol.
First off, with a standard sync protocol provided, a third party client that supports multiple Internet storage services is easy to implement since APIs provided by different providers would be unnecessary or at least simplified. This would attract more and more people or organizations to develop and implement their own client (sometimes it is even possible for the user himself to implement his client). As a result, users do not need multiple clients for multiple services any more and their user experience is improved. Furthermore, the competition in the (third party) client market is increasing which is benefit for the users. They are able to choose their clients flexibly and the frequent update of clients enable users to obtain more better features and functions.
Another advantage of having standard sync protocol is that the sync among different services is available or at least possible to achieve. If two different services both employ the standard sync protocol, their users could share files with each other using the same standard sync protocol (not the basic HTTP any more). That is to say, the user could access, retrieve, modify or upload files of users from other different service.
Using standard sync protocol also makes it easy to improve Internet storage services. Compared with the existing proprietary formats, standard sync protocol is totally open and designed by many contributors. People are welcome to revise and improve the standard protocol. We believe that both users and providers will benefit a lot from such a standard sync protocol.
TBD
The authors would like to thank Barry Leiba, Mark Nottingham, Julian Reschke, Marc Blenchet, Mike Bishop, Haibing Song, Philip Hallam Baker, Michiel de Jong and Zeqi Lai for their valuable comments and contributions to this work.
[Benchmarking] | Drago, I., Bocchi, E., Mellia, M., Slatman, H. and A. Pras, Benchmarking Personal Cloud Storage", IMC , 2013. |
[ExpanDrive] | ExpanDrive" | , "
[IFTTT] | IFTTT" | , "
[Inside_Dropbox] | Drago, I., Mellia, M., Munafo, M., Sperotto, A., Sadre, R. and A. Pras, "Inside Dropbox: Understanding Personal Cloud Storage Services", IMC , 2012. |
[Look_at_Mobile_Cloud] | Cui, Y., Lai, Z. and N. Dai, "A First Look at Mobile Cloud Storage Services: Architecture, Experimentation and Challenge", IEEE Network , 2015. |
[QuickSync] | Cui, Y., Lai, Z., Wang, X., Dai, N. and C. Miao, "QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services", MOBICOM , 2015. |