Internet Engineering Task Force | C. Yang, Ed. |
Internet-Draft | Y.Liu&Y.Wang&SY.Pan, Ed. |
Intended status: Standards Track | South China University of Technology |
Expires: May 29, 2020 | C. Chen |
Inspur | |
G. Chen | |
GSTA | |
Y. Wei | |
Huawei | |
November 26, 2019 |
A Massive Data Migration Framework
draft-yangcan-ietf-data-migration-standards-03
This document describes a standardized framework for implementing the massive data migration between traditional databases and big-data platforms on the cloud via Internet, especially for an instance of Hadoop data architecture. The main goal of the framework is to provide more concise and friendly interfaces for users more easily and quickly migrate the massive data from a relational database to a distributed platform for a variety of requirements, in order to make full use of distributed storage resource and distributed computing capability to solve the bottleneck problems of both storage and computing performance in traditional enterprise-level applications. This document covers the fundamental architecture, data elements specification, operations, and interface related to massive data migration.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 29, 2020.
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
With the widespread popularization of cloud computing and big data technology, the scale of data is increasing rapidly, and the distribution computing requirements are more significant than before. For a long time, a majority of companies have usually use relational databases to store and manage their data, a great amount of structured data exist still and accumulate with the business development in legacies. With the dairy growth of data size, the storage bottleneck and the performance degradation for the data when analyzing and processing have become pretty serious and need to be solved in globe enterprise-level applications. This distributed platform refers to a software platform that builds data storage, data analysis, and calculations on a cluster of multiple hosts. Its core architecture involves in distributed storage and distributed computing. In terms of storage, it is theoretically possible to expand capacity indefinitely, and storage can be dynamically expanded horizontally with the increasing data. In terms of computing, some key computing frameworks as mapreduce can be used to perform parallel computing on large-scale datasets to improve the efficiency of massive data processing. Therefore, when the data size exceeds the storage capacity of a single-system or the computation exceeds the computing capacity of a stand-alone system, massive data can be migrated to a distributed platform. The ability of resource sharing and collaborative computing provided by a distributed platform can well solve large-scale data processing problems. The document focuses on putting forward a standard for implementing a big data migration framework through web access via Internet and considering how to help users more easily and quickly migrate the massive data from a traditional relational database to a cloud platform from multiple requirements. Using the distributed storage and distributed computing technologies highlighted by the cloud platform, on the one hand, it solves the storage bottleneck and the problem of low data analyzing and processing performance of relational databases. Based on the access by web, the framework supports open work state and promotes globe applications for data migration.
Note: It is also permissible to implement this framework in non-web.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The following definitions are for terms used in the context of this document.
The main goal of this data migration framework is to help companies migrate their massive data stored in relational databases to cloud platforms through web access. We propose a series of rules and constraints on the implementation of the framework, by which the users can conduct massive data migration with a multi-demand perspective.
Note: The cloud platforms mentioned in the document refer to the Hadoop platform by default. All standards on the operations and the environment of the framework refer to web state by default.
Figure 1 shows the working diagram of the framework.
+---------+ +----------------+ | | (1) | WebServer | | Browser |-------->| |--------------------- | | | +-----------+ | | +---------+ | | DMOW | | | | +-----------+ | | +----------------+ | |(2) | | +-------------+ +-----------------------+ | | | (3) | | | | Data Source |--------> | Cloud Platform | | | | | +-----------------+ |<---- +-------------+ | | Migration Engine| | | +-----------------+ | +-----------------------+
Figure 1:Reference Architecture
The workflow of the framework is as follows:
This framework MUST support data migration between relational databases and cloud platforms on web, and MUST meet the following requirements:
Before conducting data migration, the framework MUST support testing the connection to the data sources that will be migrated, and then decide whether to migrate.
This framework MUST allow users to migrate large amounts of data from a relational database to the following at least two types of target storage containers:
This framework MUST allow an authorized user to specify the target cloud platform to which the data will be migrated.
This framework SHALL support the migration of large amounts of data from relational databases to one or multiple data containers for third-party Web applications. The target storage containers of the third-party Web application systems can be:
This framework is needed to meet the following requirements:
This framework MUST support the migration of all tables in a relational database to at least two types of target storage containers:
This framework MUST allow users to specify a single table in a relational database and migrate it to at least two types of target storage containers:
This framework MUST allow users to specify multiple tables in a relational database and migrate to at least two types of target storage containers:
This framework is needed to meet the following requirements on split-by.
This framework MAY allow the user to specify multiple columns in the data table to slice the data linearly into multiple parallel tasks and then migrate the data to one or more of the following target data containers:
It's OPTIONAL that this framework is needed to support non-linear intelligent segmentations of data for one or more columns and then migrate the data to one or more of the following target data containers:
This framework SHALL allow users to specify the query conditions, then querying out the corresponding data records and migrating them.
It's OPTIONAL that the framework is needed to allow users to add data redundancy labels and label communication mechanisms, then it detects redundant data dynamically during data migration to achieve non-redundant migration.
The specific requirements are as follows:
During the data migration process, the data is not compressed by default. This framework MUST support at least one of the following data compression encoding formats, allowing the user to compress and migrate the data:
This framework SHALL support the migration of appending data to existing datasets in HDFS.
When importing data into HIVE, the framework SHALL support overwriting the original dataset and saving it.
This framework is needed to meet the following requirements:
The framework SHOULD support incremental migration of table records in a relational database, and it MUST allow the user to specify a field value as "last_value" in the table in order to characterize the row record increment. Then, the framework SHOULD migrate those records in the table whose field value is greater than the specified "last_value", and then update the last_value.
The framework SHALL support real-time synchronous migration of updated data and incremental data from a relational database to one or many of the following target data containers:
This framework MUST support data migration in direct mode, which can increase the data migration rate.
Note:This mode supports only for MYSQL and POSTGRESQL.
This framework MUST allow saving the migrated data within at least one of following data file formats:
This framework MUST allow the user to specify a number of map tasks to start a corresponding number of map tasks for migrating large amounts of data in parallel.
After the framework has migrated the data in the relational database,,it MUST support the visualization of the dataset in the cloud platform.
The framework SHOULD support to show dynamically the progress to users in graphical mode when migrating.
The framework MAY provide automated migration proposals to facilitate the user's estimation of migration workload and costs.
The framework SHALL support the user to set various migration parameters(such as map tasks,the storage format of data files,the type of data compression and so on) and task execution time, and then to perform the schedule off-line/online migration tasks.
When the task fails, the framework MUST at least support to notify stakeholders through a predefined way.
Figure 2 shows the framework's working diagram of exporting data.
+---------+ +----------------+ | | (1) | WebServer | | Browser |-------->| |--------------------- | | | +-----------+ | | +---------+ | | DMOW | | | | +-----------+ | | +----------------+ | |(2) | | +-------------+ +-----------------------+ | | | (3) | | | | Data Source |<-------- | Cloud Platform | | | | | +-----------------+ |<---- +-------------+ | | Migration Engine| | | +-----------------+ | +-----------------------+
Figure 2:Reference Diagram
The workflow of exporting data through the framework is as follows:
The framework MUST at least support exporting data from HDFS to one of following relational databases:
The framework SHALL support exporting data from HBASE to one of following relational databases:
The framework SHALL support exporting data from HIVE to one of following relational databases:
The framework SHALL allow the user to specify data range of keys on the cloud platform and export the elements in the specified range to a relational database. Exporting into A Subset of Columns.
The framework SHALL support merging data in different directories in HDFS and store them in a specified directory.
The framework MUST allow the user to specify the separator between fields in the migration process.
The framework MUST allow the user to specify the separator between the record lines after the migration is complete.
The framework provides following shells for character interface to operate through web access.
The framework SHALL support Linux shell through web access, which allows users to perform basic Linux command instructions for the configuration management of the data migrated on web.
The framework SHALL support hbase shell through web access, which allows users to perform basic operations such as adding, deleting, and deleting to the data migrated to hbase through the web shell.
The framework SHALL support hive shell through web access, which allows users to perform basic operations such as adding, deleting, and deleting to the data migrated to hive through the web shell.
The framework SHALL support the Hadoop shell through web access so that users can perform basic Hadoop command operations through the web shell.
The framework SHALL support spark shell through web access and provide an interactive way to analyze and process the data in the cloud platform.
In spark web shell, the framework SHALL support at least one of the following programming languages:
The framework SHOUD support for the security of the data migration process. During the data migration process, it should support encrypt the data before transmission, and then decrypt it for storage in target after the transfer is complete. At the same time, it must support the authentication when getting data migration source data and it shall support the verification of identity and permission when accessing the target platform.
This memo includes no request to IANA.
[RFC2026] | Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, DOI 10.17487/RFC2026, October 1996. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC2578] | McCloghrie, K., Perkins, D. and J. Schoenwaelder, "Structure of Management Information Version 2 (SMIv2)", STD 58, RFC 2578, DOI 10.17487/RFC2578, April 1999. |
[RFC2629] | Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, DOI 10.17487/RFC2629, June 1999. |
[RFC4710] | Siddiqui, A., Romascanu, D. and E. Golovinsky, "Real-time Application Quality-of-Service Monitoring (RAQMON) Framework", RFC 4710, DOI 10.17487/RFC4710, October 2006. |
[RFC5694] | Camarillo, G. and IAB, "Peer-to-Peer (P2P) Architecture: Definition, Taxonomies, Examples, and Applicability", RFC 5694, DOI 10.17487/RFC5694, November 2009. |
[hadoop] | The Apache Software Foundation, "http://hadoop.apache.org/" |
[hbase] | The Apache Software Foundation, "http://hbase.apache.org/" |
[hive] | The Apache Software Foundation, "http://hive.apache.org/" |
[idguidelines] | IETF Internet Drafts editor, "http://www.ietf.org/ietf/1id-guidelines.txt" |
[idnits] | IETF Internet Drafts editor, "http://www.ietf.org/ID-Checklist.html" |
[ietf] | IETF Tools Team, "http://tools.ietf.org" |
[ops] | the IETF OPS Area, "http://www.ops.ietf.org" |
[spark] | The Apache Software Foundation, "http://spark.apache.org/" |
[sqoop] | The Apache Software Foundation, "http://sqoop.apache.org/" |
[xml2rfc] | XML2RFC tools and documentation, "http://xml.resource.org" |