Internet DRAFT - draft-jiang-nmlrg-traffic-machine-learning
draft-jiang-nmlrg-traffic-machine-learning
Network Machine Learning Research Group S. Jiang, Ed.
Internet-Draft B. Liu
Intended status: Informational Huawei Technologies Co., Ltd
Expires: December 5, 2016 P. Demestichas
University of Piraeus
J. Francois
Inria
G. M. Moura
SIDN Labs
P. Barlet
Network Polygraph
June 3, 2016
Use Cases of Applying Machine Learning Mechanism with Network Traffic
draft-jiang-nmlrg-traffic-machine-learning-00
Abstract
This document introduces a set of use cases in which machine learning
technologies are applied to network traffic relevant activities,
including machine learning based traffic classification, traffic
management, etc.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 5, 2016.
Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
Jiang, et al. Expires December 5, 2016 [Page 1]
Internet-Draft Network Machine Learning June 2016
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Methodology of Learning from Traffic . . . . . . . . . . . . 4
3.1. Data of the Network Traffic . . . . . . . . . . . . . . . 4
3.2. Data Source and Storage . . . . . . . . . . . . . . . . . 5
3.3. Architecture Considerations . . . . . . . . . . . . . . . 5
3.4. Closed Control Loop . . . . . . . . . . . . . . . . . . . 6
4. Use Cases Study of Applying Machine Learning in Network . . . 6
4.1. HTTPS Traffic Classification . . . . . . . . . . . . . . 6
4.2. Malicious Domains: Automatic Detection with DNS Traffic
Analysis . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3. Machine-learning based Policy Derivation and Evaluation
in Broadband Networks . . . . . . . . . . . . . . . . . . 10
4.4. Traffic Anomaly Detection in the Router . . . . . . . . . 11
4.5. Applications of Machine Learning to Flow Monitoring . . . 12
5. Security Considerations . . . . . . . . . . . . . . . . . . . 15
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15
8. Change log [RFC Editor: Please remove] . . . . . . . . . . . 16
9. Informative References . . . . . . . . . . . . . . . . . . . 16
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17
1. Introduction
Machine learning technology has been successful in solving
complicated issues. It helps to make predictions or decisions based
on large datasets. It could also dynamically adapt to varying
situations and response to real-time issues. Therefore, more and
more research starts on applying machine learning in the network
area.
Among many aspects of networks, the network traffic is one of the
most complicated managed objectives. Its volume is rapidly growing
along with the Internet explosion. It is always dynamically
changing. Most network traffic flows only last a few minutes, or
even shorter. And the user contents within traffic is becoming more
diverse due to the development of various network services, and
increasing use of encryption. Consequently, it is more and more
Jiang, et al. Expires December 5, 2016 [Page 2]
Internet-Draft Network Machine Learning June 2016
challenging for administrators to get aware of the network's running
status and efficiently manage the network traffic flows. Although
more and more data regarding network traffics are generated,
traditional mechanisms based on pre-designed network traffic patterns
become less and less efficient.
It is natural to utilize powerful machine learning technology to
analyze the large mount of data regarding network traffic, to
understand the network's status, such as performance, failures,
security, etc. It is a big advantage that machines can measure and
analyse the network traffic, then report the results and predictions
to humans for further decision. The machines could handle vast
amounts of data which is almost impossible for humans to deal with,
in close to real time. Even more, if the speed and accuracy of the
prediction is high enough, it is possible that the subsequent action
based on the prediction result could form a closed control loop to
achieve autonomic management. However, the maturity of latter might
be far in the future. Today, the traditional control programs still
look more reliable than machine learning based control mechanisms.
This document firstly analyzes the data of the network traffic from
various perspectives; and also discusses several important practical
considerations, including the training data source, data storage and
the learning system architecture. It then introduce a set of use
cases, which have been shown to work well although there is large
scope for improvements, including ML-based traffic classification,
traffic management, interface failure prediction, etc.
Editor notice: this document is in the primary stage. It collects
the use cases presented in the proposed Network Machine Learning
Research Group (NMLRG) session in IETF95 meeting.
2. Terminology
The terminology defined in this document.
Machine Learning A computational mechanism that analyzes and learns
from data input, either historic data or real-time feedback data,
following a set of designed features and algorithms. It can be
used to make analysis, predictions or decisions, rather than
following strictly static program instructions.
Network Traffic The amount of data moving across a network at a
given point of time. They are mostly encapsulated in network
packets.
Traffic Flow A sequence of packets from a source computer to a
destination [RFC6437]. It is the unit of network traffic.
Jiang, et al. Expires December 5, 2016 [Page 3]
Internet-Draft Network Machine Learning June 2016
Feature (machine learning) In machine learning and pattern
recognition, a feature is an individual measurable property of a
phenomenon being observed. Choosing informative, discriminating
and independent features is a crucial step for effective
algorithms in pattern recognition, classification and regression.
Algorithm (machine learning) Machine learning algorithms operate by
building a model from example inputs in order to make data-driven
predictions or decisions expressed as outputs, rather than
following strictly static program instructions. A incomplete list
of machine learning algorithms includes supervised learning,
unsupervised learning, semi-supervised learning, reinforcement
learning, deep learning, etc.
3. Methodology of Learning from Traffic
3.1. Data of the Network Traffic
There is plenty of valuable data related to the network traffic.
These data are raw features in learning process. Following is a
simple classification of network traffic data.
Measurable properties There are many measurable properties of
network traffic, such as latency, number of packets, duration,
etc. These properties are also very essential features,
especially for use cases relevant to performance, QoS (Quality of
Service), etc.
Data within communication protocols The user contents are
encapsulated in layered communication protocols. Many information
are contained within the protocol headers, for example the source
and destination IP addresses in the IP header, the port numbers in
the TCP/UDP header, etc. Transport layer protocols are often
related to the type of applications, such as FTP (File Transfer
Protocol) for file transfer, HTTP (Hyper Text Transfer Protocol)
for web, etc; and many application-relevant data are embedded
within these protocols. These could also be essential data for
classification or application-oriented analysis. However, some
traffic will not provide transport or application information, due
to unknown protocols or encryption.
User content User contents are the payload of packets, which might
be obtained by DPI (Deep Packet Inspection) within the transit
network if the packets are unencrypted, or they could be analyzed
by the source or destination nodes.
Data in network signaling protocols Traffic flows are managed or
indirectly influenced by various network signaling protocols. For
Jiang, et al. Expires December 5, 2016 [Page 4]
Internet-Draft Network Machine Learning June 2016
example, the routing protocols determine the next hop of a
specific network traffic flow, or even the traffic path (by some
sophisticated routing protocol such as MPLS-TE (Multi-Protocol
Label Switching - Traffic Engineering), segment routing, etc.);
the P2P (Peer to Peer) protocol can even decide the destination of
a specific content traffic. They are relevant and are potential
features for traffic analysis. Furthermore, the traffic of these
signaling protocols themselves may also be learning objectives.
3.2. Data Source and Storage
Within networks, forwarding devices such as routers, switches,
firewalls, etc., are the entities that directly handle the network
traffic. Thus, they could collect network traffic data, such as
measurable properties, protocol information, etc. Source nodes or
destination nodes, particularly servers, could also be the source of
network traffic data. They could either report the collected data to
a central repository for storage and learning, or collect and store
the data by themselves for local learning. This depends on the
learning architecture, which is discussed in the following section.
3.3. Architecture Considerations
Global learning vs. local learning
* Global learning refers to the tasks that are mostly network-
level, so that they need to be done in a global viewpoint. In
this case, the learning entity is normally centralized and is
different from the data source entities.
* Local learning is more applicable to the tasks that are only
relevant to one or a limited group of devices, and they could
be done directly within that one node or that limited group of
nodes. In this case of grouped nodes, the data may also need
to be transited from the data source entity to learning entity.
Offline & online learning
* Co-located mode: training (offline, based on historic data) and
prediction (online, based on real-time data) are both done
within the same entity. The entity could be a central
repository or a specific node.
* De-coupled mode: training is done in the central repository,
and prediction is made by the routers/switches/firewalls or
other devices that directly process the network traffic.
Jiang, et al. Expires December 5, 2016 [Page 5]
Internet-Draft Network Machine Learning June 2016
Central learning & distributed learning Central learning means the
learning process is done at a single entity, which is either a
central repository or a node. Distributed learning refer to
ensemble learning that multiple entities do the learning
simultaneously and ensemble the results together to sort out a
final results. Since network devices are naturally distributed,
it could be foreseen that ensemble learning is a good approach for
a certain of use cases.
3.4. Closed Control Loop
The prediction made by machine learning mechanism could be directly
used on manipulating the network traffic, or other relevant actions,
such as changing the device configuration, etc.
However, as the introduction section said, this kind of utilization
might be suitable only for a small set of the use cases, due to the
limited accuracy of machine learning technologies. Besides, some
critical usages simply cannot tolerate any false decision.
4. Use Cases Study of Applying Machine Learning in Network
Editor notes: This section is a collection of the work presented in
the proposed NMLRG session in IETF95 meeting. More contributions on
use cases are welcome.
4.1. HTTPS Traffic Classification
Managing network traffic requires a good understanding of the content
of traffic flows for various purposes. Indeed, enhancing the QoS by
prioritizing or scheduling the flows or enforcing security policies
by filtering some of them cannot solely on rely protocol headers like
IP, TCP or UDP headers. Analyzing the user content with DPI is so
necessary. However, this poses serious concerns regarding the user
privacy. In addition, OTT (Over-the-Top) actors would prefer to
fully control their network traffic rather than being subject to any
intermediaries policies. As a result, encrypting the traffic has
been widely adopted in last years.
In that context, traffic management is facing to severe difficulties
since DPI is not efficient anymore. Using an intermediary service or
proxy are the only ways to analyze the content of encrypted traffic
but it requires a high trustfulness in the intermediaries and so not
always guaranteed, for example with end-users of an operator
networks.
Therefore, new techniques wit the ability to extract knowledge and
insight from encrypted flows is necessary. Especially HTTPS
Jiang, et al. Expires December 5, 2016 [Page 6]
Internet-Draft Network Machine Learning June 2016
[RFC2818] is now a major protocol use over Internet because it
provides secure Web communication while Web is now embracing various
services which have been provided apart in the past: email, video
streaming, chat, VoIP, file sharing, etc. It relies on TLS
(Transport Layer Security) [RFC5246], [RFC6066] to encapsulate HTTP
requests.
Being able to identify the service and the providers of an HTTPS
connection would help in applying different strategies for managing
the corresponding flow. For instance, VoIP (Voice over IP) and email
do not require the same QoS or some service use might be prohibited
like file sharing to avoid data leakage in a company.
As a concrete example, Google, Facebook or Amazon are service
providers while maps, drive, gmail are services of Google. To
identify them when they are accessed by a user, IP addresses and DNS
(Domain Name System) names based identification is not reliable as
the users can relies on intermediates to respectively serve as proxy
or resolve DNS requests. The SNI (Server Name Indication) [RFC5246]
is an extension of HTTPS which is indicated by the user when
initiating the TLS handshake (Client Hello). SNI actually contains
the hostname to which the request is addressed. Such an hostname is
significative of the service and service provider name. However, SNI
is an optional field and can be easily forged to circumvent HTTPS
filtering without impacting service use [bypasssni]. More advanced
mechanisms are hence necessary to improve the robustness of
identification even in the case of non collaborative users.
Because the objective is to automatically label an HTTPS connection
by the service and service provider associated with. The TLS
handshake is not encrypted but data exchanged during this phase
(random number, selected ciphers,...) is not distinctive of the
accessed service. However, the nature of accessed service directly
impacts on user content transmitted through the secure channel
especially on the type, size and way to transmit those data. Such
metadata are still measurable properties.
Jiang, et al. Expires December 5, 2016 [Page 7]
Internet-Draft Network Machine Learning June 2016
HTTPS Connection
+
|(1)
+-------v------+
|TLS Connection|
|Reconstruction|
+-------+------+
|(2)
+-------v------+ (3') (4')
| Features +-------------+----------------------------+
| Extraction | | |
+-------+------+ +-------v---------+ +----v----+
| |Service Provider +------------->Services |
|(3) |L1 model | Load |L2 model |
| +-------^---------+ services +----^----+
+-------v------+ | model X |
|SNI Labelling | +----------------------------+
+-------+------+ |(5)
| +-----------------------------------------+
+------------> Training and |
(4) | Models building |
+-----------------------------------------+
Two-levels HTTPS traffic classification
In figure above, step(1) consists in reconstructing the HTTPS
connection and retrieving packets on top of which the following
metrics are observed (2):
o Inter Arrival Time
o Packet size
o Encrypted data size: this feature has the advantage to be strongly
related to the service accessed instead of the packet size which
is biased by other lower layer headers
Based on these values, aggregated features are computed: average,
minimum, maximum, 25th percentile, median, 75th percentile.
Because different providers may offer a similar service, a single
classifier could fail to to distinguish them. A multi-level machine
learning approach has been proposed. For learning, a dataset without
forged SNI is used (3) to build the classifiers (4). The result is
(5):
o a first level model (L1 model) whose the goal is to identify the
service provider,
Jiang, et al. Expires December 5, 2016 [Page 8]
Internet-Draft Network Machine Learning June 2016
o a set of second level models (L2 models), one for each service
provider to identify specific service of a service provider
Once all classifiers are trained, a new unknown HTTPS connection is
first matched against the LV1 model (3'). The output is the
predicted service provider but also leads to load the corresponding
LV2 model (4') to determine the specific service of this service
provider.
This framework is independent of the ML technique. being used. Each
model could be also built with a different technique but our study
have shown that best results are obtained with Random Forest.
The HTTPS classification framework has been tested over 288,901
connections from lab users. Standard evaluation procedure have been
applied. Less representative features have been automatically
discarded. Using a ten-fold cross-validation, each tested connection
has been marked as perfect identification (both the service provider
and the service name are rightly identified), partial identification
(only the service provider is identified) or invalid (none of them).
93.1% falls in the first category, 2.9% in the second and the rest in
the third. Full results are available in [httpsframework].
Although results are promising, the current method can only be
applied at the end once the HTTPS connection, i.e. after being
reconstructed. This avoids to apply any kind of policies to the
corresponding traffic flow. Future challenge is thus to classify the
connection before it ends in order to apply.
4.2. Malicious Domains: Automatic Detection with DNS Traffic Analysis
Since their inception, domain names have been used to provide a
simple identification label for hosts, services, applications, and
networks on the Internet [RFC1034]. In the same way, domains and the
DNS infrastructure have also been misused in various types of abuses,
such as phishing, spam, malware distribution, among others.
Newly registered malicious domain names are well-know to a very
distinct initial DNS lookup pattern than legitimates ones: typically,
they exhibit an abnormally higher number of lookups [Hao2011]. One
of the reasons is that malicious domains tend to rely upon spam
campaigns within the first ours after the registration of these
domains in order to maximize the number of victims before the domain
is detected and taken down.
In order to protect users from such domains, nDEWS (New Domains Early
Warning System) [Moura2016], a tool that classifies the newly
registered domains based on their initial lookup pattern, has been
Jiang, et al. Expires December 5, 2016 [Page 9]
Internet-Draft Network Machine Learning June 2016
proposed. To perform that, it is required to have access to (i) a
domains registration database and (ii) authoritative DNS server
traffic data, which is typically the case for Top-Level Domains (TLD)
registries. These domains are classified using k-means as a
clustering method into two clusters using four features extracted
from the analyzed DNS traffic: # DNS queries, # IP addresses, #
Autonomous Systems (ASes), and # Countries, which were chosen
empirically.
As a result, in an automated fashion, a large variety of suspicious
domains can be detected, including phishing, malware, but also other
types, such as fake pharmaceutical shops as well as counterfeit
sneakers. In this particular case, the responsible registrars are
notified in this pilot study about these websites. Ultimately, it
allows these websites to be taken down, minimizing the potential
number of victims.
4.3. Machine-learning based Policy Derivation and Evaluation in
Broadband Networks
Service provisioning is becoming more complex. For instance, there
are services having diverse quality requirements, there is variance
of the requirements in time and space, and there is the need for
utmost resource efficiency. Moreover, full agility in time and space
(in order to accomplish resource efficient service provisioning)
requires the solution of computationally intensive tasks. In this
respect, policies can play a role: specify the network behaviour in
time periods and service area regions.
In this direction, machine learning can have a fundamental role,
e.g., for learning situations encountered and "good" ways (policies)
for handling them. The contribution addresses the role that machine
learning can play for policy derivation and evaluation. In more
detail it addresses the requirements on the role of machine learning,
including potential inputs and outputs.
Knowledge and machine learning can be an important aspect of wireless
networks. Knowledge is created both regarding the contexts and their
occurrence, as well as on the association of the context with
specific actions and its scoring. The latter encompasses development
of knowledge on how to handle acquired contexts; this knowledge will
include the contexts encountered, the corresponding handlings done
(decisions applied), the potential alternative handlings, and the
respective efficiency of each handling (actually applied or
alternate).
Reinforcing "good" solutions per each encountered context (e.g.
reinforcement learning) can be a vital and unique element of a
Jiang, et al. Expires December 5, 2016 [Page 10]
Internet-Draft Network Machine Learning June 2016
knowledge-based management system. Machine learning can be realized
through clustering to discover underlying structures in data,
regression to identify patterns and predict values in cell and
network usage, classification to classify first-seen unknown users,
and density estimation to model complex user behavior and network
usage. Several deep architectures and techniques (such as pre-
training) can be utilized, in order to generalize better on complex
data with underlying information and be able to make accurate
predictions, even on unseen data.
As a result, depending on what we want to achieve, the proper machine
learning approach can be used.
Through machine learning it will be possible to provide faster and
targeted solutions to specific network problems. Moreover, it is
possible cluster various usage profiles and prioritize the traffic
according to the criticality level. For instance, mission critical
services need special attention with respect to latency and
prioritization, compared to plain services which may tolerate a bit
of delay without jeopardizing the overall quality. In addition,
machine learning can lead to improved results in KPIs (Key
Performance Indicator) such as end-user throughput, latency, energy
consumption and overall cost effectiveness. Moreover, reliability
can be increased since certain problematic situations may be
predicted before happening, hence it will be possible to act pro-
actively and alleviate the negative impact of a problem in the
network.
It is evident that machine learning can have significant importance
in policy derivation and evaluation in broadband networks, especially
towards in 5G infrastructures which will be complex, heterogeneous
and need to accommodate multi-services ranging from mobile broadband
to massive machine type, mission critical and vehicular
communications.
4.4. Traffic Anomaly Detection in the Router
Modern routers usually have the capability that makes alarms of high
bandwidth usage rate of a specific interface. When network traffic
exceeds a certain threshold, the router will consider it as an
anomaly event and report it to the NMS (Network Management System).
For instance, in some routers/switches, there exists configuration
such as "trap-threshold { input-rate | output-rate }" to trigger
traffic alarms, which is statically configured by experienced
administrators. However, network traffic is usually not static and
even changes significantly due to the changes of carried services,
residential situation, and etc. Thus, static configuration could not
effectively identify the traffic anomaly events.
Jiang, et al. Expires December 5, 2016 [Page 11]
Internet-Draft Network Machine Learning June 2016
To address above issue, machine learning technologies are applied for
routers/switches to learn local traffic pattern and detect the
traffic anomaly events based on the learning results.
Wavelets are employed to analyze time-series network traffic for
anomaly detection. In some certain interval, the routers measure,
record, and analyze the input and output traffic rates respectively,
or in the form of rate sums. (The former is recommended for a finer
granularity analysis.)
Running for some time, the router would get a set of "time-rate"
data, collected as time-series waves for further wavelet analysis.
Besides wavelets, this use case proposes other machine learning
techniques such as outlier detection. For this way, features are to
be extracted from wavelets for supervised or unsupervised learning.
After data collection, the router would sort up the data and figure
out the alarm threshold statistically based on data distribution, to
discriminate the normal and outlier traffic rates. When interface
traffic exceeds the threshold, the router would make alarms to the
NMS. The router could dynamically adjust the alarm threshold with
new coming data, by periodical anomaly analysis. This approach helps
devices detect traffic anomaly more efficiently and effectively,
compared to traditional way of learning at the central repository
that collects traffic information from various devices.
This use case could be extended from single interface to multiple
ones, that is, device scope of multiple traffic waves, and even wider
scope of multiple devices in a certain domain. Thus would make the
analysis more comprehensive.
Besides wavelet analysis, there might be more techniques to explore,
such as correlation analysis of traffic anomaly events among multiple
devices.
4.5. Applications of Machine Learning to Flow Monitoring
A commercial cloud-based flow monitoring service from Network
Polygraph [polygraph] has used Machine Learning analysis as a cost-
effective alternative to DPI for traffic classification, which
identifies the application responsible for each network traffic flow.
Nowadays, DPI is considered as the standard technology for traffic
classification. However, DPI is generally expensive as it requires
the analysis of the payload of every single packet. This usually
involves the use of powerful, specialized hardware appliances, which
need to be deployed in every link to obtain full coverage of the
network. In the case of Network Polygraph, the use of DPI is
Jiang, et al. Expires December 5, 2016 [Page 12]
Internet-Draft Network Machine Learning June 2016
impractical, because the volume of data to be exported to the cloud
would be overwhelming (i.e., all traffic should be replicated). A
more viable alternative is the use flow-based monitoring
technologies, such as NetFlow [RFC3954] or IPFIX [RFC7011], where the
volume of exported data is significantly lower. Flow-based
monitoring technologies provide summarized information (e.g.,
duration, traffic volume) for every connection (or "traffic flow")
handled by a router. The information available in flow records is
more limited compared to DPI (e.g., packet payloads are not
available). As a result, most flow-based monitoring tools base their
classification on the port numbers or simple heuristics, which are
known to be highly unreliable.
To address this problem, Network Polygraph uses a traffic
classification approach based on ML. Several studies showed that
supervised learning can achieve similar classification accuracy to
DPI at a fraction of its cost. However, supervised methods suffer
from some practical limitations that make them very difficult to
deploy and maintain in production environments. For example, they
require a costly training phase prior to its deployment and need to
be frequently retrained, every time there is a change in the network
or in the network applications.
This section describes the ML approach used by Network Polygraph for
online classification of NetFlow/IPFIX traffic. To solve the
practical limitations of supervised learning, Network Polygraph
incorporates an automatic retraining system. Figure 1 shows the
components and data flow of the classification engine, which is
divided in two parts:
o The classification path (Figure 1, top) is in charge of the
classification of the traffic online using ML. The input of the
classification path are the NetFlow/IPFIX flows exported by the
routers, while the output are the classified flows. Several
traffic features are extracted from each flow, including the
information directly available in the flow records (e.g.,
addresses, ports, packet and byte counts) together with some
features we construct (e.g., average packet size, rate and
interarrival time). The traffic features are the input of the
traffic classification algorithm, whose function is to identify
the application that generated the flow. Among the different
supervised algorithms, a C5.0 decision tree was selected, because
it has been shown to present the best accuracy/cost ratio for
traffic classification. Other supervised methods, e.g., Support
Vector Machine (SVM) and Artificial Neural Network (ANN), obtain
similar accuracy, but classification and training times are faster
with decision trees. In Network Polygraph, training times are
Jiang, et al. Expires December 5, 2016 [Page 13]
Internet-Draft Network Machine Learning June 2016
critical as the training path is continuously updating the
classification model in the background.
o The training path (Figure 1, bottom) implements the automatic
retraining system, which is responsible of automatically updating
the classification model when it becomes obsolete. To that end, a
random packet-level sample of the network traffic is continuously
collected using flow-based sampling. Sampled flows are then
labeled using DPI. It is possible to use DPI in the training path
because training can be performed only with a small data sample
(e.g., 1/1000 flows). This significantly reduces the
computational overhead and volume of data to be exported. The
labeled sample is used to verify the accuracy of the
classification model. The system accuracy is estimated by
comparing the output of DPI (training path) and C5.0
(classification path) for those flows sampled in the training
path. If the estimated accuracy falls below a configurable
threshold, the labeled sample is used to generate an updated model
using only those features available in NetFlow/IPFIX (IP Flow
Information Export) records. This training process can also be
performed in few vantage points, and use it for other networks
where only NetFlow/IPFIX monitoring data is available.
CLASSIFICATION PATH
NetFlow/ +----------+ +----------+ Classified
IPFIX | Feature | | C5.0 | flows
+-------->|Extraction+------------------------>|Classifier+----------->
| | | |
+----------+ +----------+
^
|
TRAINING PATH +----------+ +----------+ |
| NetFlow/ | | Feature | | Retraining
+-->| IPFIX +-->|Extraction+--+ |
Packet stream | |Generation| | | | |
(flow sampling) | +----------+ +----------+ | |
+--------------->| +--+ DPI-labeled
| +----------+ | NetFlow/
| | DPI | | IPFIX
+---------->| App. +---------+
| Labeling |
+----------+
Network Polygraph classification engine data flow
Figure 1
Jiang, et al. Expires December 5, 2016 [Page 14]
Internet-Draft Network Machine Learning June 2016
In order to validate the performance of the described ML approach,
the accuracy of Network Polygraph was measured using a complete
14-day trace from the 10-Gigabit link that connects the Catalan
Research and Education Network (Anella Cientifica) to its Spanish
counterpart (RedIRIS). The trace contained about 70 million flows
with a flow sampling rate of 1/400. The experimental results showed
that, with a 96% retraining threshold, the system sustained an
average classification accuracy of 97.5%, needing only 15 retrainings
during the 14 days, which were performed automatically without
requiring any human intervention. When the retraining threshold was
decreased to 94%, the accuracy was slightly reduced to 96.76% with
only 5 retrainings.
The target objective is to progressively reduce the dependence on DPI
technologies, which are expensive, difficult to deploy, not scalable,
and not robust against encryption, in favor of flow-based machine
learning approaches that are more cost-effective and can be easily
offered as a cloud service. In this direction, some research
challenges include the classification of web services and CDN traffic
from flow-based measurements, and the combination of multiple ground
truths obtained from vantage points in different networks.
5. Security Considerations
This document is focused on applying machine learning in network,
including of course applying machine learning in network security, on
higher-layer concepts. Therefore, it does not itself create any new
security issues.
6. IANA Considerations
This memo includes no request to IANA.
7. Acknowledgements
The authors would like to acknowledge Josep Sanjuas, Andreas
Georgakopoulos, Kostas Tsagkaris, Valentin Carela, Wazen M. Shbair,
Thibault Cholez, and Isabelle Chrisment for their contributions.
The author would like to acknowledge the valuable comments made by
participants in the IRTF Network Machine Learning Research Group,
particular thanks to Lars Eggert, Brian Carpenter, Albert Cabellos,
Shufan Ji, Susan Hares, Rudra Saha, and Dacheng Zhang.
Jerome Francois was partly funded by Flamingo, a Network of
Excellence project (ICT-318488) supported by the European Commission
under its 7th Framework Programme.
Jiang, et al. Expires December 5, 2016 [Page 15]
Internet-Draft Network Machine Learning June 2016
This document was produced using the xml2rfc tool [RFC7749].
8. Change log [RFC Editor: Please remove]
draft-jiang-nmlrg-traffic-machine-learning-00: original version,
2016-06-03.
9. Informative References
[bypasssni]
Shbair, W., Cholez, T., Goichot, A., and I. Chrisment,
"Efficiently Bypassing SNI-based HTTPS Filtering", IFIP/
IEEE International Symposium on Integrated Network
Management (IM2015) , 2015.
[Hao2011] Hao, S., Feamster, N., and R. Pandrangi, "Monitoring the
Initial DNS Behavior of Malicious Domains", Proceedings of
the 2011 ACM SIGCOMM Conference on Internet Measurement
Conference (IMC 2011) , Nov 2011.
[httpsframework]
Shbair, W., Cholez, T., Francois, J., and I. Chrisment, "A
Multi-Level Framework to Identify HTTPS Services", IEEE/
IFIP Network Operations and Management Symposium , 2016.
[Moura2016]
M. Moura, G., Mueller, M., Wullink, M., and C. Hesselman,
"nDEWS: a New Domains Early Warning System for TLDs",
IEEE/IFIP International Workshop on Analytics for Network
and Service Management (AnNet 2016), co-located with IEEE/
IFIP Network Operations and Management Symposium (NOMS
2016) , 04 2016.
[polygraph]
"Network Polygraph", <https://polygraph.io>.
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
<http://www.rfc-editor.org/info/rfc1034>.
[RFC2818] Rescorla, E., "HTTP Over TLS", RFC 2818,
DOI 10.17487/RFC2818, May 2000,
<http://www.rfc-editor.org/info/rfc2818>.
[RFC3954] Claise, B., Ed., "Cisco Systems NetFlow Services Export
Version 9", RFC 3954, DOI 10.17487/RFC3954, October 2004,
<http://www.rfc-editor.org/info/rfc3954>.
Jiang, et al. Expires December 5, 2016 [Page 16]
Internet-Draft Network Machine Learning June 2016
[RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security
(TLS) Protocol Version 1.2", RFC 5246,
DOI 10.17487/RFC5246, August 2008,
<http://www.rfc-editor.org/info/rfc5246>.
[RFC6066] Eastlake 3rd, D., "Transport Layer Security (TLS)
Extensions: Extension Definitions", RFC 6066,
DOI 10.17487/RFC6066, January 2011,
<http://www.rfc-editor.org/info/rfc6066>.
[RFC6437] Amante, S., Carpenter, B., Jiang, S., and J. Rajahalme,
"IPv6 Flow Label Specification", RFC 6437,
DOI 10.17487/RFC6437, November 2011,
<http://www.rfc-editor.org/info/rfc6437>.
[RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
"Specification of the IP Flow Information Export (IPFIX)
Protocol for the Exchange of Flow Information", STD 77,
RFC 7011, DOI 10.17487/RFC7011, September 2013,
<http://www.rfc-editor.org/info/rfc7011>.
[RFC7749] Reschke, J., "The "xml2rfc" Version 2 Vocabulary",
RFC 7749, DOI 10.17487/RFC7749, February 2016,
<http://www.rfc-editor.org/info/rfc7749>.
Authors' Addresses
Sheng Jiang (editor)
Huawei Technologies Co., Ltd
Q 22, Huawei Campus, No.156 Beiqing Road
Hai-Dian District, Beijing, 100095
P.R. China
Email: jiangsheng@huawei.com
Bing Liu
Huawei Technologies Co., Ltd
Q 22, Huawei Campus, No.156 Beiqing Road
Hai-Dian District, Beijing, 100095
P.R. China
Email: leo.liubing@huawei.com
Jiang, et al. Expires December 5, 2016 [Page 17]
Internet-Draft Network Machine Learning June 2016
Panagiotis Demestichas
University of Piraeus
Piraeus
Greece
Email: pdemestichas@gmail.com
Jerome Francois
Inria
615 rue du jardin botanique
54600 Villers-les-Nancy
France
Email: jerome.francois@inria.fr
Giovane C. M. Moura
SIDN Labs
Meander 501
Arnhem, 6825 MD
The Netherlands
Email: giovane.moura@sidn.nl
Pere Barlet
Network Polygraph
Edifici K2M - Parc UPC
Jordi Girona, 1-3, Barcelona 08034
Spain
Email: pbarlet@polygraph.io
Jiang, et al. Expires December 5, 2016 [Page 18]