Internet DRAFT - draft-perumal-nfvrg-nfv-compute-acceleration
draft-perumal-nfvrg-nfv-compute-acceleration
NFVRG
Internet-Draft Bose Perumal
Intended Status: Informational Wenjing Chu
R. Krishnan
Hemalathaa. S
Dell
Expires: December 3 2015 June 29 2015
NFV Compute Acceleration Evaluation and APIs
draft-perumal-nfvrg-nfv-compute-acceleration-00
Abstract
Network functions are being virtualized and moved to industry
standard servers. Steady growth of traffic volume requires more
compute power to process the network functions. Network packet based
architecture provides a lot of scope for parallel processing. Generic
parallel processing can be done in common multicore platforms like
GPUs, coprocessors like Intel Xeon Phi[6][7] and Intel[7]/AMD[10]
multicore CPUs. In this draft to check the feasibility and to exploit
this parallel processing capability, multi string matching is taken
as the sample network function for URL filtering. Aho-Corasick
algorithm has been made use for multi pattern matching.
Implementation utilizes OpenCL [3] to support many common
platforms[7][10][11]. A list of optimizations is done, the
application is tested on Nvidia Tesla K10 GPUs. A common API for NFV
Compute Acceleration has been proposed.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
Perumal, et al. Expires December 3 2015 [Page 1]
Internet-Draft NFV Compute Acceleration June 28 2015
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright and License Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5
2. OpenCL based Virtual Network Function Architecture . . . . . . 6
2.1 CPU Process . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Device Discovery . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Mixed Version Support . . . . . . . . . . . . . . . . . . . 7
2.4 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . 8
3. Aho-Corasick Algorithm . . . . . . . . . . . . . . . . . . . . 9
4. Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Variable size packet packing . . . . . . . . . . . . . . . . 9
4.2 Pinned Memory . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Pipelined Scheduler . . . . . . . . . . . . . . . . . . . . 10
4.4 Reduce Global memory access . . . . . . . . . . . . . . . . 10
4.5 Organizing GPU cores . . . . . . . . . . . . . . . . . . . . 10
5. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1 Worst case 0 string match . . . . . . . . . . . . . . . . . 11
5.2 Packet match . . . . . . . . . . . . . . . . . . . . . . . . 12
6. Compute Acceleration API . . . . . . . . . . . . . . . . . . . 13
6.1 Add Network Function . . . . . . . . . . . . . . . . . . . . 13
6.2 Add Traffic Stream . . . . . . . . . . . . . . . . . . . . . 14
6.3 Add Packets to Buffer . . . . . . . . . . . . . . . . . . . 16
6.4 Process Packets . . . . . . . . . . . . . . . . . . . . . . 16
6.5 Event Notification . . . . . . . . . . . . . . . . . . . . . 16
6.6 Read Results . . . . . . . . . . . . . . . . . . . . . . . . 17
7. Other Accelerators . . . . . . . . . . . . . . . . . . . . . . 17
Perumal, et al. Expires December 3 2015 [Page 2]
Internet-Draft NFV Compute Acceleration June 28 2015
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 18
10 Security Considerations . . . . . . . . . . . . . . . . . . . 18
11 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
12 References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
12.1 Normative References . . . . . . . . . . . . . . . . . . . 18
12.2 Informative References . . . . . . . . . . . . . . . . . . 18
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 19
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 19
Perumal, et al. Expires December 3 2015 [Page 3]
Internet-Draft NFV Compute Acceleration June 28 2015
1 Introduction
Network equipment vendors use specialized hardware to process data at
a low latency and high throughput. Packet processing above 4 Gb/s is
done using expensive, purpose-built application-specific integrated
circuits. However, the low unit volumes force manufacturers to price
these devices at many times the cost of producing them, to recover
the R&D cost.
Network Function Virtualization (NFV)[1] is a key emerging area for
network operators, hardware and software vendors, cloud service
providers, and in general network practitioners and researchers. NFV
introduces virtualization technologies into the core network to
create a more intelligent, more agile service infrastructure. Network
functions that are traditionally implemented in dedicated hardware
appliances will need to be decomposed and executed in virtual
machines running in data centers. The parallelism of graphics
processor provides it the potential to function as network
coprocessor.
Network virtual function is responsible for specific treatment of
received packets. A network virtual function can act at various
layers of a protocol stack. When there is more compute power,
multiple virtual network functions can be executed in a single system
or VM. When multiple virtual network functions are processed in a
system, some of them could be processed in parallel with other
network functions. This paper proposes a method to represent ordered
set of virtual network functions in a combination of a sequential and
parallel order. This draft is for software based network functions,
so any further reference to network function means virtual network
function.
Software written for specialized hardware like network processors,
ASIC, FPGA, is closely tied to the hardware and specific vendor
products. It cannot be reused in other hardware platforms. For
generic compute acceleration different hardware platforms can be
used, like GPUs from different vendors, Intel Xeon Phi coprocessors
and multi core CPUs from different vendors. All these compute
acceleration platforms support OpenCL as parallel programming
language. Instead of every vendor writing OpenCL code, NFV Compute
Acceleration (NCA) API has been proposed for a common compute
accelerator in this paper. This API will be a library with C API
functions for declaring the network functions as an ordered set and
moving packets around.
Multi-pattern string matching is used in a number of applications
including network intrusion detection and digital forensics. Hence
multi pattern matching is chosen as a sample network function. Aho-
Perumal, et al. Expires December 3 2015 [Page 4]
Internet-Draft NFV Compute Acceleration June 28 2015
Corasick[2] algorithm with few modifications has been used to find
the first occurrence of any pattern from the signature database.
Based on this network function the throughput numbers are measured.
1.1 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Perumal, et al. Expires December 3 2015 [Page 5]
Internet-Draft NFV Compute Acceleration June 28 2015
2. OpenCL based Virtual Network Function Architecture
Network functions like multi pattern matching is process intensive
and common for multiple NFV applications. Generic compute
acceleration framework functions and service specific functions are
clearly separated in this prototype. The architecture diagram with
one network function is shown in Figure 1. Multiple network functions
can also be loaded. Most of the signature based algorithms like Aho-
Corasick[2], Regex[8],etc. generate a Deterministic Finite
Automaton(DFA)[2][8]. DFA database is generated in CPU and loaded to
the accelerator. Kernels executed in the accelerator will use the
DFA.
+----------------------------------------+ +-------------------+
| CPU Process | | GPU/Xeon Phi,etc. |
| | | |
| | | |
| | | |
| Scheduler | | |
| +------------+ +------------+ | | +------------+ |
| | Packet | |Copy Packet | | | |Input Buffer| |
| |Generator +----->|CPU to GPU +------------->|P1,P2,...,Pn| |
| +------------+ +------------+ | | +-----+------+ |
| | | | |
| +------------+ | | +-----v------+ |
| |Launch GPU | | | |GPU Kernels | |
| |Kernels +------------->|K1,K2,...,Kn|<+ |
| +------------+ | | +-----+------+ | |
| | | | | |
| +-------------+ +------------+ | | +-----v------+ | |
| |Results for | |Copy Results| | | |Result Buf | | |
| |each packet +<-----+GPU to CPU |<-------------+R1,R2,...,Rn| | |
| +-------------+ +------------+ | | +------------+ | |
| | | | |
| | | | |
| +----------------+ +-----------+ | | +------------+ | |
| |Network Function| | NF | | | | NF | | |
| |(AC,Regex,etc) +-->| Database +------------->| Database +-+ |
| +-------+--------+ +-----------+ | | +------------+ |
| ^ | | |
+-----------|----------------------------+ +-------------------+
|
+----+------+
| Signature |
| Database |
+-----------+
Figure 1. OpenCl based Virtual Network Function
Software Architecture Diagram
Perumal, et al. Expires December 3 2015 [Page 6]
Internet-Draft NFV Compute Acceleration June 28 2015
2.1 CPU Process
Accelerators like GPUs or coprocessors will augment CPU and currently
they cannot function alone. Network virtual function is split between
CPU and GPU. CPU process owns the packet preprocessing, packet
movement and scheduling. GPUs will do the core functionality of the
network functions. CPU process interfaces between the packet I/O and
GPU. During initialization it does set of following functions.
1. Device Discovery
2. Initialize OpenCL object model
3. Initialize memory module
4. Initialize network functions
5. Trigger scheduler
2.2 Device Discovery
Using OpenCL functions device discovery module discovers the
platforms and devices. Based on number of devices discovered, device
context and command queues are created.
2.3 Mixed Version Support
OpenCl is designed to support devices with different capabilities
under a single platform[3]. There are three version identifiers in
OpenCl, the platform version, the version of a device, and the
version(s) of the OpenCl C language supported on a device.
The platform version indicates the version of the OpenCL runtime
supported. The device version is an indication of the devices
capabilities. The language version for a device represents the OpenCL
programming language features a developer can assume are supported on
a given device.
OpenCl C is designed to be backwards compatible, so a device is not
required to support more than a single language version to be
considered conformant. If multiple language versions are supported,
the compiler defaults to using the highest language version supported
for the device.
Code written for old device version may not utilize the full
capabilities of new device if there are hardware architectural
changes.
Perumal, et al. Expires December 3 2015 [Page 7]
Internet-Draft NFV Compute Acceleration June 28 2015
2.4 Scheduler
Scheduling between the packet buffers coming from the network I/O to
the device command queues is carried out by the scheduler. Scheduler
operates on following parameters.
N - Number of Packet buffers (Default 6)
M - Number of Packets in each buffer (Default 16384)
K - Number of Devices (Discovered 2)
J - Number of Command Queues for each device (Default 3)
I - Number of Commands to the device to complete
single network function (Default 3)
S - Number of network functions executed in parallel. (Default 1)
Default values mentioned above are for the best results in our
current hardware environment and multi string match function.
Operations for completing network function for one packet buffer
1. Identify a free command queue
2. Copy packets from IO memory to pinned memory for GPU
3. Fill Kernel function parameters
4. Copy pinned memory to GPU global memory
5. Launch kernels for number of packets in the packet buffer
6. Check kernel execution completion and collect results
7. Report results to application
Scheduler calls OpenCl API with number of kernels to be executed in
parallel. Distributing the kernels to cores is taken care by OpenCl
library. If there are any error during launching the kernels, OpenCl
API returns error and appropriate error handling can he done.
Perumal, et al. Expires December 3 2015 [Page 8]
Internet-Draft NFV Compute Acceleration June 28 2015
3. Aho-Corasick Algorithm
The Aho-Corasick algorithm [2] is the most effective multi pattern
matching algorithm. Aho-Corasick algorithm is a kind of dictionary-
matching algorithm that locates elements of a finite set of strings
within an input text. The complexity of the algorithm is linear in
the length of the patterns plus the length of the searched text plus
the number of output matches.
The algorithm works in two parts. The first part is the building of
the tree from keywords that needs to be searched for, and the second
part is searching the text for the keywords using the previously
built tree (state machine). Searching for a keyword is efficient,
because it only moves through the states in the state machine. If a
character is match, goto () function is executed otherwise it follows
fail () function. Match found is returned by the out () function.
All the three functions just access the indexed data structures and
return the value. goto () data structure is a two dimension matrix
accessed based on current state and currently compared character.
fail () function is an array, which has the link to alternate path
for each state. Out function is an array of states and it has the
records on whether the string search has completed on a particular
state.
Based on the signature database, all three data structures are
constructed in CPU. These data structures are copied to GPU global
memory during the initialization stage. Pointers to these data
structures are passed as the kernel parameter when the kernels are
launched.
4. Optimizations
For this prototype Nvidia Tesla K10 GPU [5] is used which has 2
processors with 1536 cores each running at 745MHz. Each processor
has 4GB of memory attached to it. It is connected to CPU via PCI 3.0
x16 interface.
Server used is Dell R720 which has Intel Xeon 2665 with 2 processors
each having 16 cores. Only one CPU core is used for our experiment.
4.1 Variable size packet packing
Multiple copies from CPU to GPU is costly. Packets are batched for
processing in GPU. Packet sizes vary from 64 bytes to 1500 bytes.
Having a fixed size buffer for each packet, leads to copying a lot of
unwanted memory from CPU to GPU in case of smaller number of packets.
Perumal, et al. Expires December 3 2015 [Page 9]
Internet-Draft NFV Compute Acceleration June 28 2015
For variable size packing one single large buffer is allocated for
number of packets in the batch. Initial portion of the buffer has the
packet start offsets for all packets. At the packet offset, packet
size and packet contents are placed. Only buffer size filled with
packets is copied from CPU to GPU.
4.2 Pinned Memory
Memory allocated using malloc is paged memory. When coping from CPU
to GPU, memory is first copied from paged memory to non-paged memory,
then it is copied from non-paged memory to GPU global memory.
OpenCL provides commands and procedure to allocate and copy memory
from non-paged memory[3][4]. Using this pinned memory avoids one
internal copy and showed 3x improvements in memory copy time. In our
experiments pinned memory was used for CPU to GPU packet buffer copy
and GPU to CPU result buffer copy.
4.3 Pipelined Scheduler
OpenCL supports multiple command queues and Nvidia supports 32
command queues. Using non-blocking calls, commands can be placed on
each queue. When GPU kernel functions are being executed, memory copy
between CPU and GPU can happen in parallel.
In our experiment 6 command queues were created, 3 queues for each
GPU processor. Copy packet buffer to GPU, Launch GPU kernel functions
and read results from GPU are executed in parallel for 6 batches of
data. Scheduling is performed using round robin to maintain packet
order. Using pipe lining, allows hiding 99% of copy time and
utilizing the full processing power of GPU.
4.4 Reduce Global memory access
OpenCL architecture and NVidia GPU architecture has 3 levels of
memory, Global, Local and Private. Packets from CPU are copied to GPU
global memory. Global memory access is costly and Char by Char access
is not efficient.
Accessing private memory is faster but private memory is small, it
cannot hold complete packet. So packets are copied as 32 bytes at a
time using vload8 and float type.
4.5 Organizing GPU cores
Number of kernels functions(global-size) lunched should be more than
the number GPU cores to hide latency. GPU provides the sub-grouping
of cores to share memory. Optimal grouping size(local-size) is
Perumal, et al. Expires December 3 2015 [Page 10]
Internet-Draft NFV Compute Acceleration June 28 2015
calculated specific to GPU card.
5. Results
Using Aho-Corasick algorithm measured the performance of GPU system
with different parameters. Signature database is the top website
names. Ethernet and IP headers are skipped in the search, which is 34
bytes for each packet. Only protocol header analysis or application
header analysis can also be performed.
Aho-Corasick algorithm is modified to match any one string from the
signature database. After the first string is matched, result is
written in the result buffer and function exits. If none of the
string matched in the packet, whole packet is searched, this is the
worst-case performance. If any one of the string matched earlier then
remaining packet is not searched.
To understand the performance and to keep track of the timing of how
the commands execute, OpenCl supports a function
clGetEventProfilingInfo, which allows querying cl_event to get
counter values. The device time counter is returned in nanoseconds.
For these experiment results Nvidia Tesla K10 GPU and Dell R720
server is used.
Results were taken by executing on bare metal server on Linux. Same
code can be executed inside the Virtual Machine also.
5.1 Worst case 0 string match
Measured performance by varying signature database size to 1000, 2500
and 5000. Fixed size packets were generated with packet sizes
64/583/1450. Variable size packet generated with packet sizes from 64
to 1500 with an average packet size of 583 bytes and the results are
shown in Table 1 and Table 2.
+-----+--------+----------+----------+---------- +------------+
|No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable|
+-------------------------------------------------------------+
| 1000 | 37.03 | 30.74 | 31.68 | 15.08 |
| 2500 | 37.03 | 30.17 | 31.15 | 14.94 |
| 5000 | 36.75 | 30.03 | 31.15 | 14.87 |
+--------------+----------+----------+----------=+------------+
Table 1: Bandwidth in Gbps for different packet sizes of traffic
Perumal, et al. Expires December 3 2015 [Page 11]
Internet-Draft NFV Compute Acceleration June 28 2015
+-----+--------+----------+----------+-----------+------------+
|No of Strings | 64 Fixed | 583 Fixed| 1450 Fixed|583 Variable|
+-------------------------------------------------------------+
| 1000 | 77.67 | 7.07 | 2.93 | 3.47 |
| 2500 | 77.41 | 6.95 | 2.88 | 3.44 |
| 5000 | 77.07 | 6.91 | 2.88 | 3.42 |
+--------------+----------+----------+----------=+------------+
Table 2: Number of packets in Million packets per second(mpps)
for different packet sizes of traffic
Varying signature database size with 1000, 2500 and 5000 do not have
any major impact. State machine size gets bigger based on signature
database size, but processing time for the packets remain the same.
For fixed size packets total bandwidth processed was always above 30
Gbps. For variable bandwidth packets total bandwidth processed is
14.9 Gbps.
Variable packet sizes vary from 64 to 1500 bytes, each packet being
assigned to one core. The core which finishes early is idle till
other cores complete their work. So full GPU power is not effectively
utilized when using variable length packets.
5.2 Packet match
Having match percentage as the key, different parameters are
measured. Table 3 shows the match percentage against the bandwidth in
Gbps. For this experiment variable size packets with an average of
583 bytes are used. 16384 packets are batched for processing in GPU
and 16384 threads are instantiated. Each packet is checked for 5000
strings.
+-----+--------+-----------+
|% of packets | Bandwidth |
| matched | in Gbps |
+--------------------------+
| 0 | 14.87 |
| 15 | 18.50 |
| 25 | 20.85 |
| 35 | 33.02 |
+--------------+-----------+
Table 3: Bandwidth in Gbps for different packet match percentage
Perumal, et al. Expires December 3 2015 [Page 12]
Internet-Draft NFV Compute Acceleration June 28 2015
+-----+--------+--------------+
|% of packets | No of Packets|
| matched | in mpps |
+-----------------------------+
| 0 | 3.42 |
| 15 | 4.25 |
| 25 | 4.80 |
| 35 | 7.60 |
+--------------+------===-----+
Table 4: Number of packets in Mpps for different packet
match percentage
The packet match percentage against number of packets processed in
mpps is shown in Table 4. Worst case experiment is 0 packets matched,
so whole packet need to be searched. Time for single buffer(16384
packets) copy from CPU to GPU is 0.903 milliseconds. Kernel execution
time for single buffer is 9.784 milliseconds. Result buffer copy from
GPU to CPU is 0.161 milliseconds. Total of 209 buffers processed in
one second, which is 3.42 million packets and 14.9 Gbps.
Best case experiment was executed with 35% of packets match. Time for
single buffer(16384 packets) copy from CPU to GPU is 0.923
milliseconds. Kernel execution time for single buffer is 4.38
milliseconds. Result buffer copy from GPU to CPU is 0.168
milliseconds. Total of 464 buffers processed in one second, which is
7.6 million packets and 33.02 Gbps.
6. Compute Acceleration API
Multiple compute accelerators like GPUs, Coprocessors, ASICs/FPGAs
and multi core CPUs can be used for NFV. Having a common API for NFV
Compute Acceleration (NCA) can abstract the hardware details and
enable NFV applications to use compute acceleration. This API will be
a C library, user can compile it along with their code.
The delivery of end-to-end services often requires various network
functions. Compute acceleration APIs should support the definition of
ordered set of network functions and subset of these network
functions which can be processed in parallel.
6.1 Add Network Function
Multiple network functions can be defined in the system. Network
functions are identified by network function id. Based on service
chain requirement network functions are dynamically loaded to the
cores and executed. The API function nca_add_ network-function adds a
new network function to the NCA.
Perumal, et al. Expires December 3 2015 [Page 13]
Internet-Draft NFV Compute Acceleration June 28 2015
In OpenCL terminology kernel is a function or set of function
executed in compute core. OpenCL code files are small files with
these functions called kernel functions.
int nca_add_network_function(
int network_func_id,
int (*network_func_init)(int network_func_id,void *init_params),
char *cl_file_name,
char *kernel_name,
int (*set_kernel_arg)(int network_func_id,void *sf_args,
char *pkt_buf),
int result_buffer_size
)
network_func_id : Network function identifier unique
for every network function in the framework
network_func_init : Initializing the network function, with the
device memory allocations, service
specific data structures are created.
cl_file_name : File with network function kernel code
kernel_name : Network function kernel entry function name
set_kernel_arg : Function that will setup kernel arguments
before calling the kernel
result_buffer_size : result buffer size for this service function
6.2 Add Traffic Stream
Traffic streams are identified by stream id. Traffic streams are
initialized with number of buffers and size of each buffer allocated for
this stream. Each buffer is identified by a buffer id and it can hold N
number of packets. These buffers are treated like ring buffers. These
buffers are allocated as a contiguous memory by NCA and the pointer is
returned.
Any notification during buffer processing is given through the callback
function with stream_id, buffer_id and event.
Traffic stream is associated with a service function chain.
Service function chain is defined by three parameters. Number of network
functions is mentioned in num_network_funcs. Actual network function ids
are in service_function_chain array. Network functions are divided into
subsets. Each subset has a subset number. Network functions within the
subset can be executed in parallel. Subsets should be executed in
sequence. There is a special subset number 0, which can be executed
independent of any network functions in the chain
Perumal, et al. Expires December 3 2015 [Page 14]
Internet-Draft NFV Compute Acceleration June 28 2015
For example 6 service functions are represented below.
num_network_funcs = 6;
service_func_chain = {101, 105, 107, 108, 109, 110 }
network_func_ordered_set = {1, 1, 1, 2, 2, 0}
In the above example subset 1 which 101, 105, 107 should be executed
first. Within this subset all 3 can be executed in parallel. After
subset 1 subset 2 which is 108,109 will be executed. Subset 0 does not
have any dependencies; scheduler can execute it at any time.
typedef struct dev_params_s {
int dev_type,
int num_devices,
} nca_dev;
int nca_traffic_stream_init (
int num_buffers,
int buffer_size,
int (*notify_callback)(int buffer_id,int event)
int num_network_funcs,
int service_func_chain[CAF_MAX_SF],
int network_func_parallel_sets[CAF_MAX_SF],
nca_dev dev_params,
)
stream-id : Unique id to identify traffic stream
num_buffers : Number of buffers
buffer_size : Size of each buffer
notify_callback I : Event notification callback.
num_service_funcs : Number of service functions
in the service chain
service_func_chain : Service function ids in this service chain
network_func_parallel_set : subsets for sequential and parallel
ordering of service functions.
dev_params : For this traffic stream user can choose the
device for processing
Return Value : stream-id which is unique to identify
traffic stream
Perumal, et al. Expires December 3 2015 [Page 15]
Internet-Draft NFV Compute Acceleration June 28 2015
6.3 Add Packets to Buffer
Packets are added to the buffer directly by the client application or by
calling nca_add_packets. One or more packets can be added to the buffer.
int nca_add_packets(
int context_id,
int buffer_id,
char * packet,
int packet_len[]
int num_packets
)
stream_id : Steam id of the traffic stream
buffer_id : Idenitfy the buffer to add the packet
packet : Packet contents
packet_len[] : Length of each packet
num_packets : Number of packets
6.4 Process Packets
Once packets are filled in the buffer, nca_buffer_ready is called to
process the buffer. This function can also be called without filling the
complete buffer. NCA scheduler marks this buffer for processing.
int nca_buffer_ready(
int context_id,
int buffer_id
)
stream_id : Stream id identifies the traffic stream
buffer_id : Identify the buffer to add the packet
6.5 Event Notification
NCA will notify event about the buffer using the registered callback
function. After the buffer is processed for the registered services,
notify event callback is called. Client can read the result buffer.
int (*notify_callback) (
int stream_id,
int buffer_id,
int event
)
stream_id : Stream id identifies the traffic stream
buffer_id : Identify the buffer to add the packet
event : Event maps to one of the buffer events. If
the event is not specific to a buffer, buffer id is 0
Perumal, et al. Expires December 3 2015 [Page 16]
Internet-Draft NFV Compute Acceleration June 28 2015
6.6 Read Results
Client can read the results after service chain processing. Service
chain processing completion is notified by an event through call back
function.
int caf_read_results(
int context_id,
int buffer_id,
char *result_buffer
)
stream_id : Stream id identifies the traffic stream
buffer_id : Identify the buffer to add the packet
result-buffer : Result buffer pointer to be copied.
7. Other Accelerators
The prototype multi string search written in OpenCL successfully
compiled and executed on both Intel Xeon Phi coprocessor and CPU only
system with minimal changes in make file. For CPU only systems memory
copies can be avoided. Since the optimizations for these platforms are
not carried out, the performance numbers are not published.
8. Conclusion
To get best performance out of GPUs with large number of cores, number
of threads executed in parallel should be large. For a single network
function the latency will be in milliseconds, so it will be suited for
network monitoring functions. If GPUs are tasked to do multiple network
functions in parallel it can be used for other NFV functions.
Assigning single core for each packet gives best performance when all
packet sizes are equal. For variable length packets performance goes
down because the core processing the smaller packet has to be idle till
the other cores complete processing the larger packets.
Code written in OpenCL is easily portable to other platforms like Intel
Xeon Phi, multicore CPU with just make file changes. Though the same
code execute correctly on all platforms, to achieve good performance,
platform specific optimizations need to be done.
Proposed a network compute acceleration framework which will have all
hardware specific optimizations and expose high level APIs to the
applications. A set of APIs for defining traffic streams, network
function addition and declaration of service chain with ordering method
which include sequential and parallel.
Perumal, et al. Expires December 3 2015 [Page 17]
Internet-Draft NFV Compute Acceleration June 28 2015
9. Future Work
Dynamic device discovery and optimized code for different algorithms and
devices will make NCA as a common platform to develop applications on
top of this.
Integration of compute acceleration with IO acceleration technologies
like Intel DPDK[9] can provide a complete networking platform for the
applications.
Verification and performance measurement of compute acceleration
platform running inside a Virtual Machines. Compute acceleration
platform running inside Linux containers or Docker.
10 Security Considerations
Not Applicable
11 IANA Considerations
Not Applicable
12 References
12.1 Normative References
12.2 Informative References
[1] ETSI NFV White Paper: "Network Functions Virtualisation, An
Introduction, Benefits, Enablers, Challenges, & Call for
Action,"http://portal.etsi.org/NFV/NFV_White_Paper.pdf"
[2] A.V.Aho and M.J.Corasick, "Efficient string matching:An aid to
A.v. Aho and M.j.Corasick, "Efficient string matching:An
aid to bibliographic search",Communications of the ACM,
vol. 20, Session 10.
[3] OpenCl Specification,
"https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf"
[4] OpenCl Best Practices Guide,
"http://www.nvidia.com/content/cudazone/CUDABrowser/
downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf"
[5] Nvidia Tesla K10, http://www.nvidia.in/content/PDF/kepler/Tesla-
K10-Board-Spec-BD-06280-001-v07.pdf
[6] Intel Xeon Phi
"http://www.intel.in/content/www/in/en/processors/xeon/
xeon-phi-detail.html"
[7] Intel OpenCl "https://software.intel.com/en-us/intel-opencl"
Perumal, et al. Expires December 3 2015 [Page 18]
Internet-Draft NFV Compute Acceleration June 28 2015
[8] Implementing Regular Expressions "https://swtch.com/~rsc/regexp/"
[9] Intel DPDK "http://dpdk.org/"
[10] AMD OpenCl Zone, "http://developer.amd.com/tools-and-
sdks/opencl-zone/"
[11] Nvida OpenCl "https://developer.nvidia.com/opencl
Acknowledgements
The authors would like to thank the following individuals for their
support in verifying the prototype in different platforms : Shiva
Katta and K. Narendra.
Authors' Addresses
Bose Perumal
Dell
Bose_Perumal@Dell.com
Wenjing Chu
Dell
Wenjing_Chu@Dell.com
Ram (Ramki) Krishnan
Dell
Ramki_Krishnan@Dell.com
Hemalathaa S
Dell
Hemalathaa_S@Dell.com
Perumal, et al. Expires December 3 2015 [Page 19]