Internet DRAFT - draft-yang-ipsfcc-algorithm
draft-yang-ipsfcc-algorithm
INTERNET-DRAFT Yixian Yang
Expires: April 2006 Ming Cao
Jian Li
Beijing University of
Posts and Telecom.
October 2005
An Algorithm for Intrusion Prevention System
based on Fuzzy Connectedness Clustering
draft-yang-ipsfcc-algorithm-00.txt
Intellectual Property Rights (IPR) statement:
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Status of this Memo
By submitting this Internet-Draft, I certify that any applicable
patent or other IPR claims of which I am aware have been disclosed,
and any of which I become aware will be disclosed, in accordance with
RFC 3668.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Copyright Notice
Copyright (C) The Internet Society (2005). All Rights Reserved.
Abstract
In this document, an algorithm for intrusion prevention systems based
on fuzzy connectedness clustering was presented in order to measure
the similarity. Fuzzy connectedness that was first applied in image
segmentation was extended to clustering algorithm, and used as the
similarity metric between objects to be clustered. By a little expert
knowledge, starting with at least one seed data in each cluster, the
similarity between each data in dataset and each seed data in each
Yixian Yang, et al. Expires April, 2006 [Page 1]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
cluster was measured. According to the highest fuzzy connectedness
of each data, clustering was finished. This algorithm was applied in
intrusion prevention systems as a tool of detecting intrusion,
overcoming some shortcomings of previous clustering algorithm.
Table of Contents
Status of This Memo ......................................... 1
Abstract .................................................... 1
1. Introduction ............................................. 3
2. Glossary ................................................. 4
3. IPS based on Data Mining ................................. 5
3.1 Related Work .......................................... 5
3.2 Clustering Algorithms Applied in IPS .................. 6
4. FCC Algorithm Proposed ................................... 11
4.1 Background ............................................. 11
4.2 Fuzzy Connectedness ................................... 11
4.3 Methodology ........................................... 12
4.3.1 Definition ......................................... 12
4.3.2 Overview of FCC Algorithm .......................... 13
4.3.3 Description of FCC Algorithm ....................... 13
4.3.4 Parameters ......................................... 14
4.4 Application in IPS .................................... 15
4.4.1 Application Flow ................................... 15
4.4.2 Explanation about Detail ........................... 16
4.4.3 System Evaluation and Results....................... 17
5 Conclusions ........................................... 18
6. Acknowledgements.......................................... 19
7. Informative References.................................... 19
8. Authors' Addresses........................................ 19
Full Copyright Statement .................................... 20
Yixian Yang, et al. Expires April, 2006 [Page 2]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
1. Introduction
Human attention has become a precious resource. So, we must find
ways to automatically analyze the data, to automatically classify
it, to automatically summarize it, to automatically discover and
characterize trends in it, and to automatically flag anomalies.
In this document, we introduce a new clustering algorithm, FCC, for
intrusion detection based on the concept of fuzzy connectedness.
This concept was introduced by Rosenfeld in 1979 and used with
success in image segmentation; here we extend this approach to
clustering and demonstrate its effectiveness in intrusion prevention.
Starting with a single or a few seed points in each cluster, all the
data points are dynamically assigned to the cluster that has the
highest fuzzy connectedness value (strongest connection). With an
efficient heuristic algorithm, the time complexity of the clustering
process is O( NlogN), where N is the number of data points. The value
of fuzzy connectedness is calculated using both the Euclidean
distance and the statistical properties of clusters.
This unsupervised learning method allows the discovery of clusters
of any shape. Application of the method in intrusion detection
demonstrates that it can detect not only known intrusion types, but
also their variants. Experimental results on the KDD-99 intrusion
detection data set show the efficiency and accuracy of this method.
Yixian Yang, et al. Expires April, 2006 [Page 3]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
2. Glossary
This document uses terminology that is defined in [DSARCH]. There
is also current work-in-progress on this terminology in the IETF
and some of the definitions provided here are taken from that work.
Some of the terms from these other references are defined again
here in order to provide additional detail, along with some new
terms specific to this document.
IPS devices deployed in carrier, corporate, and
small-to-medium business environments stop upper and
lower layer network-based attacks that other network
security devices are not designed to handle. Network
IPS devices are placed inline normally either in front
of or behind the network firewall. There may be
multiple related deployments depending on your
network's size, complexity, and number of entry points.
Clustering builds models from data without predefined classes.
Data instances are grouped together based on a
similarity scheme defined by the clustering system.
With the help of one or several evaluation techniques,
it is up to us to decide the meaning of the formed
clusters.
Yixian Yang, et al. Expires April, 2006 [Page 4]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
3. IDS based on Data Mining
3.1 Related Work
Data mining based intrusion prevention systems can be roughly
categorized into two major groups: misuse detection and anomaly
detection. In misuse detection, a model is trained with labeled
data to recognize the patterns of ''normal'' visits and
''intrusion'' attempts. Unlike the traditional knowledge base
method, signatures of different types of intrusions are learnt
automatically, and they are much more powerful than manually
defined signatures in recording the subtle characteristics.
Misuse detection has been shown to be very successful in detecting
previously known attacks. However, since the misuse model is highly
dependent on the labeled data used in the training stage, its
capabilities of detecting new intrusion types is limited.
Different from misuse detection, anomaly detection first
establishes a model of normal system behaviors, and anomaly events
are then distinguished based on this model. The implicit assumption
is that any intrusive activity will be anomalous. Anomaly detection
is able to detect newly emerging attacks (if only the assumption
still holds), but it also has some drawbacks. It may fail to detect
some known attacks if the behaviors of them are not significantly
different from what is considered to be normal. Moreover, the false
alarm rate tends to be high when the data of some normal system
behaviors are not involved in the training phase.
With the help of data mining approaches, these difficulties can be
easily overcome. Briefly, the data mining techniques have provided
the following benefits:
+ Improved variants detection. This is especially true for anomaly
detection. Not limited to pre-defined signatures, the concern
with variants is not as much as before, since any deviation from
a unique (normal) signature will be treated as intrusion,
including those previously unknown variants of intrusions.
+ Controlled false alarms. Even though these are false positives,
with a learning process to identify recurring sequences of false
alarms, it is possible for us to filter those normal system
activities and keep the rate of false alarms at an acceptable
level.
+ Reduced false dismissals. With data mining techniques, patterns
(or signatures) of normal activities and abnormal events
(intrusions) can be created automatically. It is also possible
to introduce new types of attacks through an incremental learning
process. As a result, more and more attacks can be detected
correctly. This leads to a reduced number of false dismissals.
Yixian Yang, et al. Expires April, 2006 [Page 5]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
+ Improved efficiency. One very attractive feature of data mining
techniques is the ability to extract most meaningful information
out of large amounts of data. After a step of feature extraction
and/or feature selection, the learning process can be done much
more efficiently.
A wide variety of data mining techniques have been applied to
intrusion prevention, which include decision trees and neural
networks. With pre-labeled (normal or intrusion) network traffic
data, classification-based intrusion detection models can be trained
and proved to be very useful. Actually, however, it is not always
possible to have enough training data to guarantee the effectiveness
of the classifiers. At the same time, temporal changes in network
intrusion patterns and characteristics also tend to invalidate the
usability of trained models. These challenges lead to the emerging
interests of researchers in applying unsupervised or clustering
techniques to intrusion detection.
3.2 Clustering Algorithms Applied in IPS
Clustering is a well known and studied problem. It has been studied
in many fields including statistics, machine learning, databases, and
visualization. Basic methods for clustering include the Linkage
bases and K-means techniques. K-means makes several passes through
the training data and on each pass shifts cluster centers to the mean
of the data points assigned to that cluster. It then re-assigned
data point to the nearest prototype, and continues interacting in
this manner until no significant changes in cluster center positions
occur. The K-means method generally produces a more accurate
clustering than linkage based methods, but it has a greater time
complexity and this becomes an extremely important factor in network
intrusion detection due to very large dataset sizes. Although some
optimizations of K-means for very large datasets exist, they still
do not perform sufficiently fast for datasets with high
dimensionality.
Image that you are given a set of data objects for analysis
where, unlike in classification, the class label of each object
is not known. Clustering is the process of grouping the data
into classes or clusters so that objects within a cluster have
high similarity in comparison to one another, but are very
dissimilar to objects in other clusters. Dissimilarities are
assessed based on the attribute values describing the objects.
Often, distance measures are used. Clustering has its rots in
many areas, including data mining, statistics, biology and
machine learning.
Yixian Yang, et al. Expires April, 2006 [Page 6]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering. A cluster is
collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other
clusters. A cluster of data objects can be treated collectively
as one group in many applications.
Cluster analysis is an important human activity. Early in childhood,
one learns how to distinguish between cats and dogs, or between
animals and plants, by continuously improving subconscious
clustering schemes. Cluster analysis has been widely used in
numerous applications, including pattern recognition, data
analysis, image processing, and market research. By clustering,
one can identify dense and sparse regions and, therefore, discover
overall distribution patterns and interesting correlations among
data attributes.
"What are some typical applications of clustering?" In business,
clustering can help marketers discover distinct groups in their
customer bases and characterize customer groups based on
purchasing patterns. In biology, it can be used to derive plant
and animal taxonomies, categorize genes with similar functionality,
and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar
land use in an earth observation database, and in the identification
of groups of automobile insurance policy holders with a high average
claim cost, as well as the identification of groups of houses in a
city according to house type, value, and geographical location. It
can also be used to help classify documents on the Web for
information discovery. As a data mining function, cluster analysis
can be used as a stand-alone tool to gain insight into the
distribution of data, to observe the characteristics of each
cluster, and to focus on a particular set of clusters for further
analysis. Alternatively, it may serve as a preprocessing step for
other algorithms, such as characterization and classification,
which would then operate on the detected clusters.
Owing to the huge amounts of data collected in databases, cluster
analysis has recently become a highly active topic in data mining
research.
Yixian Yang, et al. Expires April, 2006 [Page 7]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
In machine learning, clustering is an example of unsupervised
learning. Unlike classicality, clustering and unsupervised learning
do not rely on predefined classes and class-labeled training
examples. For this reason, clustering is a form of learning by
observation, rather than learning by examples. In conceptual
clustering, a group of objects forms a class only if it is
describable by a concept. This differs from conventional clustering,
which measures similarity based on geometric distance. Conceptual
clustering consists of two components: (1) it discovers the
appropriate classes, and (2) it forms descriptions for each class,
as in classification. The guideline of striving for high interclass
similarity and low interclass similarity still applies.
In data mining, efforts have focused on finding methods for efficient
and effective cluster analysis in large databases. Active themes of
research focus on the scalability of clustering methods, the
effectiveness of methods for clustering complex shapes and types of
data, high-dimensional clustering techniques, and methods for
clustering mixed numerical and categorical data in large databases.
Clustering is a challenging field of research where its potential
applications pose their own special requirements. The following
are typical requirements of clustering in data mining:
+ Scalability: Many clustering algorithms work well on small data
sets containing fewer than 200 data objects; however, a large
database may contain millions of objects. Clustering on a sample
of a given large data set may lead to biased results. Highly
scalable clustering algorithms are needed.
+ Ability to deal with different types of attributes: Many
algorithms are designed to cluster interval-based(numerical)
data. However, applications may require clustering other types
of data, such as binary, categorical(nominal), and ordinal data,
or mixtures of these data types.
+ Discovery of clusters with arbitrary shape: Many clustering
algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures
tend to find spherical clusters with similar size and density.
However, a cluster could be of any shape. It is important to
develop algorithms that can detect clusters of arbitrary shape.
+ Mining requirements for domain knowledge input parameters: Many
clustering algorithms require users to input certain parameters
in cluster analysis(such as the number of desired clusters).
The clustering results can be quite sensitive to input parameters.
Parameters are often hard to determine, especially for data sets
containing high-dimensional objects. This not only burdens users,
but also makes the quality of clustering difficult to control.
Yixian Yang, et al. Expires April, 2006 [Page 8]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
+ Ability to deal with noisy data: Most real-world databases contain
outliers or missing, unknown, or erroneous data. Some clustering
algorithms are sensitive to such data and may lead to clusters of
poor quality.
+ Insensitivity to the order of input records: Some clustering
algorithms are sensitive to the order of input data; for example,
the same set of data, when presented with different orderings to
such an algorithm, may generate dramatically different clusters.
It is important to develop algorithms that are insensitive to the
order of input.
+ High dimensionality: A database or a data warehouse can contain
several dimension or attributes. Many clustering algorithms are
good at handling low-dimensional data, involving only two to three
dimensions. Human eyes are good at judging the quality of
clustering for up to three dimensions. It is challenging to
cluster data objects in high-dimensional space, especially
considering that such data can be very sparse and highly skewed.
+ Constraint-based clustering: Real-world applications may need to
perform clustering under various kinds of constraints. Suppose
that your job is to choose the locations for a given number of new
automatic cash-dispensing machines(i.e., ATMs) in a city. To
decide upon this, you may cluster households while considering
constraints such as the city's rivers and highway networks, and
customer requirements per region. A challenging task is to find
groups of data with good clustering behavior that satisfy
specified constraints.
+ Interpretability and usability: Users expect clustering results to
to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied up with specific semantic
interpretations and applications. It is important to study how
an application goal may influence the selection of clustering
methods.
Recently, several unsupervised methods have been proposed. Portnoy
et al. introduced an algorithm to detect both known and new
intrusion types without the need to label the training data. This
method, however, has its limitations. It is based on the assumption
that ''the normal instances constitute an overwhelmingly large
portion (>98%)''. Also, it requires a predefined parameter of
clustering width which is not always easy to find. Y-means,
introduced by Guan et al., overcomes the shortcomings of the
traditional K-means algorithm and can automatically decide the
number of clusters through splitting and merging clusters. However,
like other centroid-based clustering algorithms, Y-means can only
deal with clusters with a spherical shape and one still needs to
define a threshold for the "confident area".
Yixian Yang, et al. Expires April, 2006 [Page 9]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
To deal with these issues, we introduce a new clustering algorithm,
the Fuzzy-Connectedness Clustering (FCC) for intrusion detection
based on the concept of fuzzy connectedness. With little prior
knowledge, the proposed method allows the discovery of clusters
of any shape and can detect not only the known intrusion types,
but also their variants.
Yixian Yang, et al. Expires April, 2006 [Page 10]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
4. FCC Algorithm
4.1 Background
In data mining, clustering is the most important unsupervised
learning process used to find the structures or patterns in a
collection of unlabeled data. During the learning process, the
similarities between pairs of data instances are considered and an
optimized output is found which maximizes the similarities within
the same group and the dissimilarities between different groups.
While there is no absolute criterion to judge the optimality of the
clustering result, the choice of similarity measures can surely lead
to totally different groupings.
In many clustering algorithms, the spatial distance (usually
Euclidean distance) between different instances is used to measure
the similarity. This has proved to be successful in many
applications. It is intuitive that similar objects are closer
together while objects from different groups are far from each other.
This intuition, however, is not always true in more complicated
applications, especially when the shape of data clusters is not
limited to be spherical. Figure 1 shows an example in which the
popular k-means clustering algorithm can not achieve satisfactory
result.
To deal with more complex situations, it should be more suitable to
take both local neighborhood and global information into
consideration when deciding the membership of each data instance.
In this document, we propose a new clustering algorithm which uses
fuzzy connectedness as the similarity metric.
4.2 Fuzzy Connectedness
The concept of fuzzy connectedness was first introduced by
Rosenfeld in 1979. It can be utilized to deal with the spatial
vagueness when the spatial entities of data do not have homogeneous
interiors and sharply boundaries. Fuzzy connectedness has been
successfully applied to image processing. In an image segmentation
framework introduced by Udupa et al., fuzzy affinities are assigned
to the target data object during the classification. The affinity
between a pair of pixels is defined as a weighted function of their
adjacencies in the coordinate space, the intensity space, and the
intensity gradient space.
Yixian Yang, et al. Expires April, 2006 [Page 11]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
In a recent work, Herman and Carvalho generalized the concept of
fuzzy connectedness. For a finite set of data points, the strength
of the link from one point to another is defined as the reciprocal
of the distance between them and it can be automatically defined
based on statistical properties of the links within regions of
points identified as belonging to the same groups. A chain is a
sequence of ordered points and the strength of a chain is its
weakest link. Taking fuzzy connectedness as the similarity metric,
the task of a clustering process is to find the strongest path of
every data point and assign the point to the group that path leads
to.
4.3 Methodology
4.3.1 Definition
Basic definitions that we use in this document are presented below.
A fuzzy set A in a space of points X={x} is a class of events with
a continuum of grades of membership and is characterized by a
membership function 米_A(x) with associates with each point in X a
real number in the interval [0, 1] with the value of 米_A(x) at x
representing the grade of membership of x in A. Formally, a fuzzy
set A with its finite number of supports x1, x2, ..., xn is defined
as a collection of ordered pairs:
A={(米_A(xi),xi), xi, i=1, 2, ..., n}
where the supports of A is the subset of X defined as
S(A)={x, x﹋X and 米A(x)>0}
米i, the grade of membership of xi in A, denotes the degree to which
an event xi may be a membership of A or belong to A. This
characteristic function in fact, can be viewed as a weighting
coefficient which reflects the ambiguity in a set, and as it
approaches unity, the grade of membership of an event in A becomes
higher.
Definition 1: For a positive integer K, a K-FCC of a set D (of
points) is a function which maps each point c﹋D into a
(K+1)-dimensional vector 考_c = {考_c0, 考_c1, 考_c2, ... , 考_cK},
such that 考_ci﹋[0, 1], and for at least one k, 1 ≒ k ≒ K,
考_c0 = 考_ck; for 1 ≒ m ≒ K and m ≧ k, 考_cm = 0 or 考_cm =
考_ck.
Definition 2: A fuzzy affinity on D is a function:耳: D2 ↙ [0, 1].
耳(c, d) is the strength of the link (c,d).
Definition 3: A chain in U﹋D from c_0 to c_N is a sequence
C_(0,N ) = (c_0, c_1, c_2, ... , c_N) in U and the strength of the
chain 耳_U(C_(0, N)) = min(耳_U(c_n-1, c_n)), 1 ≒ n ≒ N.
Yixian Yang, et al. Expires April, 2006 [Page 12]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
4.3.2 Overview of FCC Algorithm
Given the above definitions, a clustering process is to find the
mapping of K每FCC which is deduced from the calculation of fuzzy
affinities between pairs of points. In this process, the definition
of the fuzzy affinity function 耳 is very essential; one can include
the statistical information of each cluster to make the results more
reliable. When only the Euclidean distance between pairs of points
is considered, it is not possible to separate the two clusters
correctly.
In the proposed method, in order to initialize the clustering
process, we need a little prior knowledge. This is in the form of a
few labeled data points in each possible cluster. This requirement,
however, is usually easy to satisfy in intrusion prevention
applications.
4.3.3 Description of FCC Algorithm
Every data collected is regarded as a vector of characteristic,
because FCC algorithm will be applied in IPS. So, we call an
element in the data as a vector here.
Input: Data D, number of clusters K, seed points S(U_k), 1 ≒ k ≒ K;
Output: K-FCC matrix 考={考_ck | c﹋D, 0 ≒ k ≒ K} and label matrix
label(c) for c﹋D.
Step1: for every c﹋D
Step2: for k = 1:K
Step3: 考_ck = 0
Step4: Heap = 耳
Step5: for k = 1:K
Step6: U_k = S(U_k) ﹍ {nearest neighbors of S(U_k)}
Step7: Heap = Heap ﹍ U_k
Step8: for every c ﹋ U_k 考_ck = 1
Step9: Max_link = 1
Step10: while Max_link > 0
Step11: for k = 1:K
Step12: while |U_k | > 0
Step13: d = U_k(1); U_k = U_k - d
Yixian Yang, et al. Expires April, 2006 [Page 13]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
Step14: NN = {nearest neighbors of d}
Step15: C = { c﹋NN | 考_ck < min(Max_link, 考_k(d, c)
& 考_c0 ≒ Max_link}
Step16: while |C| > 0
Step17: c = C(1); C = C - c
Step18: l = min(Max_link, 耳_k(d, c))
Step19: if Max_link = l and 考_ck < Max_link then U_k = U_k ﹍{c}
Step20: if 考_c0 < l then
Step21: if 考_c0 = 0 then Heap = Heap ﹍{c}
Step22: for n = 1:K 考_cn = 0
Step23: if 考_c0 ≒ l then 考_c0 = 考_ck = l
Step24: Heap = Heap - {c|c ﹋ Heap and 考_c0 = Max_link}
Step25: Max_link = Max(考_c0), c ﹋ Heap
Step26: for every c﹋D
Step27: label(c) = k with 考_ck = 考_c0
4.3.4 Parameters
In the FCC algorithm, two parameters are involved: (a) the number
of seed points and (b) the number of neighbors in the neighborhood.
The first parameter is decided using prior knowledge about the
data. The minimum value is 1, which means that each cluster has
at least one instance at the very beginning. Usually, the more
seed points we have, the more information about the cluster will
be given, and this will possibly provide more accurate results.
However, depending on the distributions of different clusters, a
small number of seed points may also give satisfactory results.
Yixian Yang, et al. Expires April, 2006 [Page 14]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
The number of neighbors used to define the neighborhood is also
important. As discussed earlier, when calculating the fuzzy
connectedness of an instance to other instances in a certain
cluster, both the local information about the distance between two
instances and the statistical information about that cluster are
involved. An instance x is fuzzy connected to another instance y
only if y falls in the neighborhood of x. If the number of
neighbors is too small, only a few instances tightly close to x are
considered reachable from x, and this may separate x from the
cluster it really belongs to; when this number is too large, the
importance of local connectivity will be reduced and it is possible
for two instances far from each other to be considered as fuzzy
connected. This may bring two or more clusters close to each other
forming a single cluster, which is also not desirable. So, there
should be a trade off among the effects of varying the two
parameters.
With different configurations of the parameters, while Detection
Rate stays high and False Alarm Rate stays low, there are some
changes in the numbers. When the number of nearest neighbors
increases from 2 to 4, both Detection Rate and False Alarm Rate are
slightly improved. The reason is that there are more choices for an
instance to get connected and this helps it to find the correct
cluster. Consequently, fewer attacks will be mislabeled as normal
and fewer normal instances will be treated as intrusion. However,
there is not always improvement with the increase of the
number of nearest neighbors considered.
4.4 Application in IPS
4.4.1 Application Flow
+--------+ +-----------+ +------------------+
|Database|-------->|Abstraction|-------->|Vector Description|
+--------+ +-----------+ +------------------+
|
|
V
+-----------+ +------------------+
| FCC |<--------|Measure Similarity|
+-----------+ +------------------+
Figure 1: Application Flow in IPS
Yixian Yang, et al. Expires April, 2006 [Page 15]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
4.4.2 Explanation about Detail
The algorithm starts with an initialization process (steps 1-9).
For every data point c﹋D , its fuzzy mapping vector is initialized
with zeros. As shown later in the algorithm, the values in the
mappings vectors are used to decide the memberships of the data
points. All the seed points are put into corresponding clusters
and certain items in their mapping vectors are set to 1. A heap
data structure is used to temporarily contain the unlabelled data
points. It is set to empty at the beginning. Before the main loop
(10-27) of the algorithm, only the seed points S(Uk) and their
nearest neighbors are labeled. Their linkage to their cluster is
considered the strongest and the value is set to one. Starting
with these points, all the clusters try to expand. The strength of
the linkage of each unlabeled point to a cluster is the strength
of a chain (path) that connects this point to one seed point in
that cluster. After all the fuzzy affinities are calculated, we
get the K-FCC for the dataset D. The label of each data point c﹋D
is finally decided by its mapping vector 考 c . It is assigned to
cluster k if c k 考 is the largest element of the vector.
In the FCC algorithm, a priority heap is used for the operations of
insertion into (step 21) and removal (step 24) from the priority
queue. With the definition of nearest neighborhood of an instance,
there are no more than a fixed number of nearest neighbors that
need to be processed in step 15. If there are N instances in D
and n is used to denote the number of nearest neighbors, the
computational complexity of FCC is O(N(logN + nK)).
There are two issues that are very important to this algorithm: one
is related to the definition of neighborhood (NN) of a data
instance, the other is how to calculate the fuzzy affinity between
each pair of instances. For the first issue, the neighborhood is
defined in Euclidean space. A small number of neighbors around the
target point with the shortest Euclidean distances are defined to
be its nearest neighbors. With an indexing scheme such as an R-
tree, it is very easy to find the neighborhood in a short time.
In order to calculate the fuzzy affinity of an instance c and
another instance k d ﹋U (target cluster), besides their spatial
adjacency, we also involve the statistical information about the
cluster Uk. The distribution information of a certain variable
(e.g., Euclidean distance between nearest neighbors) of Uk is
stored and kept up to date. When a new instance comes, its
probability of falling into Uk is calculated using the
distribution information and the Euclidean distance between the
new instance and its nearest neighbor d in Uk. Let E(c, d) be the
Euclidean distance between c and d, and Mk and Sk be the mean and
standard deviation of the Euclidean distance between every pair of
instances already in Uk.
Yixian Yang, et al. Expires April, 2006 [Page 16]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
4.4.3 System Evaluation and Results
To evaluate our system we are interested in two major indicators of
performance: the detection rate and the false positive rate. The
detection rate is defined as the number of intrusion instances
detected by the system divided by the total number of intrusion
instances present in the test set. The false positive rate is
defined as the total number of normal instances that were
(incorrectly) classified as intrusions divided by the total number
of normal instances. These are good indicators of performance, since
they measure what percentage of intrusions the system is able to
detect and how many incorrect classifications it makes in the
process. We calculate these values over the labeled data to measure
performance.
Yixian Yang, et al. Expires April, 2006 [Page 17]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
5 Conclusions
In this document, we introduce a new clustering algorithm, FCC, for
intrusion detection. This algorithm uses the concept of fuzzy
connectedness to calculate the similarities among different data
instances. Starting with a single or a few seed points in each
cluster, all the data points are dynamically assigned to the
cluster that has the highest fuzzy connectedness value (strongest
connection). In calculating the fuzzy connectedness, both local
distance information and statistical properties of clusters are
considered. Comparing with the frequently utilized Euclidean
distance, the new similarity metric is more robust, and there is
no restriction on the shape of clusters that can be discovered.
FCC has also some other advantages over the other previously
proposed clustering algorithms: no predefined ※cluster width§, no
threshold for ※confidence area§, etc. Although the FCC algorithm
requires a parameter K which denotes the number of clusters, most
of the time it is not necessary to know an exact value for it.
In many cases where we just care about whether an instance is an
intrusion and do not need to know about the exact intrusion type,
the parameter K can be fixed as a constant, i.e., K=2. With an
efficient heuristic algorithm, the time complexity of the
clustering process is O(NlogN+nK), where N is the number of data
points and n, K are constants. Moreover, this unsupervised learning
method can detect not only the known intrusion types, but also
their variants. The two other parameters in FCC, the number of seed
points and the number of neighbors, do not seem to affect greatly
the performance of the algorithm. Experimental results on a subset
of KDD-99 dataset showed the stability of efficiency and accuracy
of the FCC algorithm. With different settings, the Detection Rate
stayed always above 94% while the False Alarm Rate was below 4%.
For future work, we need to find out if there is a general accepted
definition of fuzzy affinity. With different distribution variable
involved in the calculation, the values of fuzzy affinity may be
different and it may lead to different clustering results. Even
though the one we employ in equation (1) gives good results on
the KDD-99 data set, it is not necessary that it will be appropriate
for other datasets.
Yixian Yang, et al. Expires April, 2006 [Page 18]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
6. Acknowledgement
The authors wish to thank Jian Li.
7. Informative References
[1] Leonid Portnoy, Eleazar Eskin and Sal Stolfo, ''Intrusion
Detection with Unlabeled Data Using Clustering'',
Proceedings of ACM CSS Workshop on Data Mining Applied
to Security(DMSA-2001),Philadelphia, PA: November 5-8, 2001.
[2] Wenke Lee, ''A Data Mining Framework for Constructing
Features and Models for Intrusion Detection Systems'',
PhD Thesis, Columbia University, 2000.
[3] Pierre Hansen and Nenad Mladenoviu, ''J-Means: A New
Local Search Heuristic for Minimum Sum of Squares
Clustering'', Pattern Recognition , 2001, 34(2): 405-413
[4] Yu Guan, Ail A, Ghorbani and Nabil Belacel, ''Y-Means: A
Clustering Method for Intrusion Detection'', Proceedings
of Canadian Conference on Electrical and Computer
Engineering, Montreal, Quebec, Canada. May 4-7, 2003.
[5] Azriel Rosenfeld, ''Fuzzy Digital Topology'', Information
and Control, 1979, Vol.40, 76-87.
[6] Rebecca Gurley Bace, ''Intrusion Detection'', Macmillan
Technical Publishing, 2000.
[7] Pan JJ, Yang XN, Wang GZ. ''An Image Segmentation and
its Algorithm based on Fuzzy Connetedness'', Journal of
Software, 2005, 16(1): 67-76.
8. Authors' Addresses
Yixian Yang
Information Security Center,
Beijing University of posts and telecom.(BUPT),
Beijing, China,100876
Phone:8610-62283366
Email:yxyang@bupt.edu.cn
Yixian Yang, et al. Expires April, 2006 [Page 19]
INTERNET-DRAFT algorithm for IPSFCC October, 2005
Full Copyright Statement
Copyright (C) The Internet Society (2005).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
ipr@ietf.org.
Yixian Yang, et al. Expires April, 2006 [Page 20]