Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Spatiotemporal Patterns and Predictability of Cyberattacks

  • Yu-Zhong Chen,

    Affiliation School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA

  • Zi-Gang Huang ,

    huangzg@lzu.edu.cn (ZGH); Ying-Cheng.Lai@asu.edu (YCL)

    Affiliations School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA, Institute of Computational Physics and Complex Systems, Lanzhou University, Lanzhou Gansu 730000, China

  • Shouhuai Xu,

    Affiliation Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249, USA

  • Ying-Cheng Lai

    huangzg@lzu.edu.cn (ZGH); Ying-Cheng.Lai@asu.edu (YCL)

    Affiliations School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA, Department of Physics, Arizona State University, Tempe, Arizona 85287, USA

Correction

22 Jun 2015: Chen YZ, Huang ZG, Xu S, Lai YC (2015) Correction: Spatiotemporal Patterns and Predictability of Cyberattacks. PLOS ONE 10(6): e0131501. https://doi.org/10.1371/journal.pone.0131501 View correction

Abstract

A relatively unexplored issue in cybersecurity science and engineering is whether there exist intrinsic patterns of cyberattacks. Conventional wisdom favors absence of such patterns due to the overwhelming complexity of the modern cyberspace. Surprisingly, through a detailed analysis of an extensive data set that records the time-dependent frequencies of attacks over a relatively wide range of consecutive IP addresses, we successfully uncover intrinsic spatiotemporal patterns underlying cyberattacks, where the term “spatio” refers to the IP address space. In particular, we focus on analyzing macroscopic properties of the attack traffic flows and identify two main patterns with distinct spatiotemporal characteristics: deterministic and stochastic. Strikingly, there are very few sets of major attackers committing almost all the attacks, since their attack “fingerprints” and target selection scheme can be unequivocally identified according to the very limited number of unique spatiotemporal characteristics, each of which only exists on a consecutive IP region and differs significantly from the others. We utilize a number of quantitative measures, including the flux-fluctuation law, the Markov state transition probability matrix, and predictability measures, to characterize the attack patterns in a comprehensive manner. A general finding is that the attack patterns possess high degrees of predictability, potentially paving the way to anticipating and, consequently, mitigating or even preventing large-scale cyberattacks using macroscopic approaches.

Introduction

Highly networked communication and information infrastructures built via various state-of-the-art technologies play crucial roles in modern economic, social, military, and political activities. However, such sophisticated infrastructures are facing more and more severe security challenges on the global scale [14]. Earlier theoretical works focused on understanding the complex topologies of the Internet [5] and on the likelihood of large scale failures caused by node removal in complex networks [610]. Recent years have witnessed tremendous efforts devoted to mitigating and coping with increasing cybersecurity threats. For example, attack graphs were invented to analyze the overall network vulnerability and to generate a global view of network security against attacks [1114]. By deploying network sensors at particular points in the Internet, monitoring systems were built to detect cyberthreats and statistically analyze the time, sources, and the types of attacks [15], and various visualization methods were developed to better understand the result of the detection and analysis [1619]. Quite recently, a genetic epidemiology approach to cybersecurity was proposed to understand the factors that determine the likelihood that individual computers are compromised [20], and the general concept of cybersecurity dynamics was introduced [21].

Attack traffic analysis were mainly done in the field of Intrusion Detection System (IDS), the cyberspace’s equivalent to the burglar alarm. IDS has become one of the fundamental technologies for network security [22]. There are three approaches to building an IDS: (1) signature or misuse detection, (2) anomaly detection, and (3) hybrid or compound detection. In particular, signature detection technique is based on a predefined set of known attack signatures obtained from security experts. The system observes the activities of subjects and alarms if their behaviors match the malicious ones in the attack signature set. Both host-based [23, 24] and network-based [25, 26] detection systems were developed. Anomaly detection technique is based on machine learning methodologies, such as system call based sequence analysis [2729], Bayesian networks [3032], principal component analysis [3335], and Markov models [36, 37]. The IDS monitoring capability can be improved by taking a hybrid approach that combines both signature and anomaly detection strategies [38, 39]. All these methods are often based on data packet payload inspection and thus are difficult to perform for high speed networks. Another limitation of these approaches is the assumption that either the attacks are well defined (i.e., signatures) or the normal behaviors are well defined (so are the abnormal behaviors). Recently, there has been a growing interest in flow-based intrusion detection technologies, by which communication patterns within the network are analyzed, instead of the contents of individual packets [40, 41]. Interestingly, a quite recent study analyzing the data obtained from the host IDSs reveals strong associations between the network services running on the host and the specific types of threats to which it is susceptible [20]. Making use of the plan recognition method in artificial intelligence, one can predict the attack plan from the IDS alert information [42]. Utilizing virtual or physical networks to test these IDS techniques can be costly and time consuming, hence, as an alternative, simulation modeling approaches were developed to represent computer networks and IDS to efficiently simulate cyberattack scenarios [4345]. As botnets have become a major threat in cyberspace, cyberattack traffic patterns have also been used to understand botnet’s Command-and-Control strategies [4648].

In this paper, we uncover the existence of intrinsic spatiotemporal patterns underlying cyberattacks and address the important question of whether certain such attacks may be predicted or anticipated in advance. The overwhelming complexity of the modern cyberspace would suggest complete randomness in the distribution of cyberattacks and, as a result, the intuitive expectation is that attackers’ behaviors are random and attacks are unpredictable. However, our discovery of the spatiotemporal patterns and quantitative characterization of the predictability of these patterns suggest the otherwise. In particular, distinct from previous works on cyberattack analysis, our efforts concentrate on analyzing the macroscopic properties of the attack traffic flows using a data set of cyberattacks available to us. Especially, the data set recorded attacks on 491 consecutive victim IP addresses (sensors) in 18 days. The IP addresses can thus be regarded, approximately, as a variable in space. An attack is regarded as an event occurring in both space and time, and we speak of events in spatiotemporal dimensions. This is much more comprehensive than the analysis of the individual time series obtained from sampled IP addresses or the time series obtained by treating the IP addresses as a whole [4957]. Our results reveal, for the first time, that robust macroscopic patterns exist in the seemingly random cyberspace: majority of the attacks are governed by a few very limited number of patterns, indicating that cyberattacks are mainly committed by a few types of major attackers, each with unique spatiotemporal characteristics. More specifically, the patterns can be divided into two types: deterministic and stochastic. The emergence of deterministic patterns implies predictability, which can potentially be exploited to anticipate certain types of attacks to achieve greater cybersecurity. We characterize the predictability of attack frequency time series based on information entropy [58]. Our results suggest a surprisingly high degree of predictability, especially for the IPs under deterministic attack. Effective algorithms can then be developed to predict the future attack frequencies. We also develop methods to evaluate the inference probability between the attack frequency time series based on series similarity, which may allow us to plant much fewer attack probes into the Internet while still achieving effective monitoring. The stochastic patterns can be quantified using the flux-fluctuation law in statistical and nonlinear physics [5968]. Our findings outline a global picture of how cyberattacks are initiated and distributed into the Internet. This will be of potential value to the development of defense strategies against cyberattacks on a global scale.

Results

Existence of IP address blocks sharing similar attack patterns and common backgrounds

We define attack frequency w(t) as the number of attacks received by a victim IP per time unit Δt. Fig 1 shows the time series w(t) from all the victim IP addresses for Δt = 1000 seconds. Surprisingly, instead of overwhelming randomness, we observe substantial regularity: the IP-space can be unequivocally divided into distinct colored blocks, where the amplitudes of the time series within each block are approximately of the same order of magnitude, but the amplitudes from different blocks vary considerably. Note that, within each block, the time series w(t) are approximately synchronized, which correspond to a particular attack pattern. For different attack patterns, there exists a common background in the different IP sectors. For example, about half of the IP addresses (from 1 to 246) belong to the background of aqua block [w(t) ∼ 101.5], in which yellow, orange, and red patterns all exist, while all other IP addresses possess a dark-blue background [w(t) < 100.5] with a number of vertical light-blue lines going through them.

thumbnail
Fig 1. Time series of attack frequency w(t) for all IP addresses.

(a) Spatiotemporal representation of attack frequencies w(t) of all IP addresses on a logarithmic scale, where the x-axis and y-axis are, respectively, time t and IP address index from 1 to 491 (top to bottom). The IP region 1–246 is denoted as the aqua background, where each of the four IP regions (19–31, 35–47, 50–130, and 131–191) exhibits a particular attack pattern that is overlayed on the background. The IP region 247–363 possesses an attack pattern that is overlayed on the dark-blue background lying under the entire IP-space. (b) Three-dimensional presentation of the attack frequency w(t) on a logarithmic scale. The “walls” of unity height are not visible on the logarithmic scale, but higher “walls” are visible in the IP region 364–491 [corresponding to the vertical light-blue lines on the bottom dark-blue background in (a)], which actually occur in the entire IP-space but are mixed with other patterns. The time resolution is Δt = 1000 seconds.

https://doi.org/10.1371/journal.pone.0124472.g001

To further investigate the properties of each pattern, we concentrate on a small time period, as shown in Fig 2. Firstly, under much higher time resolution (Δt = 10sec), the large number of vertical light-blue lines in Fig 1(a) become curves evenly distributed along the t axis and occupying about half of the IP-space (from 1 to 246). These curves are approximately parallel lines within the same IP region but with different slopes in different IP regions (the sloped lines become visually vertical under low time resolution or large observational time period). It is these curves that form the aqua background throughout the upper half of the IP space in Fig 1. Secondly, all the light-blue curves exist on a dark-blue background with three vertical light-blue lines appearing throughout the entire IP-space, indicating a common background covering every single IP address on which other attack patterns are superimposed. We call the vertical lines “walls,” which represent the level of simultaneous attacks on each victim IP and are visually similar to walls in the three-dimensional representation, as shown in Figs 1(b) and 2(b).

thumbnail
Fig 2. Time series of attack frequency with higher time resolution.

Time range: t = 86200 to 86600. Time resolution: Δt = 10 seconds. (a,b) Two- and three-dimensional representations of the attack frequency w(t) for the entire IP-space on a linear scale, respectively.

https://doi.org/10.1371/journal.pone.0124472.g002

These observations have the following implications. Firstly, IP addresses in the same colored block are attacked through quite similar patterns, and the block as a whole shows unique spatial-temporal features, which can be effectively associated with the “fingerprint” of a particular set of attackers. For an IP block corresponding to a particular attack pattern, since the relatively strong spatiotemporal regularity of the pattern can only be achieved via attacks highly organized in time and space, it is reasonable to speculate that the corresponding attacks are solely committed by a set of intimately correlated attackers, or even one major attacker, rather than by a set of multiple independent attackers. A single major attacker is more likely to play the role, due to the requirement of close cooperation between the attackers that may be unrealistic. Thus, a particular type of attack patterns, or fingerprints, corresponds to a particular major attacker. Globally, according to the limited types of attack patterns, a small number of major attackers are responsible for almost all the attacks. Secondly, the consecutiveness of IP addresses attacked in similar patterns reveals the way how the corresponding attackers select their targets, i.e., targeting each IP address within a consecutive IP sector rather than distantly separated IP addresses. Thirdly, the coexistence of multiple patterns overlaying on some identical background indicates that some IP addresses may be under the cross fire from multiple major attackers.

Deterministic attacks

Counterintuitively, the parallel light-blue lines in Fig 2(a) constituting the aqua background in the upper half IP-space show a substantial component of deterministic attack behaviors. We speculate that each light-blue line is generated by one of the attacker’s devices, such as a zombie computer in a botnet that launches attacks on a certain IP region with a constant sweeping speed and a certain order with similar time intervals. Such an “organized” attack pattern makes the light-blue lines nearly evenly distributed along the time axis, which can be regarded as deterministic attacks. Because of the deterministic rules that the attacks follow, it is possible to predict when and where the next attack is going to take place by identifying the ordered time interval and extrapolating the sweeping speed. We observe that the attacker associated with a light-blue line typically attacks each IP address once approximately within every 3 to 8 seconds. Specifically, the three different sweeping rates in the three sub-regions of IP addresses are about 8 seconds per IP from 51 to 130, 3 seconds per IP from 131 to 191, and 6 seconds per IP from 192 to 246. Similar deterministic attack patterns can also be observed in a relatively narrow IP region close to the top of the IP region (IP 19–31), but in a much larger time scale, indicated as the thicker light-green lines right below the top of Fig 3(a). Fig 3(b) is the enlarged graph with the contrast slopes of the two deterministic attack patterns shown more clearly. The deterministic attacks in the IP range 19–31 typically take about 200 to 1100 seconds per IP before switching to the next, with the attack rate of about once per second.

thumbnail
Fig 3. Deterministic attack pattern on different time scales.

For time resolution Δt = 10 seconds and IP region 1–250, (a) deterministic attack pattern in a relatively large time scale: the sloped light-green lines in the IP region 19–31, where the time range is t = 38000 to 42000, (b) enlarged section between the two vertical pink lines in (a) in the time range from t = 39500 to 40000, which provides a time-scale comparison between two types of deterministic attack patterns.

https://doi.org/10.1371/journal.pone.0124472.g003

The “wall” attacks in the dark-blue background show another type of deterministic attack patterns [Figs 1 and 2], which are instantaneous attacks to each IP that can be observed with time resolution of Δt = 1 second. However, when a higher time resolution is used, e.g., Δt = 10−3 second, the “wall” attacks are found to occur one after another in the order of the IP index. Fig 4 shows two typical types of “wall” attacks. For the first type, attacks are performed exactly once on each IP in the order of IP index but with time delays of the order of 10−2 second. This is the most frequently observed walls with unity height in the data set, while close occurrence of several “wall” attacks induces a higher wall when larger values of Δt are used. The second kind of “wall” attacks shown in Fig 4(b) and 4(c) consists of 5 consecutive attacks on one IP before skipping to the next, strictly in terms of the IP order.

thumbnail
Fig 4. Deterministic attack pattern of “walls”.

For time resolution Δt = 10−3 second, (a) one type of “wall” attacks over the entire IP space, with each IP receiving one attack. The time delays for the attacks to reach the targeted IP addresses exhibit three distinct values, separating the dots in this panel into three lines. (b) A different type of “wall” attack with different delay, and (c) enlarged section in the small red square in (a). Five consecutive attacks on each IP occurred before the next IP is attacked.

https://doi.org/10.1371/journal.pone.0124472.g004

The three different deterministic attack patterns observed in Fig 3, i.e., the thick green lines in IP 19–31, the background in IP 1–246, the “wall” attacks over the whole IP space, all sweep the IP space strictly in order but with time scales differing by orders of magnitude. The multiscale behaviors associated with the deterministic attack patterns are crucial for understanding and predicting attacks in the cyberspace.

Flux-fluctuation relation and stochastic attack

Recently there has been a great deal of attention to the flux-fluctuation relations in complex networked systems with stochastic transportation dynamics [5968], where the flux w(t) of a node is the amount of goods or the number of data packets that it transmits per time unit at time t, and the standard deviation σ of w(t) corresponds to the fluctuation. The following flux-fluctuation relation holds when the traffic system evolves under a given external drive [66]: (1) where ⟨w⟩ is the time average of flux w(t), ⟨W⟩ is the time average of the external drive W(t), and σext denotes the standard deviation of W(t). If the external drive W is approximately a constant, i.e., σext ≈ 0, or if W follows the Poisson distribution (σext = ⟨W⟩), then σw. However, if the external drive has large fluctuations, the relation becomes σ ∼ ⟨w⟩.

Our key idea is that the stochastic component of cyberattacks can be characterized as a flux distribution processes among all the IP addresses. Using the flux-fluctuation relation, we can identify and distinguish the patterns of the external drives. Fig 5(a) shows the flux-fluctuation relation on a double logarithmic scale, where the attack frequency w(t) of a victim IP corresponds to its flux, and the total number of attacks on a certain IP region is regarded as the external drive M(t). We observe that a substantial part of the flux-fluctuation relation follows the scaling σw (with slope 1/2 on the logarithmic scale), while a small portion follows the scaling σ ∼ ⟨w⟩ (with slope 1). These results suggest that cyberattacks share some intrinsic common features with stochastic transportation dynamics. The flux-fluctuation theory can then be used to analyze the stochastic components of cyberattack patterns.

thumbnail
Fig 5. Flux-fluctuation relation.

For time resolution Δt = 10 seconds, ⟨w⟩-σ relations (a) for each IP in the IP space, (b) for IP regions 247–257 and 364–491, and for each IP in the dark-blue background with walls (red circles) and without walls (blue squares), (c) for IP region 258–363 with (red circles) and without (blue squares) walls. The blue dashed lines in (b) and (c) have the slope 1/2.

https://doi.org/10.1371/journal.pone.0124472.g005

From Fig 5(a), we see that the flux-fluctuation relation for the most heavily bombarded IP region (35–47, see Fig 1) displays the unity slope, indicating a non-Poisson type of external drive with strong fluctuations. The IP region under mostly deterministic attacks (IP 192–246, the aqua background with no other overlayed attack patterns) corresponds to only one dot in the flux-fluctuation diagram, establishing the deterministic nature of the attack without any randomness. Fig 5(b) shows the σ-⟨w⟩ relation for the dark-blue background with no overlayed attack patterns, where we obtain the relation σ=w by removing the deterministic attacks. This is mainly due to the sparsity of attacks associated with the dark-blue background. That is, without walls, mostly only one attack was received in the corresponding IP region within each time unit. We thus have W = 1 or 0. The average flux per unit time for one given IP is ⟨w⟩ ≤ 1/N, with N being the size of the IP region. For observation with T time units, the number of time units with W = 1 is denoted by t. We have (2) Substituting this relation into Eq (1), we obtain (3) When deterministic attacks such as those represented by the walls are included, the fluctuation becomes (4) where ⟨w′⟩ = (t + twall)/T, and twall is the number of walls with one attack to each IP. Since twallta, ⟨w′⟩2 cannot be ignored, and so the slope of the σ′-⟨w′⟩ relation on a double logarithmic scale is given by (5) As shown in Fig 5(b), walls tend to reduce the slope slightly. This example illustrates the effect of deterministic attacks on the fluctuation of the attack flux. Similar phenomenon is observed in another IP region (IP 258–363) where a less sparse but larger ⟨w⟩ scale attack pattern takes place, as shown in Fig 5(c).

Spatial concentration of attacks

Our analysis so far has focused on the temporal flux and fluctuation behavior of each IP. It is useful to study the distribution of the attacks in the IP space to distinguish the attack patterns in different IP blocks. The relation between the average attack frequency over the IP addresses in a given region, denoted by w¯IP, and the standard deviation of the attacks distributed among these IP addresses, denoted by σIP, is shown in Fig 6, where the left and right panels correspond to the attacks associated with the dark-blue background and with IP address block 258–363, respectively. In the dark-blue background, attacks are sparse. The extreme case with n attacks homogeneously distributed among N IP addresses leads to w¯IP=n/N, where each of n IP addresses receives w = 1 attack and the remaining IP addresses have w = 0. The standard deviation of the attacks among these IP addresses can simply be written as (6) which is the equation of an ellipse. This equation matches the lower bound of the real data very well, while any degree of inhomogeneity in the attack distribution will lead to a larger value of σIP. In addition, for the time resolution of 10 seconds, the effect of “wall” attacks with unity height is to increase each w¯IP exactly by 1, since a wall introduces one attack to each single IP. However, the change in σIP due to the “wall” attacks is small. Thus we observe two ellipses from real data in both Fig 6(a) and 6(b).

thumbnail
Fig 6. Attack frequency deviation within an IP region versus the average at each time unit for (a) the the dark-blue background and (b) IP region 258–363.

The time resolution is Δt = 10 seconds.

https://doi.org/10.1371/journal.pone.0124472.g006

The major difference between Fig 6(a) and 6(b) is the straight line that appears only in Fig 6(b). This is due to the concentration of multiple attacks on a few IP addresses, leaving the remaining IP addresses free of attacks. The extreme case is where all n attacks focus merely on one IP. Then, the average value of w¯IP is still n/N, but the standard deviation becomes (7) which gives the upper bound fit (the straight line) with the real data, as shown in Fig 6(b). Consequently, it is the concentration of attacks that distinguishes the attack pattern in the IP region (258–363) from the dark-blue background. We thus see that, for observational time resolution Δt < 10 seconds in which the attacks are sparse, the plots of σIP-w¯IP relation are confined in the region bounded by Eqs (6) and (7), and larger values of σIP imply that the attacks are inhomogeneous in the IP space with uneven spatial concentration.

We conclude that the target selection scheme in the IP region 259–491 is highly stochastic due to the lack of regular spatiotemporal pattern. Within each time unit, there are mainly two target selection schemes: (1) all attacks are concentrated only on a single IP; (2) attacks are evenly spread over the whole region, i.e., each attack packet is received by a different IP.

Inference probability of attack patterns

Due to the surprisingly stable similarity of the attack patterns within an IP block, attack time sequences of frequencies on all IP addresses may be inferred from any single sequence. High inference probability of a consecutive IP region indicates that information obtained from one sensor (one IP) may be sufficient to capture the key features of the attack patterns in the whole region. Thus, our discovery of the attack pattern similarity may help to reduce dramatically the number of sensors needed to monitor the attack behavior in the whole cyberspace of interest.

A straightforward measure of inference probability is the correlation coefficient, as the time series belonging to a similar attack pattern are likely to be highly correlated. The correlation coefficient between two time series i and j is given by (8) where wi and wj are the attack frequencies of IP addresses i and j, and σi and σj are the corresponding standard deviations. Fig 7(a) shows the correlation coefficient matrix containing ρi,j for each i-j pair with relatively low time resolution (Δt = 104 seconds). We see that relatively large correlation coefficients appear within the IP regions corresponding to the readily distinguishable colored blocks in Fig 1. To further exploit the use of the correlation coefficients, we consider the IP group generated from one given IP i by calculating the correlation coefficients between the time series from this IP and those from all other IP addresses, ρij, over a certain threshold ρc. The sizes of the groups generated from the IP addresses under a similar attack pattern would be close to each other. From the group size of each IP shown in Fig 7(b), we see that time series associated with IP addresses in different blocks behave differently, despite the large fluctuations. The correlation coefficient, however, may not be a reliable measure to characterize similarity and inference probability of the main attack pattern. Fig 7(c) shows a case where the correlation coefficient works well, but an unsuccessful case is shown in Fig 7(d) where the two abrupt peaks produce very low correlation coefficients among the three time series, in spite of their similarities in regions excluding the peaks. We note that the peaks correspond to extreme events with highly concentrated attacks. When the extreme events on different IP addresses are out of phase, the correlation coefficient fails to reflect the similarity of the main attack pattern and can lead to large fluctuations. Other statistical properties, such as M, the total number of attacks on a particular IP, and ⟨τ⟩, the average time interval between consecutive attacks, can also be used to characterize the attack pattern similarity, as shown in Fig 7(b). We see that the fluctuations of these quantities are much smaller than ρij and they are thus able to better distinguish the IP addresses under different attack patterns. Another disadvantage of the correlation coefficients is that they are sensitive to time resolution Δt. For high resolution (small Δt), the reduction in the attack frequency values may make a time series so sparse that it is dominated by random fluctuations, and this can result in a sharp decrease in the correlation coefficients between such sparse series. For example, for Δt = 10 seconds, ρi,j is close to 0 for almost any i-j pairs.

thumbnail
Fig 7. Correlation coefficients.

(a) Correlation coefficient matrix associated with the IP space, (b) total number of attacks on an IP address, M (blue), the average time interval between consecutive attacks on an IP, ⟨τ⟩ (red), and the cluster size to which each IP belongs (green). The clustering threshold is set to be 0.7 (somewhat arbitrary). (c) Time series of attack frequency for IP 155 (red), 160 (blue), and 165 (green). All three IP addresses belong to the region of the aqua background without any other overlayed attack patterns. (d) Time series of attack frequency for IP 405 (red), 409 (blue), and 420 (green), which belong to the region of the dark-blue background without any other overlayed attack patterns. One abrupt peak occurs at IP 409, and another at IP 420. The time resolution is Δt = 10000 seconds for all four panels.

https://doi.org/10.1371/journal.pone.0124472.g007

To better characterize the attack patterns and overcome the sensitivity on time resolution and frequency value fluctuations, we coarse-grain the time series according to the order of magnitude of their amplitude, i.e., we replace each w(t) by a state number, X(t) = int[log10 w(t)], where int denotes the integer closest to log10 w(t) [for w(t) = 0, we set w(t) = 0.1 so that X(t) = −1]. Based on X(t), we can construct a Markov state transition probability matrix (MSTPM) to capture the key information about the attack pattern on any particular IP. The entry in the matrix at the mth row and nth column denotes the probability that the IP is in state m at t − Δt and it transitions to state n at time t. Since the length of the time series is relatively large, the transition probabilities are robust against random fluctuations. In addition, for IP addresses under similar attack patterns, the similarity in their state transition patterns would hold irrespective of the time resolution. As shown in Fig 8, IP addresses within the same attack pattern region have similar MSTPMs, even for high resolution (e.g., Δt = 10 second). This suggests that MSTPMs, which can be measured via the correlation coefficients where the matrix entries are organized into a vector according to a certain order, can be used to quantify the inference probability, as shown in Fig 9. We observe that the IP addresses attacked under similar patterns have relatively high correlation coefficients, while low correlation coefficients can distinguish the deterministic from stochastic attacks. In general, the attack frequencies of a group of IP addresses under deterministic (or stochastic) attack can be inferred from one of them under attack of the same type.

thumbnail
Fig 8. Markov state transition probability matrix (MSTPMs).

For time resolution Δt = 10 seconds, MSTPMs obtained from two randomly selected IP addresses within the corresponding IP region. See the legend of Fig 6(a) for the eight IP regions under different attack patterns: IP addresses (a1,a2) 10 and 33, (b1,b2) 20 and 30, (c1,c2) 40 and 45, (d1,d2) 60 and 120, (e1,e2) 140 and 190, (f1,f2) 200 and 240, (g1,g2) 260 and 360, and (h1,h2) 420 and 490.

https://doi.org/10.1371/journal.pone.0124472.g008

thumbnail
Fig 9. Correlation coefficient matrix associated with the MSTPMs for (a) Δt = 10 seconds and (b) Δt = 100 seconds.

https://doi.org/10.1371/journal.pone.0124472.g009

Generally, attack frequency time series have high inference probability, which can be measured by the correlation coefficients or the MSTPMs. The latter, however, are more reliable especially for high time resolutions. This finding would enable one to monitor the cyberattack patterns throughout the entire IP space of interest using fewer sensors.

Predictability of cyberattacks

Predicting future cyberattacks is an ultimate goal in investigating cyberattack patterns through data analysis. The degree of predictability can be characterized by the uncertainty associated with the state transitions in the coarse-grained time series of attack frequencies, which can be further quantified by the information entropy. Taking into account the temporal correlations in the state transition process, we define the information entropy of IP i as (9) where Si = {X1, X2, …, XT} denotes the sequence of states that the time series reached for IP i, and P(Si) is the probability that the state sequence Si appears to IP i. The information entropy Ei takes into account both the heterogeneous probability distribution in different states and the temporal correlations among the states. The entropy is thus able to provide a realistic characterization of the attack patterns.

The predictability Π of a state sequence is defined as the success rate that an algorithm can achieve in predicting the sequence’s future states [58]. For a sequence with NS possible states, the predictability measure is subject to the Fano’s inequality: Π ≤ Πmax(E, NS), where the predictability upper bound Πmax(E, NS) is obtained by solving the following equation (10) In the predictability calculation, high time resolutions, e.g., Δt = 10 seconds, can make the series so sparse that the overwhelming majority of the states are reduced to the “ground state” X(t) = −1 [corresponding to w(t) = 0]. In this case, the predictability Πmax assumes artificially high values, as the future states can trivially be predicted to be X(t) = −1, leading to a high success rate. To avoid this artifact, we need to use moderate time resolutions, e.g., Δt = 100 seconds or 1000 seconds. The length of the time series can affect the value of the predictability. Predictable patterns may not be fully contained within short time series and, hence, we evenly divide the whole time series into multiple sections, each of length of h hours. Fig 10(a) and 10(b) show the predictability upper bound Πmax for different section lengths and time resolutions. While longer time series (higher h) tend to be more predictable as expected, all time series have surprisingly high predictability. There are cases where the predictability probability is over 90%, indicating that it is possible to make correct predictions for over 90% of the cases. We observe that the IP regions of different attack patterns in Fig 1 have different Πmax values, and the IP addresses under each attack pattern are approximately equally predictable. Deterministic attacks have higher predictability than stochastic attacks for Δt = 1000 seconds, but the opposite occurs for Δt = 100 seconds, due to the fact that stochastic attacks are much more sparse and, as such, a higher time resolution can introduce overwhelmingly more ground states into the state sequence. Despite the different attack patterns, the average predictability over all IP addresses converges to about 93% with relatively small error bars as the section length is increased, as shown in Fig 10(c). Fig 10(d) shows the average predictability of the deterministic and stochastic attacks. We see that deterministic attacks systematically have higher predictability than stochastic attacks for Δ = 1000 seconds. Again, sparsity contributes to the relatively high predictability (over 85% for long sections) of stochastic attacks.

thumbnail
Fig 10. Predictability calculated from coarse-grained time series.

(a,b) Predictability upper bound Πmax for all 491 IP addresses for various section lengths under time resolutions Δt = 100 seconds and 1000 seconds, respectively, where Πmax values are averaged over all sections. (c) Average Πmax versus section length h for Δt = 100 seconds (red circles) and 1000 seconds (blue squares). (d) Average Πmax of the deterministic (open symbols) and stochastic (closed) attacks versus the section length h for Δt = 100 seconds (red circles) and 1000 seconds (blue squares). In the figure, “D” denotes deterministic and “R” stands for random or stochastic. See text for explanation on why, in the case of Δt = 100, stochastic attacks can have a higher predictability upper bound than deterministic attacks.

https://doi.org/10.1371/journal.pone.0124472.g010

Discussion

How do attackers choose their targets?

From Fig 7(b), we see that the IP addresses under a similar attack pattern have similar M values, which means that different IP addresses in the IP region dominated by one particular attack pattern are approximately equally visited by the attackers. Due to this, we speculate that in most cases, an attacker does not target some specific IP addresses, but a consecutive IP region. In the case of deterministic attacks, the attacker scans a whole block of adjacent IP addresses by using its botnet, and each IP in the targeted block has equal probability to be hit. In the case of stochastic attacks, at each time an attack is launched, IP addresses in the targeted block have similar chances to receive it. It may indicate that the attackers are most likely to view the targeted IP block as a black-box, the internal structure of which may be irrelevant.

Interplay between different attack patterns

Except for the parts in Fig 5 that show slope 1 or 1/2, there are some parts of the data that do not exhibit a plausible scaling, which correspond to the high attack density patterns that are overlayed on the deterministic aqua background. Figs 2(a) and 3(b) show that the light-blue lines change their slopes when other attack pattern are superimposed on them. This reveals an interesting phenomenon: different attack patterns may interfere with each other. If so, it would be of great importance to investigate how a certain type of attack pattern promotes or suppresses the efficiency of another type, and how the attack packets differ from normal data packets when transporting through the Internet. Making use of the suppression effect can lead to a dramatic reduction in the transportation efficiency of the attack packets. Our work indicates that it is potentially possible to mitigate or eliminate cybersecurity threats substantially through a macroscopic approach, a field that requires much further efforts.

Attack pattern inference and prediction

The surprising finding that cyberattack patterns are highly predictable encourages us to develop inference and prediction algorithms. The former can provide us with global insights into cybersecurity based on limited information resources, and the latter would enable us not just to get a better understanding of current cyberattack data but also help us to forecast future cyberthreats.

Possible improvement for future IDS

The analysis on the deterministic attack reveals the possibility for an IDS to be prepared in advance to the coming attack based on the estimation of the relatively constant attack frequency. If we could make the IDSs in a consecutive IP region communicate on attack information, then the sweeping speed can be estimated, which more accurately tells each IDS when shall the next attack take place, since the adjacent victim IPs are swept by the attacker one after another in order within a constant time interval.

About the IP information of the attackers

This paper focuses on the analysis of data obtained via the victim side. In addition, with the IP information of the attackers, if possible, we would be able to further verify our speculation that each sweeping is generated from a single IP address. The spatial distribution of the attackers’ IPs can also be retrieved. For deterministic attack, the sweepings are targeted on consecutive IPs, and it is very likely that a substantial part of the attackers’ IPs are also consecutive or nearly consecutive, since those attacker could be ex-victims of the same type of attack, which have been successfully compromised by a major attacker and are now serving as a part of its botnet. More properties of cyberattack which can not be observed on the victims’ side could also be revealed via such information.

Materials and Methods

The data set we analyzed was collected between 2/9/2011 and 2/25/2011 by a cyber instrument known as honeypot (see http://www.honeynet.org). The data set contains 491 honeypot IP addresses. The honeypot simulates vulnerable computer services corresponding to distinct TCP ports. Since there are no legitimate services associated with these IP addresses, the traffic arriving at these ports is widely deemed as attacks. The raw network traffic arriving at these ports was initially recorded as pcap files. By using some standard pre-processing procedure, we reformulated the raw traffic into flows, which are the commonly accepted representation of cyber attacks. The pre-processing procedure uses two widely used parameters: the flow timeout time of 60 seconds, meaning that a flow expires after 60 seconds of no more packets arriving activities; the flow lifetime of 300 seconds, meaning that a flow expires after 300 seconds. Sec. Supporting Information contains the data set file (c.f. “S1 Data”).

Supporting Information

S1 Data. There are two columns in the file S1 Data.

Each row is a tuple (a, b), which represents that IP address a was attacked at time b at some TCP port. Since our analysis treats an IP address as a whole, we do not distinguish the specific ports. For the purpose of protecting privacy, the honeypot IP addresses (i.e., the first column) were anonymized via a one-to-one mapping that also maintains their consecutiveness.

https://doi.org/10.1371/journal.pone.0124472.s001

(ZIP)

Author Contributions

Conceived and designed the experiments: SX YCL. Performed the experiments: YZC. Analyzed the data: YZC ZGH SX YCL. Wrote the paper: YZC YCL.

References

  1. 1. Pastor-Satorras R, Vespignani A (2007) Evolution and Structure of the Internet: A Statistical Physics Approach. Cambridge University Press.
  2. 2. Choo KKR (2011) The cyber threat landscape: Challenges and future research directions. Computers and Security 30: 719–731.
  3. 3. Brenner J (2011) America the Vulnerable: Inside the New Threat Matrix of Digital Espionage, Crime, and Warfare. Penguin Group US.
  4. 4. Verizon (2012) 2012 Data Breach Investigations Report.
  5. 5. Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. ACM SIGCOMM Comput Commun Rev 29: 251–262.
  6. 6. Pastor-Satorras R, Vázquez A, Vespignani A (2001) Dynamical and correlation properties of the internet. Phys Rev Lett 87: 258701. pmid:11736611
  7. 7. Cohen R, Erez K, Ben-Avraham D, Havlin S (2000) Resilience of the internet to random breakdowns. Phys Rev Lett 85: 4626–4629. pmid:11082612
  8. 8. Cohen R, Erez K, Ben-Avraham D, Havlin S (2001) Breakdown of the internet under intentional attack. Phys Rev Lett 86: 3682–3685. pmid:11328053
  9. 9. Motter AE, Lai YC (2002) Cascade-based attacks on complex networks. Phys Rev E 66: 065102(R).
  10. 10. Zhao L, Park K, Lai YC (2004) Attack vulnerability of scale-free networks due to cascading breakdown. Phys RevE 70: 035101(R).
  11. 11. Sheyner O, Haines J, Jha S, Lippmann R, Wing J (2002) Automated generation and analysis of attack graphs. In: Proceedings. 2002 IEEE Symposium on Security and Privacy. IEEE, pp. 273–284.
  12. 12. Wang L, Islam T, Long T, Singhal A, Jajodia S (Springer Berlin Heidelberg, 2008) An attack graph-based probabilistic security metric. Data and Applications Security XXII vol 5094 of Lecture Notes in Computer Science: 283–296.
  13. 13. Ingols K, Chu M, Lippmann R, Webster S, Boyer S (2009) Modeling modern network attacks and countermeasures using attack graphs. In: Computer Security Applications Conference, 2009. ACSAC’09. Annual. pp. 117–126.
  14. 14. Jajodia S, Noel S, O’Berry B (2005) Topological analysis of network attack vulnerability. Managing Cyber Threats of Massive Computing 5: 247–266.
  15. 15. Snapp SR, Brentano J, Dias GV, Goan TL, Heberlein LT, et al. (1991) Dids (distributed intrusion detection system)-motivation, architecture, and an early prototype. In: Proceedings of the 14th national computer security conference. Citeseer, volume 1, pp. 167–176.
  16. 16. Koike H, Ohno K (2004) Snortview: visualization system of snort logs. In: VizSEC/DMSEC’04 Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security. pp. 143–147.
  17. 17. Lakkaraju K, Yurcik W, Lee AJ (2004) Nvisionip: netflow visualizations of system state for security situational awareness. In: VizSEC/DMSEC’04 Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security. pp. 65–72.
  18. 18. Ball R, Fink GA, North C (2004) Home-centric visualization of network traffic for security administration. In: VizSEC/DMSEC’04 Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security. pp. 55–64.
  19. 19. Koike H, Ohno K, Koizumi K (2005) Visualizing cyber attacks using ip matrix. In: VIZSEC’05 Proceedings of the IEEE Workshops on Visualization for Computer Security. p. 11.
  20. 20. Gil S, Kott A, Barabási AL (2014) A genetic epidemiology approach to cyber-security. Scientific Reports 4: 5659–5665. pmid:25028059
  21. 21. Xu S (2014) Cybersecurity dynamics. In: 2014 Symposium and Bootcamp on the Science of Security (HotSoS’14).
  22. 22. Patcha A, Park JM (2007) An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks 51: 3448–3470.
  23. 23. Smaha SE (1988) An intrusion detection system. In: IEEE Fourth Aerospace Computer Security Applications Conference, Orlando, FL. IEEE, pp. 37–44.
  24. 24. Ye N, Emran SM, Chen Q, Vilbert S (2002) Multivariate statistical analysis of audit trails for host-based intrusion detection. In: IEEE Transactions on Computers. IEEE, volume 51, pp. 810–820.
  25. 25. Anderson D, Frivold T, Tamaru A, Valdes A (1994) Multivariate statistical analysis of audit trails for host-based intrusion detection. In: Computer Science Laboratory, SRI International, Menlo Park, CA, USA, Technical Report SRI–CSL–95–0.
  26. 26. Staniford S, Hoagland JA, McAlerney JM (2002) Practical automated detection of stealthy portscans. Journal of Computer Security 10: 105–136.
  27. 27. Forrest S, Hofmeyr SA, Somayaji A, Longstaff TA (1996) A sense of self for unix processes. In: IEEE Symposium on Research in Security and Privacy, Oakland, CA, USA. IEEE, pp. 120–128.
  28. 28. Warrender C, Forrest S, Pearlmutter B (1999) Detecting intrusions using system calls: Alternative data models. In: IEEE Symposium on Security and Privacy, Oakland, CA, USA. pp. 133–145.
  29. 29. Eskin E, Stolfo SJ, Lee W (2001) Modeling system calls for intrusion detection with dynamic window sizes. In: DARPA Information Survivability Conference and Exposition II, Anaheim, CA. pp. 165–175.
  30. 30. Valdes A, Skinner K (2000) Adaptive model-based monitoring for cyber attack detection. In: Recent Advances in Intrusion Detection Toulouse, France. pp. 80–92.
  31. 31. Ye N, Xu M, Emran SM (2000) Probabilistic networks with undirected links for anomaly detection. In: IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, West Point, New York. pp. 175–179.
  32. 32. Kruegel C, Mutz D, Robertson W, Valeur F (2003) Bayesian event classification for intrusion detection. In: 19th Annual Computer Security Applications Conference, Las Vegas, NV. IEEE, pp. 14–23.
  33. 33. Shyu ML, Chen SC, Sarinnapakorn K, Chang L (2003) A novel anomaly detection scheme based on principal component classifier. In: IEEE Foundations and New Directions of Data Mining Workshop, Melbourne, Florida, USA. pp. 172–179.
  34. 34. Wang W, Guan X, Zhang X (2004) A novel intrusion detection method based on principle component analysis in computer security. In: International Symposium on Neural Networks, Dalian, China. pp. 657–662.
  35. 35. Wang W, Battiti R (2006) Identifying intrusions in computer networks with principal component analysis. In: The First International Conference on Availability, Reliability and Security, Vienna, Austria. pp. 270–279.
  36. 36. Ye N, Zhang Y, Borror CM (2004) Robustness of the markov-chain model for cyber-attack detection. IEEE Transactions on Reliability 53: 116–123.
  37. 37. Yeung DY, Ding Y (2003) Host-based intrusion detection using dynamic and static behavioral models. Pattern Recognition 36: 229–243.
  38. 38. Tombini E, Debar H, Mé L, Ducassé M (2004) A serial combination of anomaly and misuse idses applied to http traffic. In: 20th Annual Computer Security Applications Conference, Tucson, AZ, USA. IEEE, pp. 428–437.
  39. 39. Zhang J, Zulkernine M (2006) A hybrid network intrusion detection technique using random forests. In: The First International Conference on Availability, Reliability and Security, Vieena University of Technology. IEEE, pp. 262–269.
  40. 40. Kim AS, Kang H, Hang SC, Chung SH, Hang JW (2004) A flow-based method for abnormal network traffic detection. In: Network Operations and Management Symposium. pp. 599–612.
  41. 41. Sperotto A, Schaffrath G, Sadre R, Morariu C, Pras A, et al. (2010) An overview of ip flow-based intrusion detection. Communications Surveys and Tutorials, IEEE 12: 343–356.
  42. 42. Qin X, Lee W (2004) Attack plan recognition and prediction using causal networks. In: Proceedings of the 20th Annual Computer Security Applications Conference. pp. 370–379.
  43. 43. Cheung S, Lindqvist U, Fong MW (2003) Modeling multistep cyber attacks for scenario recognition. In: Proceedings of the DARPA Information Survivability Conference and Exposition. IEEE, pp. 284–292.
  44. 44. Nicol DM, Yan GH, Liu J, Jijenstam M (2003) Simulation of large-scale networks using ssf. In: Pmceedings of the 2003 Winter Simulation Conference. IEEE, pp. 650–657.
  45. 45. Kuhl M, Kistner J, Costantini K, Sudit M (2007) Cyber attack modeling and simulation for network security analysis. In: Proceedings of the 2007 Winter Simulation Conference. IEEE, pp. 1180–1188.
  46. 46. Gu G, Perdisci R, Zhang J, Lee W (2008) Botminer: clustering analysis of network traffic for protocol- and structure-independent botnet detection. In: SS’08 Proceedings of the 17th conference on Security symposium. pp. 139–154.
  47. 47. Gu G, Zhang J, Lee W (2008) Botminer: clustering analysis of network traffic for protocol- and structure-independent botnet detection. In: Proceedings of 16th Annual Network and Distributed System Security Symposium.
  48. 48. Kartaltepe EJ, Morales JA, Xu SH, Sandhu R (2010) Social network-based botnet command-and-control: Emerging threats and countermeasures. In: Proceedings of the 8th international conference on Applied cryptography and network security. pp. 511–528.
  49. 49. Zhan Z, Xu M, Xu S (2013) Characterizing honeypot-captured cyber attacks: Statistical framework and case study. IEEE Transactions on Information Forensics and Security 8: 1775–1789.
  50. 50. Li Z, Goyal A, Chen Y, Paxson V (2011) Towards situational awareness of large-scale botnet probing events. Information Forensics and Security, IEEE Transactions on 6: 175–188.
  51. 51. Almotairi S, Clark A, Mohay G, Zimmermann J (2009) A technique for detecting new attacks in low-interaction honeypot traffic. In: Proc. International Conference on Internet Monitoring and Protection. pp. 7–13.
  52. 52. Almotairi SI, Clark AJ, Mohay GM, Zimmermann J (2008) Characterization of attackers’ activities in honeypot traffic using principal component analysis. In: Proc. IFIP International Conference on Network and Parallel Computing. pp. 147–154.
  53. 53. Clark A, Dacier M, Mohay G, Pouget F, Zimmermann J (2006) Internet attack knowledge discovery via clusters and cliques of attack traces. Journal of Information Assurance and Security 1: 21–32.
  54. 54. Almotairi SI, Clark AJ, Dacier M, Leita C, Mohay GM, et al. (2007) Extracting inter-arrival time based behaviour from honeypot traffic using cliques. In: 5th Australian Digital Forensics Conference. pp. 79–87.
  55. 55. Conti G, Abdullah K (2004) Passive visual fingerprinting of network attack tools. In: Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security. pp. 45–54.
  56. 56. Pang R, Yegneswaran V, Barford P, Paxson V, Peterson L (2004) Characteristics of internet background radiation. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement (IMC’04). New York, NY, USA: ACM, IMC’04, pp. 27–40.
  57. 57. Wustrow E, Karir M, Bailey M, Jahanian F, Huston G (2010) Internet background radiation revisited. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC’10). pp. 62–74.
  58. 58. Song CM, Qu ZH, Blumm N, Barabàsi A (2010) Limits of predictability in human mobility. Science 327: 1018. pmid:20167789
  59. 59. de Menezes M, Barabási AL (2004) Fluctuations in network dynamics. Phys Rev Lett 92: 028701. pmid:14753972
  60. 60. Eisler Z, Kertész J (2005) Random walks on complex networks with inhomogeneous impact. Phys Rev E 71: 057104.
  61. 61. Duch J, Arenas A (2006) Scaling of fluctuations in traffic on complex networks. Phys Rev Lett 96: 218702. pmid:16803280
  62. 62. Yoon S, Yook SH, Kim Y (2007) Scaling property of flux fluctuations from random walks. Phys Rev E 76: 056104.
  63. 63. Kujawski B, Tadii B, Rodgers GJ (2007) Preferential behaviour and scaling in diffusive dynamics on networks. New J Phys 9: 154–154.
  64. 64. Meloni S, Gómez-Gardees J, Latora V, Moreno Y (2008) Scaling breakdown in flow fluctuations on complex networks. Phys Rev Lett 100: 208701. pmid:18518584
  65. 65. Zhou Z, Huang ZG, Huang L, Lai YC, Yang L, et al. (2013) Universality of flux-fluctuation law in complex dynamical systems. Phys Rev E 87: 012808.
  66. 66. Huang ZG, Dong JQ, Huang L, Lai YC (2014) Universal flux-fluctuation law in small systems. Scientific Reports 4: 6787. pmid:25345973
  67. 67. Kishore V, Santhanam MS, Amritkar RE (2011) Extreme events on complex networks. Phys Rev Lett 106: 188701. pmid:21635132
  68. 68. Chen YZ, Huang ZG, Lai YC (2014) Controlling extreme events on complex networks. Scientific Reports 4: 6121. pmid:25131344