International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
An Insider Threat Detection Method Based on Business Process Mining Taiming Zhu, Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China Yuanbo Guo, Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China Ankang Ju, Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China Jun Ma, Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China Xuan Wang, Department of Electronics Technology, Engineering University of the Armed Police Force, Xi’an, China
ABSTRACT Current intrusion detection systems are mostly for detecting external attacks, but the “Prism Door” and other similar events indicate that internal staff may bring greater harm to organizations in information security. Traditional insider threat detection methods only consider the audit records of personal behavior and failed to combine it with business activities, which may miss the insider threat happened during a business process. The authors consider operators’ behavior and correctness and performance of the business activities, propose a business process mining based insider threat detection system. The system firstly establishes the normal profiles of business activities and the operators by mining the business log, and then detects specific anomalies by comparing the content of real-time log with the corresponding normal profile in order to find out the insiders and the threats they have brought. The relating anomalies are defined and the corresponding detection algorithms are presented. The authors have performed experimentation using the ProM framework and Java programming, with five synthetic business cases, and found that the system can effectively identify anomalies of both operators and business activities that may be indicative of potential insider threat. Keywords Anomaly Detection, Insider Threat, Process Mining
1. INTRODUCTION The insider threat is a long-term problem that faced by most organizations. It usually results in significant damage and could range from financial theft and intellectual property theft to the destruction of property and business process. Compared with attacks from external network incurred by hardware or software vulnerabilities, the insider threats are more harmful and more difficult to detect. The main causes of insider threats are as follows: First, part of employees may lack security awareness and violate the safety regulations by accident. Second, part of employees intentionally bypasses the security measures for their own convenience and efficiency in the works. Last but not least, some employees choose to leak the organization’s confidential information or sabotage the systems because of their resentment or other’s inducement. In general, insider threat a comprehensive problem, which DOI: 10.4018/ijbdcn.2017070107 Copyright © 2017, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
83
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
consists of human factors and systemic factors. How to detect and prevent the insider threat has become a huge challenge for all organizations. For organizations, various types of business activities are the main activities carried out during their daily operations, one of the main tasks is to ensure the successful completion of each business process. In order to improve the efficiency, more and more organizations begin to use various business systems to accomplish business activities. However, most business systems usually only consider how to ensure the achievement of normal business functions during the design phase and ignore the safety demands of business activities. This could make the business system vulnerable to insider threats and get caught in different kinds of anomalies, or even lead to the destruction and disclosure of critical business data in severe cases. Therefore, in this paper, we see this problem from the perspective of business activity and try to detect insider threat by a comprehensive analysis of operators’ abnormal behavior and anomalies emerged during business process execution. Business processes are a series of activities completed by a group of people in organizations in order to achieve specific goals. The order between activities is strictly defined, so as to the content, modalities and responsibilities of each activity. In addition to the staff, the execution of a business process usually depends on specific business system and software program, which is a complex activity that involves human, machine, software and other multiple factors. Clearly, it can provide more comprehensive information support to insider threat detection by inspecting the daily work of organizations from the perspective of business process and establishing a normal business process model. Since the actual business activity involves many factors, its process model must also be multidimensional, not only to reflect the sequence between business events, but also to reflect the behavior information of operators, the features of business cases and the time and frequency information of business events. There is no doubt that traditional pre-designed manual modeling methods are unable to meet this requirement. Manual modeling usually relies on limited expert knowledge and only provides an idealized view of part of factors in business activities, and cannot take complex realistic conditions into consideration, so it is often out of touch with reality and mostly useless. To solve this problem, most organizations turn to the log-based process mining method, which has many advantages. System log is easily available and has mostly no impact on the running system. Detailed information about the execution of a business system is recorded in the log and facilitates managers to understand what happened during the process. Finally, mining business process through the system log is more objective and efficient. As far as we know, many of the current process mining methods establish a business process model from the perspective of control-flow [18-26], few of them can include human factor and performance factor into process mining in a multi-dimensional representation of business activities and the identification of human-oriented vulnerabilities and threats. In this paper, we present a business process mining based insider threat detection system, which includes the operators’ behavior into business process mining. The system mines normal business process models from the logs generated during the execution of business activities under a normal condition, and then detects anomalous behavior of operators, logical anomalies and performance anomalies in real-time logs by comparing them with the corresponding models, helping security managers to find out insider threats just in time. To test the performance of the approach, the system was implemented in ProM framework and Java Programming and the experimentation was conducted with synthetic data for evaluation. It was found that the system performed well for detecting the insider threats. The remainder of this paper is organized as follows. The related works are introduced in Section 2. Section 3 briefly introduces the presented insider threat detection system and the details of each components are described in Section 4. Section 5 presents the process of conducting experimentation and discusses the results, and Section 6 concludes this paper.
84
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
2. RELATED WORK In this section, we introduce the related work in two research areas: insider threat detection and business process mining. Insider threat has always been an important factor in harming the security of organizations and enterprises. Early in 2000, a variety of challenging problems are presented in researching this area (Anderson, Bozek, Longstaff, Meitzler, & Skroch, 2000), which further induced attention and exploration of insider threats. Many researchers try to model the insider threat and propose targeted solutions based on the model. For example, a generic model named SKRAM is proposed for detecting malicious attacks on information systems (Parker, 1998), this model defines five key elements that constitute an insider threat, i.e. Skill, Knowledge, Resource, Authority and Motivation. As one of the earliest insider threat models, it provides a high reference value for subsequent research work. The CMO model uses an integrated approach to simulate malicious operations by insiders and point out Capability, Motivation and Opportunity are the three necessary conditions for the implementation of insider attacks (Wood, 2000). However, this model ignores the normal insider’s unintentional behavior and does not study the standards of how to quantify the insider threat. Similar to signature-based attack detection methods, an insider threat prediction and detection model is presented (Schultz, 2002), which uses indicators to indicate the type, feature and operations of insider attack, and also gives a quantitative formula. The shortcoming of this model is that a large number of aggressive behavior must first be correctly classified and stored and cannot detect unknown attacks. Magklaras and Furnell put forward an insider threat prediction model that calculates the probability of insider threat using a three-tier structure of mathematical functions (Magklaras, & Furnell, 2001). A user-motivation-based insider threat detection model is presented and it converts the purpose of use submitted before the user is logged into the system to a list consists of the operation subject, the operation target (Mantha, Chinchani, Upadhyaya, & Kwiat, 2000), the required actions and the deadline, then compares the practical operation with the list for detecting anomalous behavior. Schneier proposes the attack tree model to describe the possible attack path and analyze the relation between these attacks and system vulnerabilities, then combine the analysis result with attack tree for reproducing the attack scenario (Schneier, 1999). Besides, many researchers have put forward various detection model, prediction model and reasoning model of insider threats. But these models all ignore the business activities that occupy a major part in organizations, which cannot help managers to assess the insider threats that may occur in daily business activities. In addition to the abstract models, researchers have also proposed a variety of specific methods to detect insider threats. For instance, Spitzner tries to decoy insider attacks using a honeypot (Spitzner, 2003). However, with the way of attack becoming more subtle and advanced, people have begun to seek more sophisticated methods. A role-based access control model is used to establish user behavior rules and detect insider threats by looking for behavior that violates the rules (Hu, Bradford, & Liu, 2006). Bishop et al. extend the RBAC model by focusing on generalized attributes of people and data, and placing the insider threat in the context of modeling policies (Bishop, Engle, Peisert, Whalen, & Gates, 2009). Moreover, Greitzer and Frincke combine traditional cyber security audit data with psychosocial data (Greitzer, & Frincke, 2010), so that people can detect and predict insider threats. They also propose a framework for integrating and analyzing organization’s internal data and network security data to predict possible insider threats. Similarly, Brdiczka et al. present an insider threat detection method that combines structural anomaly (SA) detection with personal psychology (PP) (Brdiczka et al., 2012), and identifies insider threats by integrating SA and PP and sorting the results of integration. Parveen et al. use stream mining and graph mining to detect insider threat and extend it (Parveen, Evans, Thuraisingham, Hamlen, & Khan, 2011). Unlike these approaches, the detection system proposed in this paper not only considers the anomalous behavior of insiders, but also detects anomalies in business activities to help managers to evaluate the influence to the business system.
85
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Business process mining is a business process reconstruction technique developed from the workflow management area. Many researchers have proposed relating mining algorithms. It is worth mentioning that Aalst et al. point out a complete business process modeling process should consider three perspectives (Van der Aalst, & de Medeiros, 2005), i.e. the control-flow perspective, the organization perspective and the case perspective. The control-flow perspective is concerned with the “How?” question and it focuses on the ordering of activities. The organization perspective is concerned with the “Who?” question and it focuses on the originator field. The case perspective is concerned with the “What?” question and it focuses on properties of cases. However, most existing process mining methods only consider the control-flow perspective. For example, the α-algorithm builds control-flow structure by discovering the binary relations between activities (Van der Aalst, Weijters, & Maruster, 2000), but cannot capture the dependency in non-free-choice constructs. The α++-algorithm extend the α-algorithm but cannot deal with noisy logs(Wen, Wang, & Sun, 2006). The heuristic algorithm considers the frequency of following relationship between activities so that it can handle the noise in logs (Weijters & Ribeiro, 2011; Weijters & Van der Aalst, 2003), but it is unable to deal with non-free-choice structure and duplicate activities. The two-stage mining algorithm first creates the binary relation models of activities in logs and then integrates these models to form the sequence and selection structure (Weijters & Ribeiro, 2004, 2005), but this algorithm cannot mine the loop structure. The genetic mining algorithm can overcome the limitations of current process mining techniques and it uses global optimization strategies to ensure the outcome is global optimal (Van der Aalst, De Medeiros, & Weijters, 2005), which is better than local optimal in process modeling. In this paper, we first use the genetic mining algorithm to discover the business control-flow model, and then mine the business performance model and operators’ behavior profile based on the control-flow model and the information in business logs in order to detect possible insider threats in business activities. 3. SYSTEM OVERVIEW In fact, the process of detecting insider threats is a process that considers both the embodiment and the influence of threats, detects the anomalous operation by attackers and the anomalies caused by attacks. Therefore, we need to establish the normal model from the perspective of both the execution process of business activities and the operation behavior of each operator, then judge if there is any deviation from the normal model in practical execution process. According to this thinking, we propose an insider threat system shown in Figure 1. The system is divided into two modules: a model mining module and an anomaly detection module. The model mining module first captures the event logs generated during the execution of a business process in normal circumstances. Noted that the meaning of “normal circumstances” refers to the processes and results of business execution are in line with expectations and there is no significant fault in hardware or software and any deliberate malicious action of operators. In fact, due to the fluctuations in execution efficiency of the hardware or software and the differences in operators’ behavior habits, the event logs of a business process may also be different from each other, but they all reflect the normal execution. Hence, when mining process models, we need to collect enough event logs that generated during different executions so that we can get a more robust and practical model. Then, we filter the event logs by removing unrelated events according to the target business process and choosing illegal starting event and ending event in the process. The output is called training log. Next, the system mines a control-flow model that represents normal logical structure of a business process from the training logs using mining algorithm. On the basis of this, we determine the events or subprocesses that have specific time or frequency requirements based on expert knowledge and obtain the performance model of them by statistical method. Besides, we also mine a tree-
86
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Figure 1. Business process mining based insider threat detection system
structured behavior profile for each operator using extra information in the training log. This notion provides a representation for the operator’s roles, the tasks he must carry out and the detailed operation content. The anomaly detection module is divided into two parts: detecting anomalous operations and the anomalies in business activities. We consider the former question from both longitudinal and horizontal angles. The vertical angle means comparing the current operations with the operator’s own behavior profile to detect the anomalous deviation, and the horizontal angle means comparing the operator’s current operations with other peer operators’ to detect the outlier. Similarly, anomalies in business activities can be classified into logical anomaly and performance anomaly. By comparative assessment, we can know the anomalies occurred during the execution of business activities. The working principle of each module will be introduced in detail in the following sections. 4. SYSTEM DESIGN AND IMPLEMENTATION 4.1. Log Preprocessing An event log generated during the execution of business activities contains many events, each event consists of the identifier of the business activity it belongs to, the task name, the timestamp, the operator, the operations and the changes of data or file etc. The events in an event log are ordered according to their timestamp. As the data source of process mining, the completeness and correctness of the event logs are directly related to the accuracy of the mining results, so that it is necessary to preprocess the event logs before using it. We will give some formal definitions of the relevant concepts and then describe the process of log preprocessing in the following parts.
87
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Definition 1 - Event: An Event E is a sextuple (Id, TaskName, Type, Operator, Timestamp, ExtraInfo), where: ◦◦ Id is an identifier of the kind of business activity; ◦◦ TaskName is the name of the task in this business activity; ◦◦ Type is the execution state of this task; ◦◦ Timestamp is the time the event is recorded; ◦◦ ExtraInfo is some other relevant information about this task. Definition 2 - Event Sequence: An Event Sequence ES is a sequence of events recorded in an execution of a business activity, i.e. ES = ( E1 , E2 ...En ) , where E1 and En are the starting and ending event of ES respectively, and it satisfies the following condition:
∀Ei , E j ∈ ES (i < j ), Ei ⋅ Timestamp ≤ E j ⋅ Timestamp Definition 3 - Event Log: An Event Log EL is a set of several event sequences, i.e. EL = ( ES1 , ES2 ...ESn ) . The first step of preprocessing is to have an overall understanding of the log’s statistic information, like the number of different business activities, the number of event sequences in each business activity, the occurrence number and the ratio of each kind of event in the event sequence and so on. This information can facilitate the next step of log filtering and screening. Because we want to mine a complete process model of a specific business activity, so we just need the complete and relevant event sequence. To do this, we should first remove the irrelevant event sequence according to the Id. Second, the illegal starting event and ending event are determined. Finally, choose the event sequences that have correct starting event and ending event and each intermediate event is completed to be the input of process mining algorithm. The preprocessed log is called training log in this paper. 4.2. Business Process Mining After obtaining training log, we can use it to mine normal business process model. However, selecting routing, parallel routing and iterative routing may result in different event sequences of a same process model, so the model mined from a certain event sequence is just a particular solution of all the possibilities. To avoid this problem and enhance the robustness of the mined model, we need to collect enough different event sequence of the business activity and to select the appropriate process mining algorithm according to its merits and drawbacks. When getting the control-flow model, we can further mine the performance model and the operators’ behavior profile by using the control-flow model and the information in training log. 4.2.1. Mining Process Model in Control-Flow Perspective We can see process mining as a search for the most appropriate process out of the search space of candidate process models. In order to get an optimal process, selecting an appropriate algorithm is very important. Usually, people can take global strategies or local strategies to measure the degree of optimization of a model (Van der Aalst, De Medeiros, & Weijters, 2005). Local strategies primarily based on a step by step establishment of the optimal process model based on local information. Many process mining algorithms are based on a local strategy, a representative algorithm is the α-algorithm (Van der Aalst, Weijters, & Maruster, 2004). Their common drawback is that a local strategy cannot guarantee the outcome of the locally optimal steps will result in a globally optimal process model. Once the necessary information is not locally available, the performance of these algorithms will be seriously hampered since each step has a significant influence on the next steps, and may result in a great deviation from the “real” optimal model. In contrast, global strategies primarily based on 88
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
a one strike search for the optimal model. The genetic algorithm is a typical example. In addition, compared with traditional methods, genetic algorithm can solve more problems, such as non-freechoice, invisible tasks and duplicate tasks. Therefore, we use the genetic algorithm for process mining. The main steps are shown in Figure 2. The genetic algorithm is roughly divided into four phases: initialization, selection, propagation and termination. In the initialization phase, the initial population is created and each generation of populations may have hundreds of thousands of individuals. Here, an individual refers to a process model. The algorithm creates the initial population by randomly combining the events appeared in the training log and generates a large number of individuals. Since the amount of population is huge, it may produce some individuals that are broadly consistent with the correct model. Next, the algorithm uses the fitness function to calculate the fitness of each individual. In other words, the fitness function is a comprehensive measure of the completeness and accuracy of each individual. The ones with the highest fitness are directly put into the next generation of population, which is called “elitism”, and the ones with the lowest fitness are discarded as “dead” individuals. The rest of them are elected as parents for creating the next generation of the population. In the propagation phase, the algorithm creates new individuals through crossover and mutation of the parents. Crossover recombines each pair of the parents and gets a sub-model pool. They share the genetic material of their parents. Then, the algorithm uses mutation to modify the sub-models, such as randomly adding or deleting a causal dependency. This can ensure the new genetic materials are inserted into the next generation of population, which facilitates the continual evolution of them. The above processes are repeated until certain conditions are met by a best individual and then the algorithm is stopped. After getting the control-flow model, we need to determine the events or sub-processes that have specific time or frequency requirements based on expert knowledge. Then we could obtain the statistical information about these events or sub-processes through relevant fields in the training log, Figure 2. Main steps of genetic algorithm
89
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
such as the time interval between events, the timestamp and the occurrence number of specific events etc. Due to the slight normal variation of the above indicators in each event sequence, we create a multiset for each indicator in order to calculate the average value, which makes up the performance model. 4.2.2. Mining Normal Profile of Operator As is stated above, most of the current process mining methods only concern about the level of the control-flow perspective and usually ignore the operators, which is the dominant factor of the execution of business activities. In order to detect insider threats due to human factors, we further extend the mining model based on the control-flow model. By combining the logical structure and extra information in the training log, we can confirm the operator’s role and his tasks, as well as the details of the operation’s objects and their changes, so that to establish normal profile of the operator for detecting the deviation anomaly and the outlier. Through statistical analysis of training log and observation of control-flow model, we can easily determine who is involved in the business activity, what his tasks are, and the order of these tasks, their time interval and frequency. The role of an operator can be known from the training log, or it can be reasonably assumed if the corresponding information is not contained in the training log. For example, we can identify that the operators who perform the same task set belong to a same role. More details can be get from the ExtraInfo field, such as the devices and the software, files or data that are related to the event, the operations on these objects and the changes of them etc. After obtaining these data, we use a multi-tree to represent the normal behavior profile of an operator and the outline is shown in Figure 3. For simplicity, we only give some of the branches. The root of the tree is “Id-Operator”, which indicates the business case and the operator’s name. The hierarchical structure of the tree is as follows: • • • •
Role level, which contains all the roles of this operator; Task level, which lists all the tasks that may be performed based on a specific role. Device level, which lists all the devices that are operated in the corresponding task; Data level, which lists all the software, files and data that are involved in the corresponding device; Operational level, which lists the operations and their frequency.
Figure 3. An example of tree-structured profile of operator behaviors
90
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Thus, along with a path from the root up to the leaf, we can easily get a set of operations for the operator that should be carried out when performing a certain task, as well as the normal frequency range of these operations. To facilitate the subsequent calculation, we use the behavior vector to represent these operations and their frequency range. Definition 4 - Behavior Vector: A Behavior Vector BV is a tuple that indicates the operations on a device and their frequency of an operator in the execution of a certain task, i.e. BV ( Operator , Task , Device ) = ( f ( op1 ) , f ( op2 ) ... f ( opn ) ) , where f (opi ) is the frequency of operation opi , which satisfies the following equation:
f (opi ) = #(opi ) / ∑ #(opi ) where #(opi ) is the occurrence number of opi . In particular, BVN represents the behavior vector in the normal profile, and BVE represents the behavior vector in practical execution process. 4.3. Anomaly Detection Insider threat is a comprehensive issue that involves human factors and system factors so that the detection need to be considered from the above two aspects simultaneously. Anomalous behavior of insiders is the cause and should be focused on during detection. The loss and damage on the system level are the result, which can provide necessary information and help assessing the severity of the threat. In the following, we will introduce anomaly detection method from two aspects. We list the types of anomalies that are concerned in this paper and then give the pseudo code of anomaly detection algorithms. 4.3.1. Operator’s Anomalous Behavior Detection In section 4.2.3, we have discovered the operator’s normal behavior profile, which can help to detect anomalous behavior through comparison between the current behavior and the profile. However, when the execution of the organization’s business normally changes, the task sets and operations of an operator may also change. If we only rely on the longitudinal comparison result, it is possible to get many false positives. Considering the behavior of operators that belong to a same role are highly similar, we can horizontally compare their actual behavior to find out the abnormal outlier for further detecting potential malicious operators. The formal definitions of operator’s behavior anomalies and the pseudo codes of detection algorithms are given below. Definition 5 - Individual Anomalous Behavior: Let σ be the threshold, the Individual Anomalous Behavior is the situation that BVE has some operations that do not belong to BVN or the difference of frequency of a same operation between BVE and BVN is greater than σ. Definition 6 - Abnormal Outlier: Let γ be the threshold and AvDis be the average distance from an operator’s BVE to all the BVs of the generated cluster. If AvDis is bigger than γ, the operator is an Abnormal Outlier. The pseudo code of individual behavior anomaly detection algorithm is presented in Algorithm 1 according to the above definition. When detecting abnormal outliers, we also inspect the behavior vector both on the content and frequency of operation, and use distance-based clustering method to find out abnormal outlier. Different operations make the dimension of each vector that belongs to the same task and the same device different, which is not allowed when calculating the distance between each vector pair. To deal with 91
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Algorithm 1. Individual-Behavior_AD (Normal Profile, Execution Log)
for each task T and device D for each BVE in Execution Log find the corresponding BVN in Normal Profile if {op | op ∈ BVE } ≠ {op | op ∈ BVN } trigger an Operation-Content alarm 5. end if 6. if ∃opi ∋| BVE ⋅ f (opi ) − BVN ⋅ f (opi ) |> σ 7. trigger an Operation-Frequency alarm 8. end if 9. end for 10. end for
1. 2. 3. 4.
it, we take the union of the elements in the above vectors as the operation set of each operator. For each operator, we set the frequency of those operations that do not belong to their own vector before (called “different operations”) to zero and the frequency of the other operations remaining unchanged. We also allow a weight to be associated with each operation so that operations of greater importance can be emphasized, as dictated by an analyst. For example, if an analyst wants to emphasize those different operations, he would set higher weight to them. If no weights are specified, then the weight is set to be 1/n, where n is the total number of operations. Then, the Euclidean distance between each pair of behavior vectors is computed and all the behavior vectors are clustered based on the Euclidean distance using agglomerative hierarchical clustering algorithm. Finally, the abnormal outlier is identified if the distance between him and the corresponding cluster is bigger than the threshold γ. The pseudo code of detecting abnormal outlier is presented in Algorithm 2. 4.3.2. Control-Flow Anomaly Detection In the perspective of control-flow, we focused on the logical anomaly and the performance anomaly. Definition 7 - Logical Anomaly: Logical Anomaly refers to the situation that business process cannot follow the normal process structure so that causes abnormal termination or incorrect results. In other words, the event sequence of the business process with logical anomaly cannot be properly parsed by the normal control-flow model. Conformance checking is usually used to detect the logical anomaly in a business process. The criterion of judgment may be divided into two categories. The first one globally treats the event sequence that is inconsistent with the normal model in order as anomalous sequence, and the other locally cares about whether the event sequence has an event that is not in the normal model. This paper takes the first criterion because it is stricter and can help collect more clues about insider threat. The pseudo code of detecting logical anomaly is presented in Algorithm 3. Definition 8 - Performance Anomaly: The Performance Anomaly refers to the situation that the indicators of some specific events in the current event sequence are beyond the normal scope in the performance model. In detail, the Performance Anomalies include: • 92
Timestamp Anomaly: The timestamp of a specific event is beyond the scope of threshold τ 1 ;
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Algorithm 2. Outlier_AD(Execution Log)
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
for each role R for each task T and device D set op - set to φ for each operator O belongs to R
op - set = op - set ∪ O ⋅ BVE ⋅ opi
end for unify the op of each BV according to op - set for each operator O belongs to R for each op which is newly added into O ⋅ BVE set f (opi ) = 0 end for for each set the weight of each op end for end for cluster the BVE s according to the Euclidean distance between each BV - pair for each cluster of BVE s if someone’s AvDis is bigger than γ Trigger an Abnormal Outlier alarm end if end for end for end for
Algorithm 3. Logic_AD(Normal Model, Execution Log)
1. for each ES in Execution Log 2. if ES can’t be parsed by Normal Model 3. trigger a Control-Flow-Logic alarm and log the fault event in ES 4. end if 5. end for • •
Time Interval Anomaly: The time interval of two specific events is beyond the scope of threshold τ 2 ; Frequency Anomaly: The occurrence number of a specific event is beyond the scope of threshold µ .
where τ 1 , τ 2 and µ are the average value of the corresponding indicator in the performance model. The pseudo code of detecting performance anomaly is presented in Algorithm 4. It can be easily proved that all the algorithms presented above can converge in polynomial time. 5. EXPERIMENTAL EVALUATION We conduct some experiments to assess the performance of the proposed internal threat detection system. The experiment is divided into three steps. First, we use a tool to generate synthetic dataset as the original data, which represents the normal business execution logs. Second, the process mining 93
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Algorithm 4. Performance_AD (Time-Frequency Constraint Table, Execution Log)
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
for each ES in Execution Log if event E has timestamp-constraint if E ⋅ Timestamp ∉τ 1 trigger a Timestamp-Anomaly alarm and log E end if end if if event E1 and E2 have time-interval constraint if |E1 ⋅ Timestamp − E2 ⋅ Timestamp |> τ 2 trigger a Time-Interval-Anomaly alarm and log E1 and E2 end if end if let S be the set of events which have frequency constraint if #( S ) ∉ µ trigger a Frequency-Anomaly alarm and log S end if end for
software ProM is used to preprocess the original data and mine the control flow model (Van der Aalst et al., 2009). Meanwhile, we use Java programming to mine the performance model and the operators’ behavior profile on the basis of the control-flow model and the training logs. Finally, some anomalies are injected into the data set and used in the evaluation of proposed detection methods. Since the models are established before and the algorithms are written in Java, the detection phase can also be transplanted to the big-data processing platform when analyzing very large volume of log data stream in real-time. There are many reasons for using synthetic data. First, the real internal audit logs of an organization usually have high confidentiality and privacy, which is difficult to be collected and used for researchers; Second, synthetic data is flexible to be modified on scale and time scope according to the needs of research so that is more convenient for use. In the following, we will detail the procedure of experimentations. 5.1. Generating Dataset To generate experimental data set, we used the Process Log Generator (PLG) to simulate the execution of five business cases (Burattin, & Sperduti, 2010). By setting the relative parameters, PLG can generate a random and approximate real business process model and simulate the execution process of it. For a more accurate result, we simulate twenty times of execution for each business case, the event sequences of each business case are recorded into a log file with the format of MXML. This format can be supported by the ProMimport plugin and be easily imported into the mining tool ProM. After that, some anomalies are injected into the data set to simulate the insider threats the system suffered. Each injected anomaly corresponds to a pre-designed insider threat scenario, which describes the attacker’s role, operating information and the affected activities. For example, an attack scenario might describe an operator of a role uses much more copy and paste commands than usual in the course of carrying out his tasks, and leads to a significant increase in execution time. Once a scenario is created, then an operator in the correct role is selected, and the attack data are injected into the log files. Due to the random nature of the data generation process, very little was known about the behavior of the attacker before the injection of the anomalies. In order to simulate the attack scenario as real as possible, the anomalies are directly related to some operators and the tasks they 94
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
perform, and the operator’s abnormal behavior is associated with the control-flow anomalies and the performance anomalies. The details of original data and the injected anomalies are shown in Table 1, where OCA represents Operation-Content Anomaly, OFA represents Operation-Frequency Anomaly, LA represents Logical Anomaly, TA represents Timestamp Anomaly, TIA represents Time-Interval Anomaly and FA represents Frequency Anomaly. 5.2. Model Implementation Model mining begins with log preprocessing. First, the log file of a business case is imported by the ProMimport plugin of the ProM framework, which contains the event sequences generated during several executions. Then, the simple log filtering tools are used to filter the event sequences in imported log file and retain those complete ones. These tools can help define expected starting and ending events and thus remove the incomplete ones. Next, the genetic algorithm plugin is used to deal with the preprocessed log file and discover the proper control-flow model of the business case. The output is in the form of C-net. Finally, the Java program is written to discover the performance model and the operators’ behavior profile. The relative parameters are collected and calculated during the process. To reduce experimental error and get a more stable and reliable model, the original dataset is partitioned to training sets and test sets. The training sets are used to discover the appropriate model and the test sets are used to evaluate how proper the model is. In this paper, we take the 10-fold cross validation method and the original dataset is partitioned to ten equally folds. In each mining procedure, a different fold is held-out for validation and the remaining nine folds are used for training. The value of predefined metrics of a business case’s model are average of the ten results gained from ten mining procedures. 5.3. Results Table 2 shows the results from the detection system. For evaluating the detection results, we use the F1 score as the evaluation indicator, which is commonly used in the fields of information retrieval and classification. The F1 score satisfies the following equations: Table 1. Details of five business process cases and injected anomalies No.
Average Number of Events
# Roles
# Operators
Injected Anomalies
1
574
3
6
5 OCA, 2 LA
2
1227
4
9
9 OCA, 3 OFA, 3 LA, 3 TIA
3
2072
4
12
13 OCA, 5 OFA, 5 LA, 4 TA
4
3356
5
14
16 OCA, 7 OFA, 2 FA
5
5184
6
18
21 OCA, 12 OFA, 6 LA
Table 2. Anomaly detection results No.
# Alerts of Anomalies
1
5 OCA, 1 LA
2
9 OCA, 4 OFA, 3 LA, 2 TIA
3
14 OCA, 4 OFA, 4 LA, 4 MA
4
16 OCA, 8 OFA, 3 FA
5
22 OCA, 12 OFA, 7 LA 95
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
F1 = 2 × P × R / ( P + R ) P = TP / (TP + FP) R = TP / (TP + FN ) where P represents the precision, R represents the recall, TP represents the number of true positive results, FP represents the number of false positive results and FN represents the number of false negative results. The value of P, R and F1 score of each scenario are shown in Table 3. From these results, scenario 1 has the best precision of 100% and scenario 4 has the best recall of 100%. Due to the limited data size and injected anomalies, a slight deviation in FP and FN will result in a great impact on the precision and recall, so we judge the overall effect of the method by the F1 score, which is the harmonic mean of precision and recall. We can see from Table 1 and Table 3 that, regardless of the number of business events, the F1 score can maintain a high level of more than 90%, which indicates that the proposed methods have a good performance for detecting specific anomalies. In the test of detecting abnormal outliers, we choose the fifth business case as test data, and use the agglomerative hierarchical clustering to cluster the behavior vectors of operators that belong to the same role, for discovering the abnormal outlier. The results are shown in Table 4. Apparently, the first, third and fifth role each has an outlier and is consistent with our injected anomalies. Since the experimentations are conducted in a static environment and have no legal dynamic change of business activities, so the operator with individual anomaly may also be an abnormal outlier, which makes the number of abnormal outlier in Table 4 is equal to the number of injected anomalies in Table 2. In practical cases, if business activities legally adjust in the level of role, the operators’ normal behavior vectors of a same role will have similar change, thus the abnormal outliers may be fewer than the operators with individual anomaly in a same role. Table 3. Values of P, R and F1 score No.
P
R
F1 Score
1
100%
85.7%
92.3%
2
94.4%
94.4%
94.4%
3
96.2%
92.6%
94.4%
4
92.6%
100%
96.2%
5
92.7%
97.4%
95.0%
Table 4. Results of outlier detection Role 1
# Operator 3
# Cluster 2
# Outlier 1
2
4
1
0
3
5
2
1
4
3
1
0
5
3
2
1
96
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
REFERENCES Althebyan, Q., & Panda, B. (2007). A knowledge-base model for insider threat prediction. Proceedings of the Information Assurance and Security Workshop IAW’07 (pp. 239-246). IEEE. doi:10.1109/IAW.2007.381939 Anderson, R. H., Bozek, T., Longstaff, T., Meitzler, W., & Skroch, M. (2000). Research on mitigating the insider threat to information systems-#2 (No. RAND-CF-163-DARPA). Rand National Defense Research Inst., Santa Monica, CA. Bishop, M., Engle, S., Peisert, S., Whalen, S., & Gates, C. (2009). We have met the enemy and he is us. Proceedings of the 2008 workshop on New security paradigms (pp. 1-12). ACM. Brdiczka, O., Liu, J., Price, B., Shen, J., Patil, A., Chow, R., . . . Ducheneaut, N. (2012). Proactive insider threat detection through graph learning and psychological context. Proceedings of the 2012 IEEE Symposium on Security and Privacy Workshops (SPW) (pp. 142-149). IEEE. doi:10.1109/SPW.2012.29 Burattin, A., & Sperduti, A. (2010). PLG: A framework for the generation of business process models and their execution logs. Proceedings of the International Conference on Business Process Management (pp. 214-219). Springer Berlin Heidelberg. Butts, J. W., Mills, R. F., & Baldwin, R. O. (2005). Developing an insider threat model using functional decomposition. Proceedings of the International Workshop on Mathematical Methods, Models, and Architectures for Computer Network Security (pp. 412-417). Springer Berlin Heidelberg. doi:10.1007/11560326_32 Greitzer, F. L., & Frincke, D. A. (2010). Combining traditional cyber security audit data with psychosocial data: towards predictive modeling for insider threat mitigation. In Insider Threats in Cyber Security (pp. 85-113). Springer US. doi:10.1007/978-1-4419-7133-3_5 Hu, N., Bradford, P. G., & Liu, J. (2006). Applying role based access control and genetic algorithms to insider threat detection. Proceedings of the 44th annual Southeast regional conference (pp. 790-791). ACM. doi:10.1145/1185448.1185638 Magklaras, G. B., & Furnell, S. M. (2001). Insider threat prediction tool: Evaluating the probability of IT misuse. Computers & Security, 21(1), 62–73. doi:10.1016/S0167-4048(02)00109-8 Mantha, K., Chinchani, R., Upadhyaya, S., & Kwiat, K. (2000). A Comprehensive Simulation Platform for Intrusion Detection in Distributed Systems. Proceedings of the Summer Computer Simulation Conference (pp. 586-591). Society for Computer Simulation International. Nithiyanandam, C., Tamilselvan, D., Balaji, S., & Sivaguru, V. (2012). Advanced framework of defense system for prevetion of insider’s malicious behaviors. Proceedings of the 2012 International Conference on Recent Trends In Information Technology (ICRTIT) (pp. 434-438). IEEE. doi:10.1109/ICRTIT.2012.6206788 Parker, D. B. (1998). Fighting computer crime: A new framework for protecting information. John Wiley & Sons, Inc. Parveen, P., Evans, J., Thuraisingham, B., Hamlen, K. W., & Khan, L. (2011). Insider threat detection using stream mining and graph mining. Proceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Conference on Social Computing (SocialCom) (pp. 1102-1110). IEEE. doi:10.1109/PASSAT/SocialCom.2011.211 Parveen, P., & Thuraisingham, B. (2012). Unsupervised incremental sequence learning for insider threat detection. Proceedings of the 2012 IEEE International Conference on Intelligence and Security Informatics (ISI) (pp. 141143). IEEE. doi:10.1109/ISI.2012.6284271 Schneier, B. (1999). Attack trees. Dr. Dobb’s journal, 24(12), 21-29. Spitzner, L. (2003). Honeypots: Catching the insider threat. Proceedings of the 19th Annual Computer Security Applications Conference (pp. 170-179). IEEE. doi:10.1109/CSAC.2003.1254322 Van der Aalst, W., Weijters, T., & Maruster, L. (2004). Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1128–1142. doi:10.1109/TKDE.2004.47 Van der Aalst, W. M., De Medeiros, A. A., & Weijters, A. J. M. M. (2005). Genetic process mining. Proceedings of the International Conference on Application and Theory of Petri Nets (pp. 48-69). Springer Berlin Heidelberg. 97
International Journal of Business Data Communications and Networking Volume 13 • Issue 2 • July-December 2017
Van der Aalst, W. M., & de Medeiros, A. K. A. (2005). Process mining and security: Detecting anomalous process executions and checking process conformance. Electronic Notes in Theoretical Computer Science, 121, 3–21. doi:10.1016/j.entcs.2004.10.013 Van der Aalst, W. M., van Dongen, B. F., Günther, C. W., Rozinat, A., Verbeek, E., & Weijters, T. (2009). ProM: The process mining toolkit. BPM (Demos), 489(31), 2. Van Dongen, B. F., & Van der Aalst, W. M. (2004). Multi-phase process mining: Building instance graphs. Proceedings of the International Conference on Conceptual Modeling (pp. 362-376). Springer Berlin Heidelberg. van Dongen, B. F., & Van der Aalst, W. M. (2005). Multi-phase process mining: Aggregating instance graphs into EPCs and Petri nets. Proceedings of the PNCWB 2005 workshop (pp. 35-58). Weijters, A. J., & Van der Aalst, W. M. (2003). Rediscovering workflow models from event-based data using little thumb. Integrated Computer-Aided Engineering, 10(2), 151–162. Weijters, A. J. M. M., & Ribeiro, J. T. S. (2011). Flexible heuristics miner (FHM). Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) (pp. 310-317). IEEE. Wen, L., Wang, J., & Sun, J. (2006). Detecting implicit dependencies between tasks from event logs. Proceedings of the Asia-Pacific Web Conference (pp. 591-603). Springer Berlin Heidelberg. doi:10.1007/11610113_52 Wood, B. (2000). An insider threat model for adversary simulation. SRI International. Research on Mitigating the Insider Threat to Information Systems, 2, 1–3.
Taiming Zhu is currently a postgraduate student at Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China. His research interests include information security and big data analysis. Yuanbo Guo is currently a professor at Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China. His research interests include cryptology and information security. Jun Ma is currently a lecturer at Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China. His research interests include IoT and information security. Ankang Ju is currently a postgraduate student at Institute of Cyber Space Security, Information Engineering University, Zhengzhou, China. His research interests include information security and big data analysis. Xuan Wang is currently an associate professor at Department of Electronics Technology, Engineering University of Armed Police Force, Xi’an, China. His research interests include cryptology and information security. He published also articles on the consequences of ICT on the strategies of firms in terms of innovation and e-commerce. 98