Improving Classification Performance with Supervised Filter Discretization

Shailesh Singh Panwar

Ph.D Research Scholar, Department of Computer Science and Engineering

H.N.B. Garhwal University Srinagar Garhwal, Uttarakhand 246174, India

Dr. Y. P. Raiwani

Associate Professor, Department of Computer Science and Engineering

H.N.B. Garhwal University Srinagar Garhwal, Uttarakhand 246174, India

Abstract:- Naive Bayes, and Bayes Net is a critical classification method in data mining classification and have ended up being profitable tools for the classification, description and generalization of information.. All classification algorithms are an open source Java implementation of the C4.5 algorithms in the Weka data mining tool. This paper exhibit the strategy for improving accuracy for Naive Bayes and Bayes Net algorithms mining with data preprocessing. We have applied the supervised filter discretization on two classification algorithms and compared the result with and without Discretization. The outcomes acquired from experiment showed significant improvement over the existing classification algorithms.

Keywords:- Naive Bayes, Bayes Net , Weka, KDD Data Set, Preprocessing, Discretization.

Introduction

Data mining is the way toward separating helpful data furthermore, learning from the knowledge and conflicting raw information. Data mining is some portion of learning revelation process. Data mining separates data from expansive dataset furthermore, changes over it to a justifiable frame.

Clustering is a type of data analysis that extracts sample and pattern describing crucial statistics classes. Those models are referred to as classifiers; foresee absolute class names. For example, an order model can be worked to financial institution strengthens packages as either protected or risky 1.

Naive Bayes induction is the way closer to studying of statistics from magnificence marked preparing tuples. Decision tree is a calculation that is normally used to foresee display, and also to find out the sizeable statistics thru the large amounts of data classification. A Naive Bayes is a basic flowchart like tree structure, where the topmost node in a tree is the root node 2. Every leaf node (or terminal) holds a category name, each internal node (non leaf node) shows the property and every branch speaks to an end result of the test.

Bayesian classifiers are statistical classifiers. Naive Bayesian classifiers count on that the impact of a characteristic value on a given magnificence is unbiased of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computation concerned and, on this experience, is taken into consideration “naive” 3.

The final of this paper is prepared as follows. Phase II of this paper gives a brief note approximately the related works which happened in past. Section III presents and discusses our research technique observed through the outline of the dataset. We have utilized in our experiments in addition to the experimental setup. In Phase IV, analysing the results and improve the performance of classification algorithms after that find the best techniques. At last, Phase V gives conclusions.

RELATED WORKS

For looking over the issue of enhancing decision tree arrangement calculation for substantial informational collections, a few calculations have been created for building DTs of substantial informational collections.

kohavi & john 1995 3, looked for parameter settings of C4.5 decision tree algorithms that could bring about most appropriate performance on a particular records set. The optimization goal became “top of the line performance of the tree, i.e., the accuracy measured the use of 10- fold cross-validation. J48, random forest, naive Bayes etc algorithms 4 are used for disease prognosis as they brought about proper accuracy and had been used to make predictions. The dynamic interface can also use the built models that suggest the utility worked well in every considered case. The algorithms, Naive Bayes, decision tree (j48), sequential minimal optimization (SMO), K- nearest neighbour (IBK) and Multi-Layer Perception are as compared via the use of matrix and classification accuracy 5.

Three unique breast cancer databases were used and classification accuracy is provided on the basis of 10-fold move validation method. An aggregate at class level is finished between those classifiers to get the pleasant multi classifier method and accuracy for each dataset. Diabetes and cardiac sicknesses are predicted using decision tree and Incremental Learning to know at the early stage 6.

Liu X.H 1998 proposed a new optimized set of rules of decision tree 7. On the idea of ID3, this set of rules taken into consideration of attribute choice in the ranges of decision tree and the classification accuracy of the developed set of rules have been proved better than ID3.

Liu Yuxun & Xie Niuniu 2010 solving the hassle of a decision tree algorithms primarily based on attribute importance is proposed. The advanced algorithm makes use of attribute-significance to increase the information advantage of attributes which has fewer attributions and compares ID3 with developed ID3 by using an instance. Experimental evaluation of the facts indicates that the improved ID3 set of algorithm can get greater reasonable and powerful rules 8.

Gaurav & Hitesh 2013 suggests C4.5 algorithm that is improved by using the usage of L-Hospital Rule, this simplifies the calculation manner and improves the performance of decision making algorithms 9.

METHODOLOGY

Our method is to pick the dataset, apply on Naive Bayes and Bayes Net and calculate the accuracy of dataset. Next step is, preprocessing step, apply the supervised discretization filter at the dataset alongside the Naive Bayes and Bayes Net classification algorithms (Machine Learning Algorithms) and evaluate the accuracy. At last, comparing both accuracy result and find out which one is higher.

Naive Bayes Classifier: – In Machine Learning, Naive Bayes classifiers are the family of easy “probabilistic classifiers” based on applying Bayes theorem with sturdy (naive) independence assumptions among the functions. Naive Bayes classifiers are particularly scalable; require a number of parameters linear inside the wide variety of variables (capabilities/predictors) in a learning hassle. Naive Bayes is an easy technique for building classifiers: sample and pattern that assign magnificence labels to hassle times, represented as vectors of feature values, where the labels are drawn from a few finite set. It is not an only one algorithm for training such classifiers, but group of algorithms based totally on a common precept: all Naive Bayes classifiers count on that the value of a particular function is independent of the cost of another feature, given the magnificence variable 10.

Bayes Net: – In machine learning, Naive Bayes classifiers are a group of Belief networks, Bayesian networks is probabilistic networks. For this algorithm, use Learn Bayesian nets functions. Bayes Net algorithm has implementing the probabilistic Naïve Bayes classifier 11.

Pre-processing:- Records usually come in mixed layout: nominal, discrete, and/or non-stop. Discrete and continuous statistics having orders among values are ordinal records types, even as nominal values do not own any order amongst them. Discrete information is spaced out with intervals in a non-stop spectrum of values. We used discretization as records preprocessing technique.

Discretization of ceaseless qualities is both a prerequisite furthermore, a method for execution change for some machine learning calculations. The fundamental advantage of discretization is that a few classifiers can just work on the ostensible qualities, however not numeric characteristics. Another favourable position is that it will increment the order exactness of tree and decide based calculations that rely upon ostensible information.

Discretization can be grouped into two classes, Unsupervised Discretization and Supervised Discretization. In Unsupervised Discretization the most part connected to datasets having no class data. It has Equal Width Binning; Equal Recurrence Binning for the most part yet more perplexing ones depend on grouping strategies 12. Regulated Discretization procedures as the name recommends considers the class data before making subgroups. Administered strategies are predominantly in light of Fayyad-Irani 13 or Kononenko 14 calculations.

In preprocessing, Weka has the Unsupervised Discretization and Supervised Discretization algorithm. Shown in fig. 1, embedded proper under Weka, filter, supervised and attribute options. Discretization approach is a supervised hierarchical divided method.

Fig. 1. Selecting Discretization from Preprocess Tab

Class data entropy is a quantity of immaculateness and it measures the data which would be expected to determine the class an occasion has a place. In every single estimation of an element, it thinks of one as large interim containing and after that recursively segments this interim into littler subintervals until an ideal number of interims are accomplished.

Fayyad and Irani introduced supervised discretization method called entropy based discretization. The supervised discretization methods handle sorted feature and values to determine the potential separate points such that the resulting separate point has the strong majority of one particular class. The separate point for discretization is selected by evaluating the distinction measure (i.e., class entropies).

EXPERIMENTS AND RESULTS

to assess the performance of our approach, a sequence of experiments had been performed.

WEKA TOOL- In this paper used WEKA Software tool to investigate and analyze the NSL KDD dataset with machine learning algorithms. Weka is open source GUI application which is referred to Waikato Environment for Knowledge Learning. WEKA software tool that developed at the University of Waikato in New Zealand for the motive of identifying information from raw records collected from different domain. It helps many data mining and machine learning applications along with preprocessing, clustering classification, regression, feature selection and visualization regression.

The basic premise of the this software is to utilize a computer software that can be trained to carry out machine learning abilities and derive useful data inside in the form of tendencies and styles. It operates on the predication that the consumer information is available as a flat document or relation, because of this every facts object is described by a set wide variety of attributes that generally are of a specific kind, normal alpha-numeric or numeric values. The WEKA software allows novice users to become aware of hidden facts from database and file systems with simple to apply alternatives and visible interfaces 15.

NSL KDD Dataset- NSL-KDD dataset used to solve some of the inherent issues of the KDD’99 dataset. The new edition of the KDD statistics dataset still suffers from some problems and may not be an ideal representative of real networks, because of the dearth of public data units for network-based IDSs, agree with it still may be implemented as an powerful benchmark dataset to assist researchers compare distinctive intrusion detection strategies. In NSL-KDD dataset there is no duplicate statistics in the proposed test sets; therefore, the performance of the newcomers are not biased by means of the methods which have better detection rates on the common data. This dataset contains variety of attributes, which can be supportive for measure the attacks. In NSL- KDD dataset have 22544 instances at dataset (KDD Test) and 125973 instances for training dataset (KDD Train) 16.

Table I and Table III shown that the training dataset with and without preprocessing confusion matrix and Table II and Table IV shown that the testing dataset with and without preprocessing confusion matrix. The result indicates that by preprocessing of dataset, accuracy of both machine learning algorithms (Naive Bayes, Bayes net) turned into accelerated.

Table I Training Dataset Confusion Matrix (Naive Bayes)

Confusion Matrix For Training Dataset

Without Preprocessing With Preprocessing

a b Classified as a b Classified as

21448 1414 a = normal 22713 149 a = normal

2615 17354 b = anomaly 1027 18942 b = anomaly

Table II Test Dataset Confusion Matrix (Naive Bayes)

Confusion Matrix For Test Dataset

Without Preprocessing With Preprocessing

a b Classified as a b Classified as

3139 158 a = normal 3156 141 a = normal

1314 3054 b = anomaly 226 4142 b = anomaly

Table III Training Dataset Confusion Matrix (BayesNet)

Confusion Matrix For Training Dataset

Without Preprocessing With Preprocessing

a b Classified as a b Classified as

22715 147 a = normal 22717 145 a = normal

1007 18962 b = anomaly 1005 18964 b = anomaly

Table IV Test Dataset Confusion Matrix (BayesNet)

Confusion Matrix For Test Dataset

Without Preprocessing With Preprocessing

a b Classified as a b Classified as

3142 155 a = normal 3157 140 | a = normal

226 4142 b = anomaly 224 4144 b = anomaly

Performance Analysis: – WEKA tool apply in both training and test dataset (NSL-KDD dataset) and find out the accuracies by Naive Bayes, Bayes algorithms without supervised discretization and with supervised discretization. The accuracies obtained by this shown in Fig. 2. The end result indicates that the supervised discretization of the attributes improved the overall performance of both the machine learning algorithms. Naive Bayes improves the performance approximately 6.84% for training dataset and 14.42% for testing dataset. Bayes Net improved the overall performance approximately 0.01% for training dataset and 0.09 % for test dataset.

Table V Performance Analysis Table

Classifier

Naive Bayes BayesNet

Training Dataset Test Dataset Training Dataset Test Dataset

Without Preprocessing 90.59% 80.79% 97.30 % 95.02 %

With Preprocessing 97.43% 95.21% 97.31% 95.11 %

Fig.2. Performance Analysis

Fig.2. Performance Analysis

CONCLUSION

The initial step of Data Mining, preprocessing process demonstrated its advantages amid the order exactness execution tests. In this paper, filtered supervised discretization strategy is utilized for enhancing the characterization exactness for datasets including continuous valued features. In the first stage, the continuous valued features of the given dataset are discretized. In second stage, we tried the execution of this approach with the Naive Bayes, and Bayes Net and compared and execution of Naive Bayes and Bayes Net classifier without supervised discretization.

According to table V, when we use Bayes Net algorithm there is almost same accuracy and performance in both with preprocessing and without preprocessing. When we use Naive Bayes algorithms the accuracy is largely increases in both with preprocessing and without preprocessing for training and testing data set. So

1. Naive Bayes algorithm is more accurate for training and testing data set.

2. Naive Bayes algorithms is most suitable and accurate with preprocessing compare to without preprocessing.

The outcomes shows that Naive Bayes classifier algorithms with filtered supervised discretization can increase the prediction accuracy and moreover proves that the filtered supervised discretization has a larger effect in the performance of the classifier algorithms.

REFERENCES

Mehmed Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms”, ISBN: 0471228524, John Wiley ; Sons, 2003.

Sushmita Mitra, Tinku Acharya, “Data Mining Multimedia, Soft Computing, and Bioinformatics”, John Wiley ; Sons, Inc, 2003.

Tea Tusar, “Optimizing Accuracy and Size of Decision Trees”, Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia, 2007.

Robu, R., Hora, C., “Medical data mining with extended WEKA,” Intelligent Engineering Systems (INES), 2012 IEEE 16th International Conference on, pp.347-350, 13-15 June 2012.

Salama, G.I., Abdelhalim, M.B., Zeid, M.A., “Experimental comparison of classifiers for breast cancer diagnosis,” Computer Engineering ; Systems (ICCES), Seventh International Conference , pp.180,185, 27-29 Nov.,2012.

UM, Ashwin kumar, Ananda kumar KR. “Predicting Early Detection of Cardiac and Diabetes Symptoms using Data Mining Techniques”, pp:161- 165,2011.

Weiguo Yi, Jing Duan, Mingyu Lu, “Optimization of Decision Tree Based on Variable Precision Rough Set”, International Conference on Artificial Intelligence and Computational Intelligence, 2011.

Liu Yuxun, ; XieNiuniu, “Improved ID3 Algorithm”, 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT) , 2010.

Gaurav L. Agrawal, Prof. Hitesh Gupta, “Optimization of C4.5 Decision Tree Algorithm for Data Mining Application”, International Journal of Emerging Technology and Advanced Engineering, Volume 3, Issue 3, March 2013.

http://cis.poly.edu/~mleung/FRE7851/f07/naiveBayesianClassifier.pdf.

http://web.ydu.edu.tw/~alan9956/.

Joao Gama and Carlos Pinto, “Discretization from data streams: applications to histograms and data mining”, In Proceedings of the 2006 ACM symposium on Applied computing, SAC , pages 662–667, New York, NY, USA, 2006.

Usama M. Fayyad and Keki B. Irani, “Multi-interval discretization of continuous valued attributes for classification learning”, In Thirteenth International Joint Conference on Articial Intelligence, volume 2, pages 1022– 1027. Morgan Kauf- mann Publishers, 1993.

Igor Kononenko, “On biases in estimating multivalve attributes”, In 14th Inter- national Joint Conference on Artificial Intelligence, pages 1034–1040, 1995.

Weka User Manual, Available Online www.gtbit.org/downloads/dwdmsem6/dwd

msem6lman.pdf.

NSL-KDD dataset, Available Online http://iscx.ca/NSL-KDD/.

S.S.Panwar,Y.P.Raiwani, ” Data Reduction Techniques to Analyze NSL-KDD Dataset”, International Journal of Computer Engineering ;Technology”, vol 5 issue 10, pp 21-31, 2014.

Y.P.Raiwani and S.S.Panwar, , “Research Challenges and Performance of Clustering Techniques to Analyze NSL-KDD Dataset”, International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Volume 3, Issue 6,pp 172-177, 2014.

Y.P.Raiwani and S.S.Panwar , “Data Reduction and Neural Networking Algorithms to Improve Intrusion Detection System with NSL-KDD Dataset”, International Journal of Emerging Trends & Technology in Computer Science(IJETTCS), Volume 4, Issue 1,pp 219-225, 2015.