Data Mining using the Weka toolkit

 

by Hedinn Steingrimsson

 

Table of Contents

 

1      Abstract 2

2      Introduction. 2

3      Definition of the task. 2

3.1       Data mining. 2

3.2       The data mining process. 3

3.3       Can the data mining process be automated ?. 4

3.4       The automatic data mining process. 4

4      The method. 5

4.1       Extracting important/relevant fields form a dataset 5

4.1.1        The basic principles of feature extraction. 5

4.1.2        Specific feature extraction algorithms. 6

4.2       Finding classification models that generally perform well 7

4.2.1        How do classification models work ?. 7

5      Mining different types of data. 11

5.1       The datasets. 11

5.2       Use of the Weka packet 12

5.3       The setup of the classifiers. 12

5.4       The setup of the test environment 13

5.5       The results. 13

5.5.1        Annealing data results. 14

5.5.2        German Credit data results. 15

5.5.3        Mushrooms data results. 15

5.5.4        Breast cancer data results. 15

5.5.5        Audiology data results. 16

5.5.6        Primary tumor analysed. 17

6      Bibliography. 17

 

Table of Figures

Figure 1: Data flow diagram of the data mining process                                          2

 

           


                                                            

1         Abstract

The question that this paper starts with is: Is it possible to find a classifier that generally performs well indifferent what kind of data it works with.

In the solution of an answer to this question different simple classifiers are examined as well as more complicated meta classifier. 

The Weka software packet is used in order to test whether there can be found such a classifier.

The conclusions are that further tests are needed, but there are several promising metaclassifiers that could be a strong candidate for such a classifier.

2         Introduction

 

It is generally know that all classification methods have their advantages and disadvantages and thus are suited to different types of data.  The aim of research described in this paper is to try to find out whether this is always the case or whether some methods generally outperform others when used on various types of data.  The performance measure in this paper is taken to be classification accuracy.  The reason for that is among other things that in todays information environment the software is the primary cost factor, not hardware, which is nowadays not so expensive and that the business applications that would benefit from a generally well performing method are typically of such nature that the data mining processes will primarily be run over night or used to analyse long term trends, which in both cases means that the most important feuture of the algorithm is its accuracy, not the execution time.

The WEKA (Waitko Environment for Knowledge Analysis ) toolkit is used for the analysis.

 

 

3          Definition of the task

 

3.1      Data mining

Data mining has been described as: The process of discovering previously unknown and potentially interesting patterns in large datasets [PIAT]. 

The “mined” information can be represented as a model of the semantic structure of the dataset, which can be used on new data for prediction or classification. 

Alternatively the use of the model and the knowledge of a human domain expert may be combined in such a way that:

o       the domain expert may spot portions of the model that explain previously misunderstood or unknown characteristics of the domain under study

o       the domain expert may be able to correct some deficiencies in the classification of the model.  This is primarily applicable if the dataset that can be used for the model creation is small.

[SALLY]

 

3.2      The data mining process

 

The process of building a data mining application is described by the following diagram: [SALLY]

 

Figure 1: Data flow diagram of the data mining process

 

 As can be seen from the diagram the process involves:

It is very important to work with good data when building data mining applications so that the importance of this step should not be underestimated.  

The data mining expert and the domain expert can work together to clarify those anomalies.

If it is decided not to use all the data in order to create the model but only representative data then it is important that the representative data truly represents the real data e.g. that the proportions in each class of the representative data are similar to the ones of the real data. 

An important issue is to present the results of the data mining model in a clear and interactive way.

 

It should be added that the data mining process is iterative and that steps can depend on steps that are later in the process.  It can e.g. matter which data mining algorithm is chosen when considering which fields of the data  should be provided as input to it because some algorithms require e.g. that the fields are independent from each other in order to provide realistic results.

 

3.3      Can the data mining process be automated ?

 

An interesting question concerning the previously described data mining process is whether this process can be automated.  The reason for that is that it can generally not be assumed that the people that would benefit most by using data mining methods to get information from their data know much about how data mining works. 

It is expensive to tailor make data mining applications so that they suit each dataset and it would thus be greatly beneficial if it would be possible to create a data mining solution that would be directly applicable to business data without modifications that require expert data mining knowledge. That kind of a solution would be more that welcomed by smaller companies that would not engage in tailor making a data mining application to their data.  It is clear that that kind of a solution would not be as efficient on a particular dataset as a tailor made one, but how much would the difference be ?

 

 

 

3.4      The automatic data mining process

 

It is clear that this step is hard to automate because each business dataset is generally different with different inconsistencies.  There do exist however software solutions that can partly help with items that are of concern here e.g. correct postal codes so that they match with addresses etc.

By representing the results of the data mining model clearly and interactively in a non technical way the maximum benefit of the data mining solution would be provided to the user.

 

4         The method

4.1      Extracting important/relevant fields form a dataset

Because the main task of this paper is exploring generally well performing data mining algorithms the feature extraction topic is only shortly be dwelled upon.

4.1.1       The basic principles of feature extraction

4.1.1.1  What is the problem ?

It can be argued that theoretically data mining models should be tolerant regarding e.g. noisy irrelevant data because after all they are describing patterns that can be found in the data, not e.g. the noise.  But is there a difference between theory and practise ?   An answer to that is: there is no difference between theory and practise in theory, but in practise there is.  In practise it matters whether the data contains noise because in practise there are almost always limited amounts of available quality data and thus sometime the noisy data will be used in order to make a selection instead of the correct relevant one.

How this happens can be seen in decision trees: because the number of nodes at each level increases exponentially with depth, the chance of a rogue attribute looking good somewhere along the frontier multiplies up as the tree deepens.  Experiments have shown that this can account for 5-10% classification accuracy deterioration e.g. when a random binary attribute generated by tossing an unbiased coin is added to a dataset.

Instance based methods are especially susceptible to irrelevant attributes.  It has been showed that the number of training instances needed to produce a predetermined level of performance for instance based learning increases exponentially with the number of irrelevant attributes present. 

Other methods e.g. Naïve Bayes are sensitive to other kinds of incorrect data e.g. duplicates of relevant data fields.  It has been shown that adding a field that predicts the classification of a data tuple by the same proportion as the real classification proportions of the dataset are causes Naïve Bayes classification to deteriorate by up to 5%.  It has been shown that the performance of classification models deteriorates if the data is very noisy [John97].  The reason for that is that real world data mining models are usually built using limited amounts of data and at some point the noisy data will prevail over the correct fields and be used for selection.

 

4.1.1.2  What kind of solutions are possible ?

A solution to these problems is to extract only the meaningful fields of the data and use those for the classification.  This can be done by two approaches:

 

The first of these methods is called filtering and the second is the wrapper method.

4.1.1.2.1   Filtering methods

It is possible to extract the most important attributes for the dataset by e.g. by building a decision tree that fully classifies the training set and using only those attributes that the decision tree uses for building the classification model which can use another classification method e.g. nearest neighbour.

Another possibility is to use the 1R classification method that is described in chapter 4.2 to rank attributes according to their branching value and let the user choose how many of the best attributes are selected.

It is also possible to use nearest neighbour methods for this task by selecting the weight of attributes in such a way that attributes that are different in a near hit get decreased weight because they are likely irrelevant to the classification accuracy but attributes that are different in a near get increased weight because they are probably relevant.

This method will however not be able to detect attributes that are redundant because they are correlated with other attributes because those attributes would all be classified as a hit or a miss.

The drawback of filtering methods is that they depend only on the training data set and thus there is a danger of overfitting.

The benefit is that these methods are independent of the subsequent classification methods.

4.1.1.2.2   Wrapping methods

These methods try to find the best attributes by probing which attributes provide the best results when used for training and evaluation of the classification model.

Typically greedy search is applied, either starting with no attributes and then add them one at a time, which is called forward selection or starting with the full set and deleting attributes one at a time, which is call backward elimination.

In the forward selection approach the search end when the evaluation of the classifier does not improve when any one more attribute is added.  The same principle applies to backward search.

4.1.2      Specific feature extraction algorithms

The Weka packet provides the following attribute selection algorithms.  The algorithms are split into two adjustable functions:

 

The attribute evaluators are

o       CfsSubsetEval

o       ClassifierSubsetEval

o       WrapperSubsetEval

o       ConsistencySubsetEval

o       ReliefFAttributeEval

o       InfoGainAttributeEval

o       GainRatioAttributeEval

o       SymmetricalUncertAttributeEval

o       OneRAttributeEval

o       ChiSquaredAttributeEval

o       PrincipalComponents

 

The selection methods are:

 

A detailed explanation of how these algorithms work is to be found in the Weka documentation and is beyond the scope of this paper. 

4.2      Finding classification models that generally perform well

4.2.1      How do classification models work ?

The classification models in the Weka packet are the following:

 

4.2.1.1  Simple classifiers

 

ZeroR: This is the most primitive learning scheme in Weka.  It models the dataset with a single rule. Given a new data item for classification, ZeroR always predicts the most frequent category value in the training data for problems with a nominal class value, or the average class value for numeric prediction problems.  Although it seems to make little sense to use this scheme for classification, it can be useful for generating a baseline performance that other learning schemes are compared to.  In some datasets it is possible for other learning schemes to induce models that perform worse on new data than ZeroR which is a clear indicator of serious overfitting.

 

OneR: This is the second simplest classification method in Weka.  It is designed for nominal data.  It produces simple rules based on a single attribute.  The attribute is chosen to be the one that has the best smallest error rate when used as a sole classification attribute.

A description of the ÓneR algorithm is:

 

For each attribute

For each value of that attribute, make a rule as follows:

Count how often each class appears

Find the most frequent class

Make the rule assign that class to this attribute-value.

Calculate the error rate of the rules

Choose the rules with the smallest error rate

 

 

 It is like ZeroR useful in generating a baseline for classification performance.  It has, somewhat surprisingly been shown that this simple method can outperform many of the more sophisticated algorithms over many standard data mining datasets [HOLTE]. The reason for this is that many real world datasets contain very simply structured information with simple relationship which can be parsimoniously detected and represented by OneR.  The primary use of OneR is though for baseline purposes.

 

NaiveBayes: This method implements a naïve bayes classifier, which produces probabilistic rules.  It can be used for nominal data as well as numeric data.  If the data is numeric then the probability distribution of it must be assumed.  Common choices are to either assume normal distribution or if the data is not normally distributed to use a kernel function. When presented with a new data item, this model indicates the probability that this item belongs to each of the possible class categories.  The Bayesian classifier is naïve in the sense that attributes are treated as though they are completely independent and as if each attribute contributes equally to the model.  If extraneous attributes are included in the dataset, then those attributes will skew that model.  Despite its simplicity, NaïveBayes can give good results on may real world datasets.

 

Ibk: This is an instance-based nearest neighbour learning scheme.  Nearest neighbour methods have a rather long history ( originate in the laste 1950s ).  They work in such a way that the whole dataset is stored and new data items assigned to the same class as its “nearest neighbour(s)” . Either only the best neighbour is considered or a number of nearest neighbours vote for the classification of the new data item.  There are several definitions of the “nearest” concept.  If the data is numeric then a good candidate for the difference between an instance with attribute values a1(1), a2(1) , a3(1)… and one with a1(2), a2(2), a3(2) is the Euclidean distance function:

sqrt( (a1(1)-a1(2) )2 + ( a2(1) – a2(2) )2 + … ).  This distance functions balances the weight of attributes that are similar except that considerable difference is in one and attributes that are almost similar in many attributes and have no great differences.  The first ones would be given more weight if higher powers than the square would be taken.  If the data contains nominal attributes then a distance function can be to give a unit difference if an attribute is different and zero difference if it is the same.  Missing values can be treated by giving that attribute the highest possible distance value if the attribute is missing in both data instances.  If it is only lacking in one then a common method is to assign the difference value the value of the existing attribute or one minus that size, whichever is larger.  It is also possible to tailor make the distance function so that it expresses the how distance is seen in the specific dataset.  The Weka packet offers a possibility to detect how many neighbours give the best results by  using cross evaluation.  The nearest neighbour algorithm is rather time consuming especially when more than one nearest neighbour is to be considered.  It also requires rather a lot of memory to store the instances.  Various methods of optimisation e.g. that target the amount data that is stored exist.

The nearest neighbour model can be used as a classifier, but it does not explain its chooses like e.g. decision trees or rules.  Another possible application of nearest neighbour is to use the method in order to classify the data and then use another method based on that classification that can provide a more understandable reasoning for its classifications.

It should be added that nearest neighbour have the ability to create complex decision boundaries for numeric attributes e.g. a hyperplane if  Euclidean distance is used for a two class classification model while methods based on rules or trees or only capable of representing class boundaries that are parallel to the axes defined by the attributes.  This more flexible classification mechanism can in some cases provide a better solution that a more human understandable decision tree or rules.  [Aha]

 

J48: This is an implementation of C4.5 release 8 [QUINLAN], a standard algorithm that is widely used for practical machine learning.  This implementation produces decision tree models.

Part: This is a more recent scheme for producing sets of rules called “decision lists”, which are ordered sets of rules.  A new data item is compared to each rule in the list in turn, and the item is assigned the category of the first matching rule ( a default is applied if no rule successfully matches ).  This algorithm works by forming pruned partial decision trees ( build using C4.5 heuristics ), and immediately converting them into a corresponding rule.

 

Decision Stump: This method builds simple binary decision “stumps” ( 1-level decision trees ) for both numeric and nominal classification problems.  It copes with missing values by extending a third branch form the stump, in other words, by treating “missing” as a separate attribute value.  Decision Stump is mainly used in conjunction with the LogiBoost boosting method which is discussed later in this chapter.

 

 

 

 

 

 

4.2.1.2   Meta-Classifiers

Recent developments in computational learning theory have led to methods that enhance the performance or extend the capabilities of these basic learning schemes.  Those learning schemes have been called “meta-learning schemes” or “meta-classifiers” because they operate on the output of other learners.  Instead of using a single classifier to make predictions, why not arrange a committee of classifiers to vote on the classification ?  This is the basic idea behind combining multiple models to form an ensemble or meta classifier.

The reason why meta classifiers often outperform other methods lies in the nature of the error that is always inherit in a classification.

Basically there are two errors:

By combining multiple classifiers that work on different datasets it is possible to decrease this type of error.

 

Three of the most prominent methods for constructing ensemble classifiers are boosting, bagging and stacking [BREIMAN].  More often than not these classifiers can increase the performance over a single classifier, which means that they are a very strong candidate for a general classification model.  There is however a price to pay because unlike e.g. rule or tree based classifiers which provide rules or trees that humans can explore in order to understand why the model classifies like it does, these meta classifiers do generally not enable understanding of what is behind the improved decision making. 

 

All the meta classifiers vote on classifications using a weighted vote.  Each model in the ensemble predicts a class and can also assign a confidence value to the prediction.  These values are summed and the class with the largest value ( most confidence ) is chosen.

 

The bagging and boosting meta classifiers use one simple classification method, but create more than one module while the stacking one uses different classification methods.

Because clean datasets that can be used for training and evaluating classifiers are scarce bagging normally uses a resampling technique to get enough data for all the models.  The resampling techniques work in such a way that the same amount of data as is in the original dataset is extracted for the data set by  selecting instances randomly with replacement.  The new datasets are then used in order to create classification modules, in the normal way using the simple classifier chosen and the results of the models combined as described above to form a final classification.

 

Algorithmically:

            Let n be the number of instances in the training data

            For each of t iterations do

                        Randomly sample n instances ( using deletion and replication )

                        Apply a learning technique to build a model from the sample

                        Store the model

            End

 

 

Like bagging, boosting is iterative but instead of sampling fresh training data, each new model is influenced by the performance of those built previously.  Instances that are incorrectly classified in previous iterations are promoted and those correctly classified are relegated.  The key idea is to weight the instances and to use a learning algorithm that can take into account these weights when constructing its models.  Initially the weights are even and a model is constructed.  The instances correctly classified by this model are given less weight so that the incorrectly classified instances will have more “importance” in the next iteration.

 

 

 

 

Algorithmically:

            Assign equal weight to all instances

            For each of t iterations do

Apply a learning technique to build a model from the weighted instances and store the resulting model

                        Down-weight each instance correctly classified by the model

            End

 

The AdaBoost.M1 [FREUND] boosting algorithm gives the user control over the boosting iterations performed.  Another boosting procedure is implemented by LogitBoost [FRIEDMAN] which is suited to problems involving two-class situations.  In order to apply these schemes to multi-class datasets it is necessary to transform the multi-class problem into several two-class ones, and combine the result.  The MultiClassifier boosting technique does exactly that.

           

Stacking is an interesting method in which different classificatory types can be used in order to form a classifier by voting a mechanism as described above.

Stacking normally works in such a way that different classifiers create their module by using the training dataset and then their performance is tested on the test set like before, but the information attain then is used in order to provide weighting for each model in the final classification step.  This final step can also use the confidence of the classifications of the models in order to provide a good final classification.

 

 

5         Mining different types of data

 

5.1      The datasets

The author tested several datasets from the UCI dataset that can be found on the Weka website [WEKA] with the above described algorithms.  Detailed descriptions of the datasets are also to be found there. The reason for the choice of these datasets is that they contained the highest numbers of instances among the data sets listed at the Weka website and also come from a variety of fields which means that they can be considered to provide rather generic view of the data that data mining methods work on.  The datasets where according to the Weka site ready for use and where therefore not pre-processed using e.g. filter or wrapper functions.

 

5.2      Use of the Weka packet

The primary focus was on utilising the Weka packet for the task of finding out which classification methods generally performed best.

 

The Weka packet provides three kinds of interfaces: an command line one, Explorer which is well suited if filtering and wrapping should be done, but has not many possibilities when it comes to analysing the results of experiment and the Experimenter which is arguably the most complicated interface, but also provides the best functionality especially when it comes to analysing different experiments.

It was decided after some testing to use the Experimenter for the evaluations.

It should be added that the Weka packet algorithms are Open Source so that it is possible to change them and add new ones.  It is simplest to run the new or changed algorithms in the command line interface but with additional programming the y can also be included in the graphical ones.

 

 

5.3      The setup of the classifiers

Tenfold cross validation was used for each model creation to get a reliable estimation of the models capabilities in classifying the dataset.

Tenfold cross validation works in such a way that the available dataset is split into 10 portions and then a predetermined number of portions used for training the model and the rest for testing.  This is iterated ten times so that each data item gets to be once in a test set.

The classification methods where those described in chapter 4.

They are shown here with the run parameters used and a short explanation if appropriate:

(1) AdaBoostM1 '-P 100 -I 10 -S 1 -W j48.J48 -- -C 0.25 -M 2'

(2) IBk '-K 1 -W 0' - nearest neighbour algorithm used 1 nearest neighbour.

(3) j48.J48 '-C 0.25 -M 2'

(4) Stacking '-B \"IBk -K 1 -W 0\" -B \"AdaBoostM1 -P 100 -I 10 -S 1 -W DecisionStump --\" -B \"j48.J48 -C 0.25 -M 2\" -X 10 -S 1 -M \"NaiveBayes \ LogitBoost '-P 100 -I 10 -W DecisionStump --' "' - used NaiveBayes as its metaclassificator.  Linear regression is recommended meta classificator in [WekaB], but as it can only be used on numeric attributes the Bayes methods was used instead.  It is possible that choosing Bayes is not an optimal choice, because Bayes normally weights all attributes/class decisions equally while some classifiers may perform better and have higher confidence on their decisions.  This topic was however not looked at in detail in the study described in this paper because it would require inspecting the actual implementation of the Weka boosting algorithm.

 The stacking classifiers where: Ibk, LogitBoost with 10 iterations, and J48, NaiveBayes, AdaBoostM1 using DecisionStump and AdaBoostM1 using J48.  Those where the “real” classifiers.  OneR, DecisionStump and ZeroR create so simple models that they can mostly be used for checking whether any model is being seriously overfitting , which would be the case if its performance where worse that those trivial classifiers.

(5) LogitBoost '-P 100 -I 100 -W DecisionStump --'

(6) LogitBoost '-P 100 -I 10 -W DecisionStump --'

(7) NaiveBayes ''

(8) AdaBoostM1 '-P 100 -I 10 -S 1 -W DecisionStump --'

(9) OneR '-B 6'

(10) DecisionStump ''

(11) ZeroR ''

 

 

5.4      The setup of the test environment

The performance of different classifiers on each dataset was analysed with the Analyse component of the Weka Experiment Environment.

The Weka toolkit offers better evaluation capabilities than shown in this paper in which all the datasets are run together by all the classifiers providing one summary with information about which classifiers performed statistically better then others in the most cases with the methods shown next.  This however requires very delicate and long runs the Experimenter.

To run these experiments in the above described way is a very good next step in the research described in this paper.

The primary method used was testing whether there was a statistically significant difference of the performance of the methods by using a t-test.

A t-test is a statistical method that can be used in order to test if there is a statistical difference between two classifiers, typically a baseline classifier and one other classifier. 

The error rates of each classifier in the cross validations can be looked upon as different, independent samples from a probability distribution.

When comparing two learning schemes by comparing the average error rate over several cross validations, an attempt is thus made to find out if the mean of a set of examples is significantly greater than, or less than, the mean of another.  The t-test is the statistical method used for these purposes.

Because the same cross validation split can be used for both methods to obtain a matched pair of results, a more sensitive version of the t-test called paired t-test can be used.

 

Another comparison method used was to rank the classification method in such a way that the number of methods that each method outperformed and the number of ones that it was worse than is simply counted giving a ranked list of with the best methods having the most outperformining instances.

 

 

5.5      The results

The results of each classifier for each dataset are presented as the number of classifier that it performed statistically better than, the number of ones that it performed statistically worse than, the percent of test samples that it classified correctly, its standard deviation and whether it performed statistically better or worse than a baseline classifier.

 

Judging from the result it can be concluded that further tests are required to get a reliable estimation of the true performance of classifier models when given  different types of datasets.

Some trends are though visible which further research could test for validation.

·        The 100 iteration LogitBoost algorithm performs better than the 10 iterations one except when the performance of them both is less than 50% as in the primary tumor case.

·        The boost algorithms generally perform well, with AdaBoost performing somewhat better with  J48 as its simple classifier than the DecisionStump.

·        The Stacking algorithm performs very well when most of its members are performing very well like in the mushroom case.  It needs however probably some refinements specially it would be interesting to see if a better final decision algorithm could not be found that the NaiveBayes one as discussed above.

 

 

 

 

 

 

5.5.1      Annealing data results

 

 

 

Annealing

Wins-Losses

Wins

Losses

Percent correct

Std

Statistical significance

v better / * worse

(1) AdaBoostM1 – J48

10

10

0

99,60 %

0,78

v

(2) Ibk

8

9

1

99,04 %

1,14

v

(3) J48

6

8

2

98,58 %

1,14

v

(4) Stacking

4

7

3

98,55 %

 

v

(5) LogitBoost – 100 itera.

1

5

4

98,27 %

1,37

Baseline

(6) LogitBoost – 10 itera.

1

5

4

98,17 %

1,37

Baseline

(7) NaiveBayes

-2

4

6

86,51 %

3,34

*

(8) OneR

-5

2

7

83,63 %

0,5

*

(9) AdaBoostM1 – DecisionStump

-5

2

7

83,63 %

0,5

*

(10) DecisionStump

-8

1

9

77,19 %

2,72

*

(11) ZeroR

-10

0

10

76,17 %

0,6

 

 

5.5.2      German Credit data results

 

German Credit

Wins - Losses

Wins

Losses

Percent correct

Std

Statistical significance v better / * worse

( 7 ) NaiveBayes

8

8

0

74,98 %

4,41

v

( 5 ) LogitBoost – 100 itera.

8

8

0

75,3 %

3,21

v

( 6 ) LogitBoost – 10 itera.

4

6

2

72,22 %

3,58

neutral

( 2 )IBk

3

5

2

72,38 %

4,35

Baseline

( 4 ) Stacking

1

4

3

72,18 %

 

neutral

( 9 ) AdaBoostM1 – DecisionStump

-2

2

4

71,4 %

3,41

neutral

( 3 ) J48

-3

2

5

71,18 %

3,92

*

( 1 ) AdaBoostM1 – J48

-4

1

5

70,77 %

3,95

*

(11) ZeroR

-6

1

7

70,0 %

0

*

( 8 ) OneR

-8

0

8

60,43 %

3,43

*

 

 

 

 

 

5.5.3      Mushrooms data results

Mushrooms

Wins - Losses

Wins

Losses

Percent correct

Std

Statistical significance v better / * worse

Stacking

7

7

0

100 %

0

 

IBk

7

7

0

100 %

0

neutral

AdaBoostM1 – J48

7

7

0

100 %

0

Baseline

J48

7

7

0

100 %

0

neutral

LogitBoost - 100

2

6

4

98,91 %

0,18

*

OneR

0

5

5

98,41 %

0,23

*

AdaBoostM1 - DecisionStump

-3

3

6

96,82%

1,27

*

LogitBoost - 10

-3

3

6

97,52 %

1,13

*

NaiveBayes

-6

2

8

95,38 %

0,37

*

DecisionStump

-8

1

9

88,64 %

0,55

*

ZeroR

-10

0

10

51,5 %

0,91

*

 

5.5.4      Breast cancer data results

 

Breast cancer

Wins - Losses

Wins

Losses

Percent correct

Std

Statistical significance v better / * worse

NaiveBayes

4

4

0

72,26 %

 

v

Stacking

3

3

0

71,44 %

 

v

J48

3

3

0

71,34 %

 

v

LogitBoost 100 itera.

3

3

0

71,34 %

 

v

IBk

2

2

0

70,52 %

 

v

AdaBoostM1 - DecisionStump

2

2

0

70,93 %

 

v

LogitBoost 10 itera.

2

2

0

70,21 %

 

v

DecisionStump

-1

0

1

69,28 %

 

neutral

AdaBoostM1 – J48

-4

0

4

67,53 %

 

neutral

ZeroR

-7

0

7

68,35 %

 

Baseline

OneR

-7

0

7

67,53 %

 

neutral

 

 

 

5.5.5      Audiology data results

Audiology

Wins - Losses

Wins

Losses

Percent correct

Std

Statistical significance v better / * worse

LogitBoost – 100 itera.

7

7

0

79,87 %

3,74

v

AdaBoostM1 – J48

6

6

0

80,65 %

3,15

v

LogiBoost – 10 itera.

4

5

1

77,92 %

3,46

v

J48

3

5

2

74,03 %

5,37

v

IBk

-1

3

4

69,48 %

4,63

neutral

Stacking

-1

3

4

65,84 %

6,93

Baseline

OneR

-5

1

6

47,79 %

4,23

*

AdaBoostM1 - DecisionStump

-5

1

6

47,79 %

4,23

*

ZeroR

-8

0

8

21,69 %

4,54

*

 

 

 

5.5.6      Primary tumor analysed

This run was done with random split result producer

 

Primary tumor

Wins - Losses

Wins

Losses

Percent correct

Std

Statistical significance v better / * worse

NaiveBayes

8

8

0

49,35 %

 

v

LogitBoost - 10

8

8

0

44,17 %

3,97

v

LogitBoost - 100

3

5

2

40,26 %

3,31

v

J48

2

4

2

38,43 %

2,83

v

AdaBoostM1 – J48

2

4

2

38,96 %

2,68

 v

IBk

0

3

3

37,13 %

5,42

Baseline

OneR

0

0

0

 

 

 

Stacking

-2

3

5

34,17 %

5,22

neutral

DecisionStump

-7

0

7

26,09 %

2,25

*

ZeroR

-7

0

7

23,91 %

4,36

*

AdaBoostM1 - DecisionStump

-7

0

7

26,09 %

2,25

*

 

 

 

 

 

 

 

 

 

 

 

6         Bibliography

 

 

[Aha]: Aha, D. 1992 Tolerating noisy, irrelevant and novel attributes in instance-based perspective.  International Journal of Man-Machine Studies, Vol. 36, 267-287.

[BREIMAN]: Breiman, L. 1992 Bagging predictors. Machine Learning Vol.24 123-140.

[FREUND]: Freund, Y. and Schapire, R.E. 1996 Experiments with a new boosting algorithm.  Proceedings of COLT, 209-217. ACM Press, New York.

[FRIEDMAN]: Friedman, J.H., Hastie, T. and Tibshirani, R 1998, Additive logistic regression: a statistical view of boosting. Technical Report, Department of Statistics, Stanford University.

[John97]: John, G.H. , 1997 Enhancements to the data mining process. PhD Dissertation, Computer Science Department, Stanford University.

[HOLTE]: Holte, R.C., 1993 Very simple classification rules perform well on most commonly used datasets. Machine Learning Vol 11,  63-91.

[PIAT]: Piatetsky-Shapiro G. and Frawley W.J.,  eds. ( 1991 ) Knowledge Discovery in Databases. Menlo Park, CA, AAAI Press.

[SALLY]: Cunningham, Sally Jo and Holmes, Geoffrey, 2000 Developing innovative applications in agriculture using data mining.  University of Waikato Hamilton, New Zealand.

[QUINLAN]: Quinlan, J.R. 1993 C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, CA.

[WEKA]: www.cs.waitko.ac.nz/ml/weka

[WEKAB]: Witten, Ian, H. and Eibe, Frank 2000: Data Mining, Practical machine learning tools and techniques with java implementations. Morgan Kaufmann, San Diego, CA.