The Use Of Classification Trees (Data Mining) In Predictive Ecology

Read Complete Research Material



The use of Classification Trees (data mining) in predictive ecology

by

The use of Classification Trees (data mining) in predictive ecology

What are classification trees?

Classification trees are a type of supervised learning method used in classification problems where the response variable is categorical, with each category representing one target class. The purpose of classification trees is then to predict the right class of an instance based on the values of its independent variables. Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in Data Mining (Hastie and Friedman 2001 95).

There are many different tree-based methods. The most basic form of all tree-based methods is the classification tree. Classification trees are used in classification problems where the response variable is categorical with each category representing one target class. The purpose of classification trees is then to predict the right class of an instance based on the values of its independent variables. The basic form of classification trees can be integrated into other more complex learning methods, e.g. as a base learner for ensemble methods (Friedman and Popescu 2003 84). The classification trees themselves can be made simple or complex.

Typical tree learning algorithm is a realization of the methodology of "divide and conquer", which is recursive in nature. The algorithm works in an insatiable way at each "divide" step. Given a data set with a target class variable Y and possibly many independent variables (predictors) X, the data set can be split by some rules based on the independent variables X. After a split, one can, hopefully, have some better idea about the dependent variable in each different part of the data. More splits can be done recursively on each part until the sub-parts are fine enough to give a satisfactory conclusion about the dependent variable Y or there are no more good splits (Hastie and Friedman 2001 95). The split step is the "divide" and the following work on the divided subparts is the "conquer", which may involve more dividing, recursively. A tree learning algorithm is used to efficiently realize the "divide and conquer" methodology.

At each "divide" step, a tree learning algorithm does a "greedy search," i.e. searches over all possible splits and picks the best one in terms of some purity measure. This "greedy" search makes sure that at each step, we do the best we can in the hope of shortening the whole "divide" process and finally enabling the "conquer" to give better results. Notice that this greedy approach does not necessarily give an optimum global solution although it is locally optimal.

Perlich and Simonoff (2003) showed that the splitting measure affects the size of the final classification tree but does not have a significant impact on classification accuracy. He argued (backed by his experimental results), that even random selection of the splitting variables at the nodes would give comparable classification accuracy ...