Mengjie Zhang, Genetic Programming for Classification Tasks with Highly Unbalanced Data

Genetic Programming for Classification Tasks with Highly Unbalanced Data


  • Overview:

    This project concerns fitness function development in Genetic Programming (GP) for classification tasks with unbalanced data in different classes. Classification tasks arise in a wide variety of practical situations. Diagnosing medical conditions from medical imaging, recognising words in streams of speech, and identifying fraudulent financial transactions are just three examples. Given the amount of data that need to be classified, computer based solutions to many of these tasks would be of immense social and economic value.

    Genetic Programming (GP) is a promising approach for building reliable classification programs quickly and automatically, given only a set of example data on which a program can be evaluated. GP uses ideas analogous to biological evolution to search the space of possible programs to evolve a good program for a particular task. GP has been applied to a range of classification tasks with some success.

    The current GP approaches to classification usually use classification accuracy or error rate as the fitness function to drive the evolution toward a solution. For the tasks with relatively evenly distributed instances for different classes, this fitness function is quite reasonable. For those with highly uneven classes, however, it frequently leads to a strong performance bias --- high accuracy rate on the majority class but very low rate on the minorities. Furthermore, it does not consider the complexity of the evolved programs, which often results in very large programs that are difficult to understand.

  • Tasks and Goals of this Project

    The goal of this project is to investigate fitness functions in genetic programming that can effectively deal with classifition tasks with highly unbalanced data for different classes. Specifically, this project will investigate:

    • How to effectively organise the training and test data for different classes;
    • How to organise and balance the overall classification accuracy and performance of the minority class in the fitness function;
    • How to balance the effective measure and the program complexity measure in the fitness function.

    We expect this new fitness function to enable GP to be used successfully on a wider set of classification tasks, to improve the system effectiveness performance, and to generate programs that are easier to interpret.

  • More information can be seen from Meng's research projects . Related Publications can be viewed from here: Meng's recent publications. Contact me if you want to get a copy of those papers, or if you want to know more detail about this project.