Mengjie Zhang, Evolutionary Data Mining for Big Data

Evolutionary Data Mining for Big Data


Data mining tasks arise in a wide variety of practical situations, ranging from classification to regression, clustering, and optimisation tasks. The applications range from the military domain such as detecting F-15 helicopters and tanks from a set of satellite images, through the economic domain such as finding associate rules at retail sellers and predicting GDP or CPI of a nation/region, the engineering domain such as network intrusion detection and pattern matching in signal processing, and to daily life such as postal code recognition, human face detection and security control. The problem domain varies from Computer Science to Network Engineering, Software Engineering, Electronic Engineering and Software Engineering.

The aspects of "big data" in this project means that there are a huge number of instances/examples and/or a huge number of features/input variables in the problems, but not all of them are useful --- there are a lot of irrelevant, redundant features in the datasets. Evolutionary computation algorithms such as genetic programming (GP), particle swarm optimisation (PSO) and differential evolution (DE) are powerful methods which can automatically learn/evolve multiple good solutions for a particular problem, and have been successfully used to solve data mining tasks with a large number of features and instances.

The project aims to develop and investigate new methods and algorithms using GP/PSO/DE for data mining tasks such as classification, regression and optimisation. Specifically, at least one of the following research topics will be considered in the project:

  1. Develop new representations and structures of computer programs in the population that GP can more effectively evolve and that are more suitable for feature selection and contruction in symbolic regression tasks;
  2. Develop new methods and algorithms using GP/PSO/DE for automatically selecting important features from a large number of dimensions of low-level features and/or constructing a small number of high-level features from the relevant low-level features for classification tasks;
  3. Develop new efficient algorithms using GP/PSO/DE that can effectively use/select a small number of instances for classification, regression or optimisation tasks;
  4. Develop new GP/PSO/DE algorithms to deal with classification/regression tasks with missing data;
  5. Develop new GP/PSO/DE algorithms that can scale well for big data; or
  6. Develop new algorithms that can transfer knowledge from one domain to a different but related domain.
A strong background in Java/C/C++ programming and a basic background in Artificial Intelligence and statistics are required. A good background in machine learning, database, and operations research is desired (COMP307, SWEN304, COMP361).

This project will be co-supervised by Dr Bing Xue. The School has good international reputation in the field and would like to continue the momentum. Please check http://homepages.ecs.vuw.ac.nz/~mengjie/papers/index.shtml, http://ecs.victoria.ac.nz/Main/MengjieZhang, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.