Evolutionary Computation for Feature Selection and Instance Selection in Large-Scale Classification

Classification is one of the most important tasks in real-world applications such as medical diagnosis, speech recognition, and object recognition. A small error in classification process can have a huge impact (e.g. inaccurate classification of disease can kill a patient, inaccurate classification of objects by a missile control system might miss enemy's aircraft, etc.). However, many classification tasks involve a large number of features and instances, where existing classification techniques take a long time to train a classifier and achieve poor classification performance. Solving such tasks typically requires feature selection and/or instance selection as a pre-processing step to reduce the size of the data. This project focus on developing new feature selection algorithms based evolutionary computation techniques. The following objectives will be considered in this project (1) develop a new evolutionary feature selection algorithm to select a small subset of informative features to reduce the dimensionality of the data, (2) develop a new evolutionary instance selection algorithm to select a small set representative instances to describe the task, (3) propose a new evolutionary feature and instance selection approach that can reduce the size of the data, improve the representation power of the data, increase the classification accuracy and speed up the processing time.

Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307 This project will be co-supervised with Prof. Mengjie Zhang. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

Evolutionary Transfer Learning for Data Mining

One of the major differences between traditional Machine Learning and that of human learning is our innate ability to transfer knowledge from one problem domain to another problem domain. Human learners appear to have inherent ways to transfer knowledge between tasks. That is, we recognise and apply relevant knowledge from previous learning experiences when we encounter new tasks. The more related a new task is to our previous experience, the more easily we can master it. Common machine learning algorithms, in contrast, traditionally address isolated tasks, and must complete a learning process from scratch on a new task. Transfer learning is ``the ability of a system to recognise and apply knowledge and skills learned in previous tasks to novel tasks'', with the goals of increasing the fitness of the obtained model on the target domain and decreasing the computational effort required to train the target model, even in the face of limited or no labelled data in the target domain.

Evolutionary Computation (EC) based learning and optimisation has shown proming performance for data mining. EC approaches like genetic programming has shown it potential for transfer learning in since it is able to automatically generate computer programs for solving a given task. Since GP is problem-independent and has a flexible representation, GP provides a nature way of automatically finding useful knowledge/information during the learning or evolutionary process, which is called building blocks, extracted knowledge or blocks of knowledge (shown by a sub-tree). Meanwhile, GP and other EC methods such as particle swarm optimisation has shown sucess in feature selection or construction in traditional machine learning, and feature transformation is a major way to talke transfer learning, but such evolutionary feature transformation methods have not been investiagted in transfer learning.

This project focuses mainly on developing new evolutionary computation approaches to transfer learning. The following objectives will be considered in this project (1) develop a method for discover useful knowledge or building block for transfer learning from the source domain, (2) develop a new mechanism to find the shared information represented by features between the source and target domain, and (3) develop methods on when to transfer and how to transfer the identified knowledge in the target domain. The proposed approaches are expected to provide an good initialisation of the target domain learning, a faster convergence, and a better final accuracy in the target domain learning. Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307

This project will be co-supervised with Prof. Mengjie Zhang. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

Evolving Deep Neural Networks for Image Analysis

Deep Convolutional neural networks (CNNs) have demonstrated their exceptional superiority in visual recognition tasks, such as image classification, traffic sign recognition, biological, and image segmentation. However, the main challenges in CNNs are (1) requiring manual design of architecture which needs experts with rich domain knowledge but often not available, (2) the huge number of parameters to tune, and (3) the massive amounts of computational power that is often not available for many researchers. Evolutionary computation methods has been successfully applied to neural networks since a decade ago, while they cannot scale up well to the modern deep neural networks due to the complicated architectures and large quantities of connection weights.

This project aims to developing new methods using evolutionary computation, such as differential evolution and particle swarm optimisation to evolving deep CNNs for image classification tasks. The following objectives will be considered in this project (1) develop a new flexible variable length representation to effective encoding the large number of parameters in the a deep CNN, (2) develop a new evaluation method to quickly evaluate the performance of an evolved CNN, and (3) develop a instance (i.e. training image) selection method to use only a set of representative instances to seed up the evolutionary process. The proposed approaches are expected to increase the classification performance of deep CNNs over the state-of-the-arts, reduce the computation cost, and simplify the learnt architecture of CNNs.

Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307. This project will be co-supervised with Prof. Mengjie Zhang and Dr Yanan Sun. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

Evolutionary Computation for High-Dimensional Clustering in Big Data

The world continues to generate a large amount of data daily, leading to the pressing needs for new efforts in dealing with the challenges brought by Big Data, where difficulties for analysing data with a large number (thousands to millions) of features/attributes, i.e. the “curse of dimensionality”, has rendered many state-of-the-art methods ineffective. However, when addressing volume in Big Data analytics, researchers have largely taken a one-sided study of volume, which is the “Big Instance Size” factor of the data. The flip side of volume, i.e. the dimensionality factor of Big Data, has received much less attention.

Data clustering is one of the most popular techniques in data analytics, which is a process of partitioning an unlabeled dataset into groups, where each group contains objects which are related to each other with respect to a certain similarity measure and different from those of other groups. The applicability of clustering is manifold, ranging from image processing and bioinformatics through to document categorisation and web mining.

High-dimensional clustering in Big Data is more and more frequent, but classical clustering techniques exhibit disappointing behaviour in high-dimensional spaces due to "the curse of dimensionality". As the number of dimensions increases, the data become increasingly sparse, so that the distance measurement between pairs of points becomes less meaningful and the average density of points anywhere in the data is likely to be low. Meanwhile, many features may be redundant and/or not powerful, and using many features increases the search space, confuses clustering algorithms, and make it hard to interpret the formed clusters. Furthermore, high dimensionality often also means more computational resources are required.

Evolutionary computation (EC) includes a group of powerful techniques. Because of their powerful search ability, lack of assumption about the data, lack of requirement of domain information, and/or flexible representation, EC methods (such as genetic programming and particle swarm optimisation) have been successfully used in many areas, particularly clustering, classification, dimensionality reduction, and large-scale optimisation.

This project focuses on developing novel methods for high-dimensional clustering based on EC techniques with the goal of increasing the clustering performance, improving the interpretability, and reducing the computational time. Specifically, the following objectives will be investigated: (1) compare different performance measures/indicators on different types of clustering problems using both real-world and synthetic datasets to find the most suitable measures on each type of clustering problem; (2) propose a new similarity measure that is able to evaluate the similarity between data points in a high-dimensional space; (3) develop a new method for automatically and simultaneously removing unnecessary features and performing clustering; and (4) investigate a visualisation method for analysing the clusters and improving the interpretability of the clusters.

Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307. This project will be co-supervised with Prof. Mengjie Zhang and Andrew Lensen. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

Evolutionary Computation for Document Classification

Document classification aims to assign a document one or more classes or categories, making it easier to manage and sort. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. This project will focus on text classification.

Normally, document classification includes the stages of pre-processing, feature extraction, model building, and document class predication. In the stage of pre-processing, some texts will be filters, such as "and". Then a number of features will be extracted from each document, such as the number of occurrences of each word in a text document is a general and basic feature. Since the number of words in a document is large and some words have very low frequencies, the space of the frequency-based bag-of-word features is large and sparse. Feature selection techniques are generally required to select only the most useful features. The selected features are used to train a model by a learning algorithm, such as support vector machine, and then the trained model will be used for document class predication.

Evolutionary Computation such as Genetic programming (GP) is an evolutionary computation approach which is able to automatically generate computer programs for solving a given task. GP uses a population of trees to address a task, where each tree takes data of the problem as terminals, and uses operators, such as mathematical or logic operators, as internal nodes to evolve tree-based models. Therefore, GP can be used to automatically extract features, select features, and evolve classification models in a single process. However, the potential of GP for document classification has not been fully investigated.

This project focuses on investigating the use of GP for document classification with the goal of improving the classification accuracy. Specifically, the following objectives will be investigated: (1) develop a GP based feature selection method using features extracted by a bag-of-word model (or other methods) to select important features for reducing the dimensionality and improving the classification accuracy; (2) develop a value-based GP system using extracted features as input to perform simultaneous feature selection and classifier building; and (3) develop a rule-based GP system using the original data as terminals to automatically extract features, select important features, and build a classification model in a single process.

Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307 This project will be co-supervised with Dr Xiaoying Gao. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Main/XiaoyingGao for publications and other information.

Genetic Programming for Software/Program Testing

Search Based Software Engineering is an approach to software engineering in which search based optimization algorithms are used to identify optimal or near optimal solutions and to yield insight. Software testing is a sub-field of Search Based Software Engineering, and an essential part in the software development process, where the quality of test data set plays a critical role in the success of software testing activity. Manually generating test data is time-consuming, error-prone and complex. To avoid such problems, automatic generation of the test data is necessary to improve the performance and reduce the time and cost.

Genetic programming (GP) is an evolutionary learning and optimisation technique, and has been used for many real-world applications. GP can deal with different types of data/variables, such as continuous, categorical (binary), and ordinal data, which is a promising approach to automatic test data generation.

The goal of this project is to propose a GP approach to automatic software test data generation. The following objectives can be investigated (1) investigate complex program or software as benchmarks from either the literature and by designing new ones, (2) develop a new encoding scheme to allow different types of test data to be generated automatically and simultaneously, (3) develop a multi-objective GP method for simultaneously maximising the soft testing measures, such as Block or Branch coverage, Path coverage, and Condition/Decision coverage, and minimising the number of test data suits. This project is also open to other topics of Search Based Software Engineering, such as Planning, Prediction, Design, Bug Fixing, Maintenance and Re-Engineering, only if the student has some background in the area.

This project requires a student in Computer Science or Software Engineering with good knowledge in Evolutionary Computation, particularly genetic programming and machine learning (COMP307). The student should have a strong programming background in Java, or C++ (COMP261 and SWEN221). Good programming skills, background and experience in evolutionary computation, and classification will be preferred. It is also desirable if you have already completed COMP307 This project will be co-supervised with Prof. Mengjie Zhang. Please check http://ecs.victoria.ac.nz/Main/BingXue, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

Evolutionary Computation for Data Mining in Big Data

[This project can take up to two students]

Data mining tasks arise in a wide variety of practical situations, ranging from classification to regression, clustering, and optimisation tasks. The applications range from the military domain such as detecting F-15 helicopters and tanks from a set of satellite images, through the economic domain such as finding associate rules at retail sellers and predicting GDP or CPI of a nation/region, the engineering domain such as network intrusion detection and pattern matching in signal processing, and to daily life such as postal code recognition, human face detection and security control. The problem domain varies from Computer Science to Network Engineering, Software Engineering, Electronic Engineering and Software Engineering.

The aspect of "big data" in this project means that there are a huge number of features/input variables in the problems, but not all of them are useful --- there are a lot of irrelevant, redundant features in the datasets. Evolutionary computation algorithms such as genetic programming (GP) and particle swarm optimisation (PSO) are powerful methods which can automatically learn/evolve multiple good solutions for a particular problem, and have been successfully used to solve data mining tasks with a large number of features.

The project aims to develop and investigate new methods and algorithms using GP/PSO for data mining tasks such as classification, regression and optimisation. Specifically, at least one of the following research topics will be considered in the project:

(1)Develop new representations and structures of computer programs in the population that GP can more effectively evolve and that are more suitable for feature selection and contruction in symbolic regression tasks; or

(2) Develop new methods and algorithms using GP and PSO for automatically selecting important features from a large number of dimensions of low-level features and constructing a small number of high-level features from the relevant low-level features for classification tasks; or

(3) Apply GP/PSO to engineering and optimisation applications.

A strong background in Java/C/C++ programming and a basic background in Artificial Intelligence and statistics are required. A good background in machine learning, statistics and operations research is desired (COMP307).

This project will be co-supervised by Dr Bing Xue. The School has good international reputation in the field and would like to continue the momentum. Please check http://homepages.ecs.vuw.ac.nz/~mengjie/, http://ecs.victoria.ac.nz/Main/MengjieZhang, and http://ecs.victoria.ac.nz/Groups/ECRG/ for publications and other information.

2015: Particle Swarm Optimization for Automated DNA Sequence Design

New advancement in molecular biology requires automated design of efficient nanocarrier of functioinal nucleic acids for intracellular molecular sensor. Currently, biologists rely mainly on manual approaches for designing desirable DNA sequences that can serve as the nanocarriers for the purpose of monitoring biological molecules in living cells. In this project, based on existing optimization technologies such as the particle swarm optimization algorithm (PSO), we seek to develop and implement an effective evolutionary computation system that can, to a large extent, automate the design process. In the meantime, we are also expecting to significantly improve the stability and usability of the DNA sequences discovered through our evolutionary algorithms, in comparison with the traditional manual design method. This project is in collaboration with researchers from the School of Biological Sciences. Your main job is to design, implement and evaluate some PSO algorithms. A widely used DNA sequence analysis tool will be utilized to guide the search for optimal DNA sequences. If successful, our computing technology will help biologists to quickly design new medicines to treat infections and other diseases. To take this project, good programming skills in Java or C++ is essential. It is also desirable if you have already completed COMP307 successfully.

Dr Aaron Chen is the primiary supervisor of this project and I will be the co-superviser. Please check http://ecs.victoria.ac.nz/Main/AaronChen/

2014: Evolutionary Feature Reduction to Large-Scale Classification

Classification is one of the most important and essential processes in many real-world applications such as medical diagnosis, speech recognition, and object recognition. A small error in classification process can have a huge impact (e.g. inaccurate classification of disease can kill a patient, inaccurate classification of objects by a missile control system might miss enemy's aircraft, etc.). However, many large-scale classification tasks involve a large number of features, where existing classification techniques take a long time to train a classifier and achieve poor classification performance. Solving such tasks typically requires feature reduction as a pre-processing step to reduce the dimensionality of the data. This project focus on developing new feature reduction algorithms based evolutionary computation techniques. The following objectives will be considered in this project (1) develop a new evolutionary feature reduction approach that can quickly select a subset of important features, (2) develop a new evolutionary multi-objective feature reduction approaches incorporating users' preference to minimise both the number of features and the classification error rate, and (3) develop a feature construction algorithm to construct a small set of new high-level features to improve the classification performance. Good programming skills, background in machine learning, particularly evolutionary computation and classification will be useful.

2014: Particle Swarm Optimisation and Statistical Clustering for Feature Selection

Feature selection is an important step in machine learning and data mining tasks, such as classification. Feature selection aims to select a small subset of features from the original large feature set, but it is a difficult task due to the large search space and feature interaction problems. Particle swarm optimisation (PSO) is a powerful search technique and statistical clustering methods can effectively consider feature interaction to group features into different clusters. This project aims to develop a new feature selection approach based on PSO and statistical clustering information. Specifically, this project will focus on (1) develop a new algorithm to select features from different clusters to maximise the classification accuracy, (2) develop a new algorithm to minimise the number of features and simultaneously maximise the classification accuracy based on statistical clustering information, and (3) analysis the interactions between features to further improve the performance. Good programming skills, background and experience in evolutionary computation (e.g. PSO), classification and feature selection will be preferred. This project will be co-supervised with Prof. Mengjie Zhang and Dr Ivy Liu (Statistician).