Graduate Projects, Peter Andreae

Agents that learn in micro worlds:

(See the general introduction to these projects first)


This is now dated - Maciej Wojnar, Adam Clarke, and James Bebbington are currently working on some of these projects; and their results will shortly give rise to new projects. These projects address different aspects of the large project of making an agent that learns how to act in a complex world in order to achieve goals and/or obtain reward. All the projects will build on the same world simulator (which has already been written), and will (hopefully) be able to be integrated into a single agent that handles multiple kinds and levels of learning from a variety of sources of knowledge.
The projects will use ideas on representation from a thesis by David Andreae, a previous PhD student at VUW. The project goals are very similar to the goals of a project at M.I.T.. The project page contains two interesting papers and the slides of a presentation. It also has a link to a good reading list. An interesting paper out of that project is titled "The Thing that we tried didn't work very well".

We have a world simulator designed for experimenting with learning agents, written by David Gilligan, a former student and research assistant. The simulator is written in java, and provides all the classes for simulating the actions and perception of the agent in a world. One can experiment with an agent simply by providing a "Robot Brain" class to use the agent's perception and to control the agent's actions.

In 2003, Tom Lee constructed a Robot Brain that learns simple STRIPS-style rules that describe the effects of actions in terms of the necessary state of the world before the action and the changes to the state of the world that the action will accomplish. The agent builds these rules by observing the results of performing actions in the world and generalising the observations based on focus of attention. With further observations, it can generalise the rule further. The key ideas underlying this system are the use of heuristics about focus of attention to guide the generalisation, so that the agent can construct generalised rules from a very small number of examples, and using symbolic representions of the uncertainty that the agent has about the components of a rule.

This system does not address the question of how to choose an action - it assumes that some other module (or the teacher) chooses the actions. He is currently implementing a simple planner that can use these rules to construct a plan to accomplish a goal, and therefore enable the agent to choose its own actions, once it has learned enough about the world.


Learning Reactive Rules from Experience

The STRIPS-style rules describe how the world works, but do not say what the robot should do - using the rules requires some kind of a planner. By extending the rules to include the intentional context of the action (ie, the goal(s) that the action was used for), we can construct rules that the agent can use without a planner. Given a goal, the agent can find rules that were used in the context of similar goals and whose preconditions are satisfied, and apply them in a "reactive" manner. This allows the agent to choose some actions, even when it has not learned enough about the world to construct a plan for its goal.

The central technology of this project involves generalisation: it must find appropriate partial matches of (symbolic) descriptions of actions, states of the world, and goals in order to construct generalised rules, and to apply appropriate rules to new situations.

This project will build on ideas from classical, symbolic AI learning, including version spaces and the candidate elimination algorithm.


An Agent that Plans with Weak Rules

Traditional AI planning (from STRIPS to the complete partially ordered planning algorithms) assumed that the planner had a collection of true rules that described the way the world worked, or at least, the way the world responded to an agent's actions. These rules specified what had to be true of the world before the action was applied, and what changes to the world would be made by the action. The rules assumed that the agent was the only active entity in the world, that the actions were atomic, and only one action could be performed at a time. The planners worked backwards from the desired goal towards the initial state of the agent, adding actions to achieve the goals and the preconditions of other actions, producing an entire plan before executing any action. If the rules were unreliable, the plan would fail when executed.

More sophisticated planners can integrate the execution of a plan with the planning (and replanning) process, and could also handle more complex kinds of rules that dealt with duration, concurrent actions, conditional actions, sensing actions, etc. The planners still assume that the rules are correct. However, complete, correct planning for non-trivial goals can be very expensive in planning time.

If an agent is learning about the world from its experience with the world, the rules that it has will mostly be wrong - too general, or too specific, or missing alternatives. Also, it cannot afford to spend a long time planning between each action, so the agent needs to use whatever knowledge it currently has in order to choose its next action.

This project will explore approximate planning algorithms that make plans for an agent with rules that are almost certainly over-specific and possible wrong. The agent will use a proposed plan to choose its actions, but must monitor the plan as it goes, both to check that the world is responding as expected and to identify rules that can be improved. An important part of the planner is that it must be able to use rules that include uncertain components.

This project will build on ideas from classical AI planning and other AI algorithms, but will use the same rule representation as in the project above.


Choosing Actions for Exploration

Reinforcement learning agents choose their actions to maximise expected reward (or minimise expected penalty). Such agents are appropriate for worlds/tasks with a built-in reward function (eg, landing an airplane), worlds in which the agent has a body with built-in drives, and single-task worlds with a teacher who provides reward and penalty. Reinforcement learning agents are not so appropriate for worlds with constantly varying tasks where the value of an action varies depending on which task the agent is doing. Reinforcement learning is also problematic for very rich worlds with enormous state spaces, because we do not know how to generalise the states in a way that allows effective propagation of reinforcement through actions.

Planning agents choose their actions to achieve an explicit goal, using knowledge of the effects of actions. Such agents are appropriate for worlds with a teacher/master who specifies tasks for the agent, as long as the agent has sufficient knowledge of the way the world works and the effects of its actions.

For other kinds of worlds and tasks, or for agents who do not yet know enough about the world, the agent needs an alternative basis for choosing actions. Exploration is a useful activity for an agent - since by exploring, the agent may learn more about the world. This will let it plan its actions if it is ever given a task. It will also enable it to learn to choose the right actions to obtain reward (or avoid penalty) if the world starts providing reward or penalty.

Random exploration (which is usually a component of most reinforcement learning algorithms) is not effective in very rich worlds - when there are 100's of possible actions from any state, a random explorer will rapidly get lost, and may never learn anything of any use. Somehow, the agent needs to constrain its exploration This project (or projects - there are several possible variants) will look at other mechanisms for choosing actions for exploration. The principle under all the mechanisms is that learning requires repetition of similar (though not identical) situations, so the exploration strategy must be likely to bring the agent back to similar states rather than constantly taking the agent into brand new areas of the world. This could be described as "conservative exploration".

Focus

A first approach to exploration is to guide the choice of action by the "focus of attention".

In a world with many different objects, it is much more likely to be useful to the agent to keep acting on the same object rather than constantly switching objects. The objects (and places) that were involved in the action that the agent has performed will be part of the agents focus of attention. The agent should prefer to choose an action involving these objects.

One issue the project must address is how to use the focus of attention. If the agent is too conservative in its action choice, it may simply repeat a short cycle of actions indefinitely - picking a cup off the table, replacing it, picking it up again, etc. The project will explore ways of being conservative without being too conservative.

Using the focus of attention may also allow an agent to learn from a teacher without requiring linguistic communication: a teacher could put objects into the agent's focus of attention to lead the agent in exploring the world. For example, the teacher could point at or touch objects and places to draw attention away from the current objects, so that the agent would then choose an action involving the new objects and/or places. This would make the learning agent teachable as well.

Copying

A second approach to exploration is to give an agent a drive to copy.

Humans like to copy other humans. Children's play often consists of copying adults. One possible reason is that the drive to copy makes humans much more teachable - a teacher can simply demonstrate a behaviour, and a child may try to copy it. One interesting feature of this drive to copy is that the child is somehow able to map the observations of the behaviour of another human into his/her own actions that would generate the same (or similar) behaviour. This mapping is not trivial.

The project would explore how to give an agent a "copy" drive and a "repeat" drive. The copy drive would involve mapping the teacher's behaviour to the agents actions. One possible approach is to focus not on the teacher but on the objects in the world - characterising the teacher's behaviour in terms a sequence of state changes. If the agent has already learned simple associations between actions and their immediate effects, the agent can work out how to copy the teacher's large scale behaviour by choosing actions that will bring about the same sequence of state changes in the world. By copying the teacher, the agent can then learn more about the world and also learn about the consequences of long sequences of behaviour.

Repetition

A third approach is to give the agent a drive to repeat what it has done before. Humans not only like to copy other humans, but they like copying themselves - repeating things that they have been successful at, repeating familiar patterns of behaviour, or just doing an action several times until they are satisfied with it. Note when an agent repeats behaviour, it does not have to be an exact repetition - the agent may introduce small variations into the behaviour.

Repetition seems easy to implement - whenever the agent recognises the current state as the same or similar to one it has been in before, it simply chooses the action that it did before. One challenge is to be able to identify what counts as a "similar" state; another is to work out how to introduce the appropriate level of variation of behaviour that will lead to useful exploration.

Novelty

A fourth approach is to give the agent the goal of seeking novelty. Simply seeking states that the agent has not seen before is not useful - it leads either to random search or to deliberately doing the opposite of what it has learned. However, there is a constrained form of novelty (introduced in John Andreae's work on the PURR-PUSS system) that does not have this problem. In the PURR-PUSS system, a novel state is one that surprised the agent - the state was not what the agent had predicted would follow the action. The agent's goal is to get back to a state that was novel, thereby removing its novelty. If the world is boring, then the agent will eventually be able to predict all states, no states will be novel, and the agent will have no goal (and become bored). If the world is rich and interesting, then the agent will constantly (either by chance or by the teacher's design) stumble into states that surprise it. The agent will then set up goals to return to those states.

There are two desirable features of this "novelty goal":

  • All goal states are only transient goals - the agent does not end up repeating a small behaviour ad infinitum since once it is able to repeat the state, it is no longer novel, and therefore no longer a goal.
  • The agent tends to build up a model of the world with many small loops - sequences of actions that return to the same state. The effect of this model is that agent is likely to have paths in their world model from everywhere to everywhere.
There are several possible projects that combine these different drives in different ways.

Some Links

U Mass EKSL