Next: Agent Architecture
Up: Knowledge-based Information Agents
Previous: Introduction
Focusing on information extraction from semi-structured data, we have
examined thousands
of Web pages. We summarize the knowledge that is useful for guiding the
information
extraction as follows. We classify knowledge into three categories: general
knowledge,
domain specific knowledge and site specific knowledge.
- General Knowledge. General knowledge is true for most online
documents, if not for all of
them, that is, the knowledge is both domain independent and site
independent. Typical examples
are the common usage of HTML tags, for example, what is a table, what is a
paragraph, and what
is a line.
- Domain Specific Knowledge. Domain specific knowledge is true in a
particular domain. The
knowledge is site independent, that is, the knowledge is consistent for
most Web sites if not for
all of them as long as the Web sites present data in the same domain. For
example, in the real
estate domain, each property in an online advertisement has a suburb where
it is located;
the price for renting a property is usually denoted by a ``$'', followed
by a number,
and a unit such as ``per week'' or ``per month''. Domain specific knowledge
is usually specified
using terms in a specific domain and may not generalize to other domains.
- Site Specific Knowledge. Site specific knowledge is true for a
particular site. To prevent
the intersection with domain knowledge, we define the site specific
knowledge being domain
independent, that is, if the knowledge is true for this site and also true
for this domain, then
it is classified into domain knowledge. Site specific knowledge mainly
consists of the site
specific data formatting conventions, for example, in a particular Web site
called
NewsClassifieds, suburb names are printed in all capital letters. Site
specific knowledge is
tailored to a specific site and unlikely to be consistent with other sites.
This knowledge classification enables knowledge reuse and sharing, and also
gives guidance for
agent adaptation. General knowledge is completely reusable and can be
shared for many
information extraction tasks. Domain knowledge can be reused and shared for
Web sites in the
same domain. Site specific knowledge is limited to Web pages on the same
site. To
adapt an agent to a
new domain, new domain specific knowledge is needed. To adapt an agent to a
new site, new site
specific knowledge needs to be added.
Next: Agent Architecture
Up: Knowledge-based Information Agents
Previous: Introduction
Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001