Information Gathering as Interpretation

Next: The BIG Agent Architecture Up: Introduction Previous: Introduction

Information Gathering as Interpretation

Our approach to the IG problem is based on two observations. The first observation is that a significant portion of human IG is itself an intermediate step in a much larger decision-making process. For example, a person preparing to buy a car may search the Web for data to assist in the decision process, e.g., find out what car models are available, crash test results, dealer invoice prices, reviews and reliability statistics. In this information search process, the human gatherer first plans to gather information and reasons, perhaps at a superficial level, about the time/quality/cost trade-offs of different possible gathering actions before actually gathering information. For example, the gatherer may know that Microsoft CarPoint site has detailed and varied information on the models but that it is slow, relative to the Kelley Blue Book site, which has less varied information. Accordingly, a gatherer pressed for time may choose to browse the Kelley site over CarPoint, whereas a gatherer with unconstrained resources may choose to browse-and-wait for information from the slower CarPoint site. Human gatherers also typically use information learned during the search to refine and recast the search process; perhaps while looking for data on the new Honda Accord a human gatherer would come across a positive review of the Toyota Camry and would then broaden the search to include the Camry. Thus the human-centric process is both top-down and bottom-up, structured, but also opportunistic. The final result of this semi-structured search process is a decision or a suggestion of which product to purchase, accompanied by the extracted information and raw supporting documents.

The second observation that shapes our solution is that WWW-based IG is an instance of the interpretation problem. Interpretation is the process of constructing high-level models (e.g. product descriptions) from low-level data (e.g. raw documents) using feature-extraction methods that can produce evidence that is incomplete (e.g. requested documents are unavailable or product prices are not found) or inconsistent (e.g. different documents provide different prices for the same product). Coming from disparate sources of information of varying quality, these pieces of uncertain evidence must be carefully combined in a well-defined manner to provide support for the interpretation models under consideration.

In recasting IG as an interpretation problem, we face a search problem characterized by a generally combinatorially explosive state space. In the IG task, as in other interpretation problems, it is impossible to perform an exhaustive search to gather information on a particular subject, or even in many cases to determine the total number of instances (e.g. particular word processing programs) of the general subject (e.g. word processing) that is being investigated. Consequently, any solution to this IG problem needs to support reasoning about tradeoffs among resource constraints (e.g. the decision must be made in 1 hour), the quality of the selected item, and the quality of the decision process (e.g. comprehensiveness of search, effectiveness of IE methods usable within specified time limits). Because of the need to conserve time, it is important for an interpretation-based IG system to be able to save and exploit information about pertinent objects learned from earlier forays into the WWW. Additionally, we argue that an IG solution needs to support constructive problem solving, in which potential answers (e.g. models of products) to a user's query are incrementally built up from features extracted from raw documents and compared for consistency or suitability against other partially-completed answers.

In connection with this incremental model-building process, an interpretation-based IG problem solution must also support sophisticated scheduling to achieve interleaved data-driven and expectation-driven processing. Processing for interpretation must be driven by expectations of what is reasonable, but, expectations in turn must be influenced by what is found in the data. For example, during a search to find information on word processors for Windows95, with the goal of recommending some package to purchase, an agent finding Excel in a review article that also contains Word 5.0 might conclude based on IE-derived expectations that Excel is a competitor word processor. However, scheduling of methods to resolve the uncertainties stemming from Excel's missing features would lead to additional gathering for Excel, which in turn would associate Excel with spreadsheet features and would thus change the expectations about Excel (and drop it from the search when enough of the uncertainty is resolved). Where possible, the scheduling should permit parallel invocation of IE methods or requests for WWW documents.

**Figure:** BIG's Problem Solving Control Flow
$\begin{figure} \epsfxsize=3.25in \hspace{\fill} \epsffile{cig_exec_fig.eps} \hspace{\fill}\end{figure}$

To illustrate our objective, consider a simple sketch of BIG in action. A simplified control flow view of this BIG sketch is shown in Figure 1. A client is interested in finding a drawing program for Windows95. The client submits goal criteria that describes desired software characteristics and specifications for BIG's search-and-decide process. The search parameters are quality importance = 80%, time importance = 20%, soft time deadline of 20 minutes, hard cost limitation of 0. This translates into emphasizing quality over duration, a preference for a response in 20 minutes if possible, and a hard constraint that the search use only free information. The product parameters are: product price: $200 or less, platform: Windows95, usefulness importance rating 100 units, future usefulness rating 25, product stability 100, value 100, ease of use 100, power features 25, enjoyability 100. The client is a middle-weight home-office user who is primarily concerned with using the product today with a minimum of hassles but who also doesn't want to pay too much for power user features. Upon receipt of the criteria, BIG first invokes its planner to determine what information gathering activities are likely to lead to a solution path; activities include retrieving documents from known drawing program makers such as Corel and MacroMedia as well as from consumer sites containing software reviews, such as the Benchin Web site. Other activities pertain to document processing options for retrieved text; for a given document, there are a range of processing possibilities each with different costs and different advantages. For example, the heavyweight information extractor pulls data from freeformat text and fills templates and associates certainty factors with the extracted items. In contrast, the simple and inexpensive pattern matcher attempts to locate items within the text via simple grep-like behavior. These problem solving options are then considered and weighed by the task scheduler that performs quality/cost/time trade-off analysis and determines a course of action for BIG. The resulting schedule is then executed; multiple retrieval requests are issued and documents are retrieved and processed. Data extracted from documents at the MacroMedia site is integrated with data extracted from documents at the Benchin site to form a product description object for MacroMedia Freehand. However, when BIG looks for information on Adobe Illustrator at the Benchin site it also comes across products such as the Bible Illustrator for Windows, and creates product description objects for these products as well. After sufficient information is gathered, and the search resources nearly consumed, BIG then compares the different product objects and selects a product for the client. In this case, BIG's data indicates that the ``best'' product is MacroMedia Freehand though the academic version is the specific product that is below our client's price threshold. (The regular suggested retail price is $595.) BIG returns this recommendation to the client along with the gathered information and the corresponding extracted data.

Though the sketch above actually illustrates one of the problem areas of BIG's text processing, that is identifying special versions of products, it illustrates one of the cornerstones of our approach to the information explosion - we believe that retrieving relevant documents is not a viable end solution to the information explosion. The next generation of information systems must use the information to make decisions and thus provide a higher-level client interface to the enormous volume of on-line information. Our work is related to other agent approaches [16] that process and use gathered information, such as the WARREN [6] portfolio management system or the original BargainFider [11] agent or Shopbot [8], both of which work to find the best available price for a music CD. However, our research differs in its direct representation of, and reasoning about, the time/quality/cost trade-offs of alternative ways to gather information, its ambitious use of gathered information to drive further gathering activities, its bottom-up and top-down directed processing, and its explicit representation of sources-of-uncertainty associated with both inferred and extracted information.

Next: The BIG Agent Architecture Up: Introduction Previous: Introduction

Thomas A. Wagner
1/26/1998