The second observation that shapes our solution is that WWW-based IG is an instance of the interpretation problem. Interpretation is the process of constructing high-level models (e.g. product descriptions) from low-level data (e.g. raw documents) using feature-extraction methods that can produce evidence that is incomplete (e.g. requested documents are unavailable or product prices are not found) or inconsistent (e.g. different documents provide different prices for the same product). Coming from disparate sources of information of varying quality, these pieces of uncertain evidence must be carefully combined in a well-defined manner to provide support for the interpretation models under consideration.
In recasting IG as an interpretation problem, we face a search problem characterized by a generally combinatorially explosive state space. In the IG task, as in other interpretation problems, it is impossible to perform an exhaustive search to gather information on a particular subject, or even in many cases to determine the total number of instances (e.g. particular word processing programs) of the general subject (e.g. word processing) that is being investigated. Consequently, any solution to this IG problem needs to support reasoning about tradeoffs among resource constraints (e.g. the decision must be made in 1 hour), the quality of the selected item, and the quality of the decision process (e.g. comprehensiveness of search, effectiveness of IE methods usable within specified time limits). Because of the need to conserve time, it is important for an interpretation-based IG system to be able to save and exploit information about pertinent objects learned from earlier forays into the WWW. Additionally, we argue that an IG solution needs to support constructive problem solving, in which potential answers (e.g. models of products) to a user's query are incrementally built up from features extracted from raw documents and compared for consistency or suitability against other partially-completed answers.
In connection with this incremental model-building process, an interpretation-based IG problem solution must also support sophisticated scheduling to achieve interleaved data-driven and expectation-driven processing. Processing for interpretation must be driven by expectations of what is reasonable, but, expectations in turn must be influenced by what is found in the data. For example, during a search to find information on word processors for Windows95, with the goal of recommending some package to purchase, an agent finding Excel in a review article that also contains Word 5.0 might conclude based on IE-derived expectations that Excel is a competitor word processor. However, scheduling of methods to resolve the uncertainties stemming from Excel's missing features would lead to additional gathering for Excel, which in turn would associate Excel with spreadsheet features and would thus change the expectations about Excel (and drop it from the search when enough of the uncertainty is resolved). Where possible, the scheduling should permit parallel invocation of IE methods or requests for WWW documents.
To illustrate our objective, consider a simple sketch of BIG in action. A simplified control flow view of this BIG sketch is shown in Figure 1. A client is interested in finding a drawing program for Windows95. The client submits goal criteria that describes desired software characteristics and specifications for BIG's search-and-decide process. The search parameters are quality importance = 80%, time importance = 20%, soft time deadline of 20 minutes, hard cost limitation of 0. This translates into emphasizing quality over duration, a preference for a response in 20 minutes if possible, and a hard constraint that the search use only free information. The product parameters are: product price: $200 or less, platform: Windows95, usefulness importance rating 100 units, future usefulness rating 25, product stability 100, value 100, ease of use 100, power features 25, enjoyability 100. The client is a middle-weight home-office user who is primarily concerned with using the product today with a minimum of hassles but who also doesn't want to pay too much for power user features. Upon receipt of the criteria, BIG first invokes its planner to determine what information gathering activities are likely to lead to a solution path; activities include retrieving documents from known drawing program makers such as Corel and MacroMedia as well as from consumer sites containing software reviews, such as the Benchin Web site. Other activities pertain to document processing options for retrieved text; for a given document, there are a range of processing possibilities each with different costs and different advantages. For example, the heavyweight information extractor pulls data from freeformat text and fills templates and associates certainty factors with the extracted items. In contrast, the simple and inexpensive pattern matcher attempts to locate items within the text via simple grep-like behavior. These problem solving options are then considered and weighed by the task scheduler that performs quality/cost/time trade-off analysis and determines a course of action for BIG. The resulting schedule is then executed; multiple retrieval requests are issued and documents are retrieved and processed. Data extracted from documents at the MacroMedia site is integrated with data extracted from documents at the Benchin site to form a product description object for MacroMedia Freehand. However, when BIG looks for information on Adobe Illustrator at the Benchin site it also comes across products such as the Bible Illustrator for Windows, and creates product description objects for these products as well. After sufficient information is gathered, and the search resources nearly consumed, BIG then compares the different product objects and selects a product for the client. In this case, BIG's data indicates that the ``best'' product is MacroMedia Freehand though the academic version is the specific product that is below our client's price threshold. (The regular suggested retail price is $595.) BIG returns this recommendation to the client along with the gathered information and the corresponding extracted data.
Though the sketch above actually illustrates one of the problem areas of BIG's text processing, that is identifying special versions of products, it illustrates one of the cornerstones of our approach to the information explosion - we believe that retrieving relevant documents is not a viable end solution to the information explosion. The next generation of information systems must use the information to make decisions and thus provide a higher-level client interface to the enormous volume of on-line information. Our work is related to other agent approaches [16] that process and use gathered information, such as the WARREN [6] portfolio management system or the original BargainFider [11] agent or Shopbot [8], both of which work to find the best available price for a music CD. However, our research differs in its direct representation of, and reasoning about, the time/quality/cost trade-offs of alternative ways to gather information, its ambitious use of gathered information to drive further gathering activities, its bottom-up and top-down directed processing, and its explicit representation of sources-of-uncertainty associated with both inferred and extracted information.