Title: Incremental Dialogue System Faster than and Preferred to its Nonincremental Counterpart
Abstract: Incremental Dialogue System Faster than and Preferred to its Nonincremental Counterpart Gregory Aist 1 ([email protected]), James Allen 2 ([email protected]), Ellen Campana 2,3,4,5 ([email protected]), Carlos Gomez Gallo 2 ([email protected]), Scott Stoness 2 ([email protected]), Mary Swift 2 ([email protected]), and Michael K. Tanenhaus 3 ([email protected]) Department of Computer Science and Engineering Arizona State University P.O. Box 878809 Tempe AZ 85287 Department of Computer Science University of Rochester P.O. Box 270226 Rochester NY 14627 Department of Brain and Cognitive Sciences University of Rochester P.O. Box 270268 Rochester, NY 14627 Abstract understanding; Arts, Media, and Engineering Program Arizona State University P.O. Box 878709 Tempe AZ 85287 Department of Psychology Arizona State University P.O. Box 871104 Tempe AZ 85287 brought into context, as determined by hearer eye fixations (Altmann and Kamide 1999). Other actions can also be taken based on partial utterances. Many different sources of knowledge are available for use in understanding. On the speech recognition side, commonly used sources of information include acoustics, phonetics and phonemics, lexical probability, and word order. In dialogue systems, additional sources of information often include syntax and semantics (both general and domain-specific.) There are also however some sources of information that are less frequently programmed. These include such linguistic information as morphology and prosody. Knowledge-based features are also available, such as world knowledge (triangles have three sides), domain knowledge (here there are two sizes of triangles), and task knowledge (the next step is to click on a small triangle.) There is also pragmatic information available from the visual context (there is a small triangle near the flag.) In this paper we discuss some of the progress we have made towards building methods for incremental understanding of spoken language by machines. We first discuss some of our and others’ related work in this area. We then discuss the testbed domain that we have been developing, and show some of the characteristics of human dialogue in the domain. We then discuss the incremental architecture that we have been developing, highlighting its differences from traditional architectures. Finally, we present an experimental evaluation of the performance of the system showing that incremental systems are both faster than and preferred to their nonincremental counterparts. Current dialogue systems generally operate in a pipelined, modular fashion on one complete utterance at a time. Evidence from human language understanding shows that human understanding operates incrementally and makes use of multiple sources of information during the parsing process, including traditionally “later” components such as pragmatics. In this paper we describe a spoken dialogue sys- tem that understands language incrementally, provides visual feedback on possible referents during the course of the user’s utterance, and allows for overlapping speech and actions. We further present findings from an empirical study showing that the resulting dialogue system is faster overall than its nonincremental counterpart. Furthermore, the incremental system is preferred to its nonincremental counterpart – beyond what is accounted for by factors such as speed and accuracy. These results indicate that successful incremental understanding systems will improve both performance and usability. Keywords: natural language systems; incremental processing. dialogue Introduction The standard model of natural language understanding for dialogue systems is pipelined, modular, and operates on complete utterances. By pipelined we mean that only one level of processing operates at a time, in a sequential manner. By modular, we mean that each level of processing depends only on the previous level. By complete utterances we mean that the system operates on one sentence at a time. There is, however, considerable evidence that human language processing is neither pipelined nor modular nor whole-utterance (Marslen-Wilson 1993). Evidence is converging from a variety of sources, including particularly actions taken while speech arrives. For example, natural turn-taking behavior such as backchanneling (uh-huh) and interruption occur while the speaker is still speaking. Eye movements to possible referents also occur while listening: individuals process instructions incrementally, making saccadic eye movements to objects right after hearing relevant words in the instruction (Tanenhaus et al. 1995); verbs appearing earlier in sentences affect which objects are Related Work We have previously shown that incremental parsing can be faster and more accurate than non-incremental parsing (Stoness et al. 2005.) In addition, we have shown that in our testbed domain the relative percentage of language that is of a more interactive style also increases over time (Aist et al. 2005.)
Publication Year: 2007
Publication Date: 2007-01-01
Language: en
Type: article
Access and Citation
Cited By Count: 30
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot