Voice interfaces will not replace screens as the medium of choice for most user interfaces.

Voice interfaces do have a way of capturing the imagination, however. In 1986, I asked a group of 57 computer professionals to predict the biggest change in user interfaces by the year 2000. The top answer was speech I/O, which got twice as many votes as graphical user interfaces.

It may be hard to remember, but in 1986, there was no guarantee that the graphical user interface would win the day. It was mainly used by the "toy-like" Macintosh machines — not by the "serious" systems used by IT professionals. Now, three years after the prediction target, GUIs are clearly the interface of choice.

Voice Interfaces: Getting Real

Many people have an exaggerated impression about voice-interface benefits, likely based on the prominence of voice-operated computers in Star Trek. You know, the captain says, "Computer, locate Commander Data" and the computer answers, "Commander Data is no longer on the ship: he left half an hour ago on an unauthorized shuttle launch."

I've always thought that Captain Picard would have been much better off with a design that informed him immediately when a shuttle was stolen, without first waiting to be asked.

In any case, what to say is the key issue in interaction design, and the main usability determinant. Whether you say something by speaking or by typing is less important to most users. Thus, voice interfaces will not free us from the most substantial problems of user interface design:

  • selecting the tasks to support,
  • determining the structure of the dialogue,
  • deciding which commands or features are available,
  • letting users specify what they want, and
  • making the computer provide feedback on its actions.

All that voice does is let users speak, rather than write, commands and parameters. A small part of the puzzle, indeed.

When to Use Voice

Voice interfaces have their greatest potential in the following cases, which make relying on the traditional keyboard-mouse-monitor combination problematic:

  • Users with various disabilities, who cannot use a mouse and/or a keyboard or who cannot see pictures on the screen. Voice output is the main way for visually impaired users to interact with computers, and because these users rely so heavily on audio presentation of information, it is very important to design Web pages with voice-only browsers in mind.
  • Users who are in an eyes-busy, hands-busy situation. Whether or not they have disabilities, the keyboard-mouse-monitor combo fails users in these situations, such as when they're driving cars or repairing complex equipment.
  • Users who don't have access to a keyboard and/or a monitor. In this case, users might, for example, access a system by payphone.

So, it's not that voice is useless. It's just that it is often a secondary interaction mode when additional media are available. It's much easier to pick out the desired item from a list when the list is displayed on a monitor than when it's read aloud. Voice is a one-dimensional medium with zero persistence; a monitor is a two-dimensional medium that combines persistence (you can look at it for as long as you please) with selective updating (you can type a value into a field anywhere on the screen without changing the rest of the screen).

In the future, we may even move up to three-dimensional interfaces, even though 3D is rarely superior to 2D. Animation and other multimedia effects also add to the richness of visual interfaces, though animation is frequently used poorly in today's designs. The bottom line, though, is that visual interfaces can communicate much more information than auditory interfaces whenever users have a monitor and are capable of looking at it.

Voice in Information Appliances

There are many situations where people don't carry a display around with them and where telephone-based interfaces are the only way to access information. Checking your voice mail while grounded in O'Hare is the most notorious example, but who really likes to listen to linear voice mail?

In the future, we will have many small devices available that are perfectly portable and allow wireless Internet access. The first information appliances are already on the market. And, on occasion, it will be preferable to interact with an information appliance by voice — such as when your inbound flight was late and you're forced to run through the airport to catch your connecting flight. No time to look at anything, but it would be very useful to have a voice-operated assistant that tells you to "turn left here" or that "the outbound flight has been delayed 10 minutes, so you have time to stop at the Starbucks that's around the next corner."

My new Danger PDA nicely says "new message" when email arrives, but phone calls are announced by a selection of annoyingly funky ring tones that don't remind me of anybody I would actually want to talk to. It would be better to be able to record custom announcements such as "Luice calling" or "it's your Mother."

A voice system's usability increases dramatically according to how much it knows about the surrounding environment. Because voice is less rich than visual displays, voice designers cannot rely on users to pick out important information or create connections between separate data items. Doing so will be the system's responsibility. Contextual design will become important, as will tight management of the user's time — the computer shouldn't drone on and on about things that are of minimal importance.

I believe that voice interfaces hold their greatest promise as an additional component to a multi-modal dialogue, rather than as the only interface channel. For example, if you have a visual display and mouse available, it would be faster to point to something on the screen and say "red" or "bigger" than to first select the object, then move the mouse to a different screen area to pull down a menu or click a function button that conveys the same information.

Similarly, voice could be used to direct the user's attention to important events or elements on the screen in a richer way than the obnoxious beep that currently constitutes most computers' audio vocabulary. Grow up, computer. You're not a baby any more, and you can do better than inarticulate beeps.