Multi-modal input

I think the future of human-computer interaction lies more in the multi-modal realm than the keyboard-centric approach used in Jef Raskin’s Humane Interface project and advocated in his recent Desktop Linux Summit keynote.

The keyboard is very efficient for some tasks (text input and editing being the best examples), but it doesn’t allow people to take advantage of all of their senses and communicative capabilities, ignoring spoken and gestural input altogether.

We are not going to be using interfaces like those depicted in The Minority Report (your arms would fall off after a day at the office), or pedantically telling our computers:

Copy this to the clipboard.

Create a new e-mail message.

Paste the clipboard here.

Send this message.

More likely, we’ll use a combination of currently common input mechanisms (mouse and keyboard) along with gestures made using the fingers and some spoken commands that are contextually understood by our computers.

The computer could understand what you mean when you say:

Mail this to Joe.

The computer knows what content you are looking at, and knows who Joe is, so it can act on the command. You wouldn’t have to baby step through a command sequence.

Rather than Bernstein-on-crack arm flapping, a trackpad-like surface that could interpret finger movements as gestural input is one possibility. The fingers are capable of repeatedly making very precise motions without tiring – think of how complicated and precise writing can be. Your hand may tire after a while, but finger gestures would be only one input method.

I’ve also thought a bit about completely non-graphical interfaces. What information can be as easily or more easily communicated audibly? Do I really need to see movie time listings? Weather information? Dictionary definitions? Think of how many simple questions you “ask” your computer throughout the day – who won the game tonight? When does that movie open? How much is that new book? The problem is that to get the information they want, users must currently think “data retrieval” rather than conversationally or simple Q&A. You’ve already formed the question in your mind, so why should you have to reformat it to fit the interface?

I recognize that the input method is only one small part of Raskin’s vision for THE. I plan to write a bit about the other aspects in the future.