Multi-modal input

I think the future of human-computer interaction lies more in the multi-modal realm than the keyboard-centric approach used in Jef Raskin’s Humane Interface project and advocated in his recent Desktop Linux Summit keynote.

The keyboard is very efficient for some tasks (text input and editing being the best examples), but it doesn’t allow people to take advantage of all of their senses and communicative capabilities, ignoring spoken and gestural input altogether.

We are not going to be using interfaces like those depicted in The Minority Report (your arms would fall off after a day at the office), or pedantically telling our computers:

Copy this to the clipboard.

Create a new e-mail message.

Paste the clipboard here.

Send this message.

More likely, we’ll use a combination of currently common input mechanisms (mouse and keyboard) along with gestures made using the fingers and some spoken commands that are contextually understood by our computers.

The computer could understand what you mean when you say:

Mail this to Joe.

The computer knows what content you are looking at, and knows who Joe is, so it can act on the command. You wouldn’t have to baby step through a command sequence.

Rather than Bernstein-on-crack arm flapping, a trackpad-like surface that could interpret finger movements as gestural input is one possibility. The fingers are capable of repeatedly making very precise motions without tiring – think of how complicated and precise writing can be. Your hand may tire after a while, but finger gestures would be only one input method.

I’ve also thought a bit about completely non-graphical interfaces. What information can be as easily or more easily communicated audibly? Do I really need to see movie time listings? Weather information? Dictionary definitions? Think of how many simple questions you “ask” your computer throughout the day – who won the game tonight? When does that movie open? How much is that new book? The problem is that to get the information they want, users must currently think “data retrieval” rather than conversationally or simple Q&A. You’ve already formed the question in your mind, so why should you have to reformat it to fit the interface?

I recognize that the input method is only one small part of Raskin’s vision for THE. I plan to write a bit about the other aspects in the future.

Drag and Drop to the Dock

My idea of triggering the single application Expose view by hovering a dragged item over the Dock works great (theoretically!) if an application already has a window or document (or two) open. What if you attempted to drag and drop an object onto an application that had no open windows? What about applications which are not open at all? The obvious solution is to create a new document containing the dragged object, assuming the application can indeed handle it, regardless of activation state.

  • For pictures dragged to an image viewer, a copy of the image would be opened in a new window. For iPhoto, this would add it to the library.
  • Text snippets would create new documents in text editors.
  • Any type would open a new mail message containing the dragged object in e-mail apps. For links to movies and images, the link itself would be placed in the body of the message, not the movie or image.
  • Spring objects could create a new canvas to be used as a holding spot.
  • Web links would open a new browser window or tab depending on existing user preference.
  • FTP links would open a new connection. AFP links dropped on the Finder would do likewise.
  • Files dropped on chat clients are tricky. Prompting for a buddy to send the file to seems like the only real choice if no conversations are in progress.

This would really work best in concert with my spring-loaded Dock combined with Expose idea. This portion handles situations in which no document or window is open, eliminating the need to choose one such item as a destination. The single app Expose would handle situations in which an application already presents one or more valid destinations for a dragged object.

For single-window applications whose sole window has been minimized, it would obviously be necessary for it to unminimize to provide an object destination.

Some problems to further ponder:

  • It is probably necessary for the cursor itself to actually hit the Dock icon to initiate the drop. It may even be necessary for the dragged object(s) to stop just outside of the Dock so that the cursor is clearly visible for selecting an application.
  • Would Dock icon-drop created documents come to the foreground or not?
  • Where would items dropped on the Finder’s icon go? The Desktop folder or the frontmost Finder window?

I’m sure there are many other potential problems that I cannot currently imagine.


I’ve been playing with the Flash zooming interface demo that is a part of Jef Raskin’s Humane Interface concept. As noted in the demo itself, it is limited and does not reflect the full capabilities of today’s rendering engines. With that said, it is still easy to see the potential of such a design.

The space is easily navigable and items can be grouped in certain locations. This is a more truly “spatial” UI than that advocated by John Siracusa. A “spatial” metaphor doesn’t really make sense when items can disappear. Sure, a window or desktop can have items arranged spatially, but the value of that spatial representation is greatly decreased as soon as the window is closed or the desktop covered.

My questions about the implementation of the zooming interface:

How is a given object’s zoom level determined?

Does the user have control over an object’s zoom level? Can it be set relative to nearby objects or a global scale?

How would things such as playing media be handled in such a UI? How would I create a playlist of audio files?

I know – just buy the book. Not seeing anything more detailed on the website, I imagine the book delves into specifics.