Gaze | Telepresence — Dan Rosenfeld

During my time as a researcher at Microsoft, I began investigating eye contact in video calling. The lack of proper eye contact between participants in video calls results in a relatively unpleasant and ineffective experience compared to in-person conversations. This impedes the adoption of a critical technology needed to improve collaboration and reduce the need for travel. This section describes my work in this area, including some promising results obtained.

In the most common configuration for video calls, the camera is placed on top of a monitor, significantly above the image of the remote participant. As a result, the camera captures an image in which participants appear to each other to be looking downward, as if ignoring the other person, even when looking at the eyes of the remote participant. In this way, VC becomes a communications medium that transforms a positive social signal, eye contact, into a negative one, gaze aversion. Psychological research shows that people who avoid gaze are considered hostile or deceptive by their interlocutors.

I decided to see if I could develop a solution to this problem.

I started by building a teleprompter-type device—a well-known approach to the eye contact problem. The specific configuration I built is shown in the diagram.

This kind of device can provide very good image quality for display and capture, though like virtually all optical solutions for eye contact, it involves tradeoffs between display brightness and the amount of light available to the camera. (The semi-silvered glass used as a beam-splitter is typically about 50% reflective and 50% transmissive. This implies that 50% of the incoming light is lost before getting to the camera and that 50% of the light from the display is reflected away, before getting to the viewer's eyes.)

The simplicity and quality of this approach provided a good starting point and a reference for future experiments. However, it comes with a important drawback—the beam-splitter requires a depth which is roughly 70% of the display's width or height, depending on beam-splitter orientation—a showstopper for many applications. At Microsoft, where this work was conducted, this issue eliminated the teleprompter approach from consideration for inclusion in a product.

How the Device Works

The teleprompter experiments led to two new solutions for the eye contact problem.

The first approach I developed used an electronically switchable diffuser, time-multiplexed and synchronized with a camera viewing from behind. The video above describes how it works. (Greater detail can be found in this patent.)

The adjacent picture shows two prototype devices based on this first invention.

(The Second Light device I co-invented uses a similar technique to build a device that "that augments the typical interactions afforded by multi-touch and tangible tabletops with the ability to project and sense both through and beyond the display".)

The black object in the photo is the bottom, visible, part of an extremely clever optical invention, called an optical wedge (PDF link). It enables flat, compact projection displays or image capture systems to be constructed.

Subsequent versions of the wedge eliminated the visible lower portion, making the wedge barely larger than the display area.

At this point, I started a broader effort to attack this problem. This included Bill Buxton, who, in addition to having his hand in virtually every important development in human-computer interaction since the late 1970s, coined the term telepresence and had been working in the area since the early 90's. Bill broadened my thinking beyond concerns about eye contact alone, exposing me on to the larger issues of gaze awareness (i.e. awareness of who or what the person I’m talking to is looking at) and spatial awareness as mediated by video conferencing systems.

Around the same time, Microsoft acquired Cambridge FPD—developers of the optical wedge—and Tim Large, an especially brilliant optical engineer from Cambridge, turned his attention to the problem. Tim's invention of a new optical technique enabled a second novel device we called Cyrano. (Shown at the top of this page.)

It works as follows. A camera is placed in the front of the device, pointing back towards the display. A sheet of tiny, partially-reflective prisms acts as the equivalent of a mirror tilted forward in front the display, directing light hitting the perpendicular to the surface of the display into the camera. The camera see the image it would see if it were located behind the display looking forward—just as we need to achieve eye contact.

An angle-dependent view control film behind the prismatic sheet is arranged so that no light from the display reaches the camera, while allowing the display to be seen from normal viewing angles. (If you are interested, you can find more detail in the patent.)

One of Bill's contributions to the project was his long-standing idea that a properly designed device could act a local surrogate for a remote participant. As Bill would say, the device's microphone, speaker, camera, and display act as the ears, mouth, eyes, and face of a remote person.

This approach had huge benefits for the experience. First, in addition to eye-contact, we achieved full gaze-awareness in multiparty conferences. We typically provided two devices at each site, with one on either side of a main monitor. This enabled a three party call in which each participant knew when another participant was looking that them, someone else, or was directing their attention to a shared document presented on the monitor. These social cues, absent from conventional video calls, are critical to regulating and directing conversation.

In addition, the system allowed for normal conversational behaviors such as sidebar conversations; if a participant leaned towards another's surrogate device and whispered they would be heard only by the other person, not all the people in the conference.

It's hard to overstate the effect of all these elements working together. In preparation for our first major demo, we set up three interconnected stations in a large meeting hall. We then began to test the system, one of us at each location. After a few minutes of testing, we made a totally unconscious transition to a natural conversation and didn't realize this fact for some time. I knew we had something really valuable when it became clear that this experience, mediated by Cyrano devices, was good enough that we didn't feel the need to walk fifteen feet away to talk in person.

Momentum behind the effort increased, leading first to a significant trial inside Microsoft, and then to a product incubation effort. Unfortunately, priorities changed, as they often do in large companies, and the effort was dropped. I really hope that it won't be too long until someone else picks up where our team left off; we all could be having much better video calls than we are now.