HK1171532B

HK1171532B - Low-latency fusing of virtual and real content

Info

Publication number: HK1171532B
Application number: HK12112172.1A
Authority: HK
Inventors: A．巴-泽埃夫; J．A．古森斯; J．塔迪夫; M．S．格罗斯曼; H.辛格
Original assignee: 微软技术许可有限责任公司
Priority date: 2010-10-27
Filing date: 2012-11-27
Publication date: 2017-08-11

Description

Low latency fusion of virtual and real content

Technical Field

The invention relates to low latency fusion of virtual and real content.

Background

Mixed reality is a technology that allows virtual images to be mixed with a real-world physical environment. For example, a mixed reality system may be used to insert an image of a dinosaur into a view of a user's room so that the user sees the dinosaur walking in the room.

A significant drawback of conventional mixed reality systems is latency. When the user turns his head, the user's view of the real world changes very much instantaneously. However, in conventional mixed reality systems, the sensors take time to sense new image data and render graphical images for display to the headwear worn by the user. Although recent advances have minimized this latency, it is still a frame to a few frames in number. This virtual mismatch between the actual view and the virtual view may cause discomfort and disorientation to the user. Conventional mixed reality systems also rely on powerful but bulky processing systems to perform batch processing, including constructing maps of real-world environments and rendering graphical images.

Disclosure of Invention

Techniques are provided herein for fusing virtual content with real content to provide a mixed reality experience for one or more users. The system includes a mobile display device in wireless communication with a powerful hub computing system. Each mobile display device may include a mobile processing unit coupled to a head mounted display device (or other suitable apparatus) having a display element. Each user wears a head mounted display device that allows the user to view a room through the display element. The display device allows actual direct viewing of the room and real world objects in the room through the display element. The display element also provides the ability to project the virtual image into the user's field of view such that the virtual image appears to be in the room. The system automatically tracks where the user is looking so that the system can determine: where in the user's field of view the virtual image is inserted. Once the system knows where to project the virtual image, the image is projected using the display element.

In embodiments, the hub computing system and one or more processing units may cooperate to build an environmental model that includes the x, y, z Cartesian positions of all users, real world objects, and virtual three-dimensional objects in a room or other environment. The position of each head mounted display device worn by a user in the environment may be calibrated to the environment model and to each other. This allows the system to determine each user's line of sight and field of view of the environment. Thus, a virtual image may be displayed to each user, but the system determines the display of the virtual image from each user's perspective, adjusting the virtual image for parallax and occlusion or by other objects in the environment. The environmental model, referred to herein as a scene map, and all tracking of the user's field of view and objects in the environment may be generated by the hub and computing device and the one or more processing elements working in concert. In further embodiments, the one or more processing units may perform all system operations, and the hub computing system may be omitted.

It takes time to generate and update the positions of all objects in the environment, and it takes time to render virtual objects from each user's perspective. These operations therefore introduce inherent latency to the system. However, over small time periods, such as a few frames of data, the movement tends to be generally smooth and stable. Thus, in accordance with the present technique, data from current and previous time periods may be examined to extrapolate the position of objects in the future, as well as each user's view of those objects in the future. Using future predictions of the final scene map and the user's field of view of the scene map, a virtual image of the scene may be displayed to each user without waiting time. The final scene map and the prediction of the user field of view may be updated over the entire frame to narrow down the possible solutions until it is time to send the rendered images to each user's head mounted display device to display the virtual elements of the mixed reality experience.

In an embodiment, the technology relates to a system for presenting a mixed reality experience to one or more users, the system comprising: one or more display devices for the one or more users, each display device comprising a first set of sensors for sensing data relating to a location of the display device, and a display unit for displaying a virtual image to a user of the display device; one or more processing units, each associated with a display device of the one or more display devices and each receiving sensor data from the associated display device; and a hub computing system operatively coupled to each of the one or more processing units, the hub computing system including a second set of sensors, the hub computing system and the one or more processing units cooperatively determining a three-dimensional map of an environment in which the system is used based on data from the first and second sets of sensors.

In another example, the technology relates to a system for presenting a mixed reality experience to one or more users, the system comprising: a first head mounted display device comprising: a camera to obtain image data of an environment for which the first head mounted display device is used; an inertial sensor to provide inertial measurements of the first head-mounted display; and a display device for displaying a virtual image to a user of the first head mounted display device; and a first processing unit associated with the first head mounted display, the first processing unit to determine a three dimensional map of an environment for which the first head mounted display device is used, and a field of view for which the first head mounted display device views the three dimensional map.

In another example, the technology relates to a method for presenting a mixed reality experience to one or more users, the system comprising: (a) determining status information for at least two time periods, the status information relating to a user's view of an environment comprising a mixed reality of one or more real-world objects and one or more virtual objects; (b) extrapolating status information relating to the user's view of the environment for a third time period, the third time period being a time at which one or more virtual images of mixed reality will be displayed to the user in the future; (c) displaying at least one of the one or more virtual images to the user based on information relating to the field of view extrapolated by the user in said step (b) at a third time period.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.

Drawings

FIG. 1 is an illustration of exemplary components of one embodiment of a system for presenting a mixed reality environment to one or more users.

FIG. 2 is a perspective view of one embodiment of a head mounted display unit.

FIG. 3 is a side view of a portion of one embodiment of a head mounted display unit.

FIG. 4 is a block diagram of one embodiment of components of a head mounted display unit.

FIG. 5 is a block diagram of one embodiment of the components of a processing unit associated with a head mounted display unit.

FIG. 6 is a block diagram of one embodiment of the components of a hub computing system associated with a head mounted display unit.

FIG. 7 is a block diagram depicting one embodiment of a computing system that may be used to implement the hub computing system described herein.

Fig. 8 is an illustration of exemplary components of a mobile embodiment of a system for presenting a mixed reality environment to one or more users in an outdoor context.

FIG. 9 is a flow chart illustrating the operation and cooperation of the hub computing system, one or more processing units, and one or more head mounted display units of the present system.

10-16, and 16A show a more detailed flow chart of an example of the various steps shown in the flow chart of FIG. 9.

FIG. 17 is an exemplary field of view illustrating a virtual object displayed in a mixed reality environment that is not dependent on a user.

FIG. 18 is an exemplary field of view illustrating a user-dependent virtual object displayed in a mixed reality environment.

Fig. 19 and 20 illustrate a pair of expanded exemplary fields of view in accordance with further embodiments of the present technique.

Detailed Description

A system is disclosed herein that can fuse a virtual object with a real object. In one embodiment, the system includes a head mounted display device, and a processing unit in communication with the head mounted display device worn by each of the one or more users. Head mounted display devices include a display that allows direct view of real world objects through the display. The system may project a virtual image onto a display, where the virtual image may be viewed by a person wearing the head mounted display device while the person is also viewing real world objects through the display. Multiple users may also see the same virtual object from different perspectives thereof as if each user were viewing a real-world object from different locations thereof. Various sensors are used to detect the position and orientation of the one or more users in order to determine where to project the virtual image.

Scanning neighboring environments using one or more of the sensors and constructing a model of the scanned environment. Using the model, a virtual image is added to a view of the model at a location of a real-world object that is referenced as part of the model. The system automatically tracks where the one or more users are gazing so that the system can figure out the user's field of view through the display of the head mounted display device. Each user may be tracked using any of a variety of sensors including depth sensors, image sensors, inertial sensors, eye position sensors, and the like.

Using current and past data relating to a model of an environment (including the user, real world objects and virtual objects) and the user's view of the environment, the system extrapolates to future scenarios to predict the model of the environment and the user's view of the environment when an image of the environment is to be displayed to the user. Using this prediction, the virtual image can then be rendered without latency. The image is rendered by: determining the size and orientation of the virtual image; and rendering the sized/oriented image on a display of the head mounted display device for each user having a view of the virtual image. In embodiments, the view of the virtual image or real world object may be changed to account for occlusions.

Fig. 1 illustrates a system 10 for providing a mixed reality experience by fusing virtual content into real content. Fig. 1 shows a plurality of users 18a, 18b and 18c each wearing a head mounted display device 2. As can be seen in fig. 2 and 3, each head mounted display device 2 communicates with its own processing unit 4 via lines 6. In other embodiments, head mounted display device 2 communicates with processing unit 4 through wireless communication. Head mounted display device 2, which in one embodiment is in the shape of glasses, is worn on the head of a user so that the user can view through the display and thereby have an actual direct view of the space in front of the user. The term "actual direct view" is used to refer to the ability to view real-world objects directly with the human eye, rather than to view a created image representation of the object. For example, viewing through glasses in a room would allow a user to have an actual direct view of the room, while viewing a video of the room on a television is not an actual direct view of the room. More details of head mounted display device 2 are provided below.

In one embodiment, the processing unit 4 is a small portable device worn, for example, on the user's wrist or stored in the user's pocket. The processing unit may be, for example, the size and form factor of a cellular telephone, but it may be other shapes and sizes in further examples. Processing unit 4 may include a number of computing capabilities for operating head mounted display device 2. In an embodiment, processing unit 4 communicates wirelessly (e.g., WiFi, bluetooth, infrared, or other wireless communication means) with one or more hub computing systems 12. As will be explained later, hub computing system 12 may be omitted in further embodiments to provide a fully mobile mixed reality experience using only the head mounted display and processing unit 4.

Hub computing system 12 may be a computer, a gaming system or console, or the like. According to an example embodiment, hub computing system 12 may include hardware components and/or software components such that hub computing system 12 may be used to execute applications such as gaming applications, non-gaming applications, and the like. In one embodiment, hub computing system 12 may include a processor, such as a standardized processor, a specialized processor, a microprocessor, or the like that may execute instructions stored on a processor readable storage device for performing the processes described herein.

Hub computing system 12 also includes a capture device 20, which capture device 20 is used to capture image data from a portion of the scene within its field of view (FOV). As used herein, a scene is an environment in which a user moves around, captured within the FOV of the capture device 20 and/or the FOV of each head mounted display device 2. FIG. 1 shows a single capture device 20, but in further embodiments there may be multiple capture devices that cooperate with each other to collectively capture image data from a scene within the composite FOV of the multiple capture devices 20. The capture device 20 may include one or more cameras that visually monitor the one or more users 18a, 18b, 18c and the surrounding space so that gestures and/or movements performed by the one or more users and the structure of the surrounding space may be captured, analyzed, and tracked to perform one or more controls or actions in an application and/or animate an avatar or on-screen character.

The hub computing environment 12 may be connected to an audiovisual device 16 such as a television, a monitor, a high-definition television (HDTV), or the like that may provide game or application visuals. For example, hub computing system 12 may include a video adapter such as a graphics card and/or an audio adapter such as a sound card that may provide audiovisual signals associated with the game application, non-game application, or the like. The audiovisual device 16 may receive the audiovisual signals from the hub computing system 12 and may then output game or application visuals and/or audio associated with the audiovisual signals. According to one embodiment, the audiovisual device 16 may be connected to the hub computing system 12 via, for example, an S-video cable, a coaxial cable, an HDMI cable, a DVI cable, a VGA cable, a component video cable, an RCA cable, or the like. In one example, the audiovisual device 16 includes a built-in speaker. In other embodiments, the audiovisual device 16 and the hub computing device 12 may be connected to external speakers 22.

Hub computing system 12 may be used with capture device 20 to recognize, analyze, and/or track human (and other types of) targets. For example, one or more of the users 18a, 18b, and 18c wearing the head mounted display device 2 may be tracked using the capture device 20 such that gestures and/or motions of the user may be captured to animate one or more avatars or on-screen characters. These motions may also, or alternatively, be interpreted as controls that may be used to affect the application executed by hub computing system 12. Hub computing system 12, together with head mounted display device 2 and processing unit 4, may also provide a mixed reality experience in which one or more virtual images, such as virtual image 21 in fig. 1, may be mixed with real world objects in a scene.

Fig. 2 and 3 show perspective and side views of the head mounted display device 2. Fig. 3 shows only the right side of head mounted display device 2, which includes the portion with temple 102 and nose bridge 104. A microphone 110 is placed in nose bridge 104 for recording sounds and transmitting audio data to processing unit 4, as will be described below. In front of the head mounted display device 2 is a room-facing video camera 112 that can capture video and still images. These images are transmitted to a processing unit 4, which will be described below.

A portion of the frame of head mounted display device 2 will surround the display (which includes one or more lenses). To illustrate the components of head mounted display device 2, the portion of the frame surrounding the display is not depicted. The display includes light guide optical element 115, opacity filter 114, see-through lens 116, and see-through lens 118. In one embodiment, opacity filter 114 is behind and aligned with see-through lens 116, light guide optical element 115 is behind and aligned with opacity filter 114, and see-through lens 118 is behind and aligned with light guide optical element 115. See-through lenses 116 and 118 are standard lenses used in eyeglasses and may be made according to any prescription, including not according to a prescription. In one embodiment, see-through lenses 116 and 118 may be replaced with a variable prescription lens. In some embodiments, head mounted display device 2 will include only one see-through lens or no see-through lens. In another alternative, an optometric cell lens may enter light guide optical element 115. Opacity filter 114 filters out natural light (either on a per-pixel basis or uniformly) to enhance the contrast of the virtual image. Light guide optical element 115 directs artificial light to the eye. More details of opacity filter 114 and light guide optical element 115 are provided below.

Mounted at or within temple 102 is an image source that (in one embodiment) includes a microdisplay 120 for projecting a virtual image and a lens 122 for directing the image from microdisplay 120 into light guide optical element 115. In one embodiment, the lens 122 is a collimating lens.

Control circuitry 136 provides various electronics that support other components of head mounted display device 2. More details of the control circuit 136 are provided below with reference to fig. 4. Inside the temple 102 or mounted at the temple 102 are an earphone 130, an inertial sensor 132, and a temperature sensor 138. In one embodiment shown in FIG. 4, inertial sensors 132 include a three-axis magnetometer 132A, three-axis gyroscope 132B, and three-axis accelerometer 132C. Inertial sensors 132 are used to sense the position, orientation, and sudden accelerations (pitch, roll, and yaw) of head mounted display device 2. The inertial sensors may be collectively referred to below as the inertial measurement unit 132 or IMU 132. IMU 132 may include other inertial sensors in addition to or in lieu of magnetometer 132A, gyroscope 132B, and accelerometer 132C.

Microdisplay 120 projects an image through lens 122. There are different image generation techniques that can be used to implement microdisplay 120. For example, microdisplay 120 can be implemented using a transmissive projection technology where the light source is modulated by an optically active material, backlit with white light. These techniques are typically implemented using LCD-type displays with powerful backlights and high optical power densities. Microdisplay 120 can also be implemented using a reflective technology where external light is reflected and modulated by an optically active material. According to this technique, the illumination is forward lit by a white light source or an RGB source. Digital Light Processing (DLP), Liquid Crystal On Silicon (LCOS), and from Qualcomm, IncThe display techniques are all examples of efficient reflection techniques, as most of the energy is reflected from the modulated structure and can be used in the systems described herein. Additionally, microdisplay 120 can be implemented using an emissive technology, where light is generated by the display. For example, PicoP from Microvision, Inc^TMThe display engine uses a micro-mirror rudder to emit a laser signal onto a small screen acting as a transmissive element or directly to the eye (e.g., a laser).

Light guide optical element 115 transmits light from microdisplay 120 to the eye 140 of a user wearing head mounted display device 2. The light guide optical element 115 also allows light to be transmitted from the front of the head mounted display device 2 through the light guide optical element 115 to the user's eye as indicated by arrow 142, allowing the user to have an actual direct view of the space in front of the head mounted display device 2 in addition to receiving the virtual image from the microdisplay 120. Thus, the walls of light guide optical element 115 are see-through. Light guide optical element 115 includes a first reflective surface 124 (e.g., a mirror or other surface). Light from microdisplay 120 passes through lens 122 and is incident on reflecting surface 124. Reflective surface 124 reflects incident light from microdisplay 120 such that light is trapped by internal reflection within the planar substrate comprising light guide optical element 115. After several reflections off the surface of the substrate, the captured light waves reach the array of selectively reflective surfaces 126. Note that only one of the five surfaces is labeled 126 to prevent the drawings from being too crowded. The reflective surfaces 126 couple light waves exiting the substrate and incident on these reflective surfaces to the user's eye 140.

Since different light rays will travel at different angles and bounce off the interior of the substrate, these different light rays will hit the various reflective surfaces 126 at different angles. Thus, different light rays will be reflected out of the substrate by different ones of the reflecting surfaces. The choice of which rays will be reflected from the substrate by which surface 126 is engineered by selecting the appropriate angle of the surface 126. More details of light-guiding Optical elements can be found in U.S. patent application publication No. 2008/0285140 entitled "Substrate-Guided Optical Devices," published on 20.11.2008, the entire contents of which are incorporated herein by reference. In one embodiment, each eye will have its own light guide optical element 115. When the head mounted display device 2 has two light guide optical elements, each eye may have its own microdisplay 120, which microdisplay 120 may display the same image in both eyes or different images in both eyes. In another embodiment, there may be one light guide optical element that reflects light into both eyes.

Opacity filter 114, which is aligned with light guide optical element 115, selectively blocks natural light from passing through light guide optical element 115, either uniformly or on a per-pixel basis. Details of the Opacity Filter are provided in U.S. patent application No.12/887,426 entitled "Opacity Filter For set-Through Mounted Display" filed on 21/9/2010, the entire contents of which are incorporated herein by reference. However, in general, an embodiment of the opacity filter may be a see-through LCD panel, an electrochromic film (electrochromic film), or similar device capable of acting as an opacity filter. Opacity filter 114 may include a dense grid of pixels, where the light transmittance of each pixel can be individually controlled between a minimum and a maximum light transmittance. Although a transmittance range of 0-100% is desirable, a more limited range is also acceptable, such as, for example, about 50-90% per pixel up to the resolution of the LCD.

After z-buffering (z-buffering) with the proxy for real world objects, a mask of alpha values from the rendering pipeline may be used. When the system renders a scene for an augmented reality display, the system notes which real world objects are in front of which virtual objects, as will be explained later. If a virtual object is in front of a real world object, the opacity should be on for the coverage area of the virtual object. If the virtual object is (virtually) behind a real-world object, the opacity should be off, as well as any color of the pixel, so that for the corresponding region of real light (which is one pixel or more in size), the user will only see the real-world object. Coverage will be on a pixel-by-pixel basis, so the system can handle the case where a portion of a virtual object is in front of a real-world object, a portion of the virtual object is behind the real-world object, and a portion of the virtual object coincides with the real-world object. For such uses, displays that can go from 0% to 100% opacity at low cost, power and weight are most desirable. In addition, the opacity filter may be rendered in color, such as with a color LCD or with other displays such as organic LEDs, to provide a wide field of view.

Head mounted display device 2 also includes a system for tracking the position of the user's eyes. As will be explained below, the system will track the position and orientation of the user so that the system can determine the user's field of view. However, a human will not perceive everything in front of it. Instead, the user's eyes will be directed at a subset of the environment. Thus, in one embodiment, the system will include techniques for tracking the position of the user's eyes in order to refine the measurement of the user's field of view. For example, head mounted display device 2 includes eye tracking component 134 (see fig. 3), which eye tracking component 134 would include eye tracking illumination device 134A and eye tracking camera 134B (see fig. 4). In one embodiment, eye-tracking illumination device 134A includes one or more Infrared (IR) emitters that emit IR light toward the eye. Eye tracking camera 134B includes one or more cameras that sense reflected IR light. The location of the pupil can be identified by known imaging techniques that detect the reflection of the cornea. See, for example, U.S. patent No.7,401,920 entitled "Head mounted eye tracking and display system," issued on 22/7/2008, which is incorporated herein by reference. Such techniques may locate the position of the center of the eye relative to the tracking camera. In general, eye tracking involves obtaining images of the eye and using computer vision techniques to determine the location of the pupil within the eye socket. In one embodiment, it is sufficient to track the position of one eye, since the eyes typically move in unison. However, it is possible to track each eye separately.

In one embodiment, the system will use 4 IR LEDs and 4 IR photodetectors arranged in a rectangle such that there is one IR LED and IR photodetector at each corner of the lens of head mounted display device 2. Light from the LED reflects off the eye. The pupil position is determined by the amount of infrared light detected at each of the 4 IR photodetectors. That is, the amount of white versus black in the eye will determine the amount of light reflected off the eye for that particular photodetector. Thus, the photodetector will have a measure of the amount of white or black in the eye. From the 4 samples, the system can determine the direction of the eye.

Another alternative is to use 4 infrared LEDs as discussed below, but only one infrared CCD at the side of the lens of the head mounted display device 2. The CCD will use small mirrors and/or lenses (fish eyes) so that the CCD can image up to 75% of the visible eyes from the frame. The CCD will then sense the image and use computer vision to find the image, as discussed below. Thus, although FIG. 3 shows one component with one IR transmitter, the configuration of FIG. 3 may be adapted to have 4 IR transmitters and/or 4 IR sensors. More or less than 4 IR transmitters and/or more or less than 4 IR sensors may also be used.

Another embodiment for tracking eye direction is based on charge tracking. This scheme is based on the following observations: the retina carries a measurable positive charge and the cornea has a negative charge. The sensor is mounted through the user's ear (near the earpiece 130) to detect the potential as the eye moves around and effectively read out what the eye is doing in real time. Other embodiments for tracking the eye may also be used.

Fig. 3 shows only half of the head mounted display device 2. A complete head mounted display device would include another set of see-through lenses, another opacity filter, another light guide optical element, another microdisplay 120, another lens 122, a room-facing camera, an eye-tracking assembly, a microdisplay, headphones, and a temperature sensor.

FIG. 4 is a block diagram depicting various components of head mounted display device 2. Fig. 5 is a block diagram depicting the various components of the processing unit 4. Components of head mounted display device 2 are depicted in fig. 4, where head mounted display device 2 is used to provide a mixed reality experience to a user by seamlessly fusing one or more virtual images with the user's view of the real world. Additionally, the head mounted display device assembly of FIG. 4 includes a number of sensors that track various conditions. Head mounted display device 2 will receive instructions from processing unit 4 regarding the virtual image and provide sensor information back to processing unit 4. The components of processing unit 4 are depicted in FIG. 4, and processing unit 4 will receive sensory information from head mounted display device 2 and will exchange information and data with hub computing device 12 (see FIG. 1). Based on this information and the data exchange, the processing unit 4 will determine where and when to provide the virtual image to the user and send instructions to the head mounted display device of fig. 4 accordingly.

Some of the components of FIG. 4 (e.g., room facing camera 112, eye tracking camera 134B, microdisplay 120, opacity filter 114, eye tracking illumination 134A, headphones 130, and temperature sensor 134) are shown shaded to indicate that there are two of each of these, one for the left side of head mounted display device 2 and one for the right side of head mounted display device 2. Fig. 4 shows a control circuit 200 in communication with a power management circuit 202. The control circuit 200 includes a processor 210, a memory controller 212 in communication with a memory 214 (e.g., D-RAM), a camera interface 216, a camera buffer 218, a display driver 220, a display formatter 222, a timing generator 226, a display output interface 228, and a display input interface 230.

In one embodiment, all components of the control circuit 200 communicate with each other over dedicated lines or one or more buses. In another embodiment, each component of the control circuit 200 is in communication with the processor 210. The camera interface 216 provides an interface to the two room-facing cameras 112 and stores images received from the room-facing cameras in the camera buffer 218. Display driver 220 will drive microdisplay 120. Display formatter 222 provides information about the virtual image displayed on microdisplay 120 to opacity control circuit 224, which controls opacity filter 114. A timing generator 226 is used to provide timing data to the system. The display output 228 is a buffer for providing images from the room-facing camera 112 to the processing unit 4. Display input 230 is a buffer for receiving images, such as virtual images to be displayed on microdisplay 120. The display output 228 and the display input 230 communicate with a band interface 232 that is an interface to the processing unit 4.

Power management circuit 202 includes voltage regulator 234, eye tracking illumination driver 236, audio DAC and amplifier 238, microphone preamplifier and audio ADC 240, temperature sensor interface 242, and clock generator 244. Voltage regulator 234 receives power from processing unit 4 through band interface 232 and provides the power to the other components of head mounted display device 2. Each eye tracking illumination driver 236 provides an IR light source for eye tracking illumination 134A as described above. Audio DAC and amplifier 238 outputs audio information to headphones 130. Microphone preamplifier and audio ADC 240 provides an interface for microphone 110. The temperature sensor interface 242 is an interface for the temperature sensor 134. Power management unit 202 also provides power to and receives data back from three axis magnetometer 132A, three axis gyroscope 132B, and three axis accelerometer 132C.

Fig. 5 is a block diagram depicting the various components of the processing unit 4. Fig. 5 shows control circuitry 304 in communication with power management circuitry 306. The control circuit 304 includes: a Central Processing Unit (CPU) 320; a Graphics Processing Unit (GPU) 322; a cache 324; RAM 326, memory controller 328 in communication with memory 330 (e.g., D-RAM), flash controller 332 in communication with flash memory 334 (or other type of non-volatile storage), display output buffer 336 in communication with head mounted display device 2 through band interface 302 and band interface 232, display input buffer 338 in communication with head mounted display device 2 through band interface 302 and band interface 232, microphone interface 340 in communication with external microphone connector 342 for connecting to a microphone, and PCI express interface for connecting to wireless communication device 346; and a USB port 348. In one embodiment, the wireless communication device 346 may include a Wi-Fi enabled communication device, a Bluetooth communication device, an infrared communication device, and the like. The USB port may be used to interface processing unit 4 to hub computing device 12 for loading data or software onto processing unit 4 and for charging processing unit 4. In one embodiment, CPU 320 and GPU 322 are the main load devices used to determine where, when, and how to insert three-dimensional virtual images into a user's field of view. More details are provided below.

Power management circuitry 306 includes a clock generator 360, an analog-to-digital converter 362, a battery charger 364, a voltage regulator 366, a head-mounted display power supply 376, and a temperature sensor interface 372 (which may be located on a wrist band of processing unit 4) in communication with a temperature sensor 374. The analog-to-digital converter 362 is used to monitor battery voltage, temperature sensors, and control battery charging functions. The voltage regulator 366 communicates with a battery 368 for providing power to the system. Battery charger 364 is used to charge battery 368 (via voltage regulator 366) upon receiving power from charging jack 370. The HMD power interface 376 provides power to the head mounted display device 2.

FIG. 6 illustrates an exemplary embodiment of hub computing system 12 having capture device 20. According to an example embodiment, the capture device 20 may be configured to capture video with depth information including a depth image, which may include depth values, by any suitable technique, which may include, for example, time-of-flight, structured light, stereo image, or the like. According to one embodiment, the capture device 20 may organize the depth information into "Z layers," or layers that may be perpendicular to a Z axis extending from the depth camera along its line of sight.

As shown in FIG. 6, the capture device 20 may include a camera component 423. According to an exemplary embodiment, camera component 423 may be or may include a depth camera that may capture a depth image of a scene. The depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may represent a depth value, such as a distance in, for example, centimeters, millimeters, or the like of an object in the captured scene from the camera.

The camera component 423 may include an Infrared (IR) light component 425, a three-dimensional (3D) camera 426, and an RGB (visual image) camera 428 that may be used to capture a depth image of a scene. For example, in time-of-flight analysis, the IR light component 425 of the capture device 20 may emit an infrared light onto the scene and may then detect the backscattered light from the surface of one or more targets and objects in the scene using sensors (including sensors not shown in some embodiments), for example using the 3D camera 426 and/or the RGB camera 428. In some embodiments, pulsed infrared light may be used such that the time difference between an outgoing light pulse and a corresponding incoming light pulse may be measured and used to determine a physical distance from the capture device 20 to a particular location on the targets or objects in the scene. Additionally, in other exemplary embodiments, the phase of the outgoing light wave may be compared to the phase of the incoming light wave to determine the phase shift. The phase shift may then be used to determine a physical distance from the capture device to a particular location on the targets or objects.

According to another example embodiment, time-of-flight analysis may be used to indirectly determine a physical distance from the capture device 20 to a particular location on the targets or objects by analyzing the intensity of the reflected beam of light over time via various techniques including, for example, shuttered light pulse imaging.

In another exemplary embodiment, the capture device 20 may use a structured light to capture depth information. In such an analysis, patterned light (i.e., light displayed as a known pattern such as a grid pattern, a stripe pattern, or a different pattern) may be projected onto the scene via, for example, the IR light component 425. Upon falling onto the surface of one or more targets or objects in the scene, the pattern may deform in response. Such a deformation of the pattern may be captured by, for example, the 3D camera 426 and/or the RGB camera 428 (and/or other sensors) and may then be analyzed to determine a physical distance from the capture device to a particular location on the targets or objects. In some implementations, the IR light component 425 is displaced from the cameras 426 and 428 so that triangulation can be used to determine the distance from the cameras 426 and 428. In some implementations, the capture device 20 will include a dedicated IR sensor that senses IR light or a sensor with an IR filter.

According to another embodiment, one or more capture devices 20 may include two or more physically separated cameras that may view a scene from different angles to obtain visual stereo data that may be resolved to generate depth information. Other types of depth image sensors may also be used to create the depth image.

The capture device 20 may also include a microphone 430, the microphone 430 including a transducer or sensor that may receive and convert sound into an electrical signal. Microphone 430 may be used to receive audio signals that may also be provided to hub computing system 12.

In an example embodiment, the capture device 20 may also include a processor 432 that may be in communication with the image camera component 423. Processor 432 may include a standard processor, a special purpose processor, a microprocessor, etc. that may execute instructions including, for example, instructions for receiving depth images, generating appropriate data formats (e.g., frames), and transmitting data to hub computing system 12.

The capture device 20 may also include a memory 434, which memory 434 may store instructions executed by the processor 432, images or image frames captured by the 3D camera and/or the RGB camera, or any other suitable information, images, or the like. According to an example embodiment, the memory 434 may include Random Access Memory (RAM), Read Only Memory (ROM), cache, flash memory, a hard disk, or any other suitable storage component. As shown in FIG. 6, in one embodiment, the memory 434 may be a separate component in communication with the image capture component 423 and the processor 432. According to another embodiment, memory 434 may be integrated into processor 432 and/or image camera component 423.

Capture device 20 communicates with hub computing system 12 via communication link 436. The communication link 436 may be a wired connection including, for example, a USB connection, a firewire connection, an ethernet cable connection, etc., and/or a wireless connection such as a wireless 802.11b, 802.11g, 802.11a, or 802.11n connection, etc. According to one embodiment, hub computing system 12 may provide a clock to capture device 20 that may be used to determine when to capture a scene, for example, via communication link 436. Additionally, capture device 20 provides depth information and visual (e.g., RGB) images captured by, for example, 3-D camera 426 and/or RGB camera 428 to hub computing system 12 via communication link 436. In one embodiment, the depth image and visual image are transmitted at a rate of 30 frames per second, although other frame rates may be used. Hub computing system 12 may then create a model and use the model, depth information, and captured images to, for example, control an application such as a game or word processor and/or animate an avatar or on-screen character.

Hub computing system 12 includes a skeletal tracking module 450. Module 450 uses the depth images obtained in each frame from capture device 20, and possibly from cameras on one or more head mounted display devices 2, to develop a model of each user 18a, 18b, 16c (or other user) within the FOV of capture device 20 as each user moves around the scene. The representation model may be a skeletal model described below. Hub computing system 12 may also include a scene mapping module 452. The scene mapping module 452 uses depth image data and possibly RGB image data obtained from the capture device 20, and possibly from cameras on one or more head mounted display devices 2, to develop a map or model of the scene in which the user 18a, 18b, 18c is located. The scene graph may also include user locations obtained from the skeletal tracking module 450. The hub computing system may also include a gesture recognition engine 454 for receiving skeletal model data for one or more users in a scene and determining: whether the user is performing a predefined gesture or affecting an application control movement of an application running on hub computing system 12.

The skeletal tracking model 450 and the scene mapping module 452 are explained in more detail below. More information about the Gesture recognition engine 454 may be found in U.S. patent application No.12/422,661 entitled "Gesture recognizer system Architecture," filed on 13.4.2009, the entire contents of which are incorporated herein by reference. More information about recognizing Gestures may also be found in U.S. patent application No.12/391,150 entitled "Standard geteures" filed on 23.2.2009 and U.S. patent application No.12/474,655 entitled "geturetool" filed on 29.5.2009, both of which are incorporated herein by reference in their entirety.

Capture device 20 provides RGB images (or visual images in other formats or color spaces) and depth images to hub computing system 12. The depth image may be a plurality of observed pixels, where each observed pixel has an observed depth value. For example, the depth image may include a two-dimensional (2-D) pixel area of the captured scene where each pixel in the 2-D pixel area may have a depth value, such as a distance of an object in the captured scene from the capture device. Hub computing system 12 will use the RGB images and depth images to develop a skeletal model of the user and to track the movement of the user or other objects. Many methods can be used to model and track the human skeleton by using depth images. One suitable example of using depth images to track a skeleton is provided in U.S. patent application No.12/603,437 entitled "Pose Tracking Pipeline" (hereinafter the' 437 application), filed 21/10/2009, which is incorporated herein by reference in its entirety.

The process of the' 437 application includes: obtaining a depth image; down-sampling the data; removing and/or smoothing high variance noise data; identifying and removing the background; and assigning each of the foreground pixels to a different part of the body. Based on these steps, the system will fit a model to the data and create a skeleton. The skeleton will include a set of joints and connections between the joints. Other methods for user modeling and tracking may also be used. Suitable tracking techniques are also disclosed in the following four U.S. patent applications, all of which are incorporated herein by reference in their entirety: U.S. patent application No.12/475,308 entitled "Device for identifying and Tracking Multiple Humans Over Time" filed on 29.5.2009; U.S. patent application No.12/696,282 entitled "Visual Based Identity Tracking" filed on 29.1.2010; U.S. patent application No.12/641,788 entitled "Motion Detection using depth Images" filed on 12, 18, 2009; U.S. patent application No.12/575,388 entitled "Human Tracking System" was filed on 7.10.2009.

Hub computing system 12, in conjunction with head mounted display device 2 and processing unit 4, described above, is capable of inserting a virtual three-dimensional object into the field of view of one or more users such that the virtual three-dimensional object extends and/or replaces a view of the real world. In one embodiment, head mounted display device 2, processing unit 4, and hub computing system 12 work together in that each device includes a subset of sensors for obtaining the data needed to determine where, when, and how to insert the virtual three-dimensional object. In one embodiment, the calculations to determine where, when, and how to insert the virtual three-dimensional object are performed by the hub computing system 12 and the processing unit 4 working in cooperation (with each other). However, in other embodiments, all of the calculations may be performed by a separately operating hub computing system 12 or a separately operating processing unit 4. In other embodiments, at least some of these calculations may be performed by head mounted display device 2.

In an exemplary embodiment, hub computing device 12 and processing unit 4 work together to create a map or model of a scene in which the one or more users are located and to track moving objects in the environment. In addition, hub computing system 12 and/or processing unit 4 track the FOV of head mounted display device 2 worn by users 18a, 18b, 18c by tracking the position and orientation of head mounted display device 2. The sensor information obtained by the head mounted display device 2 is transmitted to the processing unit 4. In one embodiment, this information is communicated to hub computing system 12, and hub computing system 12 updates the scene model and communicates it back to the processing unit. Processing unit 4 then uses the additional sensor information it receives from head mounted display device 2 to refine the user's field of view and provide instructions to head mounted display device 2 as to how, where, and when to insert the virtual three-dimensional object. Based on sensor information from the camera in capture device 20 and head mounted display device 2, the scene model and tracking information may be periodically updated between hub computing system 12 and processing unit 4 in a closed loop feedback system explained below.

FIG. 7 illustrates an exemplary embodiment of a computing system that may be used to implement hub computing system 12. As shown in FIG. 7, the multimedia console 500 has a Central Processing Unit (CPU)501 that has a level one cache 502, a level two cache 504, and a flash ROM (read Only memory) 506. The level one cache 502 and the level two cache 504 temporarily store data and thus reduce the number of memory access cycles, thereby improving processing speed and throughput. CPU 501 may be equipped with more than one core and thus have additional level 1 and level 2 caches 502 and 504. The flash ROM 506 may store executable code that is loaded during an initial phase of a boot process when the multimedia console 500 is powered ON.

A Graphics Processing Unit (GPU)508 and a video encoder/video codec (coder/decoder) 514 form a video processing pipeline for high speed, high resolution graphics processing. Data is carried from the graphics processing unit 508 to the video encoder/video codec 514 via a bus. The video processing pipeline outputs data to an a/V (audio/video) port 540 for transmission to a television or other display. A memory controller 510 is connected to the GPU 508 to facilitate processor access to various types of memory 512, such as, but not limited to, RAM (random access memory).

The multimedia console 500 includes an I/O controller 520, a system management controller 522, an audio processing unit 523, a network interface controller 524, a first USB host controller 526, a second USB controller 528 and a front panel I/O subassembly 530 that are preferably implemented on a module 518. The USB controllers 526 and 528 serve as hosts for peripheral controllers 542(1) -542(2), a wireless adapter 548, and an external memory device 546 (e.g., flash memory, external CD/DVD ROM drive, removable media, etc.). The network interface 524 and/or wireless adapter 548 provide access to a network (e.g., the Internet, home network, etc.) and may be any of a wide variety of various wired or wireless adapter components including an Ethernet card, a modem, a Bluetooth module, a cable modem, and the like.

System memory 543 is provided to store application data that is loaded during the boot process. A media drive 544 is provided and may include a DVD/CD drive, a blu-ray drive, a hard disk drive, or other removable media drive, among others. The media drive 544 may be located internal or external to the multimedia console 500. Application data may be accessed via the media drive 544 for execution, playback, etc. by the multimedia console 500. The media drive 544 is connected to the I/O controller 520 via a bus, such as a Serial ATA bus or other high speed connection (e.g., IEEE 1394).

The system management controller 522 provides various service functions related to ensuring availability of the multimedia console 500. The audio processing unit 523 and the audio codec 532 form a corresponding audio processing pipeline with high fidelity and stereo processing. Audio data is transferred between the audio processing unit 523 and the audio codec 532 via a communication link. The audio processing pipeline outputs data to the a/V port 540 for reproduction by an external audio user or device having audio capabilities.

The front panel I/O subassembly 530 supports the functionality of the power button 550 and the eject button 552, as well as any LEDs (light emitting diodes) or other indicators exposed on the outer surface of the multimedia console 500. The system power supply module 536 provides power to the components of the multimedia console 500. A fan 538 cools the circuitry within the multimedia console 500.

The CPU 501, GPU 508, memory controller 510, and various other components within the multimedia console 500 are interconnected via one or more buses, including serial and parallel buses, a memory bus, a peripheral bus, and a processor or local bus using any of a variety of bus architectures. By way of example, these architectures may include a Peripheral Component Interconnect (PCI) bus, a PCI-Express bus, and the like.

When the multimedia console 500 is powered ON, application data may be loaded from the system memory 543 into memory 512 and/or caches 502, 504 and executed on the CPU 501. The application may present a graphical user interface that provides a consistent user experience when navigating to different media types available on the multimedia console 500. In operation, applications and/or other media contained within the media drive 544 may be launched or played from the media drive 544 to provide additional functionalities to the multimedia console 500.

The multimedia console 500 may be operated as a standalone system by simply connecting the system to a television or other display. In this standalone mode, the multimedia console 500 allows one or more users to interact with the system, watch movies, or listen to music. However, with the integration of broadband connectivity made available through the network interface 524 or the wireless adapter 548, the multimedia console 500 may further be operated as a participant in a larger network community. Additionally, the multimedia console 500 may communicate with the processing unit 4 through a wireless adapter 548.

When the multimedia console 500 is powered on, a set amount of hardware resources may be reserved for system use by the multimedia console operating system. These resources may include reservations of memory, CPU and GPU cycles, network bandwidth, and so on. Because these resources are reserved at system boot, the reserved resources are not present from an application perspective. In particular, the memory reservation is preferably large enough to contain the launch kernel, concurrent system applications, and drivers. The CPU reservation is preferably constant such that if the reserved CPU usage is not used by the system applications, the idle thread will consume any unused cycles.

For GPU reservation, lightweight messages (e.g., popups) generated by system applications are displayed by using a GPU interrupt to schedule code to render popup into an overlay. The amount of memory required for the overlay depends on the overlay area size, and the overlay preferably scales with the screen resolution. Where the concurrent system application uses a full user interface, it is preferable to use a resolution that is independent of the application resolution. A scaler may be used to set this resolution, thereby eliminating the need to change the frequency and cause a TV resynch.

After the multimedia console 500 boots and system resources are reserved, concurrent system applications execute to provide system functionality. The system functions are encapsulated in a set of system applications that execute within the reserved system resources described above. The operating system kernel identifies threads that are system application threads and not game application threads. The system applications are preferably scheduled to run on CPU 501 at predetermined times and intervals in order to provide a consistent system resource view for the application. The scheduling is done to minimize cache disruption caused by the gaming application running on the console.

When the concurrent system application requires audio, audio processing is scheduled asynchronously to the gaming application due to time sensitivity. A multimedia console application manager (described below) controls the audio level (e.g., mute, attenuate) of the gaming application when system applications are active.

Optional input devices (e.g., controllers 542 (1)) and 542(2)) are shared by the gaming application and the system application. The input devices are not reserved resources, but are switched between the system application and the gaming application so that each will have the focus of the device. The application manager preferably controls the switching of input stream without knowledge of the gaming application's knowledge and the driver maintains state information regarding focus switches. Capture device 20 may define additional input devices for console 500 through USB controller 526 or other interface. In other embodiments, hub computing system 12 may be implemented using other hardware architectures. No hardware architecture is necessary.

Each head mounted display device 2 and processing unit 4 (sometimes collectively referred to as a mobile display device) shown in FIG. 1 communicates with one hub computing system 12 (also referred to as hub 12). In further embodiments, there may be one, two, or more than three mobile display devices in communication with the hub 12. Each mobile display device will communicate with the hub using wireless communication as described above. It is contemplated in such embodiments that much of the information that would benefit all mobile display devices would be calculated and stored at the hub and communicated to each mobile display device. For example, the hub will generate a model of the environment and provide the model to all mobile display devices in communication with the hub. Additionally, the hub may track the location and orientation of the mobile display devices and the moving object in the room and then transmit this information to each mobile display device.

In another embodiment, the system may include a plurality of hubs 12, where each hub includes one or more mobile display devices. These hubs may communicate with each other directly or through the internet (or other network). Such an embodiment is disclosed in U.S. patent application No.12/905,952 entitled "Fusing virtual Content Into Real Content" (MS #330057.01) filed by Flaks et al on 15/10/2010 (MS #330057.01), the entire contents of which are incorporated herein by reference.

Furthermore, in other embodiments, the hub 12 may be omitted entirely. Such an embodiment is shown, for example, in fig. 8. In further embodiments, the embodiment may include one, two, or more than three mobile display devices 580. One advantage of such an embodiment is that the mixed reality experience of the present system becomes fully mobile and can be used in both indoor and outdoor contexts (settings).

In the embodiment of FIG. 8, all functions performed by hub 12 in the following description may alternatively be performed by one of processing units 4, some processing units 4 working in cooperation, or all processing units 4 working in cooperation. In such embodiments, the respective mobile display device 580 performs all of the functions of the system 10, including: state data, scene maps, each user's view of the scene maps, all texture and rendering information, video and audio data, and other information needed to perform the operations described herein are generated and updated. The embodiment described below with reference to the flowchart of FIG. 9 includes a hub 12. However, in each such embodiment, one or more of the processing units 4 may alternatively perform all of the functions of the hub 12.

FIG. 9 is a high-level flow diagram of the operation and interactivity of hub computing system 12, processing unit 4 and head-mounted display device 2 during discrete time periods (such as the time taken to generate, render and display a single frame of image data to each user), in an embodiment, data may be refreshed at a rate of 60 Hertz, but in further embodiments may be refreshed more or less frequently.

In general, the system generates a scene map having x, y, z coordinates of an environment and objects in the environment, such as users, real objects, and virtual objects. The virtual objects may be virtually placed in the environment, for example, by an application running on hub computing system 12. The system also tracks the FOV of each user. While all users may view the same aspects of a scene, they view these aspects from different perspectives. Thus, the system generates a unique view of each person of the scene to adjust for parallax and occlusions of virtual or real world objects, which in turn are different for each user.

For a given frame of image data, a user's view may include one or more real and/or virtual objects. When a user turns their head, for example, from left to right or from top to bottom, the relative position of the real-world object in the user's field of view inherently moves within the user's field of view. However, displaying virtual objects to a user as the user moves their head is a more difficult problem. In the example of a user viewing a stationary virtual object within their FOV, if the user moves their head to the left to move the FOV to the left, the display of the virtual object needs to be shifted to the right by the offset of the user's FOV so that the net effect is that the virtual three-dimensional object remains stationary within the FOV. This can be more difficult when the virtual objects themselves, as well as other users, can also move in the scene, possibly obscuring the user's view of the objects in the scene. One system for accomplishing these operations is explained below with reference to the flow diagrams of fig. 9-16.

A system for presenting mixed reality to one or more users 18a, 18b, and 18c may be configured at step 600. For example, the users 18a, 18b, 18c and other users or operators of the system may specify the virtual objects to be presented, and how, when, and where the virtual objects are to be presented. In an alternative embodiment, an application running on the hub 12 and/or the processing unit 4 may configure the system with respect to the virtual object to be presented.

In steps 604 and 630, the hub 12 and processing unit 4 collect data from the scene. For the hub 12, this may be image and audio data sensed by the depth camera 426, the RGB camera 428, and the microphone 430 of the capture device 20. For processing unit 4, this may be the image data sensed by head mounted display device 2, and specifically by camera 112, eye tracking component 134, and IMU 132 in step 652. In step 656, the data collected by the head mounted display device 2 is sent to the processing unit 4. At step 630, the processing unit 4 processes the data and sends it to the hub 12.

At step 608, the hub 12 performs various setup operations that allow the hub 12 to coordinate the image data of its capture device 20 and one or more processing units 4. In particular, even if the position of the capture device 20 relative to the scene is known (and may not be), the cameras on the head mounted display device 2 still move around the scene. Thus, in embodiments, the position and time capture of each imaging camera requires calibration with the scene, with each other, and with the hub 12. Further details of step 608 are described below in the flowchart of FIG. 10.

The operations of step 608 include: the clock skew for each imaging device in system 10 is determined at step 670. Specifically, in order to coordinate the image data from each camera in the system, it is necessary to ensure that: the coordinated image data is from the same time. Details regarding determining clock skew and synchronizing image data are disclosed in the following documents: U.S. patent application No.12/772,802 entitled "Heterogeneous image sensor Synchronization" filed on 3.5.2010; U.S. patent application No.12/792,961 entitled "Synthesis Of Information from multiple audio-visual Sources," filed on 3.6.2010, the entire contents Of which are incorporated herein by reference. In general, image data from capture device 20 and image data from one or more processing units 4 are time stamped from a single master clock in hub 12. The hub 12 determines the time offset for each imaging camera in the system by using a time stamp for all such data for a given frame and using a known resolution for each camera. Thus, the hub 12 may determine the differences between the images received from each camera and the adjustments to those images.

The hub 12 may select a reference timestamp from one of the frames received by the camera. The hub 12 may then add or subtract time to or from the image data received from all other cameras to synchronize with the reference timestamp. It can be appreciated that various other operations may be used for the calibration process to determine the time offset and/or synchronize different cameras together. The determination of the time offset may be performed once after initially receiving image data from all cameras. Alternatively, it may be performed periodically, such as for each frame or some number of frames.

Step 608 also includes the operation of calibrating the position of all cameras relative to each other in the x, y, z Cartesian space of the scene. Once this information is known, the hub 12 and/or the one or more processing units 4 can form a scene map or model to identify the geometry of the scene and the geometry and location of objects (including users) within the scene. The depth and/or RGB data may be used when calibrating the image data of all cameras to each other. Techniques for calibrating camera views Using RGB information alone are described, for example, in U.S. patent publication No.2007/0110338 entitled "Navigating Images Based Geometric Alignment and Object Based Controls," published on 17.5.2007, the entire contents of which are incorporated herein by reference.

The imaging cameras in system 10 may each have some lens distortion that needs to be corrected in order to calibrate the images from the different cameras. Once all of the image data is received from the various cameras of the system at steps 604 and 630, the image data may be adjusted to account for lens distortion of the various cameras at step 674. The distortion of a given camera (depth or RGB) may be a known attribute provided by the camera manufacturer. If not, algorithms for calculating the distortion of the camera are also known, including for example imaging objects of known dimensions, such as checkerboard patterns at different positions within the field of view of the camera. The deviation of the camera view coordinates of the points in the image will be a result of camera lens distortion. Once the degree of lens distortion is known, the distortion can be corrected by knowing the inverse matrix transform that produces a uniform camera view map of the points in the point cloud for a given camera.

Next, hub 12 may translate the distortion corrected image data points captured by each camera from the camera view to the orthogonal 3-D world view at step 678. The orthogonal 3-D world view is a point cloud mapping of all image data captured by capture device 20 and head mounted display device cameras in orthogonal x, y, z Cartesian coordinate systems. Matrix transformation equations for converting camera views into orthogonal 3-D world views are known. See, for example, "3 d Game Engine Design by David h. A Practical approach Real-Time Computer Graphics (3d Game Engine design: Practical implementation of Real-Time Computer Graphics) "(Morgan Kaufman Press (2000), incorporated herein by reference in its entirety. See also previously cited U.S. patent application No.12/792,961.

Each camera in system 10 may construct an orthogonal 3-D world view at step 678. At the end of step 678, the x, y, z world coordinates of the data points from a given camera are still from the perspective of that camera, and have not been correlated with the x, y, z world coordinates of the data points from other cameras of system 10. The next step is to translate the individual orthogonal 3-D world views of the different cameras into a single overall 3-D world view that is shared by all the cameras of the system 10.

To accomplish this, embodiments of hub 12 may then look for keypoint discontinuities or cues in the point clouds of the world views of the respective cameras at step 682, and then identify the same cues in different point clouds of different cameras at step 684. Once the hub 12 is able to determine: if the two world views of two different cameras include the same cue, then hub 12 can determine the position, orientation, and focal length of the two cameras relative to each other and the cue in step 688. In an embodiment, not all cameras in system 10 will share the same common cues. However, as long as the first and second cameras have shared cues and at least one of the cameras has a shared view with the third camera, the hub 12 is able to determine the position, orientation and focal length of the first, second and third cameras relative to each other, as well as a single overall 3-D world view. The same is true for the additional cameras in the system.

There are various known algorithms for identifying cues from the image point cloud. Such algorithms are set forth in "A Performance Evaluation of Local Descriptors" by Mikolajczyk, K and Schmid, C (IEEE model analysis and machine Intelligence bulletin, 27, 10, 1615-. Another method for detecting cues using image data is the scale-invariant feature transform (SIFT) algorithm. The SIFT algorithm is described, for example, in U.S. Pat. No.6,711,293, entitled "Method and Apparatus for Identifying invariant features in an Image and using of the Same for Locating an Object in an Image", issued 3/23/2004, the entire contents of which are incorporated herein by reference. Another cue detector method is the Maximum Stable Extremum Region (MSER) algorithm. The MSER algorithm is described, for example, in the paper "Robust Wide baseline stereo From maximum Stable Extremal region" by j.matas, o.chum, m.urba, and t.pajdla (british machine vision conference proceedings, page 384-.

In step 684, cues that are shared between point clouds from two or more cameras are identified. Conceptually, when there is a first set of vectors between a first camera and a set of cues in the first camera's cartesian coordinate system, and a second set of vectors between a second camera and the same set of cues in the second camera's cartesian coordinate system, the two systems may be resolved relative to each other into a single cartesian coordinate system that includes the two cameras. There are a number of known techniques for finding shared cues between point clouds from two or more cameras. Such techniques are shown, For example, in "An Optimal Algorithm For approximating the fixed dimension of a Nearest Neighbor search" by Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., and Wu, A.Y (ACM 45, 6, 891-. Other techniques may be used instead of or in addition to the above-incorporated approximating nearest neighbor solution of Arya et al, including but not limited to hashing or context sensitive hashing.

When point clouds from two different cameras share a sufficiently large number of matching cues, a matrix correlating the two point clouds together may be estimated, for example, by random sample consensus (RANSAC) or various other estimation techniques. Matches that are outliers of the found basis matrix may then be removed. After finding an asserted set of geometrically consistent matches between a pair of point clouds, the matches may be organized into a set of tracks for the respective point clouds, where a track is a set of artificial match cues between the point clouds. The first trajectory in the set may contain a projection of each common cue in the first point cloud. The second trajectory in the set may contain a projection of each common cue in the second point cloud. Using the information from steps 448 and 450, the point clouds from the different cameras can be resolved into a single point cloud in a single orthogonal 3-D real world view.

The positions and orientations of all cameras are calibrated relative to the single point cloud and the single orthogonal 3-D real world view. To resolve the individual point clouds together, the projections of the cues in the set of trajectories are analyzed for both point clouds. From these projections, the hub 12 may determine a perspective of the first camera relative to the cue and may also determine a perspective of the second camera relative to the cue. Thus, hub 12 may resolve the point clouds into a best estimate for a single point cloud and a single orthogonal 3-D real-world view that includes the cues and other data points from both point clouds.

This process is repeated for any other camera until the single orthogonal 3-D real world view includes all cameras. Once this is done, the hub 12 can determine the relative position and orientation of the cameras with respect to the single orthogonal 3-D real world view and with respect to each other. The hub 12 may also determine the focal length of each camera relative to a single orthogonal 3-D real world view.

Referring again to FIG. 9, once the system is calibrated at step 608, a scene map may be developed at step 610 that identifies the geometry of the scene and the geometry and location of objects within the scene. In an embodiment, the scene map generated in a given frame may include the x, y, and z positions of all users, real world objects, and virtual objects in the scene. All of this information is obtained during the image data collection steps 604, 656 and is calibrated together at step 608.

At least the capture device 20 includes a depth camera for determining the depth of the scene (to the extent that it may be bounded by walls, etc.) and the depth position of objects within the scene. As explained below, the scene map is used to locate virtual objects within the scene and to display virtual three-dimensional objects with appropriate occlusions (a virtual three-dimensional object may be occluded, or a virtual three-dimensional object may occlude a real world object or another virtual three-dimensional object). The system 10 may include multiple depth image cameras to obtain all depth images from the scene, or a single depth image camera, such as, for example, the depth image camera 426 of the capture device 20, may be sufficient to capture all depth images from the scene. A similar method for determining scene mapping within an unknown environment is known as simultaneous localization and mapping (SLAM). An example of a SLAM is disclosed in U.S. Pat. No.7,774,158 entitled "Systems and Methods for Visual Simultaneous localization and Mapping", issued on 10.8.2010, the entire contents of which are incorporated herein by reference.

At step 612, the system will detect and track moving objects, such as people moving in the room, and update the scene map based on the locations of the moving objects. This includes using a skeletal model of the user within the scene as described above. At step 614, the hub determines the x, y, and z positions, orientations, and FOVs of each head mounted display device 2 for all users within system 10. Further details of step 616 are described below with reference to the flowchart of FIG. 11. The steps of fig. 11 are described below with reference to a single user. However, the steps of FIG. 11 can be performed for each user within the scene.

At step 700, calibrated image data of a scene is analyzed at a hub to determine both a user head position, and a face unit vector looking straight from the user's face. The head position is identified in the skeleton model. The face unit vector may be determined by defining a plane of the user's face from the skeletal model, and taking a vector perpendicular to the plane. The plane may be identified by determining the position of the user's eyes, nose, mouth, ears, or other facial features. The face unit vector may be used to define the head orientation of the user and may be considered to be the center of the FOV of the user. The face unit vector may also or additionally be identified from camera image data returned from the camera 112 on the head mounted display device 2. In particular, based on what the camera 112 on the head mounted display device 2 sees, the associated processor 104 and/or hub 12 can determine a face unit vector that represents the orientation of the user's head.

The position and orientation of the user's head may also or alternatively be determined at step 704 by: analyzing the position and orientation of the user's head from an earlier time (either earlier in the frame or from a previous frame); and then uses inertial information from the IMU 132 to update the position and orientation of the user's head. Information from the IMU 132 may provide accurate kinematic data for the user's head, but the IMU typically does not provide absolute position information about the user's head. This absolute position information, also known as "ground truth," may be provided from image data obtained from the capture device 12 for the affiliated user, the camera on the head mounted display device 2, and/or image data obtained from the head mounted display devices of other users.

In an embodiment, the position and orientation of the user's head may be determined by steps 700 and 704 acting in concert. In further embodiments, one or the other of steps 700 and 704 may be used to determine the position and orientation of the user's head.

It may happen that the user is not looking forward. Thus, in addition to identifying the user head position and orientation, the hub may also take into account the position of the user's eyes on their head. This information may be provided by the eye tracking component 134 described above. The eye tracking component can identify the position of the user's eyes, which can be expressed as an eye unit vector showing the left, right, up and/or down deviation (i.e., face unit vector) from the position where the user's eyes are centered and looking forward. The face unit vector may be adjusted to an eye unit vector that defines where the user is looking.

At step 710, the FOV of the user may then be determined. The view range of the user of head mounted display device 2 may be predefined based on the circumferential fields of view (peripheralvision) of the imaginary user up, down, left and right. To ensure that the FOV calculated for a user includes objects that a particular user may be able to see within the FOV, the hypothetical user may be considered to have the largest possible circumferential field of view. Some predetermined extra FOV may be added to this in embodiments to ensure that enough data is captured for a given user.

The FOV of the user at a given moment may then be calculated by taking the view range and centering it around the face unit vector adjusted by any deviation of the eye unit vector. In addition to defining what the user is viewing at a given time, this determination of the user's field of view also helps to determine what the user cannot see. As explained above, limiting the processing of virtual objects to only those areas that are visible to a particular user will increase processing speed and reduce latency.

In the above embodiment, the hub 12 calculates the FOV for each user in the scene. In a further embodiment, the processing unit 4 of the user may be shared in the task. For example, once the user head position and eye orientation are estimated, this information may be sent to a processing unit, which may update the position and orientation, etc., based on updated data regarding the head position (from IMU 132) and eye position (from eye tracking component 134).

Returning now to FIG. 9, the application running on the hub 12 may have placed virtual objects into the scene. At step 618, the hub may use the scene map, and any application-defined motion of the virtual objects, to determine the x, y, and z positions of all such virtual objects at the current time. Alternatively, the information may be generated by one or more of the processing units 4 and sent to the hub 12 at step 618.

Further details of step 618 are shown in the flowchart of FIG. 12. At step 714, the hub determines: whether the virtual three-dimensional object is user-dependent. That is, some virtual objects may be animated independently of any user. They may be added by the application and interact with the user, but their location is not dependent on the particular user (except for conflict detection as explained below). An example of this is shown in fig. 17 explained below as virtual objects 804 and 806.

On the other hand, some virtual three-dimensional objects that are user-dependent may be added. These virtual objects may be provided to expand the user. They may be provided around or on top of the user. For example, a halo, fire, or light may be provided around the user's contour. Text may be displayed over the user, such as the user's name, contact information, or other information about the user. The user may be provided with a virtual article of apparel such as a jacket, sports pants, or hat. As another alternative, a virtual three-dimensional object may be provided on top of the user to change some aspect of the user. The user's hair or clothing may be changed, or a virtual three-dimensional object may turn the user a certain color. An example of this is shown in fig. 18 explained below as virtual object 850. As indicated above, the type of virtual object added may be controlled by an application running on the hub 12 and/or the processing unit 4. However, the location of the virtual object may be calculated by the hub 12 in communication with the application based in part on whether the virtual object is user-dependent or user-independent.

Thus, at step 714, the hub 12 may check for: whether a virtual three-dimensional object to be added to a scene is user-dependent. If not, the hub calculates a new position of the virtual three-dimensional object based on the one or more application metrics at step 718. For example, the application may set: whether and how quickly a virtual three-dimensional object moves in the scene. A change in the shape, appearance, or orientation of the virtual three-dimensional object may be determined. The application may affect various other changes to the virtual object. These changes are provided to the hub at step 718, and the hub may then update the position, orientation, shape, appearance, etc. of the virtual three-dimensional object at step 718. At step 720, the hub may check: whether the updated virtual object occupies the same space as the real-world object in the scene. In particular, all pixel locations of real world objects in the scene are known, and all pixel locations of updated virtual objects are also known. If there is any overlap, the hub 12 may adjust the position of the virtual object according to default rules or metrics defined by the application.

If it is determined at step 714: if the virtual three-dimensional object is associated with a particular user, the hub performs step 724 of updating the position, orientation, shape appearance, etc. based at least in part on the updated position of the user. The skeletal model provides a volumetric description of the user, including a pixel-pixel description of the user in an x, y, z Cartesian coordinate space. Using the skeletal model, the hub can change pixels around, on, or within the user's outline depending on the application metric. This need not be the body contour of the user. In further embodiments, a particular body part may be positioned with an associated virtual three-dimensional object such that the virtual three-dimensional object moves with the body part.

Steps 714 to 724 for providing the updated position, orientation, shape, appearance, etc. of each virtual object are agnostic from the user's perspective. That is, steps 714 through 724 are not dependent on the particular FOV of the user. These steps merely define the 3D positions that the virtual object will occupy in x, y, z cartesian space. This information may be incorporated into the screen map for the frame.

Once the above steps 600-618 have been performed, the hub 12 may transmit the determined information to one or more processing units 4 at step 626. The information transmitted in step 626 comprises the transmission of scene maps to processing units 4 of all users. The transmitted information may also include the transmission of the determined FOV of each head mounted display device 2 to the processing unit 4 of the respective head mounted display device 2. The information transferred may also include the transfer of virtual object characteristics including: the determined position, orientation, shape, appearance, and occlusion properties (i.e., whether the virtual object occludes or is occluded by another object from the view of the particular user).

The process steps 600 to 626 are described above by way of example only. It is understood that one or more of these steps may be omitted in further embodiments, the steps may be performed in a different order, or additional steps may be added. The processing steps 604-618 may be computationally expensive, but the powerful hub 12 may perform these steps several times in 60 Hz frames. In further embodiments, one or more of steps 604 to 618 may alternatively or additionally be performed by one or more of the one or more processing units 4. Furthermore, although fig. 9 illustrates the determination of various parameters, and then all of these parameters being communicated at once at step 626, it is to be understood that the determined parameters may be sent asynchronously to the processing unit 4 as soon as they are determined.

Although not shown in the flowchart of FIG. 9, the hub 12 and/or the processing unit 4 may also identify objects within the scene. For example, the system may have a user profile with a visual image that may be matched to an image of the detected object. Alternatively, the user profile may describe characteristics of a person that may be matched based on the depth image or the visual image. In another embodiment, users may log into the system and hub 12 may use the login process to identify a particular user and track that user throughout the interaction described herein. The hub 12 may also have access to a database of known shapes. In this example, the hub computing device matches many objects in the model to shapes in the database. Upon identifying a user or object, metadata identifying the user or object may be added to the scene map. For example, the metadata may indicate: the particular object is around a shiny table, someone, a green leather couch, etc.

The operation of the processing unit 4 and the head mounted display device 2 will now be explained with reference to steps 630 to 658. In general, for each frame, processing unit 4 extrapolates the received data to predict the final positions of the objects in the scene at a future time instant when the virtual objects will be displayed to the user by head mounted display device 2, and the associated user's view of those objects. The extrapolated prediction of the final FOV generated by the processing unit 4 may be continuously or repeatedly updated during a frame based on data received from the hub 12 and the head mounted display device 2. Given the inherent latency in determining the final FOV and the virtual objects within the final FOV, extrapolating the data into the future will allow the system to predict the view of the virtual objects at the time they will be displayed, thereby effectively eliminating latency from the system. This feature is explained in more detail below. The following description is directed to a single processor 4 and head mounted display device 2. However, the following description may be applied to each processing unit 4 and display device 2 in the system.

As described above, at initial step 656, head mounted display device 2 generates image and IMU data, which is sent to hub 12 by processing unit 4 at step 630. While the hub 12 processes the image data, the processing unit 4 also processes the image data and performs steps to prepare the rendered image. At step 632, processing unit 4 may use the state information from the past and/or present to extrapolate a state estimate for a future time when head mounted display unit 2 presented the rendered frame or image data to the user of head mounted display device 2. In particular, the processing unit determines, at step 632, for the current frame, a prediction of the final FOV of the head mounted display device 2 at some time in the future when the image is to be displayed to the head mounted display device 2. Further details of step 632 are described below with reference to the flowchart of FIG. 13.

At step 750, the processing unit 4 receives the image and IMU data from the head mounted display device 2, and at step 752, the processing unit 4 receives processed image data including the scene map, the FOV of the head mounted display device 2, and occlusion data.

In step 756, the processing unit 4 calculates the time from the current time t until the image is displayed by the head mounted display device 2, X milliseconds (ms). In general, X may be up to 250 milliseconds, but may be greater or less than this in other embodiments. Further, while the embodiments are described below in terms of predicting X milliseconds into the future, it will be appreciated that X may be described in units of time metrics greater or less than milliseconds. The time period X becomes smaller as the processing unit 4 cycles through its operation as explained below and comes closer to the time at which the image will be displayed in a given frame.

In step 760, the processing unit 4 extrapolates the final FOV of the head mounted display device 2 when the image is to be displayed on the head mounted display device 2. Depending on the timing of the processing steps between the hub 12 and the processing unit 4, the processing unit 4 may not have received the data from the hub when the processing unit 4 first performed step 632. In this example, the processing unit is still able to make these determinations if camera 112 of head mounted display device 2 comprises a depth camera. If not, processing unit 4 may perform step 760 after receiving the information from hub 12.

The step 760 of extrapolating the final FOV is based on the fact that the motion tends to be generally smooth and stable over small time periods, such as several frames of data. Thus, by looking at the data at the current time t as well as the data at the previous time, future situations can be extrapolated to predict the user's final view position at the time the frame image is to be displayed. Using this prediction of the final FOV, the final FOV can be displayed to the user at t + X milliseconds without any latency. The processing unit may cycle through its steps multiple times per frame, updating the extrapolation to reduce the possible solutions to the intra time at which the final FOV solution is to be displayed.

Further details of step 760 are provided in the flowchart of FIG. 14. At step 764, image data relating to the FOV received from the hub 12 and/or the head mounted display device 2 is reviewed. At step 768, a smoothing function may be applied to the examined data that captures patterns in the head position data while ignoring noisy or anomalous data points. The number of time periods examined may be two or more different time periods.

In addition to or instead of steps 764 and 768, processing unit 4 may perform step 770 of using the current FOV indicated by head mounted display device 2 and/or hub 12 as a ground truth value for head mounted display device 2. The processing unit 104 may then apply the data from the IMU unit 132 at the current time period to determine a final field of view X milliseconds in the future. The IMU unit 132 may provide kinematic metrics such as speed, acceleration, and jerk (jerk) of the movement of the head mounted display device 2 in 6 degrees of freedom as follows: translation along three axes and rotation about three axes. Using these measurements for the current time period, a direct extrapolation can be made to determine the net change from the current FOV position to the final field of view X milliseconds into the future. Using the data from steps 764, 768, and 770, the final FOV at time t + X milliseconds can be extrapolated in step 772.

In addition to predicting the final FOV of the head mounted display device 2 by extrapolating t + X milliseconds into the future, the processing unit 104 may also determine a confidence value for the prediction, referred to herein as the instantaneous prediction error. It may happen that the user moves his head too quickly that the processing unit 4 cannot extrapolate the data with an acceptable level of accuracy. Mitigation techniques may be used when the instantaneous prediction error is above some predetermined threshold level, rather than relying on extrapolated prediction with respect to the final view position. Mitigation techniques include reducing or turning off the display of the virtual image. Although not ideal, this situation is likely to be temporary and may be superior to presenting an image mismatch between virtual and real images. Another mitigation technique is to fall back to the last acquired data with an acceptable instantaneous prediction error. Additional mitigation techniques include: blurring the data (which may be an extremely acceptable method for displaying virtual objects for rapid head movements); and mixing one or more of the mitigation techniques described above.

Referring again to the flowchart of fig. 9, after extrapolating the final view in step 632, the processing unit 4 may then pick up rendering operations in step 634 such that only those virtual objects that may appear within the final FOV of the head mounted display device 2 are rendered. The positions of other virtual objects may still be tracked, but they are not rendered. As explained below with reference to fig. 19 and 20, in an alternative embodiment, step 634 may include: the rendering operations for the possible FOVs are picked, plus additional boundaries around the periphery of the FOV. This would allow image adjustment at high frame rates without having to re-render the data over the entire FOV. It is also contemplated that in further embodiments, step 634 may be skipped altogether and the entire image rendered.

Processing unit 4 may then perform a render setup step 638 in which a setup rendering operation is performed using the extrapolated final FOV prediction at time t + X milliseconds. Step 638 performs a setup rendering operation on the virtual three-dimensional object to be rendered. In embodiments where virtual object data is provided to the processing unit 4 from the hub 12, step 638 may be skipped until such time as the virtual object data is provided to the processing unit (e.g., by the first time of the processing unit step).

Upon receiving the virtual object data, the processing unit may perform the rendering setup operations in step 638 for virtual objects that may appear within the final FOV at time t + X milliseconds. The setup rendering operation in step 638 includes a common rendering task associated with the virtual object to be displayed in the final FOV. These rendering tasks may include, for example, shadow map generation, lighting, and animation. In an embodiment, the rendering setup step 638 may further include: possible drawing information, such as vertex buffers, textures and states of virtual objects to be displayed in the predicted final FOV, is compiled.

Step 632 determines a prediction of the FOV of head mounted display device 2 when the frame of image data is displayed on the head mounted display device. However, in addition to the FOV, virtual and real objects (such as the user's hands and other users) may also move in the scene. Thus, in addition to extrapolating the final FOV position for each user at time t + X milliseconds, the system also extrapolates the positions for all objects (or moving objects) in the scene, both real and virtual objects, at time t + X milliseconds, at step 640. This information may be helpful in properly realising virtual and real objects and displaying those objects with proper occlusion. Further details of step 640 are shown in the flowchart of FIG. 15.

In step 776, the processing unit 4 may check the position of the user's hand in x, y, z space from the current time t and the previous time in the position data. The hand position data may come from the head mounted display device 2 and possibly from the hub 12. At step 778, the processing unit may similarly examine the position data of other objects in the scene at the current time t and previous times. In an embodiment, the inspected objects may be all objects in the scene, or only those objects identified as moving objects, such as people. In further embodiments, the inspected objects may be limited to those calculated to be within the user's final FOV at time t + X milliseconds. The number of time periods checked in steps 776 and 778 may be two or more different time periods.

At step 782, a smoothing function may be applied to the examined data at steps 776 and 778 while ignoring noisy or anomalous data points. Using steps 776, 778, and 782, the processing unit may extrapolate the positions of the user's hand and other objects in the scene at time t + X milliseconds into the future.

In one example, a user may move their hand in front of their eyes. By tracking this movement with data from head mounted display device 2 and/or hub 12, the processing unit may predict the position of the user's hand when the image will be displayed at time t + X milliseconds, and any virtual objects in the user's FOV that were occluded by the user's hand at this time are suitably displayed. As another example, a virtual object may be marked onto the outline of another user in the scene. By tracking the movement of the tagged user with data from the hub 12 and/or the head mounted display device 2, the processing unit may predict the location of the tagged user when the image will be displayed at time t + X milliseconds, and the associated virtual object may be suitably displayed around the user's outline. Other examples are contemplated in which extrapolation of FOV data and object position data to future situations will allow virtual objects to be properly displayed within the user's FOV at each frame without waiting time.

Referring again to FIG. 9, using the extrapolated position of the object at time t + X milliseconds, the processing unit 4 may then determine occlusions and shadows in the user's predicted FOV at step 644. In particular, the screen map has the x, y, and z positions of all objects in the scene, including moving and non-moving objects and virtual objects. Knowing the user's position and his line of sight to objects in the FOV, the processing unit 104 may then determine: whether the virtual object partially or completely obscures the user's view of the real-world object. Additionally, the processing unit 104 may determine: whether the real-world object partially or completely obscures the user's view of the virtual object. The occlusion is user specific. The virtual object may be occluded or blocked from view by the first user, but not the second user. Thus, occlusion determination may be performed in the processing unit 104 of each user. However, it can be appreciated that occlusion determination can additionally or alternatively be performed by the hub 112.

Using the predicted final FOV and predicted object position and occlusion, GPU 322 of processing unit 4 may then render an image to be displayed to the user at time t + X milliseconds, at step 646. Part of the rendering operations may have been performed in the render setup step 638 and updated periodically.

Further details of the rendering step 646 are now described with reference to the flow diagrams of FIGS. 16 and 16A. In step 790 of FIG. 16, the processing unit 104 accesses a model of the environment. At step 792, the processing unit 104 determines a point fo view of the user relative to the environmental model. That is, the system determines: which portion of the environment or space the user is viewing. In one embodiment, step 792 is a cooperative job using hub computing device 12, processing unit 4, and head mounted display device 2, described above.

In one embodiment, the processing unit 104 will attempt to add multiple virtual objects to the scene. In other embodiments, unit 104 may simply attempt to insert a virtual object into the scene. For virtual objects, the system has a target to insert the virtual object there. In one embodiment, the target may be a real world object such that the virtual object will be marked onto the view of the real object and extend the view of the real world. In other environments, the target of a virtual object may be relative to a real world object.

In step 794, the system renders the previously created three-dimensional environment model from the viewpoint of the user of head mounted display device 2 in a z-buffer without rendering any color information into the corresponding color buffer. This effectively leaves the rendered image of the environment completely black, but does not store z (depth) data for objects in the environment. Step 794 results in storing depth data for each pixel (or for a subset of pixels). At step 798, virtual content (e.g., a virtual image corresponding to a virtual object) is rendered into the same z-buffer, and color information for the virtual content is written into the corresponding color buffer. This effectively allows the virtual image to be drawn on the headset microdisplay 120 taking into account real-world objects or other virtual objects occluding all or part of the virtual object.

In step 800, virtual objects rendered onto or tagged to moving objects are blurred to an appearance just sufficient to provide motion. At step 802, the system identifies the pixels of microdisplay 120 that display the virtual image. At step 806, alpha values are determined for pixels of microdisplay 120. In conventional chromakeying systems, alpha values are used on a pixel-by-pixel basis to identify how opaque an image is. In some applications, the alpha value may be binary (e.g., on and off). In other applications, the alpha value may be a number with a range. In one example, each pixel identified in step 802 will have a first alpha value and all other pixels will have a second alpha value.

At step 810, pixels of the opacity filter are determined based on the alpha values. In one example, the opacity filter has the same resolution as microdisplay 120, and thus the opacity filter can be controlled using alpha values. In another embodiment, the opacity filter has a different resolution than microdisplay 120, and thus data for darkening or not darkening the opacity filter will be derived from the alpha values using any of a variety of mathematical algorithms for converting between resolutions. Other means for deriving control data for the opacity filter based on alpha values (or other data) may also be used.

At step 812, the images in the z-buffer and color buffer, as well as the alpha values and control data for the opacity filter, are adjusted to account for light sources (virtual or real) and shadows (virtual or real). More details of step 812 are provided below with respect to fig. 16A. The process of fig. 16 allows for the virtual image to be automatically displayed on a stationary or moving object (or relative to a stationary or moving object) on a display that allows for actual direct viewing of at least a portion of the space through the display.

FIG. 16A is a flowchart describing one embodiment of a process for accounting for light sources and shadows, which is an exemplary implementation of step 812 of FIG. 16. At step 820, the processing unit 4 identifies one or more light sources that need to be considered. For example, when rendering a virtual image, real light sources may need to be considered. The effect of the virtual light source may also be taken into account in the head mounted display device 2 if the system adds the virtual light source to the user's view. In step 822, portions of the model (including virtual objects) that are illuminated by the light sources are identified. At step 824, the image depicting the illumination is added to the color buffer.

At step 828, processing unit 4 identifies one or more shadow regions that need to be added by head mounted display device 2. For example, if a virtual object is added to the area of the shadow, the shadow needs to be accounted for when drawing the virtual object by adjusting the color buffer at step 830. If a virtual shadow is to be added where no virtual object is present, the pixels of opacity filter 114 corresponding to the location of the virtual shadow are darkened at step 834.

In step 650, the processing unit checks: whether it is time to send the rendered image to head mounted display device 2, or whether there is time to use more recent position feedback data from hub 12 and/or head mounted display device 2 to further refine the extrapolated prediction. In a system using a 60 hz frame refresh rate, a single frame is approximately 16 milliseconds. In an embodiment, a frame may be sent to head mounted display device 2 for display as the frame is half way through. Thus, when X ═ 8 milliseconds (or less), the processing unit will send the rendered image to head mounted display device 2 at step 650.

Specifically, a composite image based on a z-buffer and a color buffer (which are described above with reference to FIGS. 16 and 16A) is sent to microdisplay 120. That is, the virtual image to be displayed at the appropriate pixels is sent to microdisplay 120, taking into account perspective and occlusion. At this time, control data of the opacity filter is also transmitted from processing unit 4 to head mounted display device 2 to control opacity filter 114. The head mounted display device will then display the image to the user at step 658. If the processing unit has correctly predicted the FOV and object positions, then any virtual objects that are expected to appear where the user is looking are displayed at their proper positions and with proper occlusion at the end of the frame.

On the other hand, when it is not time to send a frame of image data to be displayed at step 650, the processing unit may loop back for more updated data to further refine the prediction of the final FOV and the prediction of the final position of the object in the FOV. In particular, if there is still time at step 650, the processing unit 4 may return to step 608 to obtain updated sensor data from the hub 12 and may return to step 656 to obtain updated sensor data from the head mounted display device 2. Each time the loop of steps 632 through 650 is successfully traversed, the extrapolation performed uses a smaller time period into the future. As the time period over which the data is extrapolated becomes smaller (X decreases), the extrapolation of the final FOV and object positions at t + X milliseconds becomes more predictable and accurate.

The processing steps 630 to 652 are described above by way of example only. It is understood that one or more of these steps may be omitted in further embodiments, the steps may be performed in a different order, or additional steps may be added.

Further, the flow chart of processing unit steps in FIG. 9 shows that all data from the hub 12 and the head mounted display device 2 is provided to the processing unit 4 in a single step 632 loop. However, it can be appreciated that processing unit 4 may receive data updates from different sensors of hub 12 and head mounted display device 2 asynchronously at different times. Head mounted display device 2 provides image data from camera 112 and inertial data from IMU 132. The sampling of data from these sensors may be done at different rates and may be sent to the processing unit 4 at different times. Similarly, processed data from the hub 2 may be sent to the processing unit 4 at a time and with a different periodicity than data from both the camera 112 and the IMU 132. In general, processing unit 4 may receive updated data from hub 12 and head mounted display device 2 multiple times asynchronously during a frame. As the processing unit cycles through its steps, it uses the most recent data it has received in extrapolating the final predictions of FOV and object positions.

Fig. 17 is an illustration of FOV840 seen through head mounted display device 2 (not shown) of user 2. The FOV840 includes the actual direct view of real world objects including another user 18 and a chair 842. The FOV840 also includes the display of virtual images 844 and 846. The virtual image 846 is shown climbing a chair 842. Since the chair is in front of the lower half of the virtual image 846, the chair obscures the lower half of the virtual image 846 as described above. User 18 is also shown wearing head mounted display device 2. Thus, the user 18 will also see the same virtual images 844 and 846, but these virtual images are adjusted for parallax and occlusion from their line of sight. For example, as the user 18 gets closer to the virtual images 844, 842, they may appear larger in their FOV. The virtual image 108 may also have different occlusions.

Fig. 18 is another illustration of FOV840 seen through head mounted display device 2 of a user (not shown). The FOV840 includes an actual direct view of another user 18, and a partially outlined virtual image marked onto the user 18. As the user moves (left, right, turns, bends, etc.), the virtual image will continue to partially outline the user 18.

Fig. 19 and 20 illustrate another embodiment of the present system. In the above embodiment, for example, in step 632 of fig. 9, the processing unit 4 of the given user extrapolates the position of the final FOV at display time t + X milliseconds. Any virtual objects that extrapolate the FOV are then rendered in step 646. In the embodiment of FIG. 19, instead of merely extrapolating the final predicted FOV, the processing unit (or hub 12) adds a boundary 854 around the FOV840 to provide an expanded FOV 858. The size of the boundary 854 may vary from embodiment to embodiment, but may be large enough to encompass the possible new FOV resulting from a user turning their head in any direction for a next predetermined length of time period (e.g., half a frame).

In the embodiment of FIG. 19, the position of any virtual objects within the FOV is extrapolated as in step 640 above. However, in this embodiment, all virtual objects within the expanded FOV858 are considered in the extrapolation. Thus, in fig. 19, the processing unit 4 will extrapolate the position of virtual object 860 in the expanded FOV858 in addition to the virtual object 862 in the predicted FOV 840.

In the next time period, if the user turns his head, for example, to the left, the FOV840 will shift to the left (thereby causing the positions of all virtual objects to move to the right relative to the new FOV 840). This scenario is shown in fig. 20. In this embodiment, instead of having to re-render all objects in the new FOV840, all object pixels in the last FOV840 shown in FIG. 19 may be shifted by a distance change determined in the new FOV840 location. Thus, the virtual objects 860 may be displayed in their proper positions offset to the right without having to re-render them. The only rendering required is for any region of the expanded FOV858 that is newly included in the new FOV 840. Thus, the processing unit 104 will render the virtual image 862. Its location will be known because it is included in the expanded FOV858 from the previous time period.

Using the embodiments described in fig. 19 and 20, an updated display of the user FOV can be generated quickly by only rendering a segment of the image and reusing the rest of the image from the last time period. Thus, the updated image data may be sent to the head mounted display device 2 for display during a frame, e.g., half of the frame, effectively doubling the frame refresh rate. As described above, the image data may be resampled and re-rendered once per frame.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. The scope of the invention is defined by the appended claims.

Claims

1. A system (10) for presenting a mixed reality experience to one or more users (18), the system comprising:

one or more display devices (2) for the one or more users, each display device comprising a first set of sensors (132) for sensing data relating to the position of the display device, and a display unit (120) for displaying a virtual image to the user of the display device;

one or more processing units (4), each associated with a display device (2) of the one or more display devices and each receiving sensor data from the associated display device (2) at a current time within a video frame, the processing units extrapolating the position of the virtual three-dimensional object at a future time within the same video frame to display the virtual three-dimensional object at the future time as it is appropriately occluded by a real world object when displayed to the user by the display device; and

a hub computing system (12), the hub computing system (12) operably coupled to each of the one or more processing units (4), the hub computing system (12) including a second set of sensors (20), the hub computing system and the one or more processing units cooperatively determining a three-dimensional map of an environment in which the system (10) is used based on data from the first and second sets of sensors (132, 20).

2. The system of claim 1, wherein the hub computing system and the one or more processing units cooperatively determine a three-dimensional mapping of the environment in real-time.

3. The system of claim 1, wherein the hub computing system provides computer models of one or more users in the environment and the one or more processing units provide locations of display devices.

4. The system of claim 3, wherein the model of the user and the location of the display device enable a field of view of the display device associated with a first processing unit of the one or more processing units to be determined by at least one of the hub computing system and the first processing unit.

5. The system of claim 1, wherein one of the hub computing system and a first processing unit of the one or more processing units determines a position of a virtual three-dimensional object in a field of view of a display device associated with the first processing unit.

6. The system of claim 5, wherein one of the hub computing system and the first processing unit further determines: from a first perspective of a display device associated with the first processing unit, whether the virtual three-dimensional object is occluded by a real-world object in a three-dimensional mapping of the environment, or whether the virtual three-dimensional object occludes a real-world object in a mapping of the environment.

7. A system (10) for presenting a mixed reality experience to one or more users (18), the system comprising:

a first head mounted display device (2) comprising: a camera (112), the camera (112) for obtaining image data of an environment for which the first head mounted display device (2) is used; an inertial sensor (132), the inertial sensor (132) for providing inertial measurements of the first head mounted display device (2); and a display device (120), the display device (120) being for displaying a virtual image to a user (18) of the first head mounted display device (2); and

a first processing unit (4) associated with the first head mounted display device (2), the first processing unit (4) for determining a three dimensional map of an environment for which the first head mounted display device (2) is used, and a field of view for which the first head mounted display device (2) views the three dimensional map, wherein the processing unit uses state information from the past and/or present to extrapolate an estimated state of the first head mounted display unit at a future time in the same video frame as the state information for presenting the rendered frame of image data to a user of the first head mounted display device.

8. The system of claim 7, wherein the status information relates to the position of the user, real world objects and virtual objects in the three-dimensional map of the environment, and to the field of view of the user on the three-dimensional map of the environment.

9. A method for presenting a mixed reality experience to one or more users, the method comprising:

(a) determining status information for at least a first time period from a first video frame and a second time period from a second video frame subsequent to the first video frame (step 610,612,614,618), the status information relating to a user's (18) view of an environment comprising a mixed reality of one or more real-world objects (842) and one or more virtual objects (844, 846);

(b) extrapolating (step 640) status information relating to the user's (18) view of the environment for a third time period in the second video frame, the third time period being a time at a future time instant after the at least two time periods at which the one or more virtual objects (844, 846) of mixed reality are to be displayed to the user (18); and

(c) displaying (step 658) at least one of the one or more virtual objects to the user based on information related to the field of view extrapolated by the user in said step (b) at said third time period.