Virtual Reality With Webcam

Couple summers ago I saw a cool hack where a sensor, placed on the TV, detected where the emitter, placed in the glasses, was, and based on that information adjusted viewpoint of the 3D scene displayed on the TV.

Cool, right? And all you needed were these two simple devices and the right software.

Well, why use two if you can use only one? 😀

There is already a sensor mounted on your laptop screen, called a webcam, and if you’ve ever tagged someone on Facebook, you know face recognition is a real thing. That means your computer can literally “see” where you are if you teach it how to look.

It took me a couple of days to bang things up and have a proof that this is possible to achieve. It then collected dust on my hard drive for a year or two until I woke up a couple days ago with an idea how to tweak it to make it representable.

How it works

I used OpenCV library to detect faces and their location in the image webcam sees. Then I picked the closest, the biggest one, and did all the calculations with it. OpenGL is responsible for rendering the 3D image, and Qt (cute) is putty that binds them together, handling all the events and window management.

Several things influence the right viewpoint for the 3D scene:

  1. Size of the scene you’re viewing
  2. Angle you’re viewing the scene from
  3. Distance from the scene

Those things depend on:

  1. Size of the screen
  2. Distance from the screen
  3. Distance from the camera
  4. Your camera lens width
  5. Angle you’re looking the screen at

For real-life application, things can be simplified. With webcam attached to your display/laptop, distance to screen and distance to camera are very similar and oriented in the same direction.

For the demonstration, I simplified things further. Distance from the screen, for the lack of stereo vision, can be estimated.
Since I’m doing this on a laptop screen, on a table, I decided that length of 1 in OpenGL was about 1dm (decimetre), or about the size of a coffee mug. Distance from the camera and from the screen is roughly 6dm.

I had no choice with the webcam lens width and I couldn’t find any info about it online but this was enough to run the prototype and guestimate other parameters.

Position of your eyes in the image defines the x and y axis coordinates of the 3D scene viewpoint and depends on the lens width. Wider lens will have more coverage (left, right, up and down) than the more narrower lens, meaning that your face won’t appear in the same place on both cameras, even though they have the same resolution. Parameters should be adjusted accordingly.

Z axis represents the distance from the scene. You know how when you get closer to your window, the more of outside you can see?
Now imagine your screen being a window into a virtual world. It would have to do the same. That’s possible to adjust with the OpenGL camera angle. I tried to estimate distance from the screen based on the face surface size in the image. Unfortunatly, with my webcam, that proved too volatile for smooth experience. So I just used fixed distance of 6dm. This can be improved with stereo vision in the future.

First scene in the video is with laptop screen tilted to 45° angle and viewed from slightly above. I had to adjust the viewpoint manually since program is not aware of the angle. If gyroscope-like sensor was embedded in it like in most tablets and phones, it could have been done automatically.

Second scene is from the angle you would normally look your screen at.

Third scene is a screen capture of the program output with camera output overlaid on the bottom left. Blue rectangle marks the location where the face was detected. If you watch carefully, you can see how a false positive was briefly detected on my shirt. That’s why it pays to always pick the biggest detected face in the scene. Also, you can play-fight who will the program pick with a person sitting next to you 😀

Possible improvements and future work

Let’s start by saying that face detection can be CPU intensive. Especially when you do it 60 times a second. We can get that load off to GPU, and in fact OpenCV supports that kind of operation. Unfortunately for me, driver for my GPU doesn’t. Proprietary driver does, but doesn’t support extending my screen to second display. Guess which one I prefer 🙂

Another obvious solution is putting more interesting graphics inside that virtual reality window. I know boxes aren’t very interesting to look at. What would you like to see instead?

Since all the code is cross platform, it’s fairly easy feasible to port it to your phone or tablet. Putting that selfie camera to a good use, what would you do with this technology in your phone?

Technology limitations and possible improvements

I already mentioned the lack of depth adjustments to the image. Because we use a single camera, it’s hard to know the distance of the viewpoint. Using a higher resolution webcam might make guestimating smoother. Adding a second camera and calibrating them, it is possible to extract that information from the scene and adjust the viewpoint and virtual camera lens width.

Talking about stereo image, we humans naturally have two eyes, see stereo, and sense depth. It’s hard to fool your mind into thinking this 2D image is 3D unless you cover one eye 😀
That said, 3D video is here and you’ve probably experienced it. It is possible to produce a 3D video for the 3D TVs by rendering the 3D scene twice with a slight tilt of the viewpoint. To go even further, instead of detecting just face, it’s possible to detect eyes and use their individual locations in the stereo image.

That would bring your gaming experience to the next level, plus ducking when things are flying your way would actually make sense. 😀

What would you do with it? How could this technology help you?

Start typing to share your thoughts...