Detecting Spherical Media Files

In many ways, VR is still a case of a “wild west” as far as technology goes.  There are very few true standards, and those that do exist haven’t been implemented widely.

Recently, we’ve been looking at how to automatically identify spherical (equirectangular) photos and videos so they can be displayed properly in our Elevator digital asset management tool.  “Why is this such a problem in the first place?” you may be wondering.  Well, spherical photos and videos are packaged in a way that they resemble pretty much any other type of photo of video.  At this point, we’re working primarily with images from our Ricoh Theta spherical cameras, which saves photos as .JPG files and videos as .MP4 files.  Our computers recognize these file types as being photo and video files – which they are – but doesn’t have an automatic way of detecting the “special sauce”: the fact that they’re spherical!  You can open up these files in your standard photo/video viewer, but they look a little odd and distorted:

R0010012

So, we clearly need some way of detecting if our photos and videos were shot with a spherical camera.  That way, when we view them, we can automatically plop them into a spherical viewer, which can project our photos and videos into a spherical shape so they can be experienced as they were intended to be experienced!  As it turns out, this gets a bit messy…

Let’s start by looking at spherical photos.  We hypothesized that there must be metadata within the files to identify them as spherical.  The best way to investigate a file in a case like this is with ExifTool, which extracts metadata from nearly every media format.

While there’s lots of metadata in an image file (camera settings, date and time information, etc.), our Ricoh Theta files had some very promising additional items:

Projection Type : equirectangular
Use Panorama Viewer : True
Pose Heading Degrees : 0.0
Pose Pitch Degrees : 5.8
Pose Roll Degrees : 2.8

Additional googling reveals that the UsePanoramaViewer attribute has its origins in Google Streetview’s panoramic metadata extensions.  This is somewhere in the “quasi-standard” category – there’s no standards body that has agreed on this as the way to flag panoramic images, but manufacturers have adopted it.

Video, on the other hand is a little harder to deal with at the moment.  Fortunately, it has the promise of becoming easier in the future.  There’s a “request for comments” with a proposed metadata standard for spherical metadata.  This RFC is specifically focused on storing spherical metadata in web-delivery files (WebM and MP4), using a special identifier (a “UUID”) and some XML.

Right now, reading that metadata is pretty problematic.  None of the common video tools can display it.  However, open source projects are moving quickly to adopt it, and Google is already leveraging this metadata with files uploaded to YouTube.  In the case of the Ricoh cameras we use, their desktop video conversion tool has recently been updated to incorporate this type of metadata as well.

One of the most exciting parts of working in VR right now is that the landscape is changing on a week-by-week basis.  Problems are being solved quickly, and new problems are being discovered just as quickly.

Sharing code in higher ed

The “sharing first” sentiment is gaining momentum across academia…

The Economist recently ran an article about commercial applications of code from higher education.  While LATIS Labs isn’t exactly planning to churn out million-dollar software to help monetize eyeballs or synergize business practices, we do want to be sharing software.

We believe that sharing software is a part of our responsibility as developers at a public institution.  Of course we’ll be releasing code – we’re in higher education.  This “sharing first” sentiment is also gaining momentum in other parts of academia, from open textbooks, to open access journals, to open data (see some related links below).

We also believe that releasing code makes for better code.  At a big institution like the University of Minnesota, it’s easy to cut corners on software development by relying on private access to databases or by making assumptions about your users.  Writing with an eye towards open source forces you to design software the right way, it forces to you document your code, and it forces you to write software you’re proud of.

As we work on things in LATIS Labs, you’ll find them at github.com/umn-latis.   Clone them, fork them, file issues on them.  We’ll keep sharing.

Resources & references

I’ve just seen a face…

For many of us, facial recognition in digital images may seem like one of Facebook and Google’s recent parlor tricks to make it easier for you to “tag” your friends in vacation photos.  But if you do any work in privacy law, ethics, etc., the spread of facial recognition technology may open up some interesting policy implications and research opportunities. Here, we dig a little deeper into how facial recognition technologies work “under the hood”.

Facebook, your iPhone, Google…they all seem to know where the faces are.  You snap a picture and upload it to Facebook?  It instantly recognizes a face and tells you to tag your friends.  And when it does that, it’s actually just being polite; it already has a pretty good sense of which friends it’s recognized–it’s just looking to you to confirm.

If you’re like me, you’ve probably had some combination of reactions, ranging from “Awesome!” to “Well, that’s kinda creepy…” to “How the heck does it do that?”  And if you do any work in privacy law, ethics, etc., the spread of facial recognition technology may be more than a mere parlor trick to you.  It has major policy implications, and will likely open up a lot of interesting research opportunities.

But how the heck does it work?  Well, we dug into this a bit recently to find out…

Fortunately, there happens to be a very nice open source library called OpenCV that we can use to explore some of the various facial recognition algorithms that are floating around out there.  OpenCV can get pretty labyrinthine pretty fast, so you may also want to dig into a few wonderful tutorials (see “Resources” below) that are emerging on the subject.

We explored an algorithm called Eigenfaces, along with a nifty little method called Haar Cascades, to get a sense of how algorithms can be trained to recognize faces in a digital image and match them to unique individuals.  These are just a few algorithms among many, but the exploration should give you a nice idea of the kinds of problems that need to be tackled in order to effectively recognize a face.

But first, let’s jump to the punchline! When it’s all said and done, here’s what it does:

And here’s how it does it, in both layman’s terms and in code snippets:

First, create two sets of images.  The first will be a set of “negative” images of human faces. These are images of generic people, but not those that we want our algorithm to be able to recognize. (Note: Thanks to Samaria & Harter–see “Resources” below–for making these images available to researchers and geeks to use when experimenting with facial recognition!)

The second is a set of “positive” images of the faces that we want our algorithm to be able to recognize. We’ll need about 10-15 snapshots of each person we want to be able to recognize, and we can easily capture these using our computer’s webcam.  And for simplicity’s sake, we’ll make sure all of these images are the same size and are nicely zoomed into the center of each face, so we don’t have to take size variation or image quality into account for now.

Then, feed all of these images into the OpenCV Eigenfaces algorithm:


USER_LIST = os.listdir(config.POSITIVE_DIR) for user in USER_LIST: for filename in walk_files(config.POSITIVE_DIR + "/" + user, '*.pgm'): faces.append(prepare_image(filename)) labels.append(USER_LIST.index(user)) pos_count += 1 # Read all negative images for filename in walk_files(config.NEGATIVE_DIR, '*.pgm'): faces.append(prepare_image(filename)) labels.append(config.NEGATIVE_LABEL) neg_count += 1 print 'Read', pos_count, 'positive images and', neg_count, 'negative images.'

Next, “train” the Eigenfaces algorithm to recognize whose faces are whose. It does this by mathematically figuring out all the ways the negative and positive faces are similar to each other and essentially ignoring this information as “fluff” that’s not particularly useful to identifying individuals.  Then, it focuses on all of the ways the faces are different from each other, and uses these unique variations as key information to predict whose face is whose. So, for example, if your friend has a unibrow and a mole on their chin and you don’t, the Eigenfaces algorithm would latch onto these as meaningful ways of identifying your friend.  The exact statistics of this are slightly over my head, but for those of you who are into that kind of thing, principal component analysis is the “special statistical sauce” that powers this process.

# Train model
print 'Training model...'
model = cv2.face.createEigenFaceRecognizer()
model.train(np.asarray(faces), np.asarray(labels))

When training the model, we can also examine an interesting byproduct–the “mean Eigenface”.  This is essentially an abstraction of what it means to have an entirely “average face”, according to our model:

mean eigenface

Kind of bizarre, huh?

Now, the real test: we need to be able to recognize these faces from a webcam feed.  And unlike our training images, our faces may not be well-centered in our video feed.  We may have people moving around or off kilter, so how do we deal with this?  Enter…the Haar Cascade!

The Haar Cascade will scan through our webcam feed and look for “face-like” objects.  It does this by taking a “face-like” geometric template, and scanning it across each frame in our video feed very, very quickly.  It examines the edges of the various shapes in our images to see if they match this very basic template.  It even stretches and shrinks the template between scans, so it can detect faces of different sizes, just in case our face happens to be very close up or very far away.  Note that the Haar Cascade isn’t looking for specific individuals’ faces–it’s just looking for “face-like” geometric patterns, which makes it relatively efficient to run:

haar_faces = cv2.CascadeClassifier(config.HAAR_FACES)

def detect_single(image):
"""Return bounds (x, y, width, height) of detected face in grayscale image.
If no face or more than one face are detected, None is returned.
"""
faces = haar_faces.detectMultiScale(image,
scaleFactor=config.HAAR_SCALE_FACTOR,
minNeighbors=config.HAAR_MIN_NEIGHBORS,
minSize=config.HAAR_MIN_SIZE,
flags=cv2.CASCADE_SCALE_IMAGE)
if len(faces) != 1:
return None
return faces[0]

Once the Haar Cascade has identified a “face-like” thing in the video feed, it crops off that portion of the video frame and passes it back to the Eigenfaces algorithm.  The Eigenfaces algorithm then churns this image back through its classifier.  If the image matches the unique set of statistically identifying characteristics of one of the users we trainted it to recognize, it will spit out their name.  If it doesn’t recognize the face as someone from the group of users it was trained to recognize, it well tell us that, too!

# Test face against model.
label, confidence = model.predict(crop)

if label >= 0 and confidence < config.POSITIVE_THRESHOLD:
print 'Recognized ' + USER_LIST[label]
else:
print 'Did not recognize face!'

Interested in exploring this further with a class or as part of a research project?  Get in touch and we’re happy to help you on your way!

Related resources

  • Sobel, B. (11 June 2015). “Facial recognition technology is everywhere. It my not be legal.” The Washington Post. https://www.washingtonpost.com/news/the-switch/wp/2015/06/11/facial-recognition-technology-is-everywhere-it-may-not-be-legal/
  • Meyer, R. (24 June 2014). “Anti-Surveillance Camouflage for Your Face”. The Atlantic. http://www.theatlantic.com/technology/archive/2014/07/makeup/374929/
  • “How does Facebook suggest tags?” Facebook Help Center. https://www.facebook.com/help/122175507864081
  • “OpenCV Tutorials, Resources, and Guides”. PyImageSearch. http://www.pyimagesearch.com/opencv-tutorials-resources-guides/
  • “Face Recognition with Open CV: Eigenfaces”. OpenCV docs 2.4.12.0. http://docs.opencv.org/2.4/modules/contrib/doc/facerec/facerec_tutorial.html#eigenfaces
  • “Face Detection Using Haar Cascades”. OpenCV Docs 3.1.0. http://docs.opencv.org/3.1.0/d7/d8b/tutorial_py_face_detection.html
  • “Raspberry Pi Face Recognition Treasure Box”. Adafruit tutorials. https://learn.adafruit.com/raspberry-pi-face-recognition-treasure-box/overview
  • F. Samaria & A. Harter. “Parameterisation of a stochastic model for human face identification” 2nd IEEE Workshop on Applications of Computer Vision December 1994, Sarasota (Florida).

Interaction via Google Cardboard

While devices like the Oculus Rift and HTC Vive generate a lot of the headlines in the world of VR, for most users, lower-cost and lower-fidelity devices like the Google Cardboard and Samsung GearVR are the more typical form of interaction.

The advanced headsets have many modes of interaction; at a minimum, they support gaming controllers or hand controllers, and many of them accurately track movement within a room via infrared technology.  When users are viewing content via Google Cardboard, the options for interaction are much more limited.

Cardboard provides a few methods for getting input from your user, depending on how creative you’re willing to be.  First off, all Cardboard units have a single button.  This button translates as a “tap” on the screen.  You can’t use it to track a specific touch location on the screen, but combined with gaze detection (figuring out what the user is looking at) you can build simple interactivity.  The “gaze – tap” interaction holds some nice potential as the basic “bread and butter” interaction for Google Cardboard.

We’ve been exploring this as a way to do simple interactive walkthroughs. In addition to the obvious “pick the direction you want to go” options, we believe additional controls can often be “hidden” at the top and bottom of the view sphere.  For example, your users may be able to look down to trigger a menu, or look up to go to a map.

Google Cardboard also gives you access to the various sensors available within a smartphone, like the accelerometer and gyroscopes.  These are useful, first and foremost, for tracking the movement of the headset itself, but these could also potentially be used for gesture control.  For example, you could watch for sudden impacts – such as tapping the side of the cardboard, or having the user jump up and down – to trigger certain interactions.

If you’re building a local iOS or Android app (as opposed to a web application) you’ll also have access to the device’s microphone.  If that’s the case, speech recognition could also provide a lot of interesting flexibility.  For example, even basic detection of loud noises can allow for start/stop controls.

While the Cardboard technology is limiting in many ways, the limitations can actually be exciting, because they spur you to think of new ways to leverage the technology.  We’re excited about what’s possible, and we’re excited to hear from others!

Parsing and plotting OMNIC Specta SPA files with R and PHP

[This is a repost of an article that was originally published at DiscreteCosine.]

This is a quick “how-to” post to describe how to parse OMNIC Specta SPA files, in case anyone goes a-google’n for a similar solution in the future.

SPA files consist of some metadata, along with the data as little endian float32. The files contain a basic manifest right near the start, including the offset and runlength for the data. The start offset is at byte 386 (two byte integer), and the run length is at 390 (another two byte int). The actual data is strictly made up of the little endian floats – no start and stop, no control characters.

These files are pretty easy to parse and plot, at least to get a simple display. Here’s some R code to read and plot an SPA:

pathToSource <- "fill_in_your_path";
to.read = file(pathToSource, "rb");

# Read the start offset
seek(to.read, 386, origin="start");
startOffset # Read the length
seek(to.read, 390, origin="start");
readLength

# seek to the start
seek(to.read, startOffset, origin="start");

# we'll read four byte chunks
floatCount

# read all our floats
floatData

floatDataFrame floatDataFrame$ID<-seq.int(nrow(floatDataFrame))
p.plot p.plot + geom_line() + theme_bw()

In my particular case, I need to plot them from PHP, and already have a pipeline that shells out to gnuplot to plot other types of data. So, in case it’s helpful to anyone, here’s the same plotting in PHP.

<!--?php function generatePlotForSPA($source, $targetFile) { $sourceFile = fopen($source, "rb"); fseek($sourceFile, 386); $targetOffset = current(unpack("v", fread($sourceFile, 2))); if($targetOffset > filesize($source)) {<br ?--> return false;
}
fseek($sourceFile, 390);
$dataLength = current(unpack("v", fread($sourceFile, 2)));
if($dataLength + $targetOffset > filesize($source)) {
return false;
}

fseek($sourceFile, $targetOffset);

$rawData = fread($sourceFile, $dataLength);
$rawDataOutputPath = $source . "_raw_data";
$outputFile = fopen($rawDataOutputPath, "w");
fwrite($outputFile, $rawData);
fclose($outputFile);
$gnuScript = "set terminal png size {width},{height};
set output '{output}';

unset key;
unset border;

plot '<cat' binary filetype=bin format='%float32' endian=little array=1:0 with lines lt rgb 'black';"; $targetScript = str_replace("{output}", $targetFile, $gnuScript); $targetScript = str_replace("{width}", 500, $targetScript); $targetScript = str_replace("{height}", 400, $targetScript); $gnuPath = "gnuplot"; $outputScript = "cat \"" . $rawDataOutputPath . "\" | " . $gnuPath . " -e \"" . $targetScript . "\""; exec($outputScript); if(!file_exists($targetFile)) { return false; } return true; } ?>

VR, 360, and 3D oh my!

“Virtual reality” is here to stay, and the tools for authoring virtual reality content are becoming increasingly easy to access.  Whether it’s a basic spherical image or a fully interactive “real” virtual reality simulation, the semantics matter less than the perspective shift it can offer to learners and researchers in the liberal arts.

There’s a new technology on the horizon!  Quick, let’s have an argument over semantics!

If you spend any time with someone embedded in the world of virtual reality, at some point they’re likely to comment that such-and-such technology “isn’t actual virtual reality, it’s just spherical video.”  (The author, in fact, has been guilty of this on several occasions.)

In this post, we’ll break out the different terminology in the space. But first, let’s be clear: “Virtual Reality” has already won the semantic smackdown.  Just like we spent the 90s arguing about the difference between “hackers” and “crackers”, this argument has already been lost.

Spherical Imaging

Spherical imaging involves capturing an image of everything around a single point.  Think of it as a panorama photo, except the panorama goes all the way around you.  For those viewing the resulting image, spherical imaging lets you place your viewer in a position, and then they decide where they want to look.  It’s a great way to give someone a sense of a place without actually being there.  Spherical imaging can capture either still images or video, and can be viewed on either a normal computer screen, or using some type of VR viewing headset.  Here’s an example of a spherical image, as it’s captured by the camera.  You’ll notice its raw form is kind of stretched and distorted.  This is called an “equirectangular” image:

R0010056

And here’s an example of how you can interact with it.  Go ahead and poke, click and drag it – it won’t bite!

[sphere 9]

There are a few important distinctions to think about with spherical imaging.  First off, it’s not three dimensional.  Even though you can look all around, you can’t see different sides of an object.  This is particularly noticeable when objects are close to the camera.  Additionally, your viewer can’t move around.  The perspective is stuck wherever the camera was when the image was captured.

Many folks would argue that these two facts disqualify spherical images from being considered “virtual reality.”  We disagree, but we’ll get to that later.  If you’re interested in capturing your own spherical images, LATIS currently has two Ricoh Theta360 cameras available to borrow.  These are a simple, one-button solution for capturing these types of images.  If you’d like to give them a try, get in touch!

3D Imaging

At its most basic, 3D is just a matter of putting two cameras side by side, in a position that mimics the distance between human eyes.  Then, you simply capture two sets of images or videos set apart at this distance.  When displaying, you just need to send the correct image to the correct eye, and the viewer will have a 3D experience.  However, that’s a pretty limited experience, as the “gaze” remains relatively fixed.  The viewer can’t turn their head and look elsewhere, and they certainly can’t move around.  The more interesting type of 3D combines 3D with spherical imaging.

In order to capture spherical 3D, you need two spherical images, offset just like they’d be in the human head.  It’s a lot more complicated than putting two spherical cameras next to each other, though.  If you did that, you’d only get a 3D image when looking straight ahead or straight behind. At any other position, the cameras would block each other.  This is where things get math-y.

When folks capture spherical 3D today, they often do so by combining many traditional two-dimensional cameras in an array, with lots of overlap between the images.  Afterwards, software builds two complete spherical images with the right offsets.  This is a very processing-intensive approach.  Most of the camera arrays available on the market use inexpensive cameras like the GoPro, but require many cameras to generate the 3D effect.

If you’ve got something like a Google Cardboard viewer, you can see an example of a 3D Spherical video on YouTube.

Unfortunately, we don’t currently have any equipment for this type of capture.  Later in 2016, we expect a variety of more affordable 3D spherical cameras will begin shipping, and we’re excited to explore this space further.

Virtual Reality

When purists use the term “virtual reality,” they’re thinking about a very literal interpretation of the term.  “Real” virtual reality would be an experience so real, you wouldn’t differentiate it from actual reality.  We’re obviously not there yet, but there are a few basic features that are important to think about.

The first, most important factor in “real” virtual reality is freedom of movement.  Within a given space, the viewer should be able to move wherever they want, and look at whatever they want.  In a computer generated environment, like a video game, that’s relatively easy.  If you to provide that sort of experience using a real location, it’s a lot harder – after all, you can’t place a camera at every possible location in a room (though some advanced technology is getting close to that.)

Today, virtual reality means creating a simulation, using technology similar to what’s used when making video games or animated films.  The creator pieces together different 3D models and images, adds animation and interactivity, and then the viewer “plays” the simulation.  While free or inexpensive software like Unity3d makes that feasible, it’s still a pretty complicated process.

Another important part of the “real” virtual reality experience is the ability to manipulate objects in a natural way.  Some of the newest virtual reality viewer hardware on the market, like the Oculus Rift and HTC Vive offer hand controllers which allow you to gesture naturally in space.  Some technologies even track your movement within a room, so you can walk around.

We’re just getting started exploring these technologies, and are learning to build simulations with Unity3d.  If you’d like to work with us on this, please get in touch!