Tuesday, August 31, 2010

How *would* you test this?

So, I am using Google Voice as my voicemail these days. It's neat, and I like it, mostly. They offer, for free, a transcription service that sends me an email/txt message of what the caller said. Over the last 3 days, I've gotten calls form my son's middle school about him being absent (he is actually withdrawn!).

The beauty of this setup is that the tester in me LOVES certain characteristics of this test set. The call is an automated one -- in this case, I know that the caller is saying the *same exact* thing every time. The tone is the same, the inflections are the same, it's the same voice source data. However, over 3 days, I've gotten 3 different transcriptions from Google Voice!

The voice mail *actually* says this:
Hello. This is the attendance office at Durant Road Middle school, calling to inform you that your child, Steven Cannan, was marked absent today. Please send a signed note upon returning to school, explaining the reason for the absence. Thank you.

And following are THREE Google Voice transcriptions ... they crack me UP!

Day 1:
Hello, this is the attendance office ed to Ron corrode middle school calling to inform you that your child. Hey, it's Dawn Cannan was marked absent today. Please send a signed note on returning to school, explaining the reason for the absence. Thank you.


Day 2:
Hello, this is the attendance office and to Ron corrode middle school calling to inform you that your child. Jeanne cannon. You're a smart ass in today. Please send a signed note up on returning to school, explaining the reason for the absence. Thank you.


Day 3:
Hello, this is the attendance office at Deron corrode middle school calling to inform you that your child G intended You're a smart ass IN today. Please send a signed note up. I'm returning to school, explaining the reason for the absence. Thank you.

I *kind of* think they did the best job on the first day :)

But this brings up an interesting point for me, and it's one that I have encountered several times in my career as a tester.

Roughly, I can ask this question as something like this: How can you test something that is not feasible to prove, logically or mathematically, is *correct* every time?

I can think of so many examples where testing isn't cut and dry. I started working as a tester in a company that did software that processed genetics algorithms. How did we know that the genetics algorithms were correct? At that time, I started off by having all of my testers know how to do the algorithms by hand, so that results that weren't right would just stick out to them.

At one point, I was testing a search engine. I was lucky in this case to have knowledge of the source data in a database. One approach I took here was to find objects in the source data that exhibited certain special (and easy to identify) characteristics. Then, I could perform searches that I *knew* would return those objects and look for them early in the test results. But even then, I didn't really know that the search engine was performing correctly.

Recently, I read a set of slides by Harry Robinson, talking about testing Google Maps. Here's another good example: How do you test something that has multiple right answers, in such a way that you feel confident releasing it to customers?

I've given two approaches that I have used, but I'd like to know how others have solved similar issues. Please, let's talk, I'd love to hear more from you.

8 comments:

Chris McMahon said...

very related: https://www.socialtext.net/writing-about-testing/index.cgi?breaking_the_bug_reporting_rules

This is a huge gaping hole, let us find some answers.

DiscoveredTester said...

I've actually worked on a voice transcription project before, and I can tell you, that even a few years ago, that it often takes a lot of time to train the translation software before it can be relied on as reliably accurate.

I'm not sure how you'd go about testing something like this and verifying that A B or C is the correct translation. I can remember as a child, waiting for my mother's soaps to go off so I could watch some cartoon, and I was almost convinced that Central And Mountain was some place or show LOL.

RN said...

Intresting post.

Another difficult software would be optimization algorithm.

"How can you confirm whether the result is truly optimal ?"

Coming back to Google Voice- I guess you will have to break the test into smaller components

A) Does it detect the language (and then stop trying to make a fool of itself - if its not English) ?

B) Test words rather than sentences

C) Test the words together in a sentence after they pass individual tests.

Anonymous said...

It is pretty obvious that they are testing words.

I remember vividly --
when we were very young we loudly sang
"My country tis of V sweet land of livery"

It didn't make any sense - much of what we did in school did not make any sense, esp when you're four.


Croud Sourcing might work if you can mostly trust the testers / users.

please press 2 for no good. Some telco did that for awhile. They asked for feedback on their automated system

It is pretty obvious that they are NOT parsing the sentence. This is a difficult problem - but in the same order of magnitude of voice to text.

if you Knew the type of communication - you could filter for the words one used to not say on TV .. like ass which should not be in business communication.

if you had some semantic parsing .. large step up . you could verify some of the info -- like you are not your own child.

in related news...
You could check if they were same .. same sentence said by different speakers should at least agree. Here all three were different.

if you had a text to speech .. you could run it back and compare .. speech to text .. text to speech

assuming you had that technology - which may or may not exist.

Software Testing said...

Interesting Post.

Anonymous said...

A related problem:

web says:
component current fault
CPU 1 26.00 85

telnet says:
cpu 1 n/a 85

Now I don't know what the current temp is but
1. I know they should be same.
2. I know they should have a value between 0 and 200
3. I know it should have a bounds of about 2C up or down on each check within same day or something is wrong.

Anonymous said...

I would like to first understand the voice recognition system. Understand the region demographic area that will be deployed for accents. Once I had that understood I could start on the translation software. Each component would introduce possible failures but broken down in functional components you could redirect probably with a database of unusual posts to start correction process. Take some time and in this case as much as I hate the words - test quality in!.....blak, cant believe I said that.

JulianHarty said...

Hi,
You might find some ideas I described on stickyminds useful? Improving the Accuracy of Tests by Weighing the Results
http://www.stickyminds.com/s.asp?F=S11983_COL_2