Geeks With Blogs
Ulterior Motive Lounge UML Comics and more from Martin L. Shoemaker (The UML Guy),
Offering UML Instruction and Consulting for your projects and teams.
In Part 3, we built a Grammar for Dee Jay to recognize.

Update to Part 3

Driving around last night, it occurred to me that I can let the user specify what sort of media is expected. For example, I could say "Dee Jay, play song Has Been" to pay the song, or "Dee Jay, play album Has Been" to play the album. This specifier should be optional, so the user only has to use it when the user knows there's a potential conflict. Besides making my Dee Jay experience a little more convenient, this also gives me a chance to demonstrate two more facets of M-SAPI Grammars: SemanticResultValue and repetitions.

A SemanticResultValue lets you map phrases to a given result value, which must be a bool, int, float, or string value. Recall from Part 2 that Dee Jay has three different types of MediaDescriptor: song, album, and collection. All sorts of musical information — artist, composer, publisher, genre, etc. — are all treated simply as collection descriptors; but I wanted the user to be able to say "singer" or "artist" or "composer", as made sense for a given song. (And I wanted a good example for SemanticResultValue...) So I made a Choices, and then wrapped it in a SemanticResultValue:


private const string _Specifier = "Specifier";

private const string _Album = "Album";
private const string _Song = "Song";
private const string _Collection = "Collection";

private const string _Artist = "Artist";
private const string _Singer = "Singer";
private const string _Writer = "Writer";
private const string _Songwriter = "Song Writer";
private const string _Musician = "Musician";
private const string _Composer = "Composer";
private const string _Publisher = "Publisher";
private const string _Genre = "Genre";

/// <summary>
/// The set of collection names.
/// </summary>
private string[] mCollectionTypes;

...

mCollectionTypes = new string[] {_Collection, _Artist, _Singer, _Writer, _Songwriter, _Musician, _Composer, _Publisher, _Genre };

...

// Build the optional specifier.
Choices chcCollectionTypes = new Choices();
foreach (string collectionType in mCollectionTypes)
{

GrammarBuilder gbCollectionType = new GrammarBuilder(collectionType);
chcCollectionTypes.Add(gbCollectionType);

}
GrammarBuilder gbCollectionTypes = new GrammarBuilder(chcCollectionTypes);
SemanticResultValue semCollectionType = new SemanticResultValue(gbCollectionTypes, _Collection);


This code makes a Choices with all the different collection type phrases; and then it wraps them all up in a SemanticResultValue that maps all of them to the phrase "Collection". So the user can say...


  • Dee Jay, play singer Jonathon Richman.
  • Dee Jay, play artist Jonathon Richman.
  • Dee Jay, play musician Jonathon Richman.
  • Dee Jay, play song writer Jonathon Richman.
But Dee Jay will hear "Dee Jay, play collection Jonathon Richman."

Next, I add the other specifiers (song and album), and wrap these all in a SemanticResultKey:


Choices chcSpecifiers = new Choices();
chcSpecifiers.Add(new GrammarBuilder(semCollectionType));
chcSpecifiers.Add(_Album);
chcSpecifiers.Add(_Song);
GrammarBuilder gbSpecifier = new GrammarBuilder(chcSpecifiers);
SemanticResultKey keySpecifier = new SemanticResultKey(_Specifier, gbSpecifier);
GrammarBuilder gbOptionalSpecifier = new GrammarBuilder(keySpecifier);


Now we need to modify the keyed commands to optionally include the specifier. GrammarBuilder includes a constructor which takes an existing GrammarBuilder and a minimum and maximum number of repetitions. The Append method has a similar overload:


// Build the keyed command grammar by appending music key
// to each command.
Choices chcKeyedCommands = new Choices();
foreach (string cmd in mKeyedCommands)
{

GrammarBuilder gbKeyed = new GrammarBuilder(new SemanticResultKey(_Command, cmd));
gbKeyed.Append(gbOptionalSpecifier,0, 1);
gbKeyed.Append(gbMusic);
chcKeyedCommands.Add(gbKeyed);

}


With this code, any keyed command includes 0 or 1 specifier elements.

And now...

On with Part 4!



Now we need to create a SpeechRecognitionEngine and tell it to recognize the Grammar. And for any .NET programmer, this is honestly the easiest part:


/// <summary>
/// The recognition engine.
/// </summary>
private SpeechRecognitionEngine mRecoEngine = new SpeechRecognitionEngine();

...

// Start listening.
mRecoEngine.LoadGrammar(mGrammar);
mRecoEngine.SetInputToDefaultAudioDevice();
mRecoEngine.SpeechRecognized += new EventHandler(mEngine_SpeechRecognized);
mRecoEngine.RecognizeAsync(RecognizeMode.Multiple);


We create a SpeechRecognitionEngine. We load our Grammar. We connect to an audio source (in this case, the default audio input). We add an event handler. And we start listening. It's as simple as that.

Only that's not so simple.

First, we have to decide whether to use SpeechRecognitionEngine or SpeechRecognizer. SpeechRecognizer is higher level and simpler, but more limited. In particular, it is limited to the default audio input. SpeechRecognitionEngine is lower level and has more options, including the option to read audio from files or streams. The MS docs are confusing on this which you should use:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognitionEngine object be used instead for that purpose.


Unless I'm missing something, I think that should read:


While SpeechRecognitionEngine based applications can use the system default audio input and recognition engines, it is recommended that the SpeechRecognizer object be used instead for that purpose.


But regardless, I prefer to use SpeechRecognitionEngine. SpeechRecognizer pops up the SpeechUI, a window that shows progress and tips as the user speaks. I find that annoying, honestly. Plus I like the added flexibility of SpeechRecognitionEngine. And, well, SpeechRecognitionEngine was the first recognizer class I found, so it's what I use by default. Maybe I'll explore the choice in more detail at another time.

Then we have to choose how we'll perform our recognition. There are two basic modes: synchronous and asynchronous. And then for asynchronous, we can choose to wait for just one event, or keep listening for multiple events. For Dee Jay, we choose asynchronous with multiple events, since that means Dee Jay listens continuously as it works.

Next we have to implement our recognition event handler. And that's where the complexity can come in. I say can come in, because you can make it really simple; but simple for you is complex for your users, and vice versa. If you want satisfied users, you'll need to do some work.

Let's look at the declaration of the event handler. This should be old hat to .NET developers:


/// <summary>
/// A phrase was recognized.
/// </summary>
/// <param name="sender">The engine.</param>
/// <param name="e">The details.</param>
void mEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)


This is a standard EventHandler-style method, taking a sender and an argument object. In this case, the argument object is of type SpeechRecognizedEventArgs, a rich type with al the complexity you could ever want. The rest of our processing will focus on the contents of the SpeechRecognizedEventArgs.

The main component of SpeechRecognizedEventArgs is Result, an object of type RecognitionResult. This is a subclass of RecognizedPhrase, a more general class which we'll see more of later. RecognitionResult adds information about the audio stream, and also a list of aternate RecognizedPhrases.

Result contains the matched phrase; but as we saw in Part 3, we want the recognition engine to automatically break the phrase into SemanticValue objects for us. Here, for example, is the code for finding the command:


// Read the command.
string command = "";
if (e.Result.Semantics.ContainsKey(_Command))
{

SemanticValue valCommand = e.Result.Semantics[_Command];
command = valCommand.Value.ToString();

}


e.Result.Semantics is a dictionary that maps text keys to SemanticValue objects. A SemanticValue then contains a Value field that is a bool, an int, a float, or a string.

Now we can read our Dee Jay name:


// All other commands require a name.
if (!e.Result.Semantics.ContainsKey(_DJ))
{

return;

}
SemanticValue valName = e.Result.Semantics[_DJ];
if (valName.Confidence < 0.8)
{

return;

}


Each SemanticValue includes a Confidence value from 0 to 1, indicating how strongly that element was matched. I found that it was easy for an entire command to be matched by casual conversation, without me ever actually saying "Dee Jay". So I separately test the Confidence of the name, just to be sure it was there. (RecognizedPhrases also have a Confidence value, which will be useful in other parts of Dee Jay.)

Next we read the optional specifier:


// Music commands may include a specifier.
string specifier = "";
if (e.Result.Semantics.ContainsKey(_Specifier))
{

SemanticValue valSpecifier = e.Result.Semantics[_Specifier];
if (valSpecifier.Confidence >= 0.8)
{

specifier = e.Result.Semantics[_Specifier].Value.ToString();

}

}


The most complicated part of Dee Jay's recognition, though, is the music phrase itself. That's complex, and my time here is short. So I'll save that for the next post.
Posted on Saturday, November 15, 2008 4:38 PM .NET , M-SAPI | Back to top


Comments on this post: Dee Jay, Part 4: I recognize that!

No comments posted yet.
Your comment:
 (will show your gravatar)


Copyright © Martin L. Shoemaker | Powered by: GeeksWithBlogs.net