Difference between revisions of "Speech Recognition"

From wiki.visual-prolog.com

m (Style)
Line 1: Line 1:
I am still going up the learning curve here, so I am sure I will say things which are not correct.

Revision as of 12:57, 2 January 2010


Microsoft has developed (or bought, probably!) a speech engine, provided as a DLL, which works very nicely (thanks to Thomas and co.) with Visual Prolog 7.2. It is surprisingly accurate and responsive, even with a cheap microphone.

The COM/DLL is provided with all (purportedly) copies of Windows. The file is "SAPI.DLL", and the version used here is SAPI5.1. You should be able to locate this file on your computer, but to ensure you're using SAPI5.1 you should download your own copy, and place it (SAPI.DLL) in your project's EXE folder.

MS Download site; you need the 68Mb file SpeechSDK51.exe near the bottom of the page.

The SAPI overviews are here:

To do - SAPI5.1 on the MS site?

When your program runs, you say something into the mic, pause, and the speech callback is called. You then simply extract what was said as a string_list, which you pass onto your own predicate to process. To do - Other data can be extracted

Setting up your project

When you create a new VIP project and try to generate the prolog "glue" to the SAPI COM, the code generated is not perfect. You will have noticed in the forum that this is not a trivial task, and everyone seems to write their COMs differently. So Thomas has provided the tidied-up code here:

So first generate the faulty COM code using the IDE (at the point where you add the DLL to the project) so that all the folders and classes are created, and then overwrite these files with the correct code in Windows explorer,

Dictation versus Commands

Briefly, SAPI works in two modes. One is dictation - a "free form" mode for dictating letters etc. SAPI does a reasonable job. ToDo - is there a training mode?

The second mode is "command mode" whereby SAPI is provided with a grammar that makes it easier for it to understand, since there is a restricted limited number of words to work with. If you give it a grammar such as:

  • "turn red"
  • "turn blue"

it will totally ignore the command if you say "turn yellow" - the callback function is not called at all. If you say "turn bed", it will probably return the "turn red", or nothing at all.

The grammar file required (if you don't want dictation) is an XML file. The help file for writing an XML grammar file is provided in the download above. The rules for writing the XML file are straightforward, and for simple grammars the XML is easy to write. But if you get something slightly wrong (even though the XML structure itself is correct), you will get an exception when your program loads and compiles it(the SAPI engine compiles the XML).

The grammar file can have a rule saying "dictation" (free form) is expected.

Grammar Format Tags

The XML Grammar Format Tags are described the SDK Manual. Briefly these are:

  • <GRAMMAR> - the file starts with this, and ends with </GRAMMAR>;
  • <RULE> - the tag for defining sentences (a list of other tags). A RULE parent must always be <GRAMMAR>;
  • <DICTATION> - for free form dictation;
  • <LIST> or <L> - children can be lists of PHRASEs for example;
  • <PHRASE> or {P> - specifying the words to be recognised;
  • <OPT> or <O> - specifying words that might be said;
  • <RULEREF> - for recursively calling other RULEs;
  • <WILDCARD> - to do
  • <RESOURCE> - to do
  • <TEXTBUFFER> - to do

Many of the tags can have children which are other tags, but not all, and equally some tags are restricted as to their parent tag. <DICTATION> and <RULEREF> can have no children. <RULE> can only have <GRAMMAR> as a parent. <GRAMMAR> must have one or more <RULE>s as children, and no other type (except <ID> which is discussed below).