Difference between revisions of "Speech Recognition"

From wiki.visual-prolog.com

(DLL => COM component)
m (→‎Basics: fix spelling mistake)
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
===Basics===
===Basics===


Microsoft provides a Speech API - SAPI - which covers both Speech Recognition and Speech Synthesis. Most of this page deals with Speech Recognition (SR). It is provided as a COM component, which works very nicely (thanks to Thomas and co.) with Visual Prolog 7.2, and with a defined grammar. It is surprisingly accurate and responsive, even with a cheap microphone. Microsoft has produced SAPI recognizers for 4-5 languages including English. PDC has produced a SAPI recognizer for Danish:
Microsoft provides a Speech API - SAPI - which covers both Speech Recognition and Speech Synthesis. Most of this page deals with Speech Recognition (SR). It is provided as a COM component, which works very nicely (thanks to Thomas and co.) with Visual Prolog 7.2, and with a defined grammar. It is surprisingly accurate and responsive, even with a cheap microphone. Microsoft has produced SAPI recognizers for 4-5 languages including English. PDC has produced a SAPI recognizer for Danish: [http://www.pdc.dk/dk/dictus/ Dictus]. There may also be SAPI recognizers for other languages (but other than MS and PDC there don't seem to be any).
[http://www.pdc.dk/dk/dictus/ Dictus]. There may also be SAPI recognizers for other languages (but other than MS and PDC there don't seem to be any).


SAPI is the link between programs and Speech Recognition Engines, much like ODBC is the link between programs and SQL databases.
SAPI is the link between programs and Speech Recognition Engines, much like ODBC is the link between programs and SQL databases.


Generally, the COM/DLL is provided with Windows (but see below - "Availability").  With a copy of the DLL in your project folder, it's easy to import it into your project. It is not necessary for it to be there at run time (i.e. there's no need to distribute it with your app to end users).
Generally, the COM component is provided with Windows (but see below - [[#SAPI Availability]]).  With a copy of the DLL in your project folder, it's easy to import it into your project. It is not necessary for it to be there at run time (i.e. there's no need to distribute it with your app to end users).


If you need it (see "availability" below) you can download the SDK here:
If you need it (see "availability" below) you can download the SDK here: [http://www.microsoft.com/downloads/details.aspx?FamilyID=5e86ec97-40a7-453f-b0ee-6583171b4530&displaylang=en MS Download site]; you need the 68Mb file '''SpeechSDK51.exe''' near the bottom of the page.
[http://www.microsoft.com/downloads/details.aspx?FamilyID=5e86ec97-40a7-453f-b0ee-6583171b4530&displaylang=en MS Download site]; you need the 68Mb file '''SpeechSDK51.exe''' near the bottom of the page.


The SAPI overviews are here:
The SAPI overviews are here:


*[http://msdn.microsoft.com/en-us/library/aa911607.aspx SAPI5.0 overview]
*[http://msdn.microsoft.com/en-us/library/ee705648.aspx Microsoft Speech API (SAPI) 5.4 & 5.3]
*[http://msdn.microsoft.com/en-us/library/ms720151(VS.85).aspx SAPI5.3 overview]
 
<font color="red">To do - SAPI5.1 on the MS site?</font>
Also see:
<br>Also see:  
*[[wikipedia:Microsoft Speech API]]
*[http://en.wikipedia.org/wiki/Microsoft_Speech_API http://en.wikipedia.org/wiki/Microsoft_Speech_API]
*[http://support.microsoft.com/kb/306901/ http://support.microsoft.com/kb/306901/]
*[http://support.microsoft.com/kb/306901/ http://support.microsoft.com/kb/306901/]
*[http://support.microsoft.com/kb/306537/EN-US/ http://support.microsoft.com/kb/306537/EN-US/]
*[http://support.microsoft.com/kb/306537/EN-US/ http://support.microsoft.com/kb/306537/EN-US/]
*[http://msdn.microsoft.com/en-us/library/ee431799 http://msdn.microsoft.com/en-us/library/ee431799] (good introduction)
*[http://msdn.microsoft.com/en-us/library/ee431799 http://msdn.microsoft.com/en-us/library/ee431799] (good introduction)


When your program runs, you say something into the mic, pause, and the speech callback predicate is called. You then extract what was spoken as a string_list, which you pass onto your own predicate to process. <BR>
When your program runs, you say something into the mic, pause, and the speech callback predicate is called. You then extract what was spoken as a string_list, which you pass onto your own predicate to process.
<font color="red">To do - Other data can be extracted?<BR>
 
</font><BR><BR>
<font color="red">To do - Other data can be extracted?</font>


===Setting up your project===
===Setting up your project===
Line 32: Line 29:
* [http://discuss.visual-prolog.com/viewtopic.php?t=8206 SAPI COM glue]
* [http://discuss.visual-prolog.com/viewtopic.php?t=8206 SAPI COM glue]


So first generate the faulty COM code using the IDE (at the point where you add the DLL to the project) so that all the folders and classes are created, and then overwrite these files with the correct code in Windows explorer.<BR><BR>
So first generate the faulty COM code using the IDE (at the point where you add the DLL to the project) so that all the folders and classes are created, and then overwrite these files with the correct code in Windows explorer.


===Dictation versus Commands===
===Dictation versus Commands===


Briefly, SAPI SR works in two modes. One is dictation - a "free form" mode for dictating letters etc. SR isn't great in this mode. <font color="red">ToDo - is there a training mode?</font><br>
Briefly, SAPI SR works in two modes. One is dictation - a "free form" mode for dictating letters etc. SR isn't great in this mode. <font color="red">ToDo - is there a training mode?</font>


The second mode is "command mode" whereby SAPI is provided with a grammar that makes it easier for it to understand, since there is a restricted limited number of words to work with. Results are much better than in dictation mode. If you give it a grammar such as:
The second mode is "command mode" whereby SAPI is provided with a grammar that makes it easier for it to understand, since there is a restricted limited number of words to work with. Results are much better than in dictation mode. If you give it a grammar such as:
Line 50: Line 47:


===Grammar Format Tags===
===Grammar Format Tags===
The XML Grammar Format Tags are described the SDK Manual. Briefly these are:
The XML Grammar Format Tags are described the SDK Manual. Briefly these are:
<pre>
<GRAMMAR> - the file starts with this, and ends with </GRAMMAR>;
<RULE> - the tag for defining sentences (a list of other tags). A RULE parent must always be <GRAMMAR>;
<DICTATION> - for free-form dictation;
<LIST> or <L> - children can be lists of PHRASEs for example;
<PHRASE> or <P> - specifying the words to be recognised;
<OPT> or <O> - specifying words that might be said (i.e optional);
<RULEREF> - for recursively calling other RULEs;
<WILDCARD> - to allow recognition of some phrases without failing due to irrelevant, or ignorable words;
<RESOURCE> - to store arbitrary string data on rules (e.g. for use by a CFG Interpreter);
<TEXTBUFFER> - used for applications needing to integrate a dynamic text box or text selection with a voice command.
</pre><BR>
Many of the tags can have children which are other tags, but not all, and equally some tags are restricted as to their parent tag. <DICTATION> and <RULEREF> can have no children. <RULE> can only have <GRAMMAR> as a parent. <GRAMMAR> must have one or more <RULE>s as children, and no other type (except <ID> which is discussed below).<BR><BR>


Only the <PHRASE> and <OPT> tags contain words/phrases that will be recognisable spoken words<BR><BR>
:<GRAMMAR> - the file starts with this, and ends with </GRAMMAR>;
=== SAPI.DLL Availability/Versions ===
:<RULE> - the tag for defining sentences (a list of other tags). A RULE parent must always be <GRAMMAR>;
:<DICTATION> - for free-form dictation;
:<LIST> or <L> - children can be lists of PHRASEs for example;
:<PHRASE> or <P> - specifying the words to be recognised;
:<OPT> or <O> - specifying words that might be said (i.e optional);
:<RULEREF> - for recursively calling other RULEs;
:<WILDCARD> - to allow recognition of some phrases without failing due to irrelevant, or ignorable words;
:<RESOURCE> - to store arbitrary string data on rules (e.g. for use by a CFG Interpreter);
:<TEXTBUFFER> - used for applications needing to integrate a dynamic text box or text selection with a voice command.
 
Many of the tags can have children which are other tags, but not all, and equally some tags are restricted as to their parent tag. <DICTATION> and <RULEREF> can have no children. <RULE> can only have <GRAMMAR> as a parent. <GRAMMAR> must have one or more <RULE>s as children, and no other type (except <ID> which is discussed below).
 
Only the <PHRASE> and <OPT> tags contain words/phrases that will be recognisable spoken words
 
=== SAPI Availability/Versions ===


*<b>Windows Vista and Windows 7:</b> SAPI 5.3 is part of Windows Vista and Windows 7, but it will only work for the languages that Microsoft supports (and Danish with PDC's engine).
*'''Windows Vista and Windows 7:''' SAPI 5.3 is part of Windows Vista and Windows 7, but it will only work for the languages that Microsoft supports (and Danish with PDC's engine).


*<b>Windows XP:</b> On XP you will get SAPI 5.1 with Office 2003 (but not 2007), and you can get it as part of the SDK download mentioned above. And you can get it as a installer merge module to merge into an installer you create yourself.
*'''Windows XP:''' On XP you will get SAPI 5.1 with Office 2003 (but not 2007), and you can get it as part of the SDK download mentioned above. And you can get it as a installer merge module to merge into an installer you create yourself.


*<b>Notes:</b><br>
*'''Notes:'''
** You cannot (do not!) install SAPI 5.1 on a Vista or Windows 7.<br>
** You cannot (do not!) install SAPI 5.1 on a Vista or Windows 7.
** A program that uses SAPI 5.1 can also (without any changes) use SAPI 5.3.<br>
** A program that uses SAPI 5.1 can also (without any changes) use SAPI 5.3.
** The SAPI import provided by PDC is actually based on SAPI 5.3 (but it probably does not expose anything that is not also in 5.1). SAPI 5.3 is a conservative extension of SAPI 5.1. It's forwards compatible: a program that works with 5.1 will also work with 5.3, but not necessarily the other way around.<BR><BR>
** The SAPI import provided by PDC is actually based on SAPI 5.3 (but it probably does not expose anything that is not also in 5.1). SAPI 5.3 is a conservative extension of SAPI 5.1. It's forwards compatible: a program that works with 5.1 will also work with 5.3, but not necessarily the other way around.


=== Examples ===
=== Examples ===
Here are a few examples of grammars. These are the actual contents as would be stored in an XML file.
Here are a few examples of grammars. These are the actual contents as would be stored in an XML file.


* Example 1 - Recognises the word "hello" only.
* Example 1 - Recognises the word "hello" only.
<PRE>
 
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
<ID NAME="test" VAL="1"/>
    <ID NAME="test" VAL="1"/>
</DEFINE>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
    <RULE NAME="test" TOPLEVEL="ACTIVE">
    <P>hello</P>
        <P>hello</P>
  </RULE>>
    </RULE>>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
* Example 2 - All recognise the phrase "hello world".
* Example 2 - All recognise the phrase "hello world".
<PRE>
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
Line 103: Line 103:
   </RULE>
   </RULE>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
 
<PRE>
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
Line 114: Line 114:
   </RULE>
   </RULE>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
 
<PRE>
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
Line 127: Line 127:
   </RULE>
   </RULE>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
 
* Example 3 - Recognises the phrases "hello" and "hello world" (i.e "world" is optional)
* Example 3 - Recognises the phrases "hello" and "hello world" (i.e "world" is optional)
<PRE>
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
Line 140: Line 140:
   </RULE>
   </RULE>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
 
* Example 4 - Recognises "hello one two three"
* Example 4 - Recognises "hello one two three"
<PRE>
<source lang="xml">
<GRAMMAR>
<GRAMMAR>
<DEFINE>
<DEFINE>
Line 159: Line 159:
   </RULE>
   </RULE>
</GRAMMAR>
</GRAMMAR>
</PRE>
</source>
<BR>
 
<font color="red">provide more examples</font><BR><BR>
<font color="red">provide more examples</font>


=== Training ===
=== Training ===


Voice training is performed via the Windows Control Panel - Speech.
Voice training is performed via the Windows Control Panel - Speech.

Latest revision as of 19:35, 16 February 2010

Basics

Microsoft provides a Speech API - SAPI - which covers both Speech Recognition and Speech Synthesis. Most of this page deals with Speech Recognition (SR). It is provided as a COM component, which works very nicely (thanks to Thomas and co.) with Visual Prolog 7.2, and with a defined grammar. It is surprisingly accurate and responsive, even with a cheap microphone. Microsoft has produced SAPI recognizers for 4-5 languages including English. PDC has produced a SAPI recognizer for Danish: Dictus. There may also be SAPI recognizers for other languages (but other than MS and PDC there don't seem to be any).

SAPI is the link between programs and Speech Recognition Engines, much like ODBC is the link between programs and SQL databases.

Generally, the COM component is provided with Windows (but see below - #SAPI Availability). With a copy of the DLL in your project folder, it's easy to import it into your project. It is not necessary for it to be there at run time (i.e. there's no need to distribute it with your app to end users).

If you need it (see "availability" below) you can download the SDK here: MS Download site; you need the 68Mb file SpeechSDK51.exe near the bottom of the page.

The SAPI overviews are here:

Also see:

When your program runs, you say something into the mic, pause, and the speech callback predicate is called. You then extract what was spoken as a string_list, which you pass onto your own predicate to process.

To do - Other data can be extracted?

Setting up your project

When you create a new VIP project and try to generate the prolog "glue" to the SAPI COM, the code generated is not perfect. You will have noticed in the forum that this is not a trivial task, and everyone seems to write their COMs differently. So Thomas has provided the tidied-up code here:

So first generate the faulty COM code using the IDE (at the point where you add the DLL to the project) so that all the folders and classes are created, and then overwrite these files with the correct code in Windows explorer.

Dictation versus Commands

Briefly, SAPI SR works in two modes. One is dictation - a "free form" mode for dictating letters etc. SR isn't great in this mode. ToDo - is there a training mode?

The second mode is "command mode" whereby SAPI is provided with a grammar that makes it easier for it to understand, since there is a restricted limited number of words to work with. Results are much better than in dictation mode. If you give it a grammar such as:

  • "turn red"
  • "turn blue"

it will totally ignore the command if you say "turn yellow" - the callback function is not called at all. If you say "turn bed", it will probably return the "turn red", or nothing at all.

The grammar file required (if you don't want dictation) is an XML file. The help file for writing an XML grammar file is provided in the download above. The rules for writing the XML file are straightforward, and for simple grammars the XML is easy to write. But if you get something slightly wrong (even though the XML structure itself is correct), you will get an exception when your program loads and compiles it(the SAPI engine compiles the XML).

The grammar file can have a rule saying "dictation" is expected as part of a rule.

Grammar Format Tags

The XML Grammar Format Tags are described the SDK Manual. Briefly these are:

<GRAMMAR> - the file starts with this, and ends with </GRAMMAR>;
<RULE> - the tag for defining sentences (a list of other tags). A RULE parent must always be <GRAMMAR>;
<DICTATION> - for free-form dictation;
<LIST> or <L> - children can be lists of PHRASEs for example;
<PHRASE> or

- specifying the words to be recognised;

<OPT> or <O> - specifying words that might be said (i.e optional);
<RULEREF> - for recursively calling other RULEs;
<WILDCARD> - to allow recognition of some phrases without failing due to irrelevant, or ignorable words;
<RESOURCE> - to store arbitrary string data on rules (e.g. for use by a CFG Interpreter);
<TEXTBUFFER> - used for applications needing to integrate a dynamic text box or text selection with a voice command.

Many of the tags can have children which are other tags, but not all, and equally some tags are restricted as to their parent tag. <DICTATION> and <RULEREF> can have no children. <RULE> can only have <GRAMMAR> as a parent. <GRAMMAR> must have one or more <RULE>s as children, and no other type (except <ID> which is discussed below).

Only the <PHRASE> and <OPT> tags contain words/phrases that will be recognisable spoken words

SAPI Availability/Versions

  • Windows Vista and Windows 7: SAPI 5.3 is part of Windows Vista and Windows 7, but it will only work for the languages that Microsoft supports (and Danish with PDC's engine).
  • Windows XP: On XP you will get SAPI 5.1 with Office 2003 (but not 2007), and you can get it as part of the SDK download mentioned above. And you can get it as a installer merge module to merge into an installer you create yourself.
  • Notes:
    • You cannot (do not!) install SAPI 5.1 on a Vista or Windows 7.
    • A program that uses SAPI 5.1 can also (without any changes) use SAPI 5.3.
    • The SAPI import provided by PDC is actually based on SAPI 5.3 (but it probably does not expose anything that is not also in 5.1). SAPI 5.3 is a conservative extension of SAPI 5.1. It's forwards compatible: a program that works with 5.1 will also work with 5.3, but not necessarily the other way around.

Examples

Here are a few examples of grammars. These are the actual contents as would be stored in an XML file.

  • Example 1 - Recognises the word "hello" only.
<GRAMMAR>
<DEFINE>
    <ID NAME="test" VAL="1"/>
</DEFINE>
    <RULE NAME="test" TOPLEVEL="ACTIVE">
        <P>hello</P>
    </RULE>>
</GRAMMAR>
  • Example 2 - All recognise the phrase "hello world".
<GRAMMAR>
<DEFINE>
	<ID NAME="test" VAL="1"/>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
      <P>hello</P>
      <P>world</P>
  </RULE>
</GRAMMAR>
<GRAMMAR>
<DEFINE>
	<ID NAME="test" VAL="1"/>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
      <P>hello world</P>
  </RULE>
</GRAMMAR>
<GRAMMAR>
<DEFINE>
	<ID NAME="test" VAL="1"/>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
      <P>hello
         <P>world</P>
      </P>
  </RULE>
</GRAMMAR>
  • Example 3 - Recognises the phrases "hello" and "hello world" (i.e "world" is optional)
<GRAMMAR>
<DEFINE>
	<ID NAME="test" VAL="1"/>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
      <P>hello</P>
      <O>world</O>
  </RULE>
</GRAMMAR>
  • Example 4 - Recognises "hello one two three"
<GRAMMAR>
<DEFINE>
	<ID NAME="ref44" VAL="1"/>
	<ID NAME="test" VAL="2"/>
</DEFINE>
  <RULE NAME="test" TOPLEVEL="ACTIVE">
      <P>hello</P>
      <RULEREF NAME="ref44"/>
  </RULE>
  <RULE NAME="ref44">
      <P>one</P>
      <P>two</P>
      <P>three</P>
  </RULE>
</GRAMMAR>

provide more examples

Training

Voice training is performed via the Windows Control Panel - Speech.