This month was mostly fine tuning and an efficiency pass.
Processing takes approx 1ms per 32ms frame.
Framing is the same. One 16ms frame, the rest 32ms.
Resolution has been increased to the below values:
438, 469, 500, 532, 563, 594, 625, 657, 688, 719, 750, 782, 1000, 1250, 1500, 1625, 1750, 1875, 2000, 2125, 2250, 2500, 3000, 3500, 4000, 5500, 8000
HRTF/Fletcher-Munson calibration values for the above are better.
Frequency Transition detection is the same.
More phonemes added (26 unique patterns: 23 vowels. 1 non-plosive consonant. 2 consonants)
That is basically all the vowels I need to add, which include some duplicates for accents. I have more patterns for consonants but I'm not testing those just yet.
Finally added Peak Volume Normalisation which doubled the reliability/robustness.
"Cake" and "Tree" are IMO comfortably robust for a command based system. I think most people would get them in 1-3 shots. That might seem like it has little value, but this is no data, no training, no learning, handles mild accents, vagueness, pitch & volume change, and is instant.
Phoneme Animations:This month I also planned a total redesign to suit realtime phoneme animations. IE. To send an Animation ID & Length along with audio chunk data, for 3d experiences and games.
Sound waves travel one metre in ~2.9ms, so to be as real as possible, at 10 metres phoneme animations should be 29ms faster to start than the audio chunk data, and be seamless.
So I calculated with 4kb data chunks (128ms total @ 16khz), the initial plosive consonant (48ms) can process and send up to 78ms early (which is 27 metres in sound travel). 3kb data chunks work as well (up to 46ms early or 16 metres). Animations will be 50-100ms late if you look at your own character while you speak.
90% of the quality factor in this will be initial plosive detection (regardless of actual consonant) and this is just sample volume related. Only a general plosive mouth animation needs to be played if the actual consonant was missed. So initial mouth movement will start at the right time, look 75%+ right, and end at the right time.
Current phoneme animations in games are either pre-calculated to pre-recorded audio, or random mouth movements.
Saying "tree":