Similar to many others, I went down the path of trying to control my computer by voice when I developed some RSI pain. This is a collection of my thoughts and links to different resources.

The best dictation software that I’m aware of is Dragon. Regrettably, the Mac version has been discontinued and when I search online, those copies are pricier than comparable Windows versions.

Dictation accuracy takes context into account. Dictating a paragraph is going to be more accurate than dictating a single random word. They also rightfully prioritize writing prose, which means that programming, where more symbols are used, becomes cumbersome.

Custom grammars help a lot. They can be used to restrict what the speech engine recognizes, which can vastly improve accuracy. These grammars can be defined for Dragon using Dragonfly/Natlink. The dictation can be forwarded between machines, such as a guest Windows Virtual Machine and its host, using aenea. Aenea’s Getting Started takes you through all the steps of getting this setup up and running.

A Virtual Machine dedicated to speech recognition does not seem to be the most efficient. Without Dragon for Mac, I looked into building my own using CMUSphinx for commands and fallback to OS X Dictation if I’m dictating an e-mail or some other prose.

A great starting point for getting “Python speech to text with PocketSphinx” setup

For the above, I had to tweak the threshold values. When the average silent intensity is low, it defaults to 3500, which is actually too high of a threshold for me. I’m guessing a low threshold might pick up too much noise. It’s probably also related to the mic.

The blog post uses the default PocketSphinx dictionary and language model, but we can generate our own using the online lmtool. Once generated, we’d replace `config.set_string('-lm', os.path.join(MODELDIR, 'en-us/en-us.lm.bin'))` with the path to the lm file, and `config.set_string('-dict', os.path.join(MODELDIR, 'en-us/cmudict-en-us.dict'))` with the path to the dict file. `config.set_string('-hmm', os.path.join(MODELDIR, 'en-us/en-us'))` would remain unchanged.

With those modifications, we can inject our code where the code prints the detected phrase. Aenea server is a good place to look for mapping code to actions, like aenea’s OSX server code.

I have to look into mapping grammars to speech engines. I haven’t looked into how they work, yet. One possibility is to have a program permute all possible sentence corpuses from the grammar, upload that, and then parse the results from the detection script.

I’ve heard of, but haven’t looked into Talon. It seems like they initially started with Dragon but are moving to a homegrown solution.

Travis Rudd’s voice coding video is the talk that started me on this path.

Emily Shea has a more recent voice coding talk, and it does a great job of going over a lot of what modern voice coding looks like. I hadn't seen the homophones pop-up before and I had performance issues with my phonetic alphabet grammar; this is a reminder to look into those things and see what else I can take away from that talk.