Barry's Receptionist

Barry is a young busy professor. Getting hold of him is notoriously difficult. His calendar at times shows busy even when he is in his office coming up with yet another grand plan of world domination with his unseen minions. To make matters worse, he doesn’t get a receptionist who you could ask before knocking on his door and disturbing his super-secret meeting. Moreover, as a graduate student who likes to be near his computer than in lab, I hate to periodically check whether the boss is in when I want to have quick talk. That’s why I have decided to get him a receptionist! With the brains and beauty of spark core and muscle in form of microphone and PIR motion detector, I think we have a complete package.

Here are the use cases we care about so far :
Barry is in and not talking on phone or to the person in front —>Can meet
Barry is in but in a meeting (evident by him speaking at least once in say 5 mins) —>Can’t meet
Barry is not in —>Can’t meet

I can use spark.publish to post the current scenario (of the three listed above) every 5 mins and us graduate students could monitor the status by subscribing to the private events.

PIR motion detection is straightforward. What I have no clue about is the mic part. There are several posts here regarding audio acquisition and processing but none of them seem to have a working solution. I could have missed though. I feel what I need is the ability to sample the audio fast enough, store a few seconds worth of data and perform an FFT to look for speech components. It may not be trivial as this given the noise (it’s fairly quiet usually) and that’s what I am not sure about. I have an ISD1820 voice module handy and it can used to record audio and off-load some tasks from spark core.

Any other inputs suggestions are appreciated. Does the overall approach seems possible or is there a better way to achieve this?

Of course, Barry has conceded to be tracked this way.

I hope Project Share is the right category to discuss ideas.

1 Like

Very fun project!

I think you could get partway there by doing a simple ‘loudness’ measurement, and try to self-calibrate the background noise level of the room when he is / isn’t talking to someone. Adafruit has a general example for a loudness measurement here -

I hope that helps! :slight_smile:


1 Like

You probably want to try sampling a few loudness measurements over time, so you can (somewhat) discern actual speech from random environmental noise. For example, if Barry has a squeaky chair, drops his pencil, or shuffles some papers. If you stay above some minimum loudness threshold over a period of, say, 0.5 seconds, that’s a pretty long, sustained sound – much more likely to be someone speaking.

The code that @Dave pointed to does this, somewhat, but I just wanted to suggest that you look at tweaking how long you sample, and how many samples you take. Experiment and see what seems to work with the most acceptable error rate. :smile:


Thanks @Dave and @dougal for the pointers. I like the general idea of measuring loudness over a sufficiently long period of time. How fast do you reckon should I be sampling? Since with this approach, I don’t care about aliasing I can sample well below the standard 44.1kHz, right? I remember reading at some point that arduino can do 8kHz. I am sure spark core can do better even though I may not need that kinda resolution.

1 Like

Like I said, you might want to experiment. Just guessing, I’d say sampling somewhere from 200-500 ms would suffice, with samples taken every 10ms? I’d try logging data for various types of sounds, and comparing them. You might try different techniques for analyzing the data: average strength, median value, different minimum thresholds, etc.

After all, you aren’t trying to really analyze the sound in depth. You just need to be able to tell the difference between the sounds of a conversation versus the clicks of typing, chair squeeks, desk drawers closing, etc. Most of those will be pretty short duration.

Let’s see… A really fast typist can hit about 10 keystrokes/sec, with the sound of each click being pretty short, say around 25ms in duration. So on average, 25ms of sound and 75ms of silence between keystrokes. Sampling every 10ms, you might see 1 or 2 loud values out of very 10.

Compare that to speech, where you get much more continuous sounds. Even a very short spoken sound will probably fill up 100ms with lots of loud values.

Again, I’m doing a lot of guesswork here. And I’m not accounting for frequency oscillations and such. I think I’d just have to experiment, sample, collect data, and determine how to sort things out from there.

In case it’s of interest, I found some articles about analyzing keystroke audio to try to capture data based on timing signatures. It’s primarily concerned with the timing between keystrokes to determine patterns, but they include some waveforms:

Anyhow, typing is just one sort of sound you’d want to classify as background noise. You’ll have to figure out what else to filter out. :smile: