How to synthesise consonant sounds


A bit off-topic from drambo, but I've been searching any examples of consonant sound synthesis and really found nothing. My thought that it has to be done with noise, fast envelopes and filters. But since there are members with quite a skills in synth area in this forum, maybe someone can give me links or advices on that topic?




  • Making a formant filter out of a few bandpass filters would be a good start.

    Where did you get that term from? Do you mean vowel-consonant synthesis?

  • Indeed consonants leave some room for reasearch, although there are scientific papers that detail on speech synthesis.

    You basically have several options:

    • Use samples for the ssh, plop, rrr, b, p, g, k, d, t and so on
    • Synthesize them by using custom pulses (Graphic shaper to the rescue), filters and Drambo's physical models & FX
    • Use wavetables (2048 samples per WT frame can be enough for doing limited bandwidth speech synthesis).
  • Well it's not a term, I just tried to describe what I'm after 🙂

  • edited November 2020

    vowels are simple, (a,e,i,o,u,ü,ö,ä, y)

    use something sample based for the rest as plosives (p,b,d,t) are a nightmare to synthesize.

    ch. sh, sch, and zzz kind of works

    L & n & m are my worst nightmare , use something sample based too ( its so tricky, u think if I can make it go "aaaaaaa" I can make it go "LLLLLLLL". or nnnnnnn or mmmmmm )



    Btw. its easy to fool yourself into thinking it works here,

    play it it to someone else to see if it works ;)

    I wasted month an month on this (analyzing my voice and the news speaker from channel one)

    In the end I couldn't make it say something simple like

    the bakerman has backen bread ^^

    that was halfway understandable, I ended up with hm, interesting frequency garbage

    so I simply gave up

    more luck to you ;)

  • this is in German so will have to use google translate

  • edited November 2020

    if u look at Wolfgang Palm (PPG)

    he is using wavetables to do the tricky stuff

    to me it looks like that is the only way to go (or something else sample based)

    (I tried all kinds of other shit that I could think of and it didn't really work out at the end , was unintelligible ;) )

    above is the frequency table for all formants

    m is male, w is female, ch is children

    u just need 4 bandpassfilters and a rich source

    or take 4 sinewaves ... what ever

    good luck with the l, m, n stuff

    sadly it just doesnt work ike that at all ;)

  • edited November 2020

    formants here done with graphic eq

    (In 3 I show of some of the stuff that didn't work out,)

    you can basically use anything, its just about the specific frequencys

    sadly I dont have the examples that clearly shows it doesnt work anymore

    I tried to make it say simple words like "Ananas" (pineapple)

    doesnt work

  • edited November 2020

    Have fun

    another way to get to formants is the graphic shaper ;)

  • edited November 2020

    on the other hand i may give l,m,n a 2nd try

  • edited November 2020

    hehe, as I remembered the other stuff doesn’t work

    its not recognizable

    (with a little noise we get a little closer but still unintelligible )

    i can fool myself into thinking this sounds kind of like L

    but nope

    it’s an illusion

    if I try make to it say “ lala

    I get “ ?a?a “ back



  • edited November 2020

    U can hear L doesn’t work

    it still sounds somewhat vowel like, but nothing recognizable

    (Move morph fader)

  • edited November 2020

    Pseudo N

    oh this works actually a little better than I remembered

    so there is room for experiments here

    it sure works for musical purposes as some absurd filtering but not as intelligible speech

    if someone comes up with something intelligible I will invite him over for dinner ;)

    (I expect this won’t happen)

    btw. It doesn’t get easier k is just a click followed by silence ...

    hm, the frustrating thing is I find only papers from linguistic research

    there must be better papers from sound people about none vowels???

    most of the stuff were you think this is synthesized speech is fake, meaning not really synthesized

    its little samples glued together, or someone talking into a vocoder or frequency shifter and so on

    everything except from vowels is a ridiculous amount of work with little results to show of

  • Yes, it's a lot of work, no question.

    Arbitrary control of custom wavetables for vowels and consonants is possible in Drambo.

    I've done it in the wavetable demo song and I would still stick to it and combine with consonant samples where appropriate.

    Some consonants like the "L" @lala mentioned are actually vowels too.

    BTW @lala, dinner sounds good 😋

  • edited November 2020

    see, these linguist confuse me ^^

    they dont call "L" a vowel

    its a "lateral" or something for them

    I cant see much difference between L & A either (except there is something different about "L"

    because I just cant make it work ... (3 formants vs 4?) meh

    I get kind of close with the formant frequencies from the pictures above but I definitely dont hear "L"

    I hear something that is similar to an "Ä" or so

    I have only looked at the spectrum in my research of speech, maybe I should actually start looking at the actual waveforms and see what I can bend because I am stuck ?

    I make passable Lasagne. ^^

    L, M,N cant be that complex, I can hum it as constant sound, god damned 🤯

  • I dont want to cheat with samples, I want to really understand what's going on ;)

  • Hehe, sounds like we need a tweakable vocal tract and mouth/tongue/teeth model 😁

  • I'm just gonna uh… leave this here 😅

  • @orchid This one is excellent indeed, I'm glad it's still online!

  • edited November 2020

    well its pretty primitive, isn't it?

    while it does some articulations it doesnt do it all (lol, they left plosives out because it has to heavy interaction of things) - say pineapple, say concubine , say F### you,

    nor is it able to articulate something ( say lala )

    (they are still making endoscope videos of how certain articulations in certain languages are made )


    with spoken or sung language the frequencies change depending what letter was before it (Vowells dont change)

  • edited November 2020

    @lala it's not easy indeed.

    Imagine we already had a model that could produce all sounds, how would you control the maybe 40 parameters to produce actual speech?

    That old black and white video showing the secretary typing speech might be the way to go, I mean we have pad controllers and MIDI keyboards and the sequencer could record it 😅

  • edited November 2020

    you mean the bell lab demo?

    im not quite sure how that worked

    I think it had the words to say preprogrammed ?

    oh no, it was all realtime ( they say something about 20 sounds, hm, it couldn't do it all, there are more letters in the alphabet ...)

    looking at the specs of the voder its just pulswaves, noise and a formantfilter,

    (that explains why you cant understand shit it is trying to say) ^^

  • edited November 2020

    thinking about it,

    French rolling "RRRRR" should also be archivable

    so to do:


    (L is somehow a mixture of vocal and consonant)




    F, (V) (F is similar to M & N & S)





    english TH



    someone remind me to look into this again when we had a few updates

    I grouped them into similar sounds already


    if I do M,N,F,S its kind of I'm starting low in the spectrum and then go all the way tru it,

    it starts at the throut and ends at the lips


    note to myself 4 or 5 rooms

  • edited November 2020

    It would be quite helpful to find a way how to synthesize all kinds of transitions between any vowel and consonant without bloating the project. Wavetables are a good start but they can't cover them all. Maybe an additional bank of steady noisy samples that can be crossfaded freely? At least for a start, when everything works as expected, one can still re-build the noise samples with synth modules.

    I would use MIDI note numbers to control all that, 128 notes should be enough to control the robot...

    Pitch would be controlled independently (pitch bend), as would be vibrato (mod wheel) and maybe even a few CCs for shaping the vocal tract, mouth, lips, teeth and tongue position. Oh my, that sounds like a project eating up the whole christmas holidays, just to implement something that I can do myself 😅

  • edited December 2020

    Hehe I remember biting my nails.

    i think I spend 4 month or so on (without wavetables/samples) it until I realized it was complete trash because you couldn’t understand a word.

    i thought hah I found it (it’s really easy to trick yourself here if u work alone) then made some show off demos, and then I realized oh fuck,

    it doesn’t work

    if I have to write text to the example (this is supposed to say ananas) it clearly doesn’t work

    Ananas ;)

    never the less it is a good exercise in Sounddesign. :)

    I don’t think of it as wasted time even if I have nothing amazeballs to show of. :)

  • edited December 2020

    Hm, the frequency change seems like a thing that we are able to figure out?

    do these frequencies after those frequencies

    it must be a small number of changes as we don’t have endless letters in the alphabet

    and some combinations of letters are really odd (in Roman based languages) like qk or something so

    (Say that 5 times quick gkgkgkgk — the dogs looks at me to check I don’t have a stroke 🤣)

    a lot falls down the stairs in the first place (and doesn’t need to be figured out)

    (I only know about Roman languages, German, English, French, Italien, Spanish;

    none Roman based languages are completely alien to me like Thai, Japanese, Chinese

    thai has words with 3 k and ish something like kkkgulasch ...

    I would appreciate some input from a native speaker of a none Roman based language/ languages here.

  • edited December 2020

    Looking at the sonogram

    comparing Ana to ama

    i guess the „frequency smear“ we see at the start of the M

    is the trallala that Happens when certain letters follow each other ?

    i kind of bet it’s Resonance frequencies from one or more of the „chambers“ during transition

  • Who knows. In Drambo, it would be possible to model the vocal tract with seamless parameter control. Maybe eavetables are just a dirty workaround and a proper model would be better?

    Wavetables and samples could cover the excitation signals while filters, delays and impulse responses (from the vocal tract using Thafknar 😎) would shape them. Sounds like a nice experiment.

    Who's voluntaring to place the full-blown IR recording setup into his throat? 😂

Sign In or Register to comment.