Phonetic recordings for a vocal synthesizer please.

Saturday, June 20, 2015 - 13:28

I have an idea to make a vocal synthesizer with blender to use in games, but I need the sounds of every letter in the English alphabet (at least 26). This project won't be concerned with accuracy, but I'd like to have both male and female vocals for use within a game (numbers and numerical words will also help). If it is successful I'll post a Let's Blending tutorial, otherwise there'll be nothing. Placing a textfile in the zip with your name will allow me to add you to the credits of any game I use this in. Please no GPL licensing. Ifyou don't want to do every vowel type, I'd rather have long vowels than short (like cat, cot, cool, city).

This could also be useful for other projects, thanks.

eugeneloza

joined 10 years 9 months ago

Saturday, June 20, 2015 - 19:49

Well... every letter of the English alphabet won't take you to vocal synthesizer (actually you are speaking of speech synthesis, rather than vocal, which is rather a musical term) - you need phonetical sounds (or even larger segments), not letters. There are few opensource speech synthesis software you may chesk, like FreeTTS, Praat, Ekho, eSpeak, Festival; they may have their samples under open licenses.

andrewj

joined 13 years 1 month ago

Sunday, June 21, 2015 - 00:38

Yeah, generating speech is much more difficult than stringing together the sounds of letters. I suggest reading about it on Wikipedia, it is a fascinating and complex topic.

FacadeGaikan

joined 14 years 5 months ago

Monday, June 22, 2015 - 18:54

Thanks, I'll look into it. Meanwhile I'll still try to work on something.

[ signature ]

FacadeGaikan

joined 14 years 5 months ago

Monday, June 22, 2015 - 18:54

Thanks, I'll look into it. Meanwhile I'll still try to work on something.

[ signature ]

Tozan

joined 10 years 6 months ago

Tuesday, June 23, 2015 - 06:40

It is extremely complex to do speech synthesis. the computer software programs are not really

that good at it. But it's ok if you just only want speech synthesis as a prototype to test diaglog

before deciding to hire out voice actors for the diaglog (which is alot more expensive).

eugeneloza

joined 10 years 9 months ago

Tuesday, June 23, 2015 - 11:38

Actually Russian&Ukrainian speech synthesis is not that complex at all. You really have an absolute relation between a letter and a sound with some easy rigor rules to follow. Its just the stresses that will pose the problem.

But English is a disaster :D Not speaking about French... where different letters in different combinations and conditions produce dramatically different results. Actually you'd need to make either a complete vocabulary voice-over (yielding nice&natural words, but inconsistent sound), or a phonetical vocabulary (consistency and much less work, but junction problem and unnatural artifacts). Even this approach will fail in general case, e.g. for read[ri:d]-read[red]-read[red].

So... recoding the samples is problem of the least concern.

Tozan

joined 10 years 6 months ago

Tuesday, June 23, 2015 - 20:40

They are thiking of shifting the Pitch value of the voice to try to make more better english synthesis. it does not work with the linear approach, with english they find it difficult to get a perfectly natural sounding synthesis.

FacadeGaikan

joined 14 years 5 months ago

Thursday, June 25, 2015 - 14:57

I was fully aware of possible accents, that is the reason to do this. Anthro characters have more leeway than humans.

[ signature ]

Demetrius

joined 14 years 6 months ago

Thursday, June 25, 2015 - 17:43

@eugeneloza, I believe Russian and Ukrainian are more complex than this. There's a more-or-less straightforward relationship between a letter and phoneme, a meaningful sound. But we don't pronounce phonemes, we pronounce their positional variants, allophones. Sounds interact with each other in quite complex way. For example, phoneme /o/ can be pronounced roughly as [ɔ] in слон /slon/ and as [œ] in лёд /lʲot/ to accomodate for the previous soft consonant. /g/ will be pronouned as [gʷ] in год /god/, because /o/ is rounded and we round lips before we even start pronouncing /o/. And this is only the tip of the iceberg.

So, if we only take phonemes into account, we’ll get a very unnatural result. And taking all the allophones into account is a lot of work.

Also, I'm not sure even letter→phoneme is that straightforward. I seriously doubt anyone ever says /'zdrafstvujtʲɪ/.

eugeneloza

joined 10 years 9 months ago

Thursday, June 25, 2015 - 19:45

истес-но (read as yestestvenno :D)

I just remember a tiny DOS program reading Russian texts letter-by-letter and yielding relatively audible results if not expect too much. And couldn't understand then, in late 90s - early 2000s, why Engish speech synthesis was so awful in comparison.

P.S. was somewhat confused listening to speech synthesised blender tutorials :D However, contemporary synthesis quality is really high.

Tozan

joined 10 years 6 months ago

Thursday, June 25, 2015 - 21:27

I guess engilsh is a difficult language for the speech synthesis programs to master.

FacadeGaikan

joined 14 years 5 months ago

Friday, June 26, 2015 - 13:37

English is hard for Americans to master (I r one) but I think oss programs use windows engines, not audio files

We're not helping you, but we are talking about the subject to make the thread look active - Dad

A Link to one of these sound libraries would help since I'm stuck on a phone.

[ signature ]

eugeneloza

joined 10 years 9 months ago

Friday, June 26, 2015 - 23:53

Well... I've already provided a sketch list of opensource programs that might contain a good set of the phonetic sounds under appliccable licenses: FreeTTS, Praat, Ekho, eSpeak, Festival. If not explicit then extractable. I'm afraid, its up to you, to check their repositories & maybe contact authors if the library is not available in a usable format. I don't think they use windows engines, 'cause they're linux.

I can't do that for you, because I'm not pro in speech synthesis and your innuendo request of 'Englis alphabet letters' seems to be incrorrect in my opinion. So I even wouldn't be able to formulate the request e-mail. And there are other problems to solve, I've pointed to in my previous posts. I've never done a speech synthesis program, so it's just a sketch of my virtual to-do list if I'd ever do it. Maybe I'm wrong.

But let's be more specific then. I'm a voice actor and I can contribute to the project. But since the first post I still have extremely little data on 'what to record' and 'what are the requirements', and when I try to point this out... I get an innuendo doggie... hmm... doesn't sound too good.

Tozan

joined 10 years 6 months ago

Saturday, June 27, 2015 - 04:24

I really didn't like Festival although I have looked at it. Because You had to type everything all out inside brackets with that program to get it to say anything. I Didn't like the syntax it uses to get it to talk. Festival was written by Unix programmers or who use linux, and I don't like UNix Commands with their Paramethsis. I thought in this program you just typed out your words and it speaks like with all the other synthesis programs, but not Festival, in Festival you have to type the whole entire code script line to get it to speak a line of text or save to wav file.

The documentation for the program has been written all wrong. Its been written at the advanced level of the engine, and not at the laymen level. The professors have written it at their level of understanding. The documents has to be explained out in plain simple english without all the heavy unix scripting to confuse those who have no knowledge of unix scripts, or redo the whole program, give it a simple to understand User interfance that is more friendly, because not everybody understands unix script code commands. or the complexisies of the speech engine they built that they tried explaining out in their documentation, In other words to understand all about Festival, you need to have the knolwedge of those professors. So there's a big gap in the learning curve. and this is the BIGGEST flaw in free software and that's why its not user friendly..

And when I installed Festival, the program called Text2wave is NOT been compailed as a text2wave.exe (an Windows Executable, instead the program is called Text2wave.sh. So It is written in Festival Script format, so I can't even run the program because it first has to be compiled into an executable in order for it to run. so the Programmers of Festival should've compiled all the scripts as an windows exe installer instead of expecting us to have to compile the program ourselves. That's just lazy to fail to provide a windows installer for the program in my opinion........

Voice acting is alright for those games that use a little bit of spoken dialog in them, (small projects) under a few hundred lines or a thousand lines is ok if your budget can afford it. But becomes a problem when you got a big project with thousands of lines of dialog like Kotor Games because it gets very expensive when you have to pay out something like $5 bucks a line of text or so for a voice actor to do it for you, instead of doing it through a computer synthesis program.

So for your game character to speak just one paragraph of text, you could easily be paying 100 dollars just to get 10 lines of text done.

So because I have over 15,000 lines of text dialog already in my game script (as my game script is now over 35,000 lines long), I realize that I can't get voice actors to do it for me, because its going to be too costly, So it looks like I have to stick with voice synthesis to do the majority of the dialog for me and that at least allows me to test all the storylines out for the main game characters. So I think only big major publishing companies will have the kind of resources to cover these kind of expenses for the big projects..

FacadeGaikan

joined 14 years 5 months ago

Saturday, June 27, 2015 - 11:37

(I get an innuendo doggie... hmm... doesn't sound too good.) That's my signature, I'll change it.

( your innuendo request of 'Englis alphabet letters' seems to be incrorrect in my opinion.) What innuendo? What seems incorrect? I just wanted vocalizations that would be common within words, not someone saying "Aye", "Bee", "See", but "Ahh", "Buh", "Sss" (or "Kuh").

(Sorry I cannot spell correctly on a phone, I thought maybe there would be a collection of audio samples on this site thay could be arranged into words under a clear license, or failing that cause people to add one since I use a pair of headphones in place of a microphone)

My plan was to use PYTHON's AUDASPACE within BLENDER to parse text and assign a sound to each letter and have my characters speak using a dynamic text generation system. I assumed that I'd at least try to create a demo and see if it could be done (if so, I'd refine it; otherwise I'd scrap it, but there would be resources HERE, an attempt MADE and knoledge GAINED).

[Words capitalised to allow focus upon important areas, not to be rude or to be a capitalist]

FreeTTS = Have to sign up with another forum.

Praat = No contact info

Ekho = Chinese

eSpeak = Has contact info from 2012, maybe he'll answer, will try.

Festival = looks promising, has other links.

I'll try this, but I dont have much time to access the internet on my laptop, I have a job that takes up most of my week, may try using my phone a bit.

I don't want to / can't pay for voice acting, I want the game to DO the voice acting, but I guess I'll just use text if I cannot get the resources to try my idea.

[ signature ]

Tozan

joined 10 years 6 months ago

Saturday, June 27, 2015 - 11:59

Yes, not all of us have got the financial resources to pay for all the voice acting for our characters so we have to turn to what free software can offer us. Festival is ok if you don't mind typing your text out inside brackets.

if you are trying to build an english speech engine up from scratch with this audaspace with Blender, then you would probably need a dictionary database of words so the synthesis knows how to group the words as our speech is not grouped in letters, but grouped in words and clusters of words when we speak and it also has different up and down pitches in the word also to express a wide range of different emotions and feelings And because there are many different feelings spoken in our speech, there are many different pitch groupings for each word and also for each word group depending on what emotional state we are in at the time when speaking the words. There was this Austalian professor of white who explained it on you tube about how speech synthesis woks, and why speech syntheis is a disaster at the moment in trying to do english with the current traditional methods they been using because the traditional way has only been grouping together words, not been grouping together all the different pitch sound groupings in the words for the wide range of emotions and feelings. That's why in most speech TTS synthesis programs when words are spoken, it sounds flat and motonomous and not sound always natural. Or sounds like a disaster or a trainwreck like with Anna ect or sounds robotic wtihout feeling.

So I don't think the speech synthesis can form clear audible words with just 26 letters of the alphabet, I don't think that's enough to form a speech engine, I think you need all the words and also need all the different pitch groupings, but that's complicated because each word we speak also uses a different pitch grouping depending on what emotional or feeling is being expressed at the time when the word is being spoken..

So when Gandalf of the White says "Mary Went." or "Mary Had" Hes only showing 1 grouping for 1 emotion. not the whole range of them.

There is a huge range of emotional characteristics. HAPPY (Estaticic Joyful, gladnes, cheerful,), NEGATIVE: Anger, fear, sandness, distrust, worry, jealously,

eugeneloza

joined 10 years 9 months ago

Saturday, June 27, 2015 - 12:45

The first link on English phonetic sound google search lead me to http://www.antimoon.com/how/pronunc-soundsipa.htm . However, I'm not expert in linguistic and can't tell for sure if their list covers all required sounds for speech synthesis (appears not that large after all if it does - just about 40 sounds). Moreover they have the relative sounds recorded. They do not have an identified license thou.

This page http://teflpedia.com/IPA_phoneme_/%C9%99/ shows that it's not all that simple after all. And a simple phoneme may be pronounced differently depending on its location within the word.

It seems that "phonetic alphabet mp3" gives the required result. Maybe there are ready free-license available samples.

Tozan

joined 10 years 6 months ago

Saturday, June 27, 2015 - 13:01

Human speech has got a wide range of different emotions, look on wikipedia under emotions and you see that with the basic emotions, there are also many different kind of groupings of emotions inside multiple groupings and its all these different groupings inside groupings is what makes it complicated to build a naturally human intelligent sounding speech engine in the english language.

So if they can figure out how to synthesis all those different emotional groupings of the words of the english languge, then they might beable to get more naturally english sounding synthesis.

FacadeGaikan

joined 14 years 5 months ago

Saturday, June 27, 2015 - 13:46

Thanks, I'll look into those links, already contacted the makers of Festival. Audaspace can change pitch, speed, and even join sounds so it may allow me some flexibility. I'm not going for accuracy right now, just want to try making a working concept. I've heard that there 42 different english vocal sounds, but I could be wrong.

[ signature ]

Tozan

joined 10 years 6 months ago

Saturday, June 27, 2015 - 14:13

Do you know anything abot UDK unreal engine by any chance, trying to get my custom models to animate in the game.