US6366883B1 - Concatenation of speech segments by use of a speech synthesizer - Google Patents

Concatenation of speech segments by use of a speech synthesizer Download PDF

Info

Publication number
US6366883B1
US6366883B1 US09/250,405 US25040599A US6366883B1 US 6366883 B1 US6366883 B1 US 6366883B1 US 25040599 A US25040599 A US 25040599A US 6366883 B1 US6366883 B1 US 6366883B1
Authority
US
United States
Prior art keywords
speech
phoneme
feature parameters
cost
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US09/250,405
Inventor
Nick Campbell
Andrew Hunt
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ATR Advanced Telecommunications Research Institute International
Original Assignee
ATR Interpreting Telecommunications Research Laboratories
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ATR Interpreting Telecommunications Research Laboratories filed Critical ATR Interpreting Telecommunications Research Laboratories
Priority to US09/250,405 priority Critical patent/US6366883B1/en
Assigned to ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES reassignment ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CAMPBELL, NICK, HUNT, ANDREW
Application granted granted Critical
Publication of US6366883B1 publication Critical patent/US6366883B1/en
Assigned to ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES reassignment ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES CHANGE OF ADDRESS Assignors: ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES
Assigned to ADVANCED TELECOMMUNICATIONS RESEARCH INSTITUTE INTERNATIONAL reassignment ADVANCED TELECOMMUNICATIONS RESEARCH INSTITUTE INTERNATIONAL ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to a speech synthesizer apparatus, and in particular, to a speech synthesizer apparatus for performing speech synthesis of any arbitrary sequence of phonemes by concatenation of speech segments of speech waveform signals extracted at synthesis time from a natural utterance.
  • FIG. 2 is a block diagram of a conventional speech synthesizer apparatus.
  • LPC analysis is executed on signal waveform signal data of a speaker for training, and then acoustic feature parameters including 16-degree cepstrum coefficients are extracted.
  • the extracted acoustic feature parameters are temporarily stored in a feature parameter memory 62 of a buffer memory, and then, are transferred from the feature parameter memory 62 to a parameter time sequence generator 52 .
  • the parameter time sequence generator 52 executes a signal process, including a time normalization process and a parameter time sequence generation process using prosodic control rules stored in a prosodic rule memory 63 , based on the extracted acoustic feature parameters, so as to generate a time sequence of parameters including, for example, the 16-degree cepstrum coefficients, which are required for speech synthesis, and output the generated time sequence thereof to a speech synthesizer 53 .
  • the speech synthesizer 53 is a speech synthesizer apparatus which has already known to those skilled in the art, and comprises a pulse generator 53 a for generating voiced speech, a noise generator 53 b for generating unvoiced speech, and a filter 53 c whose filter coefficient is changeable.
  • the speech synthesizer 53 switches between voiced speech generated by the pulse generator 53 a and unvoiced speech generated by the noise generator 53 b based on an inputted time sequence of parameters, controls the amplitude of the voiced speech or unvoiced speech, and further changes filter coefficients corresponding to transfer coefficients of the filter 53 c . Then, the speech synthesizer 53 generates and outputs a speech signal of attained speech synthesis to a loudspeaker 54 , and then the speech of the speech signal is outputted from the loudspeaker 54 .
  • An essential object of the present invention is therefore to provide a speech synthesizer apparatus capable of converting any arbitrary phoneme sequence into uttered speech of speech signal without using any prosodic modification rules and without executing any signal processing, and further obtaining a voice quality closer to the natural voice, as compared with that of the conventional apparatus.
  • a speech synthesizer apparatus comprising:
  • first storage means for storing speech segments of speech waveform signals of natural utterance
  • speech analyzing means based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;
  • second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;
  • weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;
  • third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means
  • speech unit selecting means for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates;
  • speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said speech unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.
  • said speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to the speech waveform signals based on input speech waveform signals.
  • said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N 1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
  • said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N 1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis using a predetermined neural network for each of the second acoustic feature parameters.
  • said speech unit selecting means may preferably extract a plurality of top N 2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, search for a combination of phoneme candidates that minimizes the cost.
  • the first acoustic feature parameters may preferably include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
  • the first acoustic feature parameters may preferably include formant parameters and voice source parameters.
  • the prosodic feature parameters may preferably include phoneme durations, speech fundamental frequencies F 0 , and powers.
  • the second acoustic feature parameters may preferably include cepstral distances.
  • any arbitrary phoneme sequence can be converted into uttered speech without using any prosodic control rule or executing any signal processing. Still further, voice quality close to the natural one can be obtained, as compared with that of the conventional apparatus.
  • the speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to an input speech waveform signal based on the input speech waveform signal. Accordingly, since there is no need of giving a phoneme sequence beforehand, the part of manual work can be simplified.
  • FIG. 1 is a block diagram of a speech synthesizer apparatus utilizing concatenation of speech segments of speech waveform signals of natural utterance, which is a preferred embodiment according to the present invention
  • FIG. 2 is a block diagram of a conventional speech synthesizer apparatus
  • FIG. 3 is a model diagram showing a definition of speech unit selection cost calculated by a speech unit selector of FIG. 1;
  • FIG. 4 is a flowchart of a speech analysis process which is executed by a speech analyzer of FIG. 1;
  • FIG. 5 is a flowchart of a first part of a weighting coefficient training process which is processed by a weighting coefficient training controller of FIG. 1;
  • FIG. 6 is a flowchart of a second part of the weighting coefficient training process which is executed by the weighting coefficient training controller of FIG. 1;
  • FIG. 7 is a flowchart of a speech unit selection process which is executed by the speech unit selector of FIG. 1;
  • FIG. 8 is a graph showing a first example of a non-linear suitability function S to a target value t i which is used in the cost function of a modified preferred embodiment according to the present invention
  • FIG. 9 is a graph showing a second example of a non-linear suitability function S to a target value t i which is used in the cost function of a modified preferred embodiment according to the present invention.
  • FIG. 10 is a graph showing a third example of a non-linear suitability function S to a target value t i which is used in the cost function of a modified preferred embodiment according to the present invention.
  • FIG. 11 is a graph showing a fourth example of a non-linear suitability function S to a target value t i which is used in the cost function of a modified preferred embodiment according to the present invention
  • FIG. 1 is a block diagram of a speech synthesizer apparatus utilizing concatenation of speech segments of speech waveform signals of natural utterance, which is a preferred embodiment according to the present invention.
  • the conventional speech synthesizer apparatus for example as shown in FIG. 2, performs the processes from the extraction of text corresponding to input uttered speech to the generation of a speech waveform signal, as a sequence of processes.
  • the speech synthesizer apparatus of the present preferred embodiment can be roughly comprised of the following four processing units or controllers:
  • a speech analyzer 10 for performing speech analysis of a speech waveform database stored in a speech waveform database memory 21 , more specifically, a process including generation of a phonemic symbol sequence, alignment of the phonemes, and extraction of acoustic feature parameters;
  • a speech unit selector 12 for executing selection of a speech unit based on an input phoneme sequence and outputting index information on speech segments of speech waveform signals corresponding to the input phoneme sequence;
  • a speech synthesizer 13 for generating speech segments of respective phoneme candidates that have been determined as the optimum ones by randomly accessing the speech waveform database stored in the speech waveform database memory 21 with skipping them and concatenation of them, based on the index information outputted from the speech unit selector 12 , and for D/A converting and outputting the speech segments of the speech waveform signals to the loudspeaker 14 .
  • the speech analyzer 10 Based on speech segments of an input speech waveform signal of natural utterance and a phoneme sequence corresponding to the speech waveform signal, the speech analyzer 10 extracts and outputs index information for each phoneme in the speech segments of the speech waveform signal, first acoustic feature parameters for each phoneme shown by the index information, and first prosodic feature parameters for each phoneme shown by the index information. Then, a feature parameter memory 30 temporarily stores the index information outputted from the speech analyzer 10 , the first acoustic feature parameters, and the first prosodic feature parameters.
  • the weighting coefficient training controller 11 calculates acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in the feature parameter memory 30 , and determines weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis such as a linear regression analysis or the like for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances.
  • a weighting coefficient vector memory 31 temporarily stores not only the weighting coefficient vectors for the respective target phonemes in the second acoustic feature parameters determined by the weighting coefficient training controller 11 , but also previously given weighting coefficient vectors for the respective target phonemes that represent the degrees of contribution to the second prosodic feature parameters for the phoneme candidates.
  • the speech unit selector 12 searches the phoneme sequence of an input sentence of natural utterance for a combination of phoneme candidates that minimizes the cost including a target cost representing approximate costs between target phonemes and phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and then outputs index information on the searched-out combination of phoneme candidates.
  • the speech synthesizer 13 reads out speech segments of speech waveform signals corresponding to the index information from the speech waveform database memory 21 sequentially, concatenates the read-out speech segments thereof, D/A converts the concatenated speech segments of the speech waveform signal data into speech waveform signals, and outputs the D/A converted speech waveform signals to a loudspeaker 14 , and then, synthesized speech of the speech waveform signals corresponding to the input phoneme sequence is outputted from the loudspeaker 14 .
  • the speech analyzer 10 It is necessary to perform the process by the speech analyzer 10 necessarily once for a newly introduced speech waveform database. It is generally necessary to perform the process by the weighting coefficient training controller 11 only once, and the optimal weighting coefficients determined by the weighting coefficient training controller 11 can be also reused for different speech analysis conditions. Further, the processes by the speech unit selector 12 and the speech synthesizer 13 are executed each time the input phoneme sequence to be subjected to speech synthesis changes.
  • the speech synthesizer apparatus of the present preferred embodiment predicts all the feature parameters that are required according to any given level of input, and selects the samples (i.e., speech segments of phoneme candidates) closest to the features of desired speech out of the speech waveform database stored in the speech waveform database memory 21 .
  • the processing can be performed given at least a sequence of phoneme labels, however, if the phoneme fundamental frequency F 0 and the phoneme duration are previously given, even higher quality of synthesized speech can be obtained.
  • F 0 and the phoneme duration are previously given, even higher quality of synthesized speech can be obtained.
  • the sequence of phonemes based on dictionaries and rules such as phonemic hidden Markov model (hereinafter, the hidden Markov model will be referred to as an HMM) or the like.
  • HMM phonemic hidden Markov model
  • every speech waveform database can be used as speech segments of speech waveform signal data for synthesis.
  • the quality of output speech is conditioned largely by the recorded state, the balance of phonemes in the speech waveform database, and the like. Therefore, if the speech waveform database stored in the speech waveform database memory 21 has an abundance of contents, a wider variety of speech can be synthesized. Conversely, if the speech waveform database is poor, the synthesized speech would be of more discontinuity, or more brokenness.
  • the speech unit is a phoneme.
  • the contents of orthographical utterance imparted to the recorded speech are converted into a sequence of phonemes and further assigned to speech segments of speech waveform signals. Based on the result of this, the extraction of prosodic feature parameters is carried out.
  • the input data of the speech analyzer 10 is speech segments of speech waveform signal data stored in the speech waveform database memory 21 accompanied by the representation of phonemes stored in the text database memory 22 , and its output is feature vectors or feature parameters. These feature vectors serve as the fundamental units for representing speech samples or segments in the speech waveform database, and are used to select an optimal speech unit.
  • the first stage of the processing by the speech analyzer 10 is the transformation from orthographical text into phonemic symbols for describing how the contents of utterance written in orthography are pronounced with actual speech waveform signal data.
  • the second stage is a process of associating the respective phonemic symbols with speech segments of speech waveform signals in order to determine the start and end time points of each phoneme to measure prosodic and acoustic characteristics (hereinafter, the process is referred to as a phoneme alignment process).
  • the third stage is to generate feature vectors or feature parameters for respective phonemes.
  • phoneme label In these acoustic feature vectors, phoneme label, start time (or start position) of phonemes in each file within the speech waveform database stored in the speech waveform memory 30 , speech fundamental frequency F 0 , a phoneme duration, and a power value are stored as essential information.
  • speech fundamental frequency F 0 As optional information of the feature parameters, stress, accent type, position with respect to the prosodic boundary, spectral inclination, and the like are further stored.
  • Index information index number (assigned to one file) start time (or start position) of a phoneme in each file in the speech waveform database stored in the speech waveform database memory 30
  • First acoustic feature parameters 12-degree melcepstrum coefficients 12-degree delta melcepstrum coefficients phoneme label discriminative characteristics: vocalic (+) / non-vocalic ( ⁇ ) consonantal (+) / non-consonantal ( ⁇ ) interrupted (+) / continuant ( ⁇ ) checked (+) / unchecked ( ⁇ ) strident (+) / mellow ( ⁇ ) voiced (+) / unvoiced ( ⁇ ) compact (+) / diffuse ( ⁇ ) grave (+) / acute ( ⁇ ) flat (+) / plain ( ⁇ ) sharp (+) / plain ( ⁇ ) tense (+) / lax ( ⁇ ) nasal (+) / oral ( ⁇ )
  • First prosodic feature parameters phoneme duration speech fundamental frequency F 0 power value
  • the first acoustic feature parameters includes the above-mentioned parameters shown in Table 1, however, the present invention is not limited to this.
  • the first acoustic feature parameters may include formant parameters and voice source parameters.
  • the start time (or start position), first acoustic feature parameters, and first prosodic feature parameters within the index information are stored in the feature parameter memory 30 for each phoneme.
  • twelve feature parameters of discriminative characteristics to be assigned to the phoneme label are given by parameter values of (+) or ( ⁇ ) for each item.
  • An example of feature parameters, which are output results of the speech analyzer 10 is shown in Table 2.
  • the index number is given one for each file of either one paragraph composed of a plurality of sentences or one sentence, in the speech waveform database memory 21 , and the start time of a phoneme and its phoneme duration counted from the start time in the file in order to indicate the position of an arbitrary phoneme in the file to which one index number is assigned are imparted.
  • a speech waveform of the phoneme concerned can be specifically determined.
  • weighting coefficients for the respective feature parameters are calculated for all the speech samples in the speech waveform database.
  • every speech waveform database can be used as speech waveform data for synthesis, as described before.
  • every speech waveform database can be used as speech waveform data for synthesis, as described before.
  • aligning phonemes by the speech analyzer 10 when the speech is read aloud, the words would be pronounced, in many cases, nearly in their respective standard pronunciations, and rarely with hesitation or stammer.
  • the phoneme labeling will be correctly achieved by simple dictionary search, enabling the training of phoneme models of phoneme HMM for use of phoneme alignment.
  • the phoneme alignment is conducted by using Viterbi training algorithm with all speech waveform data so that appropriate segmentation is performed, and feature parameters are re-estimated.
  • the pauses between words are processed according to intra-word pause generation rules, any failures of alignment due to pauses that are present in the words need to be corrected by person's hand.
  • prosodic feature parameters for describing intonational characteristics of respective phonemes are extracted.
  • linguistic sounds have been classified according to such characteristics as utterance position and utterance mode.
  • prosody such as the Firth school or the like
  • clearly intoned places and emphasized places are distinguished from each other in order to capture fine differences in tone arising from differences in prosodic context.
  • the way of selection from these methods depends on the predictive ability of the speech synthesizer apparatus. If the speech waveform database has previously undergone the phoneme labeling, the task of the speech synthesizer apparatus is to appropriately train how to obtain actual speech in the speech waveform database from internal expressions. On the other hand, if the speech waveform database has not undergone the phoneme labeling, it is necessary to first investigate which feature parameters, when used, allow the speech synthesizer apparatus to predict the most appropriate speech unit. This investigation and the training of determining the weighting coefficients for feature parameters are executed by the weighting coefficient training controller 11 that calculates the weighting coefficients for respective feature parameters through training process.
  • the weighting coefficient training process which is executed by the weighting coefficient training controller 11 is described.
  • the speech fundamental frequency F 0 although significantly effective for the selection of voiced speech, has almost no effect on the selection of unvoiced speech.
  • the acoustic features of fricative sound have different effects depending on the kinds of the preceding and succeeding phonemes.
  • what degrees of weights are placed on the respective features is automatically determined through the optimal weight determining process, i.e., the weighting coefficient training process.
  • the first step is to list features which are used for selecting an optimal sample from among all the applicable samples or speech segments of uttered speech in the speech waveform database.
  • Employed in this case are phonemic features such as intonation position and intonation mode, as well as prosodic feature parameters such as the speech fundamental frequency F 0 , phoneme duration, and power of the preceding phoneme, the target phoneme, and the succeeding phoneme.
  • the second prosodic parameters which will be detailed later are used.
  • the acoustic distance including the difference in phoneme duration from all the other phoneme samples for one speech sample or segments (or including non-speech segments of speech signals of a phoneme), and the speech waveform segments of N 2 best analogous speech samples or segments, i.e., N 2 best phoneme candidates are selected.
  • a linear regression analysis is performed, where the weighting coefficients representing the degrees of importance or contribution of respective feature parameters in various acoustic and prosodic environments are determined or calculated by using the pseudo speech samples.
  • the prosodic feature parameters in this linear regression analysis process for example, the following feature parameters (hereinafter, referred to as second prosodic feature parameters) are employed:
  • first prosodic feature parameters of a preceding phoneme that is just one precedent to a target phoneme to be processed hereinafter, referred to as a preceding phoneme
  • the linear regression analysis is performed for determining weighting coefficients, however, the present invention is not limited to this.
  • the other type of statistical analysis may be performed for determining weighting coefficients.
  • a statistical analysis may be performed using a predetermined neural network for determining weighting coefficients.
  • the preceding phoneme is defined as the phoneme that is just one precedent to the target phoneme.
  • the present invention is not limited to this, the preceding phoneme may include phonemes that are precedent by a plurality of phonemes.
  • the succeeding phoneme is defined as the phoneme that is just one subsequent to the target phoneme.
  • the present invention is not limited to this, the succeeding phoneme may include phonemes that are subsequent by a plurality of phonemes.
  • the speech fundamental frequency F 0 of the succeeding phoneme may be excluded.
  • the conventional speech synthesizer apparatus involves the steps of determining a phoneme sequence for a target utterance of speech, and further calculating target values of F 0 and phoneme duration for use of prosodic control.
  • the speech synthesizer of the present preferred embodiment involves only the step of calculating the prosody for the purpose of appropriately selecting an optimal speech sample, where the prosody is not controlled directly.
  • the input of the processing by the speech unit selector 12 of FIG. 1 is the phoneme sequence of a target utterance of speech, weight vectors for respective features determined on the respective phonemes and feature vectors representing all the samples within the speech waveform database.
  • the output thereof is index information representing the positions of phoneme samples within the speech waveform database.
  • FIG. 3 shows the start position and speech unit duration of respective speech units for concatenation of speech segments of speech waveform signals (where, more specifically, a phoneme, or in some cases, a sequence of a plurality of phonemes are selected in continuation as one speech unit).
  • An optimal speech unit can be determined as a path that minimizes the sum of the target cost, which represents an approximate cost of the difference from the target utterance of speech, and the concatenation cost, which represents an approximate cost of discontinuity between adjacent speech units.
  • a known Viterbi training algorithm is used for the path search.
  • a target speech t 1 n (t 1 , . . . , t n )
  • the speech synthesis of the contents of any arbitrary utterance can be performed.
  • the speech unit selection cost comprises the target cost C t (u i , t i ) and the concatenation cost C c (u i-1 , u i ).
  • the target cost C t (u i , t i ) is a predictive value of the difference between a speech unit (or phoneme candidate) u i in the speech waveform database and a speech unit (or target phoneme) t i , to be realized as synthesized speech
  • the concatenation cost C c (u i-1 , u i ) is a predictive value of the discontinuity that results from the concatenation between concatenation units (two phonemes to be concatenated) u i-1 and u i .
  • the target cost is a weighted sum of differences between the respective elements of the feature vector of the speech unit to be realized and the respective elements of the feature vector of a speech unit that is a candidate selected from the speech waveform database.
  • the differences between the respective elements of the feature vectors are represented by p target sub-costs C t j (t i , u i ) (where j is a natural number from 1 to p), and the number of dimensions p of the feature vectors is variable within a range of 20 to 30 in the present preferred embodiment.
  • the number of dimensions p 30
  • the feature vectors or feature parameters with the variable j in the target sub-costs C t (t i , u i ) and weighting coefficients w t j are the above-mentioned second prosodic feature parameters.
  • the concatenation cost C c (u i-1 , u i ) can be represented likewise by a weighted sum of q concatenation sub-costs C c j (u i-1 , u i ) (where j is a natural number from 1 to q).
  • the concatenation sub-cost can be determined or calculated from acoustic characteristics of speech units u i-1 and u i to be concatenated. In the preferred embodiment, the following three kinds:
  • acoustic feature parameters the phoneme label of the preceding phoneme and the phoneme label of the succeeding phoneme are referred to as third acoustic feature parameters.
  • the concatenation cost is determined or calculated based on the first acoustic feature parameters and the first prosodic feature parameters in the feature parameter memory 30 , where the concatenation cost, which involves the above-mentioned three third acoustic feature parameters of continuous quantity, assumes any analog quantity in the range of, for example, 0 to 1.
  • the target cost which involves the above-mentioned 30 second acoustic feature parameters showing whether or not the discriminative characteristics of respective preceding or succeeding phonemes are coincident, includes elements represented by digital quantities of, for example, zero when the features are coincident for each parameter or one when the features are not coincident for each parameter.
  • the concatenation cost for N speech units results in the sum of the target cost and the concatenation cost for the respective speech units, and can be represented by the following equation (3):
  • C c (S, u 1 ) and C c (u n , S) represent the concatenation costs for a concatenation from a pause to the first speech unit and for another concatenation from the last speech unit to a pause, respectively.
  • the present preferred embodiment treats the pause in absolutely the same way as that of the other phonemes in the speech waveform database.
  • Equation (3) can be expressed directly with sub-costs by the following equation (4):
  • the speech-unit selection process is purposed to determine the combination of speech units, ⁇ overscore (u 1 n +L ) ⁇ , that minimizes the total cost that depends on the above-mentioned equation (4) as follows.
  • u 1 n _ min ⁇ ⁇ C ⁇ ( t 1 n , u 1 n ) ⁇ ⁇ u 1 , u 2 , ... ⁇ , u n ( 5 )
  • the linear combination is used as the cost function.
  • the present invention is not limited to this, the following non-linear multiplication or non-linear combination may be used.
  • Evaluating the target distance for a candidate unit, or the distance between a pair of units to be concatenated, returns only a physical measure of the distance separating the two speech waveform signals, and is not necessarily a true indicator of the distortion that may be perceived when using the particular candidate units for speech synthesis.
  • the goal of the current approach is to try to find a function relating the physical distances measured from the signal to the suitability of the unit for synthesis in a given context, in a manner analogous to the perceptual stimuli relationship.
  • This modeling concentrates all assumptions about the capabilities of the subsequent signal processing routines, and about human audio-perception etc., at this central point.
  • suitability functions for each single distance are defined in the range of zero to one, where “1” denotes a perfected match, and “0” denotes an unacceptable mismatch, as illustrated in FIGS. 8 to 11 .
  • a typical suitability function e.g., for pitch target mismatch, could be a Gaussian over the pitch axis centered at the target pitch. Small (perceptually irrelevant) distances close to the target value result in suitabilities close to one. Big unacceptable mismatches result in zero suitability.
  • Other units which are not perfectly matching but are within an acceptable distance are gradually weighted by the monotone decline between the extreme values. Anysimilar monotonically declining function may match these assumption as well.
  • Suitability functions S are extensively employed in the target and joint functions C t and C c which follow.
  • a suitability function is defined by the following formula:
  • t i is a target value of some variable, and t i is replaced by u i-1 in the joint cost;
  • T is a tolerance
  • R is a rate
  • S 1 is the suitability function for the target cost and S 2 is the suitability function for the joint cost.
  • FIGS. 8 to 11 are graphs each showing an example of a non-linear suitability function S to a target value t i which is used in the cost function of a modified preferred embodiment according to the present invention.
  • the parameters are set such that the target value is 40, the rate is 1, the tolerance is 10, min is 0 and max is 1.
  • the parameters are set such that the target value is 40, the rate is 2, the tolerance is 10, min is 0 and max is 1.
  • the parameters are set such that the target value is 40, the rate is 10, the tolerance is 10, min is 0 and max is 1.
  • the parameters are set such that the target value is 40, the rate is 2, the tolerance is 20, min is 0.1 and max is 1.
  • Weights for the target sub-costs are determined or calculated by using the linear regression analysis based on the acoustic distances.
  • different weighting coefficients may be determined or calculated for all the phonemes, or weighting coefficients may be determined or calculated for respective phoneme categories (e.g., all nasal sounds). Otherwise, a common weighting coefficient for all the phonemes may be determined or calculated. In this case, however, different weighting coefficients for respective phonemes are employed.
  • Each token or speech sample stored in the database of the feature parameter memory 30 is described by a set of the first phonemic and prosodic feature parameters related to its acoustic characteristics.
  • the weight coefficients are trained in order to determine the strength of relationship between each individual first phonemic and prosodic feature parameters and differences in the acoustic characteristics of the token (phone in context).
  • Step I Upon all the samples or speech segments in the speech waveform database that belong to the phonemic kind (or phonemic category) under the current training, the following four processes (a) to (d) are executed repeatedly:
  • Step II The acoustic distances and the target sub-costs C t j (t i , u i ) are calculated for all the target phonemes t i and the top N 1 optimal samples.
  • Step III The linear regression is used to predict the contribution of each factors of the first phonemic and prosodic feature parameters representing the target phoneme by a linear weighting of the t target sub-costs.
  • the weight coefficients determined by the linear regression are used as the weight coefficients for the target sub-costs w t j for current phoneme set or kind (category).
  • this weighting coefficient training controller 11 determines what weighting coefficient should be applied to multiply the respective target sub-costs in order to select out a speech sample that is the closest when the acoustic distances of the target speech unit, if possible, could be directly determined.
  • An advantage of the present preferred embodiment is that the speech segments of the speech waveform signals in the speech waveform database can be directly utilized.
  • the speech analyzer 10 the weighting coefficient training controller 11 , the speech unit selector 12 and the speech synthesizer 13 are implemented by, for example, a digital computer or arithmetic and control unit or controller such as a microprocessing unit (MPU) or the like, while the text database memory 22 , a phoneme HMM memory 23 , the feature parameter memory 30 and the weighting-coefficient vector memory 31 are implemented by, for example, a storage unit such as a hard disk or the like.
  • the speech waveform database memory 21 is a storage unit of CD-ROM type.
  • FIG. 4 is a flowchart of the speech analysis process which is executed by the speech analyzer 10 of FIG. 1 .
  • step S 11 speech segments of speech waveform signals of natural utterance are inputted from the speech waveform database memory 21 to the speech analyzer 10 , and the speech segments of the speech waveform signals are converted into digital speech waveform signal data through analog to digital conversion, while text data or character data obtained by writing down the speech sentence of the above speech waveform signal is inputted from the text database stored in the text database memory 22 to the speech analyzer 10 .
  • any text data may be absent, where if any text data is absent, text data may be obtained from speech waveform signal data through speech recognition using a known speech recognizing apparatus.
  • step S 12 it is decided whether or not the phoneme sequence has been predicted.
  • step S 12 if the phoneme sequence has not been predicted, the phoneme sequence is predicted and stored, for example, by using the phoneme HMM, the program flow proceeds to step S 14 . If the phoneme sequence has been predicted or previously given or the phoneme label has been given by manual work at step S 12 , the program flow goes to step S 14 , directly.
  • step S 14 the start position and end position in the speech waveform database file composed of either a plurality of sentences or one sentence for each phoneme segment are recorded, and an index number is assigned to the file.
  • step S 15 the first acoustic feature parameters for each phoneme segment are extracted by using, for example, a known pitch extraction method.
  • step S 16 the phoneme labeling is executed for each phoneme segment, and the phoneme labels and the first acoustic feature parameters for the phoneme labels are recorded.
  • the first acoustic feature parameters for each phoneme segment, the phoneme labels and the first prosodic feature parameters for the phoneme labels are stored in the feature parameter memory 30 together with the file index number and the start position and time duration in the file.
  • index information including the index number of the file and the start position and time duration in the file are given to each phoneme segment, and the index information is stored in the feature parameter memory 30 , then the speech analysis process is completed.
  • FIGS. 5 and 6 are flowcharts of the weighting coefficient training process which is executed by the weighting coefficient training controller of FIG. 1 .
  • step S 21 one phonemic kind is selected from the feature parameter memory 30 .
  • step S 22 the second acoustic feature parameters are extracted from the first acoustic feature parameters of a phoneme that has the same phonemic kind as the selected phonemic kind, and then, are taken as the second acoustic feature parameters of the target phoneme.
  • step S 23 the Euclidean cepstral distances of acoustic distances between the remaining phonemes other than the target phoneme that have the same phonemic kind, and the target phoneme in the second acoustic feature parameters, as well as the log phoneme duration with the bottom of 2 are calculated.
  • step S 24 it is decided whether or not the processes of steps S 22 and S 23 have been done on all the remaining phonemes. At step S 24 , if the processes have not been completed for all the remaining phonemes, another remaining phoneme is selected at step S 25 , and then, the processes of step S 23 and the following thereto are iterated.
  • the top N 1 best phoneme candidates are selected at step S 26 based on the distances and time durations obtained at step S 23 .
  • the selected N 1 best phoneme candidates are ranked into the first to N 1 -th places.
  • the scale conversion values are calculated by subtracting intermediate values from the respective distances.
  • step S 30 If the processes of steps S 22 to S 28 have not been completed for all the phonemic kinds, another phonemic kind and phoneme is selected at step S 30 , and then the processes of step S 22 , and the following are iterated. On the other hand, if the processes of steps S 22 to S 28 has been completed for all the phonemic kinds at step S 29 , the program flow goes to step S 31 of FIG. 6 .
  • step S 31 one phonemic kind is selected.
  • step S 32 the second acoustic feature parameters for each phoneme are extracted for the selected phonemic kind.
  • step S 33 by performing the linear regression analysis based on the scale conversion value for the selected phonemic kind, the degrees of contribution to the scale conversion values in the second acoustic feature parameters are calculated, and the calculated degrees of contribution are stored in the weighting coefficient vector memory 31 as weighting coefficients for each target phoneme.
  • step S 34 it is decided whether or not the processes of steps S 32 and S 33 has been completed for all the phonemic kinds.
  • step S 34 If the processes have not been completed for all the phonemic kinds at step S 34 , another phonemic kind is selected at step S 35 , and the processes of step S 32 and the following are iterated. On the other hand, if the processes has been completed for all the phonemic kinds at step S 34 , the weighting coefficient training process is completed.
  • FIG. 7 is a flowchart of the speech unit selection process which is executed by the speech unit selector of FIG. 1 .
  • step S 41 the first phoneme located at the first position of an input phoneme sequence is selected.
  • step S 42 a weighting coefficient vector of a phoneme having the same phonemic kind as the selected phoneme is read out from the weighting coefficient vector memory 31 , and target sub-costs and necessary feature parameters are read out and listed from the feature parameter memory 30 .
  • step S 43 it is decided whether or not the processing has been completed for all the phonemes. If the processes have not been completed for all the phonemes at step S 43 , the next phoneme is selected at step S 44 , and then, the process of step S 42 is iterated. On the other hand, if the processes have not been completed at step S 43 for all the phonemes, the program flow goes to step S 45 .
  • the total cost for each phoneme candidate is calculated using the Equation (4) for the input phoneme sequence.
  • the top N 2 best phoneme candidates are selected for the respective target phonemes based on the calculated cost.
  • index information on the combination of phoneme candidates that minimizes the total cost together with the start time and the time duration of each phoneme are searched and outputted to the speech synthesizer 13 utilizing the Viterbi search using the Equation (5), and then the speech unit selection process is completed.
  • the speech synthesizer 13 reads out digital speech waveform signal data of the unit selected phoneme candidates by accessing the speech waveform database memory 21 , and the read-out digital speech waveform signal data is D/A converted to an analog speech signal, then the converted analog speech signal is outputted through the loudspeaker 14 .
  • synthesized speech corresponding to the input phoneme sequence is outputted from the loudspeaker 14 .
  • the speech synthesizer of the present preferred embodiment comprises the four processing units 10 to 13 .
  • the speech analyzer 10 of a processing unit receives, as inputs, any arbitrary speech waveform signal data accompanied by text written in orthography, and then calculates and outputs feature vectors for describing the characteristics of all the phonemes in the speech waveform database.
  • the weighting coefficient training controller 11 of a processing unit determines or calculates optimal weighting coefficients of respective feature parameters, as weight vectors, for selecting a speech unit that best fits the synthesis of target speech by using the feature vectors of speech segments of the speech waveform database and the original waveforms of the speech waveform database.
  • the speech unit selector 12 of a processing unit generates index information for the speech waveform database memory 21 from the feature vectors and weight vectors of all the phonemes in the speech waveform database as well as the description of utterance contents of objective speech.
  • the speech synthesizer 13 of a processing unit synthesizes speech by accessing and reading out the speech segments of the speech signals in the speech waveform database stored in the speech waveform database memory 21 with skipping them and concatenation of them, according to the generated index information, and by D/A converting and outputting the objective speech signal data comprised of the read-out speech segments to the loudspeaker 14 .
  • the fundamental unit for the speech synthesis method of the present preferred embodiment is the phoneme, which is generated by dictionaries or text phoneme conversion programs, where it is demanded that sufficient variations of phonemes even with the same phoneme be contained in the speech waveform database.
  • the speech unit selection process from a speech waveform database is selected a combination of phoneme samples that fits the objective prosodic environment and yet that has the lowest discontinuity between adjacent speech units at the time of concatenation. For this purpose, the optimal weighting coefficients for respective feature parameters are determined or calculated for each phoneme.
  • prosodic features are introduced as speech unit selection criteria.
  • the speech waveform database is treated fully as external information.
  • a speech synthesizer apparatus replacing the speech waveform signal data stored simply in a CD-ROM or the like has been built up.
  • the inventors have so far conducted evaluations by various kinds of speech waveform databases including four languages.
  • speech waveform databases including four languages.
  • the method of the present preferred embodiment has overcome differences in sex, age and the like.
  • synthesized speech of highest quality has been obtained with the use of a speech waveform database that contains a corpus that a young female speaker has read short stories.
  • synthesized speech using CD-ROM data of read aloud sentences to which prosodic labels and detailed phoneme labels have been imparted has been outputted.
  • the speech synthesizer apparatus of the present preferred embodiment can freely use various types of existing speech waveform signal data, other than speech waveform signal data specially recorded for use of speech synthesis. Also, for English language, best speech quality has been obtained with 45 minute speech waveform signal data of a radio announcer in the news corpus of the Boston University. For Korean language, read aloud speech of short stories have been used.
  • any arbitrary phoneme sequence can be converted into uttered speech without using any prosodic control rule or executing any signal processing. Still further, voice quality close to the natural one can be obtained, as compared with that of the conventional apparatus.
  • the speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to an input speech waveform signal based on the input speech waveform signal. Accordingly, since there is no need of giving a phoneme sequence beforehand, the part of manual work can be simplified.

Abstract

In a speech synthesizer apparatus, a weighting coefficient training controller calculates acoustic distances in second acoustic feature parameters between one target phoneme from the same phoneme and the phoneme candidates other than the target phoneme based on first acoustic feature parameters and prosodic feature parameters, and determines weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis therefor. Then, a speech unit selector searches for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and outputs index information on the searched out combination of phoneme candidates. Further, a speech synthesizer synthesizes a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information and concatenating the read speech segments of the speech waveform signals.

Description

This application is a continuation-in-part of application Ser. No. 08/856,578 filed on May 15, 1997 now abandoned, the entire contents of which are hereby incorporated by reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech synthesizer apparatus, and in particular, to a speech synthesizer apparatus for performing speech synthesis of any arbitrary sequence of phonemes by concatenation of speech segments of speech waveform signals extracted at synthesis time from a natural utterance.
2. Description of the Prior Art
FIG. 2 is a block diagram of a conventional speech synthesizer apparatus.
Referring to FIG. 2, for example, LPC analysis is executed on signal waveform signal data of a speaker for training, and then acoustic feature parameters including 16-degree cepstrum coefficients are extracted. The extracted acoustic feature parameters are temporarily stored in a feature parameter memory 62 of a buffer memory, and then, are transferred from the feature parameter memory 62 to a parameter time sequence generator 52. Next, the parameter time sequence generator 52 executes a signal process, including a time normalization process and a parameter time sequence generation process using prosodic control rules stored in a prosodic rule memory 63, based on the extracted acoustic feature parameters, so as to generate a time sequence of parameters including, for example, the 16-degree cepstrum coefficients, which are required for speech synthesis, and output the generated time sequence thereof to a speech synthesizer 53.
The speech synthesizer 53 is a speech synthesizer apparatus which has already known to those skilled in the art, and comprises a pulse generator 53 a for generating voiced speech, a noise generator 53 b for generating unvoiced speech, and a filter 53 c whose filter coefficient is changeable. The speech synthesizer 53 switches between voiced speech generated by the pulse generator 53 a and unvoiced speech generated by the noise generator 53 b based on an inputted time sequence of parameters, controls the amplitude of the voiced speech or unvoiced speech, and further changes filter coefficients corresponding to transfer coefficients of the filter 53 c. Then, the speech synthesizer 53 generates and outputs a speech signal of attained speech synthesis to a loudspeaker 54, and then the speech of the speech signal is outputted from the loudspeaker 54.
However, in the conventional speech synthesizer apparatus, there has been such a problem that the quality of the resulting voice is considerably poor owing to the fact that the signal processing using the prosodic control rules is required, and to the fact that the speech synthesis is performed based on processed acoustic feature parameters.
SUMMARY OF THE INVENTION
An essential object of the present invention is therefore to provide a speech synthesizer apparatus capable of converting any arbitrary phoneme sequence into uttered speech of speech signal without using any prosodic modification rules and without executing any signal processing, and further obtaining a voice quality closer to the natural voice, as compared with that of the conventional apparatus.
In order to achieve the aforementioned objective, according to one aspect of the present invention, there is provided a speech synthesizer apparatus comprising:
first storage means for storing speech segments of speech waveform signals of natural utterance;
speech analyzing means, based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;
second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;
weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;
third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means;
speech unit selecting means, based on the weighting coefficient vectors for the respective target phonemes stored in said third storage means, and the prosodic feature parameters stored in said second storage means, for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates; and
speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said speech unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.
In the above-mentioned speech synthesizer apparatus, said speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to the speech waveform signals based on input speech waveform signals.
In the above-mentioned speech synthesizer apparatus, said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
In the above-mentioned speech synthesizer apparatus, said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis using a predetermined neural network for each of the second acoustic feature parameters.
In the above-mentioned speech synthesizer apparatus, said speech unit selecting means may preferably extract a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, search for a combination of phoneme candidates that minimizes the cost.
In the above-mentioned speech synthesizer apparatus, the first acoustic feature parameters may preferably include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
In the above-mentioned speech synthesizer apparatus, the first acoustic feature parameters may preferably include formant parameters and voice source parameters.
In the above-mentioned speech synthesizer apparatus, the prosodic feature parameters may preferably include phoneme durations, speech fundamental frequencies F0, and powers.
In the above-mentioned speech synthesizer apparatus, the second acoustic feature parameters may preferably include cepstral distances.
According to one aspect of the present invention, any arbitrary phoneme sequence can be converted into uttered speech without using any prosodic control rule or executing any signal processing. Still further, voice quality close to the natural one can be obtained, as compared with that of the conventional apparatus.
In another aspect of the present invention, the speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to an input speech waveform signal based on the input speech waveform signal. Accordingly, since there is no need of giving a phoneme sequence beforehand, the part of manual work can be simplified.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects and features of the present invention will become clear from the following description taken in conjunction with the preferred embodiments thereof with reference to the accompanying drawings throughout which like parts are designated by like reference numerals, and in which:
FIG. 1 is a block diagram of a speech synthesizer apparatus utilizing concatenation of speech segments of speech waveform signals of natural utterance, which is a preferred embodiment according to the present invention;
FIG. 2 is a block diagram of a conventional speech synthesizer apparatus;
FIG. 3 is a model diagram showing a definition of speech unit selection cost calculated by a speech unit selector of FIG. 1;
FIG. 4 is a flowchart of a speech analysis process which is executed by a speech analyzer of FIG. 1;
FIG. 5 is a flowchart of a first part of a weighting coefficient training process which is processed by a weighting coefficient training controller of FIG. 1;
FIG. 6 is a flowchart of a second part of the weighting coefficient training process which is executed by the weighting coefficient training controller of FIG. 1;
FIG. 7 is a flowchart of a speech unit selection process which is executed by the speech unit selector of FIG. 1;
FIG. 8 is a graph showing a first example of a non-linear suitability function S to a target value ti which is used in the cost function of a modified preferred embodiment according to the present invention;
FIG. 9 is a graph showing a second example of a non-linear suitability function S to a target value ti which is used in the cost function of a modified preferred embodiment according to the present invention;
FIG. 10 is a graph showing a third example of a non-linear suitability function S to a target value ti which is used in the cost function of a modified preferred embodiment according to the present invention; and
FIG. 11 is a graph showing a fourth example of a non-linear suitability function S to a target value ti which is used in the cost function of a modified preferred embodiment according to the present invention
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Preferred embodiments according to the present invention will be described below with reference to the attached drawings.
FIG. 1 is a block diagram of a speech synthesizer apparatus utilizing concatenation of speech segments of speech waveform signals of natural utterance, which is a preferred embodiment according to the present invention. The conventional speech synthesizer apparatus, for example as shown in FIG. 2, performs the processes from the extraction of text corresponding to input uttered speech to the generation of a speech waveform signal, as a sequence of processes. On the other hand, the speech synthesizer apparatus of the present preferred embodiment can be roughly comprised of the following four processing units or controllers:
(1) a speech analyzer 10 for performing speech analysis of a speech waveform database stored in a speech waveform database memory 21, more specifically, a process including generation of a phonemic symbol sequence, alignment of the phonemes, and extraction of acoustic feature parameters;
(2) a weighting coefficient training controller 11 for deciding an optimal weighting coefficient through training process;
(3) a speech unit selector 12 for executing selection of a speech unit based on an input phoneme sequence and outputting index information on speech segments of speech waveform signals corresponding to the input phoneme sequence; and
(4) a speech synthesizer 13 for generating speech segments of respective phoneme candidates that have been determined as the optimum ones by randomly accessing the speech waveform database stored in the speech waveform database memory 21 with skipping them and concatenation of them, based on the index information outputted from the speech unit selector 12, and for D/A converting and outputting the speech segments of the speech waveform signals to the loudspeaker 14.
Concretely speaking, based on speech segments of an input speech waveform signal of natural utterance and a phoneme sequence corresponding to the speech waveform signal, the speech analyzer 10 extracts and outputs index information for each phoneme in the speech segments of the speech waveform signal, first acoustic feature parameters for each phoneme shown by the index information, and first prosodic feature parameters for each phoneme shown by the index information. Then, a feature parameter memory 30 temporarily stores the index information outputted from the speech analyzer 10, the first acoustic feature parameters, and the first prosodic feature parameters. Next, the weighting coefficient training controller 11 calculates acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in the feature parameter memory 30, and determines weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis such as a linear regression analysis or the like for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances. A weighting coefficient vector memory 31 temporarily stores not only the weighting coefficient vectors for the respective target phonemes in the second acoustic feature parameters determined by the weighting coefficient training controller 11, but also previously given weighting coefficient vectors for the respective target phonemes that represent the degrees of contribution to the second prosodic feature parameters for the phoneme candidates. Further, based on the weighting coefficient vectors for the respective target phonemes stored in the weighting coefficient vector memory 31 and the first prosodic feature parameters stored in the feature parameter memory 30, the speech unit selector 12 searches the phoneme sequence of an input sentence of natural utterance for a combination of phoneme candidates that minimizes the cost including a target cost representing approximate costs between target phonemes and phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and then outputs index information on the searched-out combination of phoneme candidates. Then, based on the index information outputted from the speech unit selector 12, the speech synthesizer 13 reads out speech segments of speech waveform signals corresponding to the index information from the speech waveform database memory 21 sequentially, concatenates the read-out speech segments thereof, D/A converts the concatenated speech segments of the speech waveform signal data into speech waveform signals, and outputs the D/A converted speech waveform signals to a loudspeaker 14, and then, synthesized speech of the speech waveform signals corresponding to the input phoneme sequence is outputted from the loudspeaker 14.
It is necessary to perform the process by the speech analyzer 10 necessarily once for a newly introduced speech waveform database. It is generally necessary to perform the process by the weighting coefficient training controller 11 only once, and the optimal weighting coefficients determined by the weighting coefficient training controller 11 can be also reused for different speech analysis conditions. Further, the processes by the speech unit selector 12 and the speech synthesizer 13 are executed each time the input phoneme sequence to be subjected to speech synthesis changes.
The speech synthesizer apparatus of the present preferred embodiment predicts all the feature parameters that are required according to any given level of input, and selects the samples (i.e., speech segments of phoneme candidates) closest to the features of desired speech out of the speech waveform database stored in the speech waveform database memory 21. The processing can be performed given at least a sequence of phoneme labels, however, if the phoneme fundamental frequency F0 and the phoneme duration are previously given, even higher quality of synthesized speech can be obtained. In addition, when only word information is given as the input, it is necessary to predict the sequence of phonemes based on dictionaries and rules such as phonemic hidden Markov model (hereinafter, the hidden Markov model will be referred to as an HMM) or the like. Given no prosodic features, a standard prosody is generated based on known features of phonemes under various environments in the speech waveform database.
In the present preferred embodiment, if text data that orthographically describes recorded contents of the speech waveform database memory 21 is present, for example, as a text database in a text database memory 22, every speech waveform database can be used as speech segments of speech waveform signal data for synthesis. However, the quality of output speech is conditioned largely by the recorded state, the balance of phonemes in the speech waveform database, and the like. Therefore, if the speech waveform database stored in the speech waveform database memory 21 has an abundance of contents, a wider variety of speech can be synthesized. Conversely, if the speech waveform database is poor, the synthesized speech would be of more discontinuity, or more brokenness.
Next described is the phoneme labeling for speech of natural utterance. Whether or not the selection of a speech unit is appropriate depends on the labeling, as well as search method, of phonemes in the speech waveform database. In the present preferred embodiment, the speech unit is a phoneme. First of all, the contents of orthographical utterance imparted to the recorded speech are converted into a sequence of phonemes and further assigned to speech segments of speech waveform signals. Based on the result of this, the extraction of prosodic feature parameters is carried out. The input data of the speech analyzer 10 is speech segments of speech waveform signal data stored in the speech waveform database memory 21 accompanied by the representation of phonemes stored in the text database memory 22, and its output is feature vectors or feature parameters. These feature vectors serve as the fundamental units for representing speech samples or segments in the speech waveform database, and are used to select an optimal speech unit.
The first stage of the processing by the speech analyzer 10 is the transformation from orthographical text into phonemic symbols for describing how the contents of utterance written in orthography are pronounced with actual speech waveform signal data. Next, the second stage is a process of associating the respective phonemic symbols with speech segments of speech waveform signals in order to determine the start and end time points of each phoneme to measure prosodic and acoustic characteristics (hereinafter, the process is referred to as a phoneme alignment process). Further, the third stage is to generate feature vectors or feature parameters for respective phonemes. In these acoustic feature vectors, phoneme label, start time (or start position) of phonemes in each file within the speech waveform database stored in the speech waveform memory 30, speech fundamental frequency F0, a phoneme duration, and a power value are stored as essential information. As optional information of the feature parameters, stress, accent type, position with respect to the prosodic boundary, spectral inclination, and the like are further stored. These feature parameters can be summarized, for example, as shown in Table 1.
TABLE 1
Index information:
index number (assigned to one file)
start time (or start position) of a phoneme in each
file in the speech waveform database stored in the speech
waveform database memory 30
First acoustic feature parameters:
12-degree melcepstrum coefficients
12-degree delta melcepstrum coefficients
phoneme label
discriminative characteristics:
vocalic (+) / non-vocalic (−)
consonantal (+) / non-consonantal (−)
interrupted (+) / continuant (−)
checked (+) / unchecked (−)
strident (+) / mellow (−)
voiced (+) / unvoiced (−)
compact (+) / diffuse (−)
grave (+) / acute (−)
flat (+) / plain (−)
sharp (+) / plain (−)
tense (+) / lax (−)
nasal (+) / oral (−)
First prosodic feature parameters:
phoneme duration
speech fundamental frequency F0
power value
In the present preferred embodiment, the first acoustic feature parameters includes the above-mentioned parameters shown in Table 1, however, the present invention is not limited to this. The first acoustic feature parameters may include formant parameters and voice source parameters.
The start time (or start position), first acoustic feature parameters, and first prosodic feature parameters within the index information are stored in the feature parameter memory 30 for each phoneme. In this process, for example, twelve feature parameters of discriminative characteristics to be assigned to the phoneme label are given by parameter values of (+) or (−) for each item. An example of feature parameters, which are output results of the speech analyzer 10, is shown in Table 2.
Referring to Table 2, the index number is given one for each file of either one paragraph composed of a plurality of sentences or one sentence, in the speech waveform database memory 21, and the start time of a phoneme and its phoneme duration counted from the start time in the file in order to indicate the position of an arbitrary phoneme in the file to which one index number is assigned are imparted. Thus, a speech waveform of the phoneme concerned can be specifically determined.
TABLE 2
An example of feature parameters that are output results of
the speech analyzer 10:
Index number X0005
Phoneme Duration Fundamental frequency Power . . .
# 120 90 4.0 . . .
s 175 98 4.7 . . .
ei 95 102 6.5 . . .
dh 30 114 4.9 . . .
ih 75 143 6.9 . . .
s 150 140 5.7 . . .
p 87 137 5.1 . . .
l 34 107 4.9 . . .
ii 150 98 6.3 . . .
z 140 87 5.8 . . .
# 253 87 4.0 . . .
In Table 2, the symbol “#” represents a pause.
Upon selecting a speech unit, it is necessary to calculate, before selecting a speech unit, how much the acoustic and prosodic feature parameters contribute in each phoneme. In the fourth stage, for this purpose, weighting coefficients for the respective feature parameters are calculated for all the speech samples in the speech waveform database.
In the process of generating a phonemic symbol sequence by the speech analyzer 10, for the present preferred embodiment, at least if recorded contents are described in orthography, every speech waveform database can be used as speech waveform data for synthesis, as described before. When only word information is given as the input, it is necessary to predict the sequence of phonemes based on dictionaries and rules. Also, in the process of aligning phonemes by the speech analyzer 10, when the speech is read aloud, the words would be pronounced, in many cases, nearly in their respective standard pronunciations, and rarely with hesitation or stammer. In the case of such speech waveform data, the phoneme labeling will be correctly achieved by simple dictionary search, enabling the training of phoneme models of phoneme HMM for use of phoneme alignment.
In the training of phoneme models for use of phoneme alignment, unlike the complete speech recognition, it is unnecessary to completely separate speech waveform data for training and speech waveform data for tests from each other, so that the training can be done for all the speech waveform data. First of all, with a model for another speaker used as an initial model, and with only standard pronunciation or limited pronunciational variations permitted for every word, the phoneme alignment is conducted by using Viterbi training algorithm with all speech waveform data so that appropriate segmentation is performed, and feature parameters are re-estimated. Whereas the pauses between words are processed according to intra-word pause generation rules, any failures of alignment due to pauses that are present in the words need to be corrected by person's hand.
There is a need of selection as to which phoneme label should be used as the representation of phonemes. If a phoneme set that allows the use of well trained HMM models is available, it is advantageous to use the phoneme set. Conversely, if the speech synthesizer apparatus has a complete dictionary, a method of completely checking the labels of the speech waveform database with the dictionary is also advantageous. Because we have room of selection for the training of weighting coefficients, it may appropriately be taken as the most important criterion whether or not any equivalent to the prediction by the speech synthesizer apparatus afterwards can be looked up from the speech waveform database. Since subtle differences in pronunciation are automatically grasped by the prosodic environment of the pronunciation, there is no need of executing the phoneme labeling by manual work.
As the stage succeeding the pre-processing, prosodic feature parameters for describing intonational characteristics of respective phonemes are extracted. In conventional phonetics, linguistic sounds have been classified according to such characteristics as utterance position and utterance mode. By contrast, in the phonetics that involves prosody, such as the Firth school or the like, clearly intoned places and emphasized places are distinguished from each other in order to capture fine differences in tone arising from differences in prosodic context. Although various methods are available for describing these differences, the following two methods are employed here. First of all, for lower order level, values obtained by averaging the power, the length of phoneme duration and the phoneme fundamental frequency F0 of the one phoneme are used to determine one-dimensional features. For higher order level, a method of marking prosodic boundaries and emphasized places in view of the above-mentioned differences in the prosodic features is used. Whereas these two kinds of places have features closely correlated to each other such that one can be predicted from the other, both have strong effects on the characteristics of the respective phonemes.
As there is a degree of freedom for the method of prescribing phoneme sets with which the speech waveform database is described, so there is a degree of freedom for the method of describing prosodic feature parameters. However, the way of selection from these methods depends on the predictive ability of the speech synthesizer apparatus. If the speech waveform database has previously undergone the phoneme labeling, the task of the speech synthesizer apparatus is to appropriately train how to obtain actual speech in the speech waveform database from internal expressions. On the other hand, if the speech waveform database has not undergone the phoneme labeling, it is necessary to first investigate which feature parameters, when used, allow the speech synthesizer apparatus to predict the most appropriate speech unit. This investigation and the training of determining the weighting coefficients for feature parameters are executed by the weighting coefficient training controller 11 that calculates the weighting coefficients for respective feature parameters through training process.
Next, the weighting coefficient training process which is executed by the weighting coefficient training controller 11 is described. In order to select an optimal sample for acoustic and prosodic environments of any given target speech from the speech waveform database, it is necessary to first determine which features, and to what extent, contribute, depending on the differences in phonemic and prosodic environments. This is due to the fact that the kinds of important feature parameters change with properties of the phonemes. For example, the speech fundamental frequency F0, although significantly effective for the selection of voiced speech, has almost no effect on the selection of unvoiced speech. Also, the acoustic features of fricative sound have different effects depending on the kinds of the preceding and succeeding phonemes. In order to select an optimal phoneme, what degrees of weights are placed on the respective features is automatically determined through the optimal weight determining process, i.e., the weighting coefficient training process.
In the optimal weighting coefficient determining process which is executed by the weighting coefficient training controller 11, the first step is to list features which are used for selecting an optimal sample from among all the applicable samples or speech segments of uttered speech in the speech waveform database. Employed in this case are phonemic features such as intonation position and intonation mode, as well as prosodic feature parameters such as the speech fundamental frequency F0, phoneme duration, and power of the preceding phoneme, the target phoneme, and the succeeding phoneme. Actually, the second prosodic parameters which will be detailed later are used. Next, in the second step, in order to determine which feature parameters, and how much, are important in selecting optimal candidates for each phoneme, the acoustic distance including the difference in phoneme duration from all the other phoneme samples for one speech sample or segments (or including non-speech segments of speech signals of a phoneme), and the speech waveform segments of N2 best analogous speech samples or segments, i.e., N2 best phoneme candidates are selected.
Further, in the third stage, a linear regression analysis is performed, where the weighting coefficients representing the degrees of importance or contribution of respective feature parameters in various acoustic and prosodic environments are determined or calculated by using the pseudo speech samples. As the prosodic feature parameters in this linear regression analysis process, for example, the following feature parameters (hereinafter, referred to as second prosodic feature parameters) are employed:
(1) first prosodic feature parameters of a preceding phoneme that is just one precedent to a target phoneme to be processed (hereinafter, referred to as a preceding phoneme);
(2) first prosodic feature parameters of a phoneme label of a succeeding phoneme that is just one subsequent to a target phoneme to be processed (hereinafter, referred to as a succeeding phoneme);
(3) phoneme duration of the target phoneme;
(4) speech fundamental frequency F0 of the target phoneme;
(5) speech fundamental frequency F0 of the preceding phoneme; and
(6) speech fundamental frequency F0 of the succeeding phoneme.
In the present preferred embodiment, the linear regression analysis is performed for determining weighting coefficients, however, the present invention is not limited to this. The other type of statistical analysis may be performed for determining weighting coefficients. For example, a statistical analysis may be performed using a predetermined neural network for determining weighting coefficients.
In this case, the preceding phoneme is defined as the phoneme that is just one precedent to the target phoneme. However, the present invention is not limited to this, the preceding phoneme may include phonemes that are precedent by a plurality of phonemes. Also, the succeeding phoneme is defined as the phoneme that is just one subsequent to the target phoneme. However, the present invention is not limited to this, the succeeding phoneme may include phonemes that are subsequent by a plurality of phonemes. Furthermore, the speech fundamental frequency F0 of the succeeding phoneme may be excluded.
Next, the processing by the speech unit selector 12 for executing the selection of natural speech samples or segments will be described hereinafter. The conventional speech synthesizer apparatus involves the steps of determining a phoneme sequence for a target utterance of speech, and further calculating target values of F0 and phoneme duration for use of prosodic control. In contrast to this, the speech synthesizer of the present preferred embodiment involves only the step of calculating the prosody for the purpose of appropriately selecting an optimal speech sample, where the prosody is not controlled directly.
Referring to FIG. 3, the input of the processing by the speech unit selector 12 of FIG. 1 is the phoneme sequence of a target utterance of speech, weight vectors for respective features determined on the respective phonemes and feature vectors representing all the samples within the speech waveform database. On the other hand, the output thereof is index information representing the positions of phoneme samples within the speech waveform database. Thus, FIG. 3 shows the start position and speech unit duration of respective speech units for concatenation of speech segments of speech waveform signals (where, more specifically, a phoneme, or in some cases, a sequence of a plurality of phonemes are selected in continuation as one speech unit).
An optimal speech unit can be determined as a path that minimizes the sum of the target cost, which represents an approximate cost of the difference from the target utterance of speech, and the concatenation cost, which represents an approximate cost of discontinuity between adjacent speech units. A known Viterbi training algorithm is used for the path search. With respect to a target speech t1 n=(t1, . . . , tn), minimizing the sum of target cost and concatenation cost makes it possible to select such a combination of speech units, u1 n=(u1, . . . , un), in the speech waveform database that the features are closer to those of the target speech and the discontinuity between the speech units is smaller. Thus, by indicating the positions of these speech units in the speech waveform database, the speech synthesis of the contents of any arbitrary utterance can be performed.
Referring to FIG. 3, the speech unit selection cost comprises the target cost Ct(ui, ti) and the concatenation cost Cc(ui-1, ui). The target cost Ct(ui, ti) is a predictive value of the difference between a speech unit (or phoneme candidate) ui in the speech waveform database and a speech unit (or target phoneme) ti, to be realized as synthesized speech, while the concatenation cost Cc(ui-1, ui) is a predictive value of the discontinuity that results from the concatenation between concatenation units (two phonemes to be concatenated) ui-1 and ui. In terms of minimizing the target cost and the concatenation cost, a similar concept was adopted in, for example, the conventional ATR ν-Talk speech synthesizing system of ATR Interpreting Telecommunications Research Laboratories, which was studied and developed into practical use by the present applicant. However, the fact that the prosodic feature parameters are used directly for unit selection forms a novel feature of the speech synthesizer apparatus of the present preferred embodiment.
Next, cost calculation will be described. The target cost is a weighted sum of differences between the respective elements of the feature vector of the speech unit to be realized and the respective elements of the feature vector of a speech unit that is a candidate selected from the speech waveform database. Given weighting coefficients wt j for respective target sub-costs Ct j(ti, ui), the target cost Ct(ti, ui) can be calculated by the following equation (1): C t ( t i , u i ) = j = 1 P W j t C j t ( t i , u i ) ( 1 )
Figure US06366883-20020402-M00001
where the differences between the respective elements of the feature vectors are represented by p target sub-costs Ct j(ti, ui) (where j is a natural number from 1 to p), and the number of dimensions p of the feature vectors is variable within a range of 20 to 30 in the present preferred embodiment. In a more preferred embodiment, the number of dimensions p=30, and the feature vectors or feature parameters with the variable j in the target sub-costs Ct(ti, ui) and weighting coefficients wt j are the above-mentioned second prosodic feature parameters.
On the other hand, the concatenation cost Cc(ui-1, ui) can be represented likewise by a weighted sum of q concatenation sub-costs Cc j(ui-1, ui) (where j is a natural number from 1 to q). The concatenation sub-cost can be determined or calculated from acoustic characteristics of speech units ui-1 and ui to be concatenated. In the preferred embodiment, the following three kinds:
(1) cepstral distance at a concatenation point of phonemes,
(2) absolute value of a difference in a logarithmic power, and
(3) absolute value of a difference in a speech fundamental frequency F0,
are used as the concatenation sub-costs, where q=3. These three kinds of acoustic feature parameters, the phoneme label of the preceding phoneme and the phoneme label of the succeeding phoneme are referred to as third acoustic feature parameters. The weights wc j of respective concatenation sub-costs Cc j(ui-1, ui) are given heuristically (or experimentally) beforehand, in this case, the concatenation cost Cc(ui-1, ui) can be calculated by the following equation (2): C c ( u i - 1 , u i ) = j = 1 q W j c C j c ( u i - 1 , u i ) ( 2 )
Figure US06366883-20020402-M00002
If the phoneme candidates ui-1 and ui are adjacent speech units in the speech waveform database, then the concatenation is a natural one, resulting in a concatenation cost of 0. In the preferred embodiment, the concatenation cost is determined or calculated based on the first acoustic feature parameters and the first prosodic feature parameters in the feature parameter memory 30, where the concatenation cost, which involves the above-mentioned three third acoustic feature parameters of continuous quantity, assumes any analog quantity in the range of, for example, 0 to 1. On the other hand, the target cost, which involves the above-mentioned 30 second acoustic feature parameters showing whether or not the discriminative characteristics of respective preceding or succeeding phonemes are coincident, includes elements represented by digital quantities of, for example, zero when the features are coincident for each parameter or one when the features are not coincident for each parameter. Then, the concatenation cost for N speech units results in the sum of the target cost and the concatenation cost for the respective speech units, and can be represented by the following equation (3): C ( t 1 n , u 1 n ) = i = 1 n C t ( t i , u i ) + i = 2 n C c ( u i - 1 , u i ) + C c ( S , u 1 ) + C c ( u n , S ) ( 3 )
Figure US06366883-20020402-M00003
where S represents a pause, and Cc(S, u1) and Cc(un, S) represent the concatenation costs for a concatenation from a pause to the first speech unit and for another concatenation from the last speech unit to a pause, respectively. As is apparent from this expression, the present preferred embodiment treats the pause in absolutely the same way as that of the other phonemes in the speech waveform database. The above-mentioned equation (3) can be expressed directly with sub-costs by the following equation (4): C ( t 1 n , u 1 n ) = i = 1 n j = 1 P W j t C t ( t i , u i ) + i = 2 n j = 1 q C c ( u i - 1 , u i ) + C c ( S , u 1 ) + C c ( u n , S ) ( 4 )
Figure US06366883-20020402-M00004
The speech-unit selection process is purposed to determine the combination of speech units, {overscore (u1 n+L )}, that minimizes the total cost that depends on the above-mentioned equation (4) as follows. u 1 n _ = min C ( t 1 n , u 1 n ) u 1 , u 2 , , u n ( 5 )
Figure US06366883-20020402-M00005
In the above-mentioned preferred embodiment, as apparent from the equation (4), the linear combination is used as the cost function. The present invention is not limited to this, the following non-linear multiplication or non-linear combination may be used.
Evaluating the target distance for a candidate unit, or the distance between a pair of units to be concatenated, returns only a physical measure of the distance separating the two speech waveform signals, and is not necessarily a true indicator of the distortion that may be perceived when using the particular candidate units for speech synthesis.
The goal of the current approach is to try to find a function relating the physical distances measured from the signal to the suitability of the unit for synthesis in a given context, in a manner analogous to the perceptual stimuli relationship. This modeling concentrates all assumptions about the capabilities of the subsequent signal processing routines, and about human audio-perception etc., at this central point.
The mathematical framework for the computation and combination of those suitabilities is motivated by fuzzy logic. Assuming that the perfect sequence of units is not contained in a finite database, calculated mismatches between existing units have to be gradually quantized, balanced and a compromise found.
Using a procedure analogous to fuzzy membership functions, the suitability functions for each single distance are defined in the range of zero to one, where “1” denotes a perfected match, and “0” denotes an unacceptable mismatch, as illustrated in FIGS. 8 to 11. A typical suitability function, e.g., for pitch target mismatch, could be a Gaussian over the pitch axis centered at the target pitch. Small (perceptually irrelevant) distances close to the target value result in suitabilities close to one. Big unacceptable mismatches result in zero suitability. Other units which are not perfectly matching but are within an acceptable distance are gradually weighted by the monotone decline between the extreme values. Anysimilar monotonically declining function may match these assumption as well.
Suitability functions S are extensively employed in the target and joint functions Ct and Cc which follow. A suitability function is defined by the following formula:
S=exp{−(|ti−ui═/T)R+b}+min  (6),
where b=log(max−min), and  (7)
0=min<max=1  (8).
ti is a target value of some variable, and ti is replaced by ui-1 in the joint cost;
ui is an actual value;
T is a tolerance; and
R is a rate;
min is the minimum value; and
max is the maximum value.
The concatenation cost for N speech units corresponding to the formula (3) is represented by the following formula: C ( t 1 n , u 1 n ) = i S 1 + i S 2 ( 9 )
Figure US06366883-20020402-M00006
where S1 is the suitability function for the target cost and S2 is the suitability function for the joint cost.
Otherwise, the following concatenation cost for N speech units corresponding to the formula (3) may be used in stead of the formula (9): C ( t 1 n , u 1 n ) = i S 1 + i S 2 ( 10 )
Figure US06366883-20020402-M00007
The following Figures show suitability functions for various values of T, R, min and max. FIGS. 8 to 11 are graphs each showing an example of a non-linear suitability function S to a target value ti which is used in the cost function of a modified preferred embodiment according to the present invention.
In the suitability function S shown in FIG. 8, the parameters are set such that the target value is 40, the rate is 1, the tolerance is 10, min is 0 and max is 1.
In the suitability function S shown in FIG. 9, the parameters are set such that the target value is 40, the rate is 2, the tolerance is 10, min is 0 and max is 1.
In the suitability function S shown in FIG. 10, the parameters are set such that the target value is 40, the rate is 10, the tolerance is 10, min is 0 and max is 1.
In the suitability function S shown in FIG. 11, the parameters are set such that the target value is 40, the rate is 2, the tolerance is 20, min is 0.1 and max is 1.
The cost functions of the non-linear multiplication and the non-linear combination result in improvement in the precision of the unit selection.
In the above-mentioned Equation (5), the function min is a function that represents a combination of phoneme candidates (i.e., phoneme sequence candidates), u1, u2, . . . , un={overscore (u1 n+L )}, that minimizes the argument of the function, C(t1 n, u1 n).
Now the weighting coefficient training process which is performed by the weighting coefficient training controller 11 of FIG. 1 will be described below. Weights for the target sub-costs are determined or calculated by using the linear regression analysis based on the acoustic distances. In the weighting coefficient training process, different weighting coefficients may be determined or calculated for all the phonemes, or weighting coefficients may be determined or calculated for respective phoneme categories (e.g., all nasal sounds). Otherwise, a common weighting coefficient for all the phonemes may be determined or calculated. In this case, however, different weighting coefficients for respective phonemes are employed.
Each token or speech sample stored in the database of the feature parameter memory 30 is described by a set of the first phonemic and prosodic feature parameters related to its acoustic characteristics. The weight coefficients are trained in order to determine the strength of relationship between each individual first phonemic and prosodic feature parameters and differences in the acoustic characteristics of the token (phone in context).
The flow of the process of the linear regression analysis is shown below:
Step I: Upon all the samples or speech segments in the speech waveform database that belong to the phonemic kind (or phonemic category) under the current training, the following four processes (a) to (d) are executed repeatedly:
(a) Assume speech samples or segments as the target utterance content picked up from the speech waveform database;
(b) Calculate acoustic distances between the speech sample and all the other samples belonging to the same phonemic kind or category in the speech waveform database;
(c) Select top N1 best phoneme candidates (for example, N1=20) closest to the target phoneme; and
(d) Determine or calculate the target sub-costs Ct j(ti, ui) for the target phoneme itself ti and the top N1 samples selected in the above (c).
Step II: The acoustic distances and the target sub-costs Ct j(ti, ui) are calculated for all the target phonemes ti and the top N1 optimal samples.
Step III: The linear regression is used to predict the contribution of each factors of the first phonemic and prosodic feature parameters representing the target phoneme by a linear weighting of the t target sub-costs. The weight coefficients determined by the linear regression are used as the weight coefficients for the target sub-costs wt j for current phoneme set or kind (category).
The above-mentioned costs are calculated by using these weighting coefficients. Then, the processes of Step I to Step III are repeated for all the phonemic kinds or categories.
The purpose of this weighting coefficient training controller 11 is to determine what weighting coefficient should be applied to multiply the respective target sub-costs in order to select out a speech sample that is the closest when the acoustic distances of the target speech unit, if possible, could be directly determined. An advantage of the present preferred embodiment is that the speech segments of the speech waveform signals in the speech waveform database can be directly utilized.
In the speech synthesizer apparatus of the preferred embodiment shown in FIG. 1 which is constructed as described above, the speech analyzer 10, the weighting coefficient training controller 11, the speech unit selector 12 and the speech synthesizer 13 are implemented by, for example, a digital computer or arithmetic and control unit or controller such as a microprocessing unit (MPU) or the like, while the text database memory 22, a phoneme HMM memory 23, the feature parameter memory 30 and the weighting-coefficient vector memory 31 are implemented by, for example, a storage unit such as a hard disk or the like. In the present preferred embodiment, the speech waveform database memory 21 is a storage unit of CD-ROM type.
The processing which is performed by the respective processing units 10 to 13 of the speech synthesizer apparatus of FIG. 1 constructed as described above will be described below.
FIG. 4 is a flowchart of the speech analysis process which is executed by the speech analyzer 10 of FIG. 1.
Referring to FIG. 4, first of all, at step S11, speech segments of speech waveform signals of natural utterance are inputted from the speech waveform database memory 21 to the speech analyzer 10, and the speech segments of the speech waveform signals are converted into digital speech waveform signal data through analog to digital conversion, while text data or character data obtained by writing down the speech sentence of the above speech waveform signal is inputted from the text database stored in the text database memory 22 to the speech analyzer 10. It is noted that any text data may be absent, where if any text data is absent, text data may be obtained from speech waveform signal data through speech recognition using a known speech recognizing apparatus. In addition, the digital speech waveform signal data resulting from the analog to digital conversion has been divided into speech segments in a unit of, for example, 10 milliseconds. Then, at step S12, it is decided whether or not the phoneme sequence has been predicted. At step S12, if the phoneme sequence has not been predicted, the phoneme sequence is predicted and stored, for example, by using the phoneme HMM, the program flow proceeds to step S14. If the phoneme sequence has been predicted or previously given or the phoneme label has been given by manual work at step S12, the program flow goes to step S14, directly.
At step S14, the start position and end position in the speech waveform database file composed of either a plurality of sentences or one sentence for each phoneme segment are recorded, and an index number is assigned to the file. Next, at step S15, the first acoustic feature parameters for each phoneme segment are extracted by using, for example, a known pitch extraction method. Then, at step S16, the phoneme labeling is executed for each phoneme segment, and the phoneme labels and the first acoustic feature parameters for the phoneme labels are recorded. Further, at step S17, the first acoustic feature parameters for each phoneme segment, the phoneme labels and the first prosodic feature parameters for the phoneme labels are stored in the feature parameter memory 30 together with the file index number and the start position and time duration in the file. Finally, at step S18, index information including the index number of the file and the start position and time duration in the file are given to each phoneme segment, and the index information is stored in the feature parameter memory 30, then the speech analysis process is completed.
FIGS. 5 and 6 are flowcharts of the weighting coefficient training process which is executed by the weighting coefficient training controller of FIG. 1.
Referring to FIG. 5, first of all, at step S21, one phonemic kind is selected from the feature parameter memory 30. Next, at step S22, the second acoustic feature parameters are extracted from the first acoustic feature parameters of a phoneme that has the same phonemic kind as the selected phonemic kind, and then, are taken as the second acoustic feature parameters of the target phoneme. Then, at step S23, the Euclidean cepstral distances of acoustic distances between the remaining phonemes other than the target phoneme that have the same phonemic kind, and the target phoneme in the second acoustic feature parameters, as well as the log phoneme duration with the bottom of 2 are calculated. At step S24, it is decided whether or not the processes of steps S22 and S23 have been done on all the remaining phonemes. At step S24, if the processes have not been completed for all the remaining phonemes, another remaining phoneme is selected at step S25, and then, the processes of step S23 and the following thereto are iterated.
On the other hand, if the processing has been completed at step S24, the top N1 best phoneme candidates are selected at step S26 based on the distances and time durations obtained at step S23. Subsequently, at step S27, the selected N1 best phoneme candidates are ranked into the first to N1-th places. Then, at step S28, for the ranked N1 best phoneme candidates, the scale conversion values are calculated by subtracting intermediate values from the respective distances. Further, at step S29, it is decided whether or not the processes of steps S22 to S28 has been completed for all the phonemic kinds and phonemes. If the processes of steps S22 to S28 have not been completed for all the phonemic kinds, another phonemic kind and phoneme is selected at step S30, and then the processes of step S22, and the following are iterated. On the other hand, if the processes of steps S22 to S28 has been completed for all the phonemic kinds at step S29, the program flow goes to step S31 of FIG. 6.
Referring to FIG. 6, at step S31, one phonemic kind is selected. Subsequently, at step S32, the second acoustic feature parameters for each phoneme are extracted for the selected phonemic kind. Then, at step S33, by performing the linear regression analysis based on the scale conversion value for the selected phonemic kind, the degrees of contribution to the scale conversion values in the second acoustic feature parameters are calculated, and the calculated degrees of contribution are stored in the weighting coefficient vector memory 31 as weighting coefficients for each target phoneme. At step S34, it is decided whether or not the processes of steps S32 and S33 has been completed for all the phonemic kinds. If the processes have not been completed for all the phonemic kinds at step S34, another phonemic kind is selected at step S35, and the processes of step S32 and the following are iterated. On the other hand, if the processes has been completed for all the phonemic kinds at step S34, the weighting coefficient training process is completed.
It is noted that degrees of contribution in the second prosodic feature parameters are previously given heuristically or experimentally, and the degrees of contribution are stored in the weighting coefficient vector memory 31 as weighting coefficient vectors for each target phoneme.
FIG. 7 is a flowchart of the speech unit selection process which is executed by the speech unit selector of FIG. 1.
Referring to FIG. 7, first of all, at step S41, the first phoneme located at the first position of an input phoneme sequence is selected. Subsequently, at step S42, a weighting coefficient vector of a phoneme having the same phonemic kind as the selected phoneme is read out from the weighting coefficient vector memory 31, and target sub-costs and necessary feature parameters are read out and listed from the feature parameter memory 30. Then, at step S43, it is decided whether or not the processing has been completed for all the phonemes. If the processes have not been completed for all the phonemes at step S43, the next phoneme is selected at step S44, and then, the process of step S42 is iterated. On the other hand, if the processes have not been completed at step S43 for all the phonemes, the program flow goes to step S45.
At step S45, the total cost for each phoneme candidate is calculated using the Equation (4) for the input phoneme sequence. Subsequently, at step S46, the top N2 best phoneme candidates are selected for the respective target phonemes based on the calculated cost. Thereafter, at step S47, index information on the combination of phoneme candidates that minimizes the total cost together with the start time and the time duration of each phoneme are searched and outputted to the speech synthesizer 13 utilizing the Viterbi search using the Equation (5), and then the speech unit selection process is completed.
Further, based on the index information and the start time and time duration of each phoneme which are outputted from the speech unit selector 12, the speech synthesizer 13 reads out digital speech waveform signal data of the unit selected phoneme candidates by accessing the speech waveform database memory 21, and the read-out digital speech waveform signal data is D/A converted to an analog speech signal, then the converted analog speech signal is outputted through the loudspeaker 14. Thus, synthesized speech corresponding to the input phoneme sequence is outputted from the loudspeaker 14.
As described above, in the speech synthesizer apparatus of the present preferred embodiment, the method of minimizing the process by using a large scale database of natural speech has been described with a view to maximizing the naturalness of output speech. The speech synthesizer of the present preferred embodiment comprises the four processing units 10 to 13.
(A) SPEECH ANALYZER 10
The speech analyzer 10 of a processing unit receives, as inputs, any arbitrary speech waveform signal data accompanied by text written in orthography, and then calculates and outputs feature vectors for describing the characteristics of all the phonemes in the speech waveform database.
(B) WEIGHTING Coefficient TRAINING CONTROLLER 11
The weighting coefficient training controller 11 of a processing unit determines or calculates optimal weighting coefficients of respective feature parameters, as weight vectors, for selecting a speech unit that best fits the synthesis of target speech by using the feature vectors of speech segments of the speech waveform database and the original waveforms of the speech waveform database.
(C) SPEECH UNIT SELECTOR 12
The speech unit selector 12 of a processing unit generates index information for the speech waveform database memory 21 from the feature vectors and weight vectors of all the phonemes in the speech waveform database as well as the description of utterance contents of objective speech.
(D) SPEECH SYNTHESIZER 13
The speech synthesizer 13 of a processing unit synthesizes speech by accessing and reading out the speech segments of the speech signals in the speech waveform database stored in the speech waveform database memory 21 with skipping them and concatenation of them, according to the generated index information, and by D/A converting and outputting the objective speech signal data comprised of the read-out speech segments to the loudspeaker 14.
In the present preferred embodiment, the compression of speech waveform signal data and the correction of speech fundamental frequency F0 and phoneme duration have been eliminated, but alternatively, there arises a need of carefully labeling speech samples and selecting optimums from the large scale speech waveform database. The fundamental unit for the speech synthesis method of the present preferred embodiment is the phoneme, which is generated by dictionaries or text phoneme conversion programs, where it is demanded that sufficient variations of phonemes even with the same phoneme be contained in the speech waveform database. In the speech unit selection process from a speech waveform database, is selected a combination of phoneme samples that fits the objective prosodic environment and yet that has the lowest discontinuity between adjacent speech units at the time of concatenation. For this purpose, the optimal weighting coefficients for respective feature parameters are determined or calculated for each phoneme.
The features of the speech synthesizer apparatus of the present preferred embodiment are as follows.
(A) USE OF PROSODIC INFORMATION AS UNIT SELECTION CRITERIA
From the standpoint that the spectral features are inseparably related to prosodic features, prosodic features are introduced as speech unit selection criteria.
(B) AUTOMATIC TRAINING OF WEIGHTING COEFFICIENTS FOR ACOUSTIC AND PROSODIC FEATURE PARAMETERS
How much various feature quantities such as phonemic environments, acoustic features and prosodic features contribute to the selection of the speech unit is automatically determined by making use of all the speech samples in the speech waveform database. Thus, a speech synthesizer apparatus incorporating corpus as the basis has been built up.
(C) DIRECT CONCATENATION OF SPEECH WAVEFORM SEGMENTS
Based on the above automatic training, an optimal speech sample is selected out of the large scale speech waveform database. Thus, an arbitrary speech synthesizer apparatus without using any signal processing has been built up.
(D) USE OF SPEECH WAVEFORM DATABASE AS EXTERNAL INFORMATION
The speech waveform database is treated fully as external information. Thus, a speech synthesizer apparatus replacing the speech waveform signal data stored simply in a CD-ROM or the like has been built up.
EXPERIMENTS
With the speech synthesizer apparatus of the present preferred embodiment, the inventors have so far conducted evaluations by various kinds of speech waveform databases including four languages. As has been well known to those skilled in the art, hitherto, it has been technically quite difficult to synthesize high quality speech by using speech of female speakers. However, the method of the present preferred embodiment has overcome differences in sex, age and the like. By now, for Japanese, synthesized speech of highest quality has been obtained with the use of a speech waveform database that contains a corpus that a young female speaker has read short stories. For German, synthesized speech using CD-ROM data of read aloud sentences to which prosodic labels and detailed phoneme labels have been imparted has been outputted. This indicates that the speech synthesizer apparatus of the present preferred embodiment, technically, can freely use various types of existing speech waveform signal data, other than speech waveform signal data specially recorded for use of speech synthesis. Also, for English language, best speech quality has been obtained with 45 minute speech waveform signal data of a radio announcer in the news corpus of the Boston University. For Korean language, read aloud speech of short stories have been used.
According to the present preferred embodiments of the present invention, any arbitrary phoneme sequence can be converted into uttered speech without using any prosodic control rule or executing any signal processing. Still further, voice quality close to the natural one can be obtained, as compared with that of the conventional apparatus.
In another aspect of the present preferred embodiments of the present invention, the speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to an input speech waveform signal based on the input speech waveform signal. Accordingly, since there is no need of giving a phoneme sequence beforehand, the part of manual work can be simplified.
Although the present invention has been fully described in connection with the preferred embodiments thereof with reference to the accompanying drawings, it is to be noted that various changes and modifications are apparent to those skilled in the art. Such changes and modifications are to be understood as included within the scope of the present invention as defined by the appended claims unless they depart therefrom.

Claims (23)

What is claimed is:
1. A speech synthesizer apparatus comprising:
first storage means for storing speech segments of speech waveform signals of natural utterance;
speech analyzing means, based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;
second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;
weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;
third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means;
speech unit selecting means, based on the weighting coefficient vectors for the respective target phonemes stored in said third storage means, and the prosodic feature parameters stored in said second storage means, for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost repenting approximate costs between two phoneme candidates and a concatenation cost representing approximate costs been two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates, said target cost being represented by either one of a predetermined non-linear multiplication and a predetermined non-linear combination, with use of predetermined suitability functions each of fuzzy membership function, said concatenation cost being represented by either one of another predetermined non-linear multiplication and another predetermined non-linear combination with use of another predetermined suitability functions each of fuzzy membership function; and
speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.
2. The speech synthesizer apparatus as claimed in claim 1,
wherein said speech analyzing means comprises phoneme predicting means for predicting a phoneme sequence corresponding to the speech waveform signals based on input speech waveform signals.
3. The speech synthesizer apparatus as claimed in claim 1,
wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
4. The speech synthesizer apparatus as claimed in claim 2,
wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
5. The speech synthesizer apparatus as claimed in claim 1,
wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis using a predetermined neural network for each of the second acoustic feature parameters.
6. The speech synthesizer apparatus as claimed in claim 2,
wherein said weighting coefficient training means determines the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N1 phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis for each of the second acoustic feature parameters.
7. The speech synthesizer apparatus as claimed in claim 1,
wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
8. The speech synthesizer apparatus as claimed in claim 2,
wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
9. The speech synthesizer apparatus as claimed in claim 3,
wherein said speech unit selecting means extracts a plurality of top N2 phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, searches for a combination of phoneme candidates that minimizes the cost.
10. The speech synthesizer apparatus as claimed in claim 1,
wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
11. The speech synthesizer apparatus as claimed in claim 3,
wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
12. The speech synthesizer apparatus as claimed in claim 7,
wherein the first acoustic feature parameters include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
13. The speech synthesizer apparatus as claimed in claim 1,
wherein the first acoustic feature parameters include formant parameters and voice source parameters.
14. The speech synthesizer apparatus as claimed in claim 3,
wherein the first acoustic feature parameters include formant parameters and voice source parameters.
15. The speech synthesizer apparatus as claimed in claim 7,
wherein the first acoustic feature parameters include formant parameters and voice source parameters.
16. The speech synthesizer apparatus as claimed in claim 1,
wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F0, and powers.
17. The speech synthesizer apparatus as claimed in claim 3,
wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F0, and powers.
18. The speech synthesizer apparatus as claimed in claim 7,
wherein the prosodic feature parameters include phoneme durations, speech fundamental frequencies F0, and powers.
19. The speech synthesizer apparatus as claimed in claim 1,
wherein the second acoustic feature parameters include cepstral distances.
20. The speech synthesizer apparatus as claimed in claim 3,
wherein the second acoustic feature parameters include cepstral distances.
21. The speech synthesizer apparatus as claimed in claim 7,
wherein the second acoustic feature parameters include cepstral distances.
22. The speech synthesizer apparatus as claimed in claim 1,
wherein said concatenation cost C is represented by the following equation:
C=II S1+II S2
wherein S1 is a predetermined suitability function for target cost, and S2 is a predetermined suitability function for joint cost.
23. The speech synthesizer apparatus as claimed in claim 1,
wherein said concatenation cost C is represented by the following equation:
C=ΣS1+ΣS2
where S1 is a predetermined suitability function for target cost, and S2 is a predetermined suitability function for joint cost.
US09/250,405 1996-05-15 1999-02-16 Concatenation of speech segments by use of a speech synthesizer Expired - Lifetime US6366883B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/250,405 US6366883B1 (en) 1996-05-15 1999-02-16 Concatenation of speech segments by use of a speech synthesizer

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
JP12011396 1996-05-15
JP8-120113 1996-05-15
US85657897A 1997-05-15 1997-05-15
US09/250,405 US6366883B1 (en) 1996-05-15 1999-02-16 Concatenation of speech segments by use of a speech synthesizer

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US85657897A Continuation-In-Part 1996-05-15 1997-05-15

Publications (1)

Publication Number Publication Date
US6366883B1 true US6366883B1 (en) 2002-04-02

Family

ID=26457743

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/250,405 Expired - Lifetime US6366883B1 (en) 1996-05-15 1999-02-16 Concatenation of speech segments by use of a speech synthesizer

Country Status (1)

Country Link
US (1) US6366883B1 (en)

Cited By (233)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010032079A1 (en) * 2000-03-31 2001-10-18 Yasuo Okutani Speech signal processing apparatus and method, and storage medium
US20010047259A1 (en) * 2000-03-31 2001-11-29 Yasuo Okutani Speech synthesis apparatus and method, and storage medium
US20010056347A1 (en) * 1999-11-02 2001-12-27 International Business Machines Corporation Feature-domain concatenative speech synthesis
US20020072907A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
US20020072908A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
US20020077822A1 (en) * 2000-10-19 2002-06-20 Case Eliot M. System and method for converting text-to-voice
US20020077821A1 (en) * 2000-10-19 2002-06-20 Case Eliot M. System and method for converting text-to-voice
US20020099553A1 (en) * 2000-12-02 2002-07-25 Brittan Paul St John Voice site personality setting
US20020103648A1 (en) * 2000-10-19 2002-08-01 Case Eliot M. System and method for converting text-to-voice
US20020143526A1 (en) * 2000-09-15 2002-10-03 Geert Coorman Fast waveform synchronization for concentration and time-scale modification of speech
US20020152073A1 (en) * 2000-09-29 2002-10-17 Demoortel Jan Corpus-based prosody translation system
US6470316B1 (en) * 1999-04-23 2002-10-22 Oki Electric Industry Co., Ltd. Speech synthesis apparatus having prosody generator with user-set speech-rate- or adjusted phoneme-duration-dependent selective vowel devoicing
US20020184030A1 (en) * 2001-06-04 2002-12-05 Hewlett Packard Company Speech synthesis apparatus and method
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
US6505158B1 (en) * 2000-07-05 2003-01-07 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
US6535851B1 (en) * 2000-03-24 2003-03-18 Speechworks, International, Inc. Segmentation approach for speech recognition systems
US20030083878A1 (en) * 2001-10-31 2003-05-01 Samsung Electronics Co., Ltd. System and method for speech synthesis using a smoothing filter
US20030229494A1 (en) * 2002-04-17 2003-12-11 Peter Rutten Method and apparatus for sculpting synthesized speech
US6684187B1 (en) 2000-06-30 2004-01-27 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US6697780B1 (en) * 1999-04-30 2004-02-24 At&T Corp. Method and apparatus for rapid acoustic unit selection from a large speech corpus
US20040049375A1 (en) * 2001-06-04 2004-03-11 Brittan Paul St John Speech synthesis apparatus and method
US6725199B2 (en) * 2001-06-04 2004-04-20 Hewlett-Packard Development Company, L.P. Speech synthesis apparatus and selection method
US20040083102A1 (en) * 2002-10-25 2004-04-29 France Telecom Method of automatic processing of a speech signal
US20040128330A1 (en) * 2002-12-26 2004-07-01 Yao-Tung Chu Real time data compression method and apparatus for a data recorder
US20040148171A1 (en) * 2000-12-04 2004-07-29 Microsoft Corporation Method and apparatus for speech synthesis without prosody modification
US20040153324A1 (en) * 2003-01-31 2004-08-05 Phillips Michael S. Reduced unit database generation based on cost information
WO2004070701A2 (en) * 2003-01-31 2004-08-19 Scansoft, Inc. Linguistic prosodic model-based text to speech
US20040172249A1 (en) * 2001-05-25 2004-09-02 Taylor Paul Alexander Speech synthesis
US20040215462A1 (en) * 2003-04-25 2004-10-28 Alcatel Method of generating speech from text
US20040254792A1 (en) * 2003-06-10 2004-12-16 Bellsouth Intellectual Proprerty Corporation Methods and system for creating voice files using a VoiceXML application
US20050027532A1 (en) * 2000-03-31 2005-02-03 Canon Kabushiki Kaisha Speech synthesis apparatus and method, and storage medium
US20050086060A1 (en) * 2003-10-17 2005-04-21 International Business Machines Corporation Interactive debugging and tuning method for CTTS voice building
US6950798B1 (en) * 2001-04-13 2005-09-27 At&T Corp. Employing speech models in concatenative speech synthesis
US20060041429A1 (en) * 2004-08-11 2006-02-23 International Business Machines Corporation Text-to-speech system and method
US20060085187A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Testing and tuning of automatic speech recognition systems using synthetic inputs generated from its acoustic models
US7076426B1 (en) * 1998-01-30 2006-07-11 At&T Corp. Advance TTS for facial animation
US7082396B1 (en) 1999-04-30 2006-07-25 At&T Corp Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US20060200352A1 (en) * 2005-03-01 2006-09-07 Canon Kabushiki Kaisha Speech synthesis method
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US20060229874A1 (en) * 2005-04-11 2006-10-12 Oki Electric Industry Co., Ltd. Speech synthesizer, speech synthesizing method, and computer program
US20070061145A1 (en) * 2005-09-13 2007-03-15 Voice Signal Technologies, Inc. Methods and apparatus for formant-based voice systems
US20070129945A1 (en) * 2005-12-06 2007-06-07 Ma Changxue C Voice quality control for high quality speech reconstruction
US20070192113A1 (en) * 2006-01-27 2007-08-16 Accenture Global Services, Gmbh IVR system manager
US20080059184A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Calculating cost measures between HMM acoustic models
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
US7369994B1 (en) 1999-04-30 2008-05-06 At&T Corp. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US20080183473A1 (en) * 2007-01-30 2008-07-31 International Business Machines Corporation Technique of Generating High Quality Synthetic Speech
US7409347B1 (en) * 2003-10-23 2008-08-05 Apple Inc. Data-driven global boundary optimization
US20080243511A1 (en) * 2006-10-24 2008-10-02 Yusuke Fujita Speech synthesizer
US20090070115A1 (en) * 2007-09-07 2009-03-12 International Business Machines Corporation Speech synthesis system, speech synthesis program product, and speech synthesis method
US20090132253A1 (en) * 2007-11-20 2009-05-21 Jerome Bellegarda Context-aware unit selection
US20090216535A1 (en) * 2008-02-22 2009-08-27 Avraham Entlis Engine For Speech Recognition
US20090254349A1 (en) * 2006-06-05 2009-10-08 Yoshifumi Hirose Speech synthesizer
US20090258333A1 (en) * 2008-03-17 2009-10-15 Kai Yu Spoken language learning systems
US20090288232A1 (en) * 2001-08-27 2009-11-19 Syngenta Participations Ag Self processing plants and plant parts
US20100004937A1 (en) * 2008-07-03 2010-01-07 Thomson Licensing Method for time scaling of a sequence of input signal values
US20100098224A1 (en) * 2003-12-19 2010-04-22 At&T Corp. Method and Apparatus for Automatically Building Conversational Systems
US20100145691A1 (en) * 2003-10-23 2010-06-10 Bellegarda Jerome R Global boundary-centric feature extraction and associated discontinuity metrics
US20100305949A1 (en) * 2007-11-28 2010-12-02 Masanori Kato Speech synthesis device, speech synthesis method, and speech synthesis program
CN101000765B (en) * 2007-01-09 2011-03-30 黑龙江大学 Speech synthetic method based on rhythm character
US20110087488A1 (en) * 2009-03-25 2011-04-14 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
US20110246200A1 (en) * 2010-04-05 2011-10-06 Microsoft Corporation Pre-saved data compression for tts concatenation cost
US20120221339A1 (en) * 2011-02-25 2012-08-30 Kabushiki Kaisha Toshiba Method, apparatus for synthesizing speech and acoustic model training method for speech synthesis
CN103325381A (en) * 2013-05-29 2013-09-25 吉林大学 Speech separation method based on fuzzy membership function
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US8600743B2 (en) 2010-01-06 2013-12-03 Apple Inc. Noise profile determination for voice-related feature
US8614431B2 (en) 2005-09-30 2013-12-24 Apple Inc. Automated response to and sensing of user activity in portable devices
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US8660849B2 (en) 2010-01-18 2014-02-25 Apple Inc. Prioritizing selection criteria by automated assistant
US8670985B2 (en) 2010-01-13 2014-03-11 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8682649B2 (en) 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
US8688446B2 (en) 2008-02-22 2014-04-01 Apple Inc. Providing text input using speech data and non-speech data
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US8712776B2 (en) 2008-09-29 2014-04-29 Apple Inc. Systems and methods for selective text to speech synthesis
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US8718047B2 (en) 2001-10-22 2014-05-06 Apple Inc. Text to speech conversion of text messages from mobile communication devices
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US8751238B2 (en) 2009-03-09 2014-06-10 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US8768702B2 (en) 2008-09-05 2014-07-01 Apple Inc. Multi-tiered voice feedback in an electronic device
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
US8898568B2 (en) 2008-09-09 2014-11-25 Apple Inc. Audio user interface
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US20150073804A1 (en) * 2013-09-06 2015-03-12 Google Inc. Deep networks for unit selection speech synthesis
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US9053089B2 (en) 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
US9240184B1 (en) * 2012-11-15 2016-01-19 Google Inc. Frame-level combination of deep neural network and gaussian mixture models
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
US9311043B2 (en) 2010-01-13 2016-04-12 Apple Inc. Adaptive audio feedback system and method
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US20160140953A1 (en) * 2014-11-17 2016-05-19 Samsung Electronics Co., Ltd. Speech synthesis apparatus and control method thereof
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
US20170249953A1 (en) * 2014-04-15 2017-08-31 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US9798653B1 (en) * 2010-05-05 2017-10-24 Nuance Communications, Inc. Methods, apparatus and data structure for cross-language speech adaptation
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US20170345411A1 (en) * 2016-05-26 2017-11-30 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US20180018957A1 (en) * 2015-03-25 2018-01-18 Yamaha Corporation Sound control device, sound control method, and sound control program
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US9946706B2 (en) 2008-06-07 2018-04-17 Apple Inc. Automatic language identification for dynamic text processing
WO2018072543A1 (en) * 2016-10-17 2018-04-26 腾讯科技(深圳)有限公司 Model generation method, speech synthesis method and apparatus
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US10019994B2 (en) 2012-06-08 2018-07-10 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10078487B2 (en) 2013-03-15 2018-09-18 Apple Inc. Context-sensitive handling of interruptions
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
GB2560599A (en) * 2017-03-14 2018-09-19 Google Llc Speech synthesis unit selection
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10262646B2 (en) 2017-01-09 2019-04-16 Media Overkill, LLC Multi-source switched sequence oscillator waveform compositing system
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 Synthetic speech detection method for inhibiting phoneme influence based on phoneme-level analysis
US10923103B2 (en) 2017-03-14 2021-02-16 Google Llc Speech synthesis unit selection
US20210110817A1 (en) * 2019-10-15 2021-04-15 Samsung Electronics Co., Ltd. Method and apparatus for generating speech
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11151899B2 (en) 2013-03-15 2021-10-19 Apple Inc. User training by intelligent digital assistant
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5459815A (en) 1992-06-25 1995-10-17 Atr Auditory And Visual Perception Research Laboratories Speech recognition method using time-frequency masking mechanism
EP0674307B1 (en) 1994-03-22 2001-01-17 Canon Kabushiki Kaisha Method and apparatus for processing speech information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5459815A (en) 1992-06-25 1995-10-17 Atr Auditory And Visual Perception Research Laboratories Speech recognition method using time-frequency masking mechanism
EP0674307B1 (en) 1994-03-22 2001-01-17 Canon Kabushiki Kaisha Method and apparatus for processing speech information

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
A. J. Hunt et al, ICASSP-96 Japan, pps. 373-376 (1996).
John R. Deller, et al. Discrete-Time Processing of Speech Signals, Prentice-Hall, 1987, p. 385 and 841-842.* *
John R. Deller, et al., Discrete-Time Processing of Speech Signals, Prentice-Hall, 1987, pp. 837-839.
N. Iwahashi et al, IEEE, Acoustics, Speech, and Signal Processing, 1992. ICASSP-92, vol. 2, pp. 65-68.
T. Hirokawa, et al. Segment Selection 2nd Pitch Modification, Proc. ICSLP-90, 337-340 (1990).
T. Hirokawa, Speech Synthesis using a Waveform Dictionary-European Conference on Speech Communication and Technology, p. 140-143 (1989).
Y. Sagisaka, Speech synthesis by rule using an optimal selection of non-uniform synthesis units, Acoustics, Speech, and Signal Processing, 1988, ICASSP-88., 1988 International Conference on, New York, NY, USA, 1988, pp. 679-682, vol. 1.
Y. Sagisaka, Speech, Image Processing and Neutral Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on, New York, NY, USA, 1994, p. 146-150, vol. 1.

Cited By (384)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7076426B1 (en) * 1998-01-30 2006-07-11 At&T Corp. Advance TTS for facial animation
US6470316B1 (en) * 1999-04-23 2002-10-22 Oki Electric Industry Co., Ltd. Speech synthesis apparatus having prosody generator with user-set speech-rate- or adjusted phoneme-duration-dependent selective vowel devoicing
US7082396B1 (en) 1999-04-30 2006-07-25 At&T Corp Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US7761299B1 (en) 1999-04-30 2010-07-20 At&T Intellectual Property Ii, L.P. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US8086456B2 (en) 1999-04-30 2011-12-27 At&T Intellectual Property Ii, L.P. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US9236044B2 (en) 1999-04-30 2016-01-12 At&T Intellectual Property Ii, L.P. Recording concatenation costs of most common acoustic unit sequential pairs to a concatenation cost database for speech synthesis
US6697780B1 (en) * 1999-04-30 2004-02-24 At&T Corp. Method and apparatus for rapid acoustic unit selection from a large speech corpus
US8315872B2 (en) 1999-04-30 2012-11-20 At&T Intellectual Property Ii, L.P. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US9691376B2 (en) 1999-04-30 2017-06-27 Nuance Communications, Inc. Concatenation cost in speech synthesis for acoustic unit sequential pair using hash table and default concatenation cost
US7369994B1 (en) 1999-04-30 2008-05-06 At&T Corp. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US6701295B2 (en) 1999-04-30 2004-03-02 At&T Corp. Methods and apparatus for rapid acoustic unit selection from a large speech corpus
US20100286986A1 (en) * 1999-04-30 2010-11-11 At&T Intellectual Property Ii, L.P. Via Transfer From At&T Corp. Methods and Apparatus for Rapid Acoustic Unit Selection From a Large Speech Corpus
US8788268B2 (en) 1999-04-30 2014-07-22 At&T Intellectual Property Ii, L.P. Speech synthesis from acoustic units with default values of concatenation cost
US20010056347A1 (en) * 1999-11-02 2001-12-27 International Business Machines Corporation Feature-domain concatenative speech synthesis
US7035791B2 (en) * 1999-11-02 2006-04-25 International Business Machines Corporaiton Feature-domain concatenative speech synthesis
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
US6535851B1 (en) * 2000-03-24 2003-03-18 Speechworks, International, Inc. Segmentation approach for speech recognition systems
US20010032079A1 (en) * 2000-03-31 2001-10-18 Yasuo Okutani Speech signal processing apparatus and method, and storage medium
US6980955B2 (en) * 2000-03-31 2005-12-27 Canon Kabushiki Kaisha Synthesis unit selection apparatus and method, and storage medium
US7039588B2 (en) 2000-03-31 2006-05-02 Canon Kabushiki Kaisha Synthesis unit selection apparatus and method, and storage medium
US20010047259A1 (en) * 2000-03-31 2001-11-29 Yasuo Okutani Speech synthesis apparatus and method, and storage medium
US20050027532A1 (en) * 2000-03-31 2005-02-03 Canon Kabushiki Kaisha Speech synthesis apparatus and method, and storage medium
US20090094035A1 (en) * 2000-06-30 2009-04-09 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US7124083B2 (en) 2000-06-30 2006-10-17 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US6684187B1 (en) 2000-06-30 2004-01-27 At&T Corp. Method and system for preselection of suitable units for concatenative speech
US7460997B1 (en) 2000-06-30 2008-12-02 At&T Intellectual Property Ii, L.P. Method and system for preselection of suitable units for concatenative speech
US8566099B2 (en) 2000-06-30 2013-10-22 At&T Intellectual Property Ii, L.P. Tabulating triphone sequences by 5-phoneme contexts for speech synthesis
US20040093213A1 (en) * 2000-06-30 2004-05-13 Conkie Alistair D. Method and system for preselection of suitable units for concatenative speech
US8224645B2 (en) 2000-06-30 2012-07-17 At+T Intellectual Property Ii, L.P. Method and system for preselection of suitable units for concatenative speech
US7013278B1 (en) 2000-07-05 2006-03-14 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
US7565291B2 (en) 2000-07-05 2009-07-21 At&T Intellectual Property Ii, L.P. Synthesis-based pre-selection of suitable units for concatenative speech
US7233901B2 (en) 2000-07-05 2007-06-19 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
US6505158B1 (en) * 2000-07-05 2003-01-07 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
US20070282608A1 (en) * 2000-07-05 2007-12-06 At&T Corp. Synthesis-based pre-selection of suitable units for concatenative speech
US7058569B2 (en) 2000-09-15 2006-06-06 Nuance Communications, Inc. Fast waveform synchronization for concentration and time-scale modification of speech
US20020143526A1 (en) * 2000-09-15 2002-10-03 Geert Coorman Fast waveform synchronization for concentration and time-scale modification of speech
US20020152073A1 (en) * 2000-09-29 2002-10-17 Demoortel Jan Corpus-based prosody translation system
US7069216B2 (en) * 2000-09-29 2006-06-27 Nuance Communications, Inc. Corpus-based prosody translation system
US20020072908A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
US20020072907A1 (en) * 2000-10-19 2002-06-13 Case Eliot M. System and method for converting text-to-voice
US20020077822A1 (en) * 2000-10-19 2002-06-20 Case Eliot M. System and method for converting text-to-voice
US20020103648A1 (en) * 2000-10-19 2002-08-01 Case Eliot M. System and method for converting text-to-voice
US7451087B2 (en) 2000-10-19 2008-11-11 Qwest Communications International Inc. System and method for converting text-to-voice
US6990450B2 (en) 2000-10-19 2006-01-24 Qwest Communications International Inc. System and method for converting text-to-voice
US20020077821A1 (en) * 2000-10-19 2002-06-20 Case Eliot M. System and method for converting text-to-voice
US6862568B2 (en) * 2000-10-19 2005-03-01 Qwest Communications International, Inc. System and method for converting text-to-voice
US6990449B2 (en) 2000-10-19 2006-01-24 Qwest Communications International Inc. Method of training a digital voice library to associate syllable speech items with literal text syllables
US6871178B2 (en) 2000-10-19 2005-03-22 Qwest Communications International, Inc. System and method for converting text-to-voice
US20040049390A1 (en) * 2000-12-02 2004-03-11 Hewlett-Packard Company Voice site personality setting
US20020099553A1 (en) * 2000-12-02 2002-07-25 Brittan Paul St John Voice site personality setting
US20040148171A1 (en) * 2000-12-04 2004-07-29 Microsoft Corporation Method and apparatus for speech synthesis without prosody modification
US6950798B1 (en) * 2001-04-13 2005-09-27 At&T Corp. Employing speech models in concatenative speech synthesis
US20040172249A1 (en) * 2001-05-25 2004-09-02 Taylor Paul Alexander Speech synthesis
US7062439B2 (en) * 2001-06-04 2006-06-13 Hewlett-Packard Development Company, L.P. Speech synthesis apparatus and method
US20040049375A1 (en) * 2001-06-04 2004-03-11 Brittan Paul St John Speech synthesis apparatus and method
US6725199B2 (en) * 2001-06-04 2004-04-20 Hewlett-Packard Development Company, L.P. Speech synthesis apparatus and selection method
US7191132B2 (en) * 2001-06-04 2007-03-13 Hewlett-Packard Development Company, L.P. Speech synthesis apparatus and method
US20020184030A1 (en) * 2001-06-04 2002-12-05 Hewlett Packard Company Speech synthesis apparatus and method
US7113909B2 (en) * 2001-06-11 2006-09-26 Hitachi, Ltd. Voice synthesizing method and voice synthesizer performing the same
US20020188449A1 (en) * 2001-06-11 2002-12-12 Nobuo Nukaga Voice synthesizing method and voice synthesizer performing the same
US20090288232A1 (en) * 2001-08-27 2009-11-19 Syngenta Participations Ag Self processing plants and plant parts
US8718047B2 (en) 2001-10-22 2014-05-06 Apple Inc. Text to speech conversion of text messages from mobile communication devices
US20030083878A1 (en) * 2001-10-31 2003-05-01 Samsung Electronics Co., Ltd. System and method for speech synthesis using a smoothing filter
US7277856B2 (en) * 2001-10-31 2007-10-02 Samsung Electronics Co., Ltd. System and method for speech synthesis using a smoothing filter
US20030229494A1 (en) * 2002-04-17 2003-12-11 Peter Rutten Method and apparatus for sculpting synthesized speech
US8527281B2 (en) * 2002-04-17 2013-09-03 Nuance Communications, Inc. Method and apparatus for sculpting synthesized speech
FR2846458A1 (en) * 2002-10-25 2004-04-30 France Telecom METHOD FOR AUTOMATIC PROCESSING OF A SPOKEN SIGNAL.
US7457748B2 (en) * 2002-10-25 2008-11-25 France Telecom Method of automatic processing of a speech signal
US20040083102A1 (en) * 2002-10-25 2004-04-29 France Telecom Method of automatic processing of a speech signal
US6950041B2 (en) * 2002-12-26 2005-09-27 Industrial Technology Research Institute Real time date compression method and apparatus for a data recorder
US20040128330A1 (en) * 2002-12-26 2004-07-01 Yao-Tung Chu Real time data compression method and apparatus for a data recorder
WO2004070701A2 (en) * 2003-01-31 2004-08-19 Scansoft, Inc. Linguistic prosodic model-based text to speech
WO2004070701A3 (en) * 2003-01-31 2005-06-02 Scansoft Inc Linguistic prosodic model-based text to speech
US6988069B2 (en) 2003-01-31 2006-01-17 Speechworks International, Inc. Reduced unit database generation based on cost information
US20040153324A1 (en) * 2003-01-31 2004-08-05 Phillips Michael S. Reduced unit database generation based on cost information
US6961704B1 (en) * 2003-01-31 2005-11-01 Speechworks International, Inc. Linguistic prosodic model-based text to speech
US20040215462A1 (en) * 2003-04-25 2004-10-28 Alcatel Method of generating speech from text
US9286885B2 (en) * 2003-04-25 2016-03-15 Alcatel Lucent Method of generating speech from text in a client/server architecture
US20090290694A1 (en) * 2003-06-10 2009-11-26 At&T Corp. Methods and system for creating voice files using a voicexml application
US20040254792A1 (en) * 2003-06-10 2004-12-16 Bellsouth Intellectual Proprerty Corporation Methods and system for creating voice files using a VoiceXML application
US7577568B2 (en) * 2003-06-10 2009-08-18 At&T Intellctual Property Ii, L.P. Methods and system for creating voice files using a VoiceXML application
US20050086060A1 (en) * 2003-10-17 2005-04-21 International Business Machines Corporation Interactive debugging and tuning method for CTTS voice building
US20090083037A1 (en) * 2003-10-17 2009-03-26 International Business Machines Corporation Interactive debugging and tuning of methods for ctts voice building
US7487092B2 (en) * 2003-10-17 2009-02-03 International Business Machines Corporation Interactive debugging and tuning method for CTTS voice building
US7853452B2 (en) 2003-10-17 2010-12-14 Nuance Communications, Inc. Interactive debugging and tuning of methods for CTTS voice building
US8015012B2 (en) * 2003-10-23 2011-09-06 Apple Inc. Data-driven global boundary optimization
US20090048836A1 (en) * 2003-10-23 2009-02-19 Bellegarda Jerome R Data-driven global boundary optimization
US7930172B2 (en) 2003-10-23 2011-04-19 Apple Inc. Global boundary-centric feature extraction and associated discontinuity metrics
US7409347B1 (en) * 2003-10-23 2008-08-05 Apple Inc. Data-driven global boundary optimization
US20100145691A1 (en) * 2003-10-23 2010-06-10 Bellegarda Jerome R Global boundary-centric feature extraction and associated discontinuity metrics
US8462917B2 (en) 2003-12-19 2013-06-11 At&T Intellectual Property Ii, L.P. Method and apparatus for automatically building conversational systems
US8175230B2 (en) 2003-12-19 2012-05-08 At&T Intellectual Property Ii, L.P. Method and apparatus for automatically building conversational systems
US8718242B2 (en) 2003-12-19 2014-05-06 At&T Intellectual Property Ii, L.P. Method and apparatus for automatically building conversational systems
US20100098224A1 (en) * 2003-12-19 2010-04-22 At&T Corp. Method and Apparatus for Automatically Building Conversational Systems
US7869999B2 (en) * 2004-08-11 2011-01-11 Nuance Communications, Inc. Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis
US20060041429A1 (en) * 2004-08-11 2006-02-23 International Business Machines Corporation Text-to-speech system and method
US7684988B2 (en) * 2004-10-15 2010-03-23 Microsoft Corporation Testing and tuning of automatic speech recognition systems using synthetic inputs generated from its acoustic models
US20060085187A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Testing and tuning of automatic speech recognition systems using synthetic inputs generated from its acoustic models
US20060200352A1 (en) * 2005-03-01 2006-09-07 Canon Kabushiki Kaisha Speech synthesis method
US7716052B2 (en) * 2005-04-07 2010-05-11 Nuance Communications, Inc. Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US20060229874A1 (en) * 2005-04-11 2006-10-12 Oki Electric Industry Co., Ltd. Speech synthesizer, speech synthesizing method, and computer program
US9501741B2 (en) 2005-09-08 2016-11-22 Apple Inc. Method and apparatus for building an intelligent automated assistant
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US20070061145A1 (en) * 2005-09-13 2007-03-15 Voice Signal Technologies, Inc. Methods and apparatus for formant-based voice systems
US8706488B2 (en) * 2005-09-13 2014-04-22 Nuance Communications, Inc. Methods and apparatus for formant-based voice synthesis
US8447592B2 (en) * 2005-09-13 2013-05-21 Nuance Communications, Inc. Methods and apparatus for formant-based voice systems
US20130179167A1 (en) * 2005-09-13 2013-07-11 Nuance Communications, Inc. Methods and apparatus for formant-based voice synthesis
US9389729B2 (en) 2005-09-30 2016-07-12 Apple Inc. Automated response to and sensing of user activity in portable devices
US9619079B2 (en) 2005-09-30 2017-04-11 Apple Inc. Automated response to and sensing of user activity in portable devices
US9958987B2 (en) 2005-09-30 2018-05-01 Apple Inc. Automated response to and sensing of user activity in portable devices
US8614431B2 (en) 2005-09-30 2013-12-24 Apple Inc. Automated response to and sensing of user activity in portable devices
US20070129945A1 (en) * 2005-12-06 2007-06-07 Ma Changxue C Voice quality control for high quality speech reconstruction
US7924986B2 (en) * 2006-01-27 2011-04-12 Accenture Global Services Limited IVR system manager
US20070192113A1 (en) * 2006-01-27 2007-08-16 Accenture Global Services, Gmbh IVR system manager
US20090254349A1 (en) * 2006-06-05 2009-10-08 Yoshifumi Hirose Speech synthesizer
US20080059184A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Calculating cost measures between HMM acoustic models
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
US8234116B2 (en) 2006-08-22 2012-07-31 Microsoft Corporation Calculating cost measures between HMM acoustic models
US8942986B2 (en) 2006-09-08 2015-01-27 Apple Inc. Determining user intent based on ontologies of domains
US9117447B2 (en) 2006-09-08 2015-08-25 Apple Inc. Using event alert text as input to an automated assistant
US8930191B2 (en) 2006-09-08 2015-01-06 Apple Inc. Paraphrasing of user requests and results by automated digital assistant
US7991616B2 (en) * 2006-10-24 2011-08-02 Hitachi, Ltd. Speech synthesizer
US20080243511A1 (en) * 2006-10-24 2008-10-02 Yusuke Fujita Speech synthesizer
CN101000765B (en) * 2007-01-09 2011-03-30 黑龙江大学 Speech synthetic method based on rhythm character
US8015011B2 (en) * 2007-01-30 2011-09-06 Nuance Communications, Inc. Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases
US20080183473A1 (en) * 2007-01-30 2008-07-31 International Business Machines Corporation Technique of Generating High Quality Synthetic Speech
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US9275631B2 (en) * 2007-09-07 2016-03-01 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US20090070115A1 (en) * 2007-09-07 2009-03-12 International Business Machines Corporation Speech synthesis system, speech synthesis program product, and speech synthesis method
US20130268275A1 (en) * 2007-09-07 2013-10-10 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US8370149B2 (en) * 2007-09-07 2013-02-05 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US9053089B2 (en) 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
US8620662B2 (en) * 2007-11-20 2013-12-31 Apple Inc. Context-aware unit selection
US20090132253A1 (en) * 2007-11-20 2009-05-21 Jerome Bellegarda Context-aware unit selection
US20100305949A1 (en) * 2007-11-28 2010-12-02 Masanori Kato Speech synthesis device, speech synthesis method, and speech synthesis program
US11023513B2 (en) 2007-12-20 2021-06-01 Apple Inc. Method and apparatus for searching using an active ontology
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US9361886B2 (en) 2008-02-22 2016-06-07 Apple Inc. Providing text input using speech data and non-speech data
US20090216535A1 (en) * 2008-02-22 2009-08-27 Avraham Entlis Engine For Speech Recognition
US8688446B2 (en) 2008-02-22 2014-04-01 Apple Inc. Providing text input using speech data and non-speech data
US20090258333A1 (en) * 2008-03-17 2009-10-15 Kai Yu Spoken language learning systems
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US9946706B2 (en) 2008-06-07 2018-04-17 Apple Inc. Automatic language identification for dynamic text processing
US8676584B2 (en) * 2008-07-03 2014-03-18 Thomson Licensing Method for time scaling of a sequence of input signal values
US20100004937A1 (en) * 2008-07-03 2010-01-07 Thomson Licensing Method for time scaling of a sequence of input signal values
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US8768702B2 (en) 2008-09-05 2014-07-01 Apple Inc. Multi-tiered voice feedback in an electronic device
US9691383B2 (en) 2008-09-05 2017-06-27 Apple Inc. Multi-tiered voice feedback in an electronic device
US8898568B2 (en) 2008-09-09 2014-11-25 Apple Inc. Audio user interface
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US8712776B2 (en) 2008-09-29 2014-04-29 Apple Inc. Systems and methods for selective text to speech synthesis
US8713119B2 (en) 2008-10-02 2014-04-29 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10643611B2 (en) 2008-10-02 2020-05-05 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9412392B2 (en) 2008-10-02 2016-08-09 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US8762469B2 (en) 2008-10-02 2014-06-24 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
US8751238B2 (en) 2009-03-09 2014-06-10 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US9002711B2 (en) * 2009-03-25 2015-04-07 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
US20110087488A1 (en) * 2009-03-25 2011-04-14 Kabushiki Kaisha Toshiba Speech synthesis apparatus and method
US10475446B2 (en) 2009-06-05 2019-11-12 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US8682649B2 (en) 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
US8600743B2 (en) 2010-01-06 2013-12-03 Apple Inc. Noise profile determination for voice-related feature
US9311043B2 (en) 2010-01-13 2016-04-12 Apple Inc. Adaptive audio feedback system and method
US8670985B2 (en) 2010-01-13 2014-03-11 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8706503B2 (en) 2010-01-18 2014-04-22 Apple Inc. Intent deduction based on previous user interactions with voice assistant
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US8731942B2 (en) 2010-01-18 2014-05-20 Apple Inc. Maintaining context information between user interactions with a voice assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8799000B2 (en) 2010-01-18 2014-08-05 Apple Inc. Disambiguation based on active input elicitation by intelligent automated assistant
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8670979B2 (en) 2010-01-18 2014-03-11 Apple Inc. Active input elicitation by intelligent automated assistant
US8892446B2 (en) 2010-01-18 2014-11-18 Apple Inc. Service orchestration for intelligent automated assistant
US8660849B2 (en) 2010-01-18 2014-02-25 Apple Inc. Prioritizing selection criteria by automated assistant
US8903716B2 (en) 2010-01-18 2014-12-02 Apple Inc. Personalized vocabulary for digital assistant
US9548050B2 (en) 2010-01-18 2017-01-17 Apple Inc. Intelligent automated assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US9424862B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US9424861B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US9431028B2 (en) 2010-01-25 2016-08-30 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US9190062B2 (en) 2010-02-25 2015-11-17 Apple Inc. User profiling for voice input processing
US20110246200A1 (en) * 2010-04-05 2011-10-06 Microsoft Corporation Pre-saved data compression for tts concatenation cost
US8798998B2 (en) * 2010-04-05 2014-08-05 Microsoft Corporation Pre-saved data compression for TTS concatenation cost
US9798653B1 (en) * 2010-05-05 2017-10-24 Nuance Communications, Inc. Methods, apparatus and data structure for cross-language speech adaptation
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US9075783B2 (en) 2010-09-27 2015-07-07 Apple Inc. Electronic device with text error correction based on voice recognition data
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US9058811B2 (en) * 2011-02-25 2015-06-16 Kabushiki Kaisha Toshiba Speech synthesis with fuzzy heteronym prediction using decision trees
US20120221339A1 (en) * 2011-02-25 2012-08-30 Kabushiki Kaisha Toshiba Method, apparatus for synthesizing speech and acoustic model training method for speech synthesis
US10102359B2 (en) 2011-03-21 2018-10-16 Apple Inc. Device access using voice authentication
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US10019994B2 (en) 2012-06-08 2018-07-10 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US9240184B1 (en) * 2012-11-15 2016-01-19 Google Inc. Frame-level combination of deep neural network and gaussian mixture models
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US10078487B2 (en) 2013-03-15 2018-09-18 Apple Inc. Context-sensitive handling of interruptions
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11151899B2 (en) 2013-03-15 2021-10-19 Apple Inc. User training by intelligent digital assistant
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
CN103325381A (en) * 2013-05-29 2013-09-25 吉林大学 Speech separation method based on fuzzy membership function
CN103325381B (en) * 2013-05-29 2015-09-02 吉林大学 A kind of speech separating method based on fuzzy membership functions
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US9460704B2 (en) * 2013-09-06 2016-10-04 Google Inc. Deep networks for unit selection speech synthesis
US20150073804A1 (en) * 2013-09-06 2015-03-12 Google Inc. Deep networks for unit selection speech synthesis
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US10008216B2 (en) * 2014-04-15 2018-06-26 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US20170249953A1 (en) * 2014-04-15 2017-08-31 Speech Morphing Systems, Inc. Method and apparatus for exemplary morphing computer system background
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US20160140953A1 (en) * 2014-11-17 2016-05-19 Samsung Electronics Co., Ltd. Speech synthesis apparatus and control method thereof
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US11556230B2 (en) 2014-12-02 2023-01-17 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US20180018957A1 (en) * 2015-03-25 2018-01-18 Yamaha Corporation Sound control device, sound control method, and sound control program
US10504502B2 (en) * 2015-03-25 2019-12-10 Yamaha Corporation Sound control device, sound control method, and sound control program
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US20170345411A1 (en) * 2016-05-26 2017-11-30 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9934775B2 (en) * 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US10832652B2 (en) 2016-10-17 2020-11-10 Tencent Technology (Shenzhen) Company Limited Model generating method, and speech synthesis method and apparatus
WO2018072543A1 (en) * 2016-10-17 2018-04-26 腾讯科技(深圳)有限公司 Model generation method, speech synthesis method and apparatus
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10262646B2 (en) 2017-01-09 2019-04-16 Media Overkill, LLC Multi-source switched sequence oscillator waveform compositing system
GB2560599A (en) * 2017-03-14 2018-09-19 Google Llc Speech synthesis unit selection
US11393450B2 (en) 2017-03-14 2022-07-19 Google Llc Speech synthesis unit selection
GB2560599B (en) * 2017-03-14 2020-07-29 Google Llc Speech synthesis unit selection
US10923103B2 (en) 2017-03-14 2021-02-16 Google Llc Speech synthesis unit selection
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
CN109087627A (en) * 2018-10-16 2018-12-25 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
US20210110817A1 (en) * 2019-10-15 2021-04-15 Samsung Electronics Co., Ltd. Method and apparatus for generating speech
US11580963B2 (en) * 2019-10-15 2023-02-14 Samsung Electronics Co., Ltd. Method and apparatus for generating speech
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 Synthetic speech detection method for inhibiting phoneme influence based on phoneme-level analysis

Similar Documents

Publication Publication Date Title
US6366883B1 (en) Concatenation of speech segments by use of a speech synthesizer
Kawai et al. XIMERA: A new TTS from ATR based on corpus-based technologies
US7869999B2 (en) Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis
US5857173A (en) Pronunciation measurement device and method
US7013276B2 (en) Method of assessing degree of acoustic confusability, and system therefor
US6173263B1 (en) Method and system for performing concatenative speech synthesis using half-phonemes
US20100004931A1 (en) Apparatus and method for speech utterance verification
US20030154081A1 (en) Objective measure for estimating mean opinion score of synthesized speech
Turk et al. Robust processing techniques for voice conversion
JPH11143346A (en) Method and device for evaluating language practicing speech and storage medium storing speech evaluation processing program
JP5007401B2 (en) Pronunciation rating device and program
KR100362292B1 (en) Method and system for english pronunciation study based on speech recognition technology
JP4811993B2 (en) Audio processing apparatus and program
JP3050832B2 (en) Speech synthesizer with spontaneous speech waveform signal connection
Ipsic et al. Croatian HMM-based speech synthesis
GB2313530A (en) Speech Synthesizer
Kayte et al. A Corpus-Based Concatenative Speech Synthesis System for Marathi
Abdelmalek et al. High quality Arabic text-to-speech synthesis using unit selection
JP2975586B2 (en) Speech synthesis system
Narendra et al. Syllable specific unit selection cost functions for text-to-speech synthesis
KR102274766B1 (en) Pronunciation prediction and evaluation system for beginner foreign language learners
JP3091426B2 (en) Speech synthesizer with spontaneous speech waveform signal connection
JP5028599B2 (en) Audio processing apparatus and program
JP3459600B2 (en) Speech data amount reduction device and speech synthesis device for speech synthesis device
KR102274764B1 (en) User-defined pronunciation evaluation system for providing statistics information

Legal Events

Date Code Title Description
AS Assignment

Owner name: ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABOR

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CAMPBELL, NICK;HUNT, ANDREW;REEL/FRAME:009933/0393

Effective date: 19990405

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABOR

Free format text: CHANGE OF ADDRESS;ASSIGNOR:ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES;REEL/FRAME:013211/0068

Effective date: 20000325

CC Certificate of correction
AS Assignment

Owner name: ADVANCED TELECOMMUNICATIONS RESEARCH INSTITUTE INT

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ATR INTERPRETING TELECOMMUNICATIONS RESEARCH LABORATORIES;REEL/FRAME:013552/0470

Effective date: 20021031

FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 8

FPAY Fee payment

Year of fee payment: 12