96:12 • Jin, Z. et al
cost is 0. Here is the complete mathematical denition.
M
Pre, i, j
= min(
min
t
{M
Post, i−1, t
+ C(F
Post, i−1, t
, F
Pre, i, j
)},
min
t
{M
End, i−1, t
+ C(F
End, i−1, t
, F
Pre, i, j
)})
M
Start, i, j
= min(
min
t
{M
Post, i−1, t
+ C(F
Post, i−1, t
, F
Start, i, j
)},
min
t
{M
End, i−1, t
+ C(F
post, i−1, t
, F
Start, i, j
)})
For each
Post
range, when combined with a preceding
Pre
range,
they form a valid range choice for the current phoneme. Then we
can use the corresponding frames to determine the similarity cost
and the duration cost. If
Pre
and
Post
are not consecutive, then
there will be a concatenation cost. Therefore,
M
Post, i, j
= min
t
{M
Pre, i, t
+ αS(q
i
, {F
Pre, i, t
, F
Post, i, j
})
+ βL(q
i
, {F
Pre, i, t
, F
Post, i, j
})
+ C(F
Pre, i, t
, F
Post, i, j
)}
in which the denition of
α, β, S
and
L
follows the same denition
used in Equation 1. Similarly, combining an
End
range with a
Start
range forms a valid range choice that denes the similarity and
duration cost. Since there is only one range selected, there is no
concatenation cost. Therefore,
M
End, i, j
= min
t
{M
Start, i, t
+ αS(q
i
, {⟨F
Start, i, t
, F
End, i, j
⟩})
+ βL(q
i
, {⟨F
Start, i, t
, F
End, i, j
⟩})}
Using back trace, we can extract the selected ranges that produce the
minimal cost. Figure 15 shows an example selection result. When
combined, the nal selected frames form two consecutive: segments:
1-6 and 10-13.
Range Selection is more ecient than Unit selection. Suppose each
query phoneme lasts
m
frames and has
n
candidates. Unit selection
has a complexity of
O(mn
2
)
per phoneme with the Viterbi algorithm.
In Range selection, however, the list of candidates per phoneme
contain
O(n)
rows and only 2 columns (Figure 15). Therefore, the
complexity of Range selection per phoneme is
O(n
2
)
. Because of the
eciency, we can mass produce selection results for all possible
values of
α
and
β
, which leads to alternative Syntheses (Section 4.5).
REFERENCES
Acapela Group. 2016. http://www.acapela-group.com. (2016). Accessed: 2016-04-10.
Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. Voice
conversion based on Non-negative matrix factorization using phoneme-categorized
dictionary. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP 2014).
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts
and transitions in interview video. ACM Trans. on Graphics (TOG) 31, 4 (2012), 67.
Paulus Petrus Gerardus Boersma et al
.
2002. Praat, a system for doing phonetics by
computer. Glot international 5 (2002).
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving
Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer
Graphics and Interactive Techniques (SIGGRAPH ’97). 353–360.
Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura
Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using
metadata. In Proceedings of the 4th conference on Designing interactive systems:
processes, practices, methods, and techniques. ACM, 157–166.
Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion
using deep neural networks with layer-wise generative training. Audio, Speech, and
Language Processing, IEEE/ACM Transactions on 22, 12 (2014), 1859–1872.
Alistair D Conkie and Stephen Isard. 1997. Optimal coupling of diphones. In Progress
in speech synthesis. Springer, 293–304.
Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore
Prahallad. 2009. Voice conversion using Articial Neural Networks. In IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009).
Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J Prez, and Yannis
Stylianou. 2007. Towards a Voice Conversion System Based on Frame Selection. In
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
2007).
G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278.
Kei Fujii, Jun Okawa, and Kaori Suigetsu. 2007. High-Individuality Voice Conversion
Based on Concatenative Speech Synthesis. International Journal of Electrical,
Computer, Energetic, Electronic and Communication Engineering 1, 11 (2007), 1617 –
1622.
François G. Germain, Gautham J. Mysore, and Takako Fujioka. 2016. Equalization Match-
ing of Speech Recordings in Real-World Environments. In 41st IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP 2016).
Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin.
2001. Image analogies. In Proceedings of the 28th annual conference on Computer
graphics and interactive techniques. ACM, 327–340.
Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech
synthesis system using a large speech database. In IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP 1996). 373–376.
Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gautham J. Mysore.
2016. CUTE: a concatenative method for voice conversion using exemplar-based
unit selection. In 41st IEEE International Conference on Acoustics Speech and Signal
Processing (ICASSP 2016).
Alexander Kain and Michael W Macon. 1998. Spectral voice conversion for text-to-
speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP 1998). 285–288.
Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino,
and Hideki Banno. 2008. TANDEM-STRAIGHT: A temporally stable power spectral
representation for periodic signals and applications to interference-free spectrum,
F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Spee ch
and Signal Processing (ICASSP 2008). 3933–3936.
John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Fifth
ISCA Workshop on Speech Synthesis.
Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality
assessment. In Proceedings of IEEE Pacic Rim Conference on Communications
Computers and Signal Processing. 125–128.
Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time Prosody-driven
Synthesis of Body Language. ACM Trans. Graph. 28, 5, Article 172 (Dec. 2009),
10 pages.
Jingwan Lu, Fisher Yu, Adam Finkelstein, and Stephen DiVerdi. 2012. HelpingHand:
Example-based Stroke Stylization. ACM Trans. Graph. 31, 4, Article 46 (July 2012),
10 pages.
Michal Lukáč, Jakub Fišer, Jean-Charles Bazin, Ondřej Jamriška, Alexander Sorkine-
Hornung, and Daniel Sýkora. 2013. Painting by Feature: Texture Boundaries for
Example-based Image Creation. ACM Trans. Graph. 32, 4, Article 116 (July 2013),
8 pages.
Anderson F Machado and Marcelo Queiroz. 2010. Voice conversion: A critical survey.
Proc. Sound and Music Computing (SMC) (2010), 1–8.
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition
algorithms using mel frequency cepstral coecient (MFCC) and dynamic time
warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet:
A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim:
Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot
Summaries. In Proceedings of the 28th annual ACM symposium on User interface
software and technology (UIST 2015). 181–190.
Amy Pavel, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: A browsable,
skimmable format for informational lecture videos. In Proceedings of the 27th annual
ACM symposium on User interface software and technology (UIST 2014). 573–582.
Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh. 2010. Non-negative
matrix factorization based compensation of music for automatic speech recognition.
In Interspeech 2010. 717–720.
Marc Roelands and Werner Verhelst. 1993. Waveform similarity based overlap-add
(WSOLA) for time-scale modication of speech: structures and evaluation. In
EUROSPEECH 1993. 337–340.
Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh
Agrawala. 2013. Content-based Tools for Editing Audio Stories. In Proceedings
of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST
2013). 113–122.
ACM Transactions on Graphics, Vol. 36, No. 4, Article 96. Publication date: July 2017.