{"id":10155,"date":"2023-02-24T17:06:30","date_gmt":"2023-02-24T16:06:30","guid":{"rendered":"https:\/\/www.lenseup.com\/text-to-speech-tts-et-synthese-vocale-3-approches-innovantes\/"},"modified":"2023-07-31T17:09:06","modified_gmt":"2023-07-31T15:09:06","slug":"fresh-approaches-to-improve-synthetic-speech-and-text-to-speech","status":"publish","type":"post","link":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/","title":{"rendered":"Fresh approaches to improve the quality of synthetic speech and text-to-speech"},"content":{"rendered":"<p>AI<span data-offset-key=\"1o6dh-129-0\"> has<\/span><span data-offset-key=\"1o6dh-130-0\"> drastically<\/span><span data-offset-key=\"1o6dh-131-0\"> altered<\/span><span data-offset-key=\"1o6dh-132-0\"> the<\/span><span data-offset-key=\"1o6dh-133-0\"> way<\/span><span data-offset-key=\"1o6dh-134-0\"> people<\/span><span data-offset-key=\"1o6dh-135-0\"> go<\/span><span data-offset-key=\"1o6dh-136-0\"> about<\/span><span data-offset-key=\"1o6dh-137-0\"> their<\/span><span data-offset-key=\"1o6dh-138-0\"> daily<\/span><span data-offset-key=\"1o6dh-139-0\"> lives<\/span><span data-offset-key=\"1o6dh-140-0\">.<\/span><span data-offset-key=\"1o6dh-141-0\"> Voice<\/span><span data-offset-key=\"1o6dh-142-0\"> recognition<\/span><span data-offset-key=\"1o6dh-143-0\"> has<\/span><span data-offset-key=\"1o6dh-144-0\"> simplified<\/span><span data-offset-key=\"1o6dh-145-0\"> activities<\/span><span data-offset-key=\"1o6dh-146-0\"> like<\/span><span data-offset-key=\"1o6dh-147-0\"> taking<\/span><span data-offset-key=\"1o6dh-148-0\"> notes<\/span><span data-offset-key=\"1o6dh-149-0\">,<\/span><span data-offset-key=\"1o6dh-150-0\"> typing<\/span><span data-offset-key=\"1o6dh-151-0\"> documents<\/span><span data-offset-key=\"1o6dh-152-0\">,<\/span><span data-offset-key=\"1o6dh-153-0\"> and<\/span><span data-offset-key=\"1o6dh-154-0\"> more<\/span><span data-offset-key=\"1o6dh-155-0\">.<\/span><span data-offset-key=\"1o6dh-156-0\"> Its<\/span><span data-offset-key=\"1o6dh-157-0\"> speed<\/span><span data-offset-key=\"1o6dh-158-0\"> and<\/span><span data-offset-key=\"1o6dh-159-0\"> efficiency<\/span><span data-offset-key=\"1o6dh-160-0\"> are<\/span><span data-offset-key=\"1o6dh-161-0\"> what<\/span><span data-offset-key=\"1o6dh-162-0\"> makes<\/span><span data-offset-key=\"1o6dh-163-0\"> it<\/span><span data-offset-key=\"1o6dh-164-0\"> so<\/span><span data-offset-key=\"1o6dh-165-0\"> popular<\/span><span data-offset-key=\"1o6dh-166-0\">.<\/span><span data-offset-key=\"1o6dh-167-0\"> With<\/span><span data-offset-key=\"1o6dh-168-0\"> the<\/span><span data-offset-key=\"1o6dh-169-0\"> progress<\/span><span data-offset-key=\"1o6dh-170-0\"> made<\/span><span data-offset-key=\"1o6dh-171-0\"> in<\/span><span data-offset-key=\"1o6dh-172-0\"> AI<\/span><span data-offset-key=\"1o6dh-173-0\">,<\/span><span data-offset-key=\"1o6dh-174-0\"> many<\/span><span data-offset-key=\"1o6dh-175-0\"> voice<\/span><span data-offset-key=\"1o6dh-176-0\"> recognition<\/span><span data-offset-key=\"1o6dh-177-0\"> applications<\/span><span data-offset-key=\"1o6dh-178-0\"> have<\/span><span data-offset-key=\"1o6dh-179-0\"> been<\/span><span data-offset-key=\"1o6dh-180-0\"> created<\/span><span data-offset-key=\"1o6dh-181-0\">.<\/span><span data-offset-key=\"1o6dh-182-0\"> Google<\/span><span data-offset-key=\"1o6dh-183-0\">,<\/span><span data-offset-key=\"1o6dh-184-0\"> Alexa<\/span><span data-offset-key=\"1o6dh-185-0\">,<\/span><span data-offset-key=\"1o6dh-186-0\"> and<\/span><span data-offset-key=\"1o6dh-187-0\"> Siri<\/span><span data-offset-key=\"1o6dh-188-0\"> are<\/span><span data-offset-key=\"1o6dh-189-0\"> a<\/span><span data-offset-key=\"1o6dh-190-0\"> few<\/span><span data-offset-key=\"1o6dh-191-0\"> examples<\/span><span data-offset-key=\"1o6dh-192-0\"> of<\/span><span data-offset-key=\"1o6dh-193-0\"> virtual<\/span><span data-offset-key=\"1o6dh-194-0\"> assistants<\/span><span data-offset-key=\"1o6dh-195-0\"> that<\/span><span data-offset-key=\"1o6dh-196-0\"> use<\/span><span data-offset-key=\"1o6dh-197-0\"> voice<\/span><span data-offset-key=\"1o6dh-198-0\"> recognition<\/span><span data-offset-key=\"1o6dh-199-0\"> software<\/span><span data-offset-key=\"1o6dh-200-0\"> to<\/span><span data-offset-key=\"1o6dh-201-0\"> communicate<\/span><span data-offset-key=\"1o6dh-202-0\"> with<\/span><span data-offset-key=\"1o6dh-203-0\"> users<\/span><span data-offset-key=\"1o6dh-204-0\">.<\/span><span data-offset-key=\"1o6dh-205-0\"> Additionally<\/span><span data-offset-key=\"1o6dh-206-0\">,<\/span><span data-offset-key=\"1o6dh-207-0\"> text<\/span><span data-offset-key=\"1o6dh-208-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-209-0\">to<\/span><span data-offset-key=\"1o6dh-210-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-211-0\">speech<\/span><span data-offset-key=\"1o6dh-212-0\">,<\/span><span data-offset-key=\"1o6dh-213-0\"> speech<\/span><span data-offset-key=\"1o6dh-214-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-215-0\">to<\/span><span data-offset-key=\"1o6dh-216-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-217-0\">text<\/span><span data-offset-key=\"1o6dh-218-0\">,<\/span><span data-offset-key=\"1o6dh-219-0\"> and<\/span><span data-offset-key=\"1o6dh-220-0\"> text<\/span><span data-offset-key=\"1o6dh-221-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-222-0\">to<\/span><span data-offset-key=\"1o6dh-223-0\">&#8211;<\/span><span data-offset-key=\"1o6dh-224-0\">text<\/span><span data-offset-key=\"1o6dh-225-0\"> have<\/span><span data-offset-key=\"1o6dh-226-0\"> been<\/span><span data-offset-key=\"1o6dh-227-0\"> widely<\/span><span data-offset-key=\"1o6dh-228-0\"> adopted<\/span><span data-offset-key=\"1o6dh-229-0\"> in<\/span><span data-offset-key=\"1o6dh-230-0\"> various<\/span><span data-offset-key=\"1o6dh-231-0\"> applications<\/span><span data-offset-key=\"1o6dh-232-0\">.<\/span><!--more--><\/p>\n<p>Creating human-level speech is essential for Artificial Intelligence (AI), especially when it comes to chatbots. Recent advances in deep learning have drastically improved the quality of synthesized speech produced by neural-based Text-to-Speech (TTS) systems. However, most of the data used for training these systems have been limited to recordings from controlled environments, such as reading aloud or performing a script. Human beings on the other hand, can speak spontaneously with varied prosodies that convey paralinguistic information, such as subtle emotions. This ability is acquired from being exposed to a long duration of real-world speech.<\/p>\n<p>Here are 3 new ways that will dramatically improve text to speech.<\/p>\n<h2>Multi-codebook vector quantized TTS<\/h2>\n<p>Researchers<span data-offset-key=\"b1m57-121-0\"> at<\/span><span data-offset-key=\"b1m57-122-0\"> Carnegie<\/span><span data-offset-key=\"b1m57-123-0\"> Mellon<\/span><span data-offset-key=\"b1m57-124-0\"> University<\/span><span data-offset-key=\"b1m57-125-0\"> have<\/span><span data-offset-key=\"b1m57-126-0\"> developed<\/span><span data-offset-key=\"b1m57-127-0\"> an<\/span><span data-offset-key=\"b1m57-128-0\"> artificial<\/span><span data-offset-key=\"b1m57-129-0\"> intelligence<\/span><span data-offset-key=\"b1m57-130-0\"> (<\/span><span data-offset-key=\"b1m57-131-0\">AI<\/span><span data-offset-key=\"b1m57-132-0\">)<\/span><span data-offset-key=\"b1m57-133-0\"> system<\/span><span data-offset-key=\"b1m57-134-0\"> that<\/span><span data-offset-key=\"b1m57-135-0\"> can<\/span><span data-offset-key=\"b1m57-136-0\"> be<\/span><span data-offset-key=\"b1m57-137-0\"> trained<\/span><span data-offset-key=\"b1m57-138-0\"> to<\/span><span data-offset-key=\"b1m57-139-0\"> generate<\/span><span data-offset-key=\"b1m57-140-0\"> text<\/span><span data-offset-key=\"b1m57-141-0\">&#8211;<\/span><span data-offset-key=\"b1m57-142-0\">to<\/span><span data-offset-key=\"b1m57-143-0\">&#8211;<\/span><span data-offset-key=\"b1m57-144-0\">speech<\/span><span data-offset-key=\"b1m57-145-0\"> with<\/span><span data-offset-key=\"b1m57-146-0\"> a<\/span><span data-offset-key=\"b1m57-147-0\"> wide<\/span><span data-offset-key=\"b1m57-148-0\"> range<\/span><span data-offset-key=\"b1m57-149-0\"> of<\/span><span data-offset-key=\"b1m57-150-0\"> voices<\/span><span data-offset-key=\"b1m57-151-0\">.<\/span><span data-offset-key=\"b1m57-152-0\"> To<\/span><span data-offset-key=\"b1m57-153-0\"> do<\/span><span data-offset-key=\"b1m57-154-0\"> this<\/span><span data-offset-key=\"b1m57-155-0\">,<\/span><span data-offset-key=\"b1m57-156-0\"> they<\/span><span data-offset-key=\"b1m57-157-0\"> analyzed<\/span><span data-offset-key=\"b1m57-158-0\"> actual<\/span><span data-offset-key=\"b1m57-159-0\"> speech<\/span><span data-offset-key=\"b1m57-160-0\"> taken<\/span><span data-offset-key=\"b1m57-161-0\"> from<\/span><span data-offset-key=\"b1m57-162-0\"> YouTube<\/span><span data-offset-key=\"b1m57-163-0\"> videos<\/span><span data-offset-key=\"b1m57-164-0\"> and<\/span><span data-offset-key=\"b1m57-165-0\"> podcasts<\/span><span data-offset-key=\"b1m57-166-0\">.<\/span><span data-offset-key=\"b1m57-167-0\"> By<\/span><span data-offset-key=\"b1m57-168-0\"> using<\/span><span data-offset-key=\"b1m57-169-0\"> existing<\/span><span data-offset-key=\"b1m57-170-0\"> recordings<\/span><span data-offset-key=\"b1m57-171-0\">,<\/span><span data-offset-key=\"b1m57-172-0\"> they<\/span><span data-offset-key=\"b1m57-173-0\"> could<\/span><span data-offset-key=\"b1m57-174-0\"> simplify<\/span><span data-offset-key=\"b1m57-175-0\"> the<\/span><span data-offset-key=\"b1m57-176-0\"> environment<\/span><span data-offset-key=\"b1m57-177-0\"> and<\/span><span data-offset-key=\"b1m57-178-0\"> focus<\/span><span data-offset-key=\"b1m57-179-0\"> on<\/span><span data-offset-key=\"b1m57-180-0\"> text<\/span><span data-offset-key=\"b1m57-181-0\">&#8211;<\/span><span data-offset-key=\"b1m57-182-0\">to<\/span><span data-offset-key=\"b1m57-183-0\">&#8211;<\/span><span data-offset-key=\"b1m57-184-0\">speech<\/span><span data-offset-key=\"b1m57-185-0\">.<\/span><span data-offset-key=\"b1m57-186-0\"> They<\/span><span data-offset-key=\"b1m57-187-0\"> hope<\/span><span data-offset-key=\"b1m57-188-0\"> that<\/span><span data-offset-key=\"b1m57-189-0\"> this<\/span><span data-offset-key=\"b1m57-190-0\"> will<\/span><span data-offset-key=\"b1m57-191-0\"> replicate<\/span><span data-offset-key=\"b1m57-192-0\"> the<\/span><span data-offset-key=\"b1m57-193-0\"> success<\/span><span data-offset-key=\"b1m57-194-0\"> of<\/span><span data-offset-key=\"b1m57-195-0\"> large<\/span><span data-offset-key=\"b1m57-196-0\"> language<\/span><span data-offset-key=\"b1m57-197-0\"> models<\/span><span data-offset-key=\"b1m57-198-0\"> like<\/span><span data-offset-key=\"b1m57-199-0\"> G<\/span><span data-offset-key=\"b1m57-200-0\">PT<\/span><span data-offset-key=\"b1m57-201-0\">&#8211;<\/span><span data-offset-key=\"b1m57-202-0\">3<\/span><span data-offset-key=\"b1m57-203-0\">.<\/span><\/p>\n<p>Using<span data-offset-key=\"dreev-139-0\"> a<\/span><span data-offset-key=\"dreev-140-0\"> limited<\/span><span data-offset-key=\"dreev-141-0\"> amount<\/span><span data-offset-key=\"dreev-142-0\"> of<\/span><span data-offset-key=\"dreev-143-0\"> resources<\/span><span data-offset-key=\"dreev-144-0\">,<\/span><span data-offset-key=\"dreev-145-0\"> these<\/span><span data-offset-key=\"dreev-146-0\"> systems<\/span><span data-offset-key=\"dreev-147-0\"> can<\/span><span data-offset-key=\"dreev-148-0\"> be<\/span><span data-offset-key=\"dreev-149-0\"> tailored<\/span><span data-offset-key=\"dreev-150-0\"> to<\/span><span data-offset-key=\"dreev-151-0\"> particular<\/span><span data-offset-key=\"dreev-152-0\"> speaker<\/span><span data-offset-key=\"dreev-153-0\"> qualities<\/span><span data-offset-key=\"dreev-154-0\"> or<\/span><span data-offset-key=\"dreev-155-0\"> recording<\/span><span data-offset-key=\"dreev-156-0\"> conditions<\/span><span data-offset-key=\"dreev-157-0\">.<\/span><span data-offset-key=\"dreev-158-0\"> This<\/span><span data-offset-key=\"dreev-159-0\"> paper<\/span><span data-offset-key=\"dreev-160-0\"> examines<\/span><span data-offset-key=\"dreev-161-0\"> the<\/span><span data-offset-key=\"dreev-162-0\"> new<\/span><span data-offset-key=\"dreev-163-0\"> challenges<\/span><span data-offset-key=\"dreev-164-0\"> that<\/span><span data-offset-key=\"dreev-165-0\"> arise<\/span><span data-offset-key=\"dreev-166-0\"> when<\/span><span data-offset-key=\"dreev-167-0\"> training<\/span><span data-offset-key=\"dreev-168-0\"> T<\/span><span data-offset-key=\"dreev-169-0\">TS<\/span><span data-offset-key=\"dreev-170-0\"> systems<\/span><span data-offset-key=\"dreev-171-0\"> on<\/span><span data-offset-key=\"dreev-172-0\"> actual<\/span><span data-offset-key=\"dreev-173-0\"> speech<\/span><span data-offset-key=\"dreev-174-0\">,<\/span><span data-offset-key=\"dreev-175-0\"> such<\/span><span data-offset-key=\"dreev-176-0\"> as<\/span><span data-offset-key=\"dreev-177-0\"> increased<\/span><span data-offset-key=\"dreev-178-0\"> pros<\/span><span data-offset-key=\"dreev-179-0\">odic<\/span><span data-offset-key=\"dreev-180-0\"> variance<\/span><span data-offset-key=\"dreev-181-0\"> and<\/span><span data-offset-key=\"dreev-182-0\"> background<\/span><span data-offset-key=\"dreev-183-0\"> noise<\/span><span data-offset-key=\"dreev-184-0\"> that<\/span><span data-offset-key=\"dreev-185-0\"> are<\/span><span data-offset-key=\"dreev-186-0\"> not<\/span><span data-offset-key=\"dreev-187-0\"> found<\/span><span data-offset-key=\"dreev-188-0\"> in<\/span><span data-offset-key=\"dreev-189-0\"> speech<\/span><span data-offset-key=\"dreev-190-0\"> recorded<\/span><span data-offset-key=\"dreev-191-0\"> in<\/span><span data-offset-key=\"dreev-192-0\"> controlled<\/span><span data-offset-key=\"dreev-193-0\"> environments<\/span><span data-offset-key=\"dreev-194-0\">.<\/span><span data-offset-key=\"dreev-195-0\"> The<\/span><span data-offset-key=\"dreev-196-0\"> authors<\/span><span data-offset-key=\"dreev-197-0\"> show<\/span><span data-offset-key=\"dreev-198-0\"> that<\/span><span data-offset-key=\"dreev-199-0\"> the<\/span><span data-offset-key=\"dreev-200-0\"> use<\/span><span data-offset-key=\"dreev-201-0\"> of<\/span><span data-offset-key=\"dreev-202-0\"> mel<\/span><span data-offset-key=\"dreev-203-0\">&#8211;<\/span><span data-offset-key=\"dreev-204-0\">spect<\/span><span data-offset-key=\"dreev-205-0\">rogram<\/span><span data-offset-key=\"dreev-206-0\">&#8211;<\/span><span data-offset-key=\"dreev-207-0\">based<\/span><span data-offset-key=\"dreev-208-0\"> aut<\/span><span data-offset-key=\"dreev-209-0\">ore<\/span><span data-offset-key=\"dreev-210-0\">gressive<\/span><span data-offset-key=\"dreev-211-0\"> algorithms<\/span><span data-offset-key=\"dreev-212-0\"> cannot<\/span><span data-offset-key=\"dreev-213-0\"> reproduce<\/span><span data-offset-key=\"dreev-214-0\"> accurate<\/span><span data-offset-key=\"dreev-215-0\"> text<\/span><span data-offset-key=\"dreev-216-0\">&#8211;<\/span><span data-offset-key=\"dreev-217-0\">audio<\/span><span data-offset-key=\"dreev-218-0\"> alignment<\/span><span data-offset-key=\"dreev-219-0\"> when<\/span><span data-offset-key=\"dreev-220-0\"> applied<\/span><span data-offset-key=\"dreev-221-0\"> to<\/span><span data-offset-key=\"dreev-222-0\"> real<\/span><span data-offset-key=\"dreev-223-0\">&#8211;<\/span><span data-offset-key=\"dreev-224-0\">world<\/span><span data-offset-key=\"dreev-225-0\"> speech<\/span><span data-offset-key=\"dreev-226-0\">,<\/span><span data-offset-key=\"dreev-227-0\"> resulting<\/span><span data-offset-key=\"dreev-228-0\"> in<\/span><span data-offset-key=\"dreev-229-0\"> distorted<\/span><span data-offset-key=\"dreev-230-0\"> speech<\/span><span data-offset-key=\"dreev-231-0\">.<\/span><span data-offset-key=\"dreev-232-0\"> The<\/span><span data-offset-key=\"dreev-233-0\"> failure<\/span><span data-offset-key=\"dreev-234-0\"> of<\/span><span data-offset-key=\"dreev-235-0\"> inference<\/span><span data-offset-key=\"dreev-236-0\"> alignment<\/span><span data-offset-key=\"dreev-237-0\"> is<\/span><span data-offset-key=\"dreev-238-0\"> attributed<\/span><span data-offset-key=\"dreev-239-0\"> to<\/span><span data-offset-key=\"dreev-240-0\"> the<\/span><span data-offset-key=\"dreev-241-0\"> errors<\/span><span data-offset-key=\"dreev-242-0\"> that<\/span><span data-offset-key=\"dreev-243-0\"> accumulate<\/span><span data-offset-key=\"dreev-244-0\"> in<\/span><span data-offset-key=\"dreev-245-0\"> the<\/span><span data-offset-key=\"dreev-246-0\"> decoding<\/span><span data-offset-key=\"dreev-247-0\"> process<\/span><span data-offset-key=\"dreev-248-0\">,<\/span><span data-offset-key=\"dreev-249-0\"> as<\/span><span data-offset-key=\"dreev-250-0\"> they<\/span><span data-offset-key=\"dreev-251-0\"> also<\/span><span data-offset-key=\"dreev-252-0\"> demonstrate<\/span><span data-offset-key=\"dreev-253-0\"> that<\/span><span data-offset-key=\"dreev-254-0\"> precise<\/span><span data-offset-key=\"dreev-255-0\"> align<\/span><span data-offset-key=\"dreev-256-0\">ments<\/span><span data-offset-key=\"dreev-257-0\"> can<\/span><span data-offset-key=\"dreev-258-0\"> be<\/span><span data-offset-key=\"dreev-259-0\"> learned<\/span><span data-offset-key=\"dreev-260-0\"> during<\/span><span data-offset-key=\"dreev-261-0\"> the<\/span><span data-offset-key=\"dreev-262-0\"> training<\/span><span data-offset-key=\"dreev-263-0\"> phase<\/span><span data-offset-key=\"dreev-264-0\">.<\/span><\/p>\n<p>Researchers<span data-offset-key=\"a1tlr-139-0\"> found<\/span><span data-offset-key=\"a1tlr-140-0\"> that<\/span><span data-offset-key=\"a1tlr-141-0\"> replacing<\/span><span data-offset-key=\"a1tlr-142-0\"> the<\/span><span data-offset-key=\"a1tlr-143-0\"> mel<\/span><span data-offset-key=\"a1tlr-144-0\">&#8211;<\/span><span data-offset-key=\"a1tlr-145-0\">spect<\/span><span data-offset-key=\"a1tlr-146-0\">rogram<\/span><span data-offset-key=\"a1tlr-147-0\"> with<\/span><span data-offset-key=\"a1tlr-148-0\"> a<\/span><span data-offset-key=\"a1tlr-149-0\"> learned<\/span><span data-offset-key=\"a1tlr-150-0\"> discrete<\/span><span data-offset-key=\"a1tlr-151-0\"> code<\/span><span data-offset-key=\"a1tlr-152-0\">book<\/span><span data-offset-key=\"a1tlr-153-0\"> could<\/span><span data-offset-key=\"a1tlr-154-0\"> solve<\/span><span data-offset-key=\"a1tlr-155-0\"> the<\/span><span data-offset-key=\"a1tlr-156-0\"> problem<\/span><span data-offset-key=\"a1tlr-157-0\">.<\/span><span data-offset-key=\"a1tlr-158-0\"> This<\/span><span data-offset-key=\"a1tlr-159-0\"> is<\/span><span data-offset-key=\"a1tlr-160-0\"> due<\/span><span data-offset-key=\"a1tlr-161-0\"> to<\/span><span data-offset-key=\"a1tlr-162-0\"> the<\/span><span data-offset-key=\"a1tlr-163-0\"> fact<\/span><span data-offset-key=\"a1tlr-164-0\"> that<\/span><span data-offset-key=\"a1tlr-165-0\"> discrete<\/span><span data-offset-key=\"a1tlr-166-0\"> representations<\/span><span data-offset-key=\"a1tlr-167-0\"> are<\/span><span data-offset-key=\"a1tlr-168-0\"> more<\/span><span data-offset-key=\"a1tlr-169-0\"> resistant<\/span><span data-offset-key=\"a1tlr-170-0\"> to<\/span><span data-offset-key=\"a1tlr-171-0\"> input<\/span><span data-offset-key=\"a1tlr-172-0\"> noise<\/span><span data-offset-key=\"a1tlr-173-0\">.<\/span><span data-offset-key=\"a1tlr-174-0\"> However<\/span><span data-offset-key=\"a1tlr-175-0\">,<\/span><span data-offset-key=\"a1tlr-176-0\"> their<\/span><span data-offset-key=\"a1tlr-177-0\"> research<\/span><span data-offset-key=\"a1tlr-178-0\"> showed<\/span><span data-offset-key=\"a1tlr-179-0\"> that<\/span><span data-offset-key=\"a1tlr-180-0\"> a<\/span><span data-offset-key=\"a1tlr-181-0\"> single<\/span><span data-offset-key=\"a1tlr-182-0\"> code<\/span><span data-offset-key=\"a1tlr-183-0\">book<\/span><span data-offset-key=\"a1tlr-184-0\"> still<\/span><span data-offset-key=\"a1tlr-185-0\"> produced<\/span><span data-offset-key=\"a1tlr-186-0\"> distorted<\/span><span data-offset-key=\"a1tlr-187-0\"> speech<\/span><span data-offset-key=\"a1tlr-188-0\"> even<\/span><span data-offset-key=\"a1tlr-189-0\"> when<\/span><span data-offset-key=\"a1tlr-190-0\"> the<\/span><span data-offset-key=\"a1tlr-191-0\"> code<\/span><span data-offset-key=\"a1tlr-192-0\">book<\/span><span data-offset-key=\"a1tlr-193-0\"> was<\/span><span data-offset-key=\"a1tlr-194-0\"> increased<\/span><span data-offset-key=\"a1tlr-195-0\"> in<\/span><span data-offset-key=\"a1tlr-196-0\"> size<\/span><span data-offset-key=\"a1tlr-197-0\">.<\/span><span data-offset-key=\"a1tlr-198-0\"> It<\/span><span data-offset-key=\"a1tlr-199-0\"> is<\/span><span data-offset-key=\"a1tlr-200-0\"> believed<\/span><span data-offset-key=\"a1tlr-201-0\"> that<\/span><span data-offset-key=\"a1tlr-202-0\"> there<\/span><span data-offset-key=\"a1tlr-203-0\"> are<\/span><span data-offset-key=\"a1tlr-204-0\"> too<\/span><span data-offset-key=\"a1tlr-205-0\"> many<\/span><span data-offset-key=\"a1tlr-206-0\"> pros<\/span><span data-offset-key=\"a1tlr-207-0\">ody<\/span><span data-offset-key=\"a1tlr-208-0\"> patterns<\/span><span data-offset-key=\"a1tlr-209-0\"> in<\/span><span data-offset-key=\"a1tlr-210-0\"> spontaneous<\/span><span data-offset-key=\"a1tlr-211-0\"> speech<\/span><span data-offset-key=\"a1tlr-212-0\"> for<\/span><span data-offset-key=\"a1tlr-213-0\"> a<\/span><span data-offset-key=\"a1tlr-214-0\"> single<\/span><span data-offset-key=\"a1tlr-215-0\"> code<\/span><span data-offset-key=\"a1tlr-216-0\">book<\/span><span data-offset-key=\"a1tlr-217-0\"> to<\/span><span data-offset-key=\"a1tlr-218-0\"> capture<\/span><span data-offset-key=\"a1tlr-219-0\">.<\/span><span data-offset-key=\"a1tlr-220-0\"> Therefore<\/span><span data-offset-key=\"a1tlr-221-0\">,<\/span><span data-offset-key=\"a1tlr-222-0\"> multiple<\/span><span data-offset-key=\"a1tlr-223-0\"> code<\/span><span data-offset-key=\"a1tlr-224-0\">books<\/span><span data-offset-key=\"a1tlr-225-0\"> were<\/span><span data-offset-key=\"a1tlr-226-0\"> used<\/span><span data-offset-key=\"a1tlr-227-0\"> to<\/span><span data-offset-key=\"a1tlr-228-0\"> create<\/span><span data-offset-key=\"a1tlr-229-0\"> architectures<\/span><span data-offset-key=\"a1tlr-230-0\"> for<\/span><span data-offset-key=\"a1tlr-231-0\"> multi<\/span><span data-offset-key=\"a1tlr-232-0\">&#8211;<\/span><span data-offset-key=\"a1tlr-233-0\">code<\/span><span data-offset-key=\"a1tlr-234-0\"> sampling<\/span><span data-offset-key=\"a1tlr-235-0\"> and<\/span><span data-offset-key=\"a1tlr-236-0\"> mon<\/span><span data-offset-key=\"a1tlr-237-0\">ot<\/span><span data-offset-key=\"a1tlr-238-0\">onic<\/span><span data-offset-key=\"a1tlr-239-0\"> alignment<\/span><span data-offset-key=\"a1tlr-240-0\">.<\/span><span data-offset-key=\"a1tlr-241-0\"> A<\/span><span data-offset-key=\"a1tlr-242-0\"> pure<\/span><span data-offset-key=\"a1tlr-243-0\"> silence<\/span><span data-offset-key=\"a1tlr-244-0\"> audio<\/span><span data-offset-key=\"a1tlr-245-0\"> prompt<\/span><span data-offset-key=\"a1tlr-246-0\"> was<\/span><span data-offset-key=\"a1tlr-247-0\"> used<\/span><span data-offset-key=\"a1tlr-248-0\"> during<\/span><span data-offset-key=\"a1tlr-249-0\"> the<\/span><span data-offset-key=\"a1tlr-250-0\"> inference<\/span><span data-offset-key=\"a1tlr-251-0\"> process<\/span><span data-offset-key=\"a1tlr-252-0\"> to<\/span><span data-offset-key=\"a1tlr-253-0\"> ensure<\/span><span data-offset-key=\"a1tlr-254-0\"> that<\/span><span data-offset-key=\"a1tlr-255-0\"> the<\/span><span data-offset-key=\"a1tlr-256-0\"> model<\/span><span data-offset-key=\"a1tlr-257-0\"> produced<\/span><span data-offset-key=\"a1tlr-258-0\"> clear<\/span><span data-offset-key=\"a1tlr-259-0\"> speech<\/span><span data-offset-key=\"a1tlr-260-0\"> despite<\/span><span data-offset-key=\"a1tlr-261-0\"> being<\/span><span data-offset-key=\"a1tlr-262-0\"> trained<\/span><span data-offset-key=\"a1tlr-263-0\"> on<\/span><span data-offset-key=\"a1tlr-264-0\"> a<\/span><span data-offset-key=\"a1tlr-265-0\"> noisy<\/span><span data-offset-key=\"a1tlr-266-0\"> corpus<\/span><span data-offset-key=\"a1tlr-267-0\">.<\/span><\/p>\n<p>In<span data-offset-key=\"eudaa-180-0\"> this<\/span><span data-offset-key=\"eudaa-181-0\"> paper<\/span><span data-offset-key=\"eudaa-182-0\">,<\/span><span data-offset-key=\"eudaa-183-0\"> the<\/span><span data-offset-key=\"eudaa-184-0\"> authors<\/span><span data-offset-key=\"eudaa-185-0\"> present<\/span><span data-offset-key=\"eudaa-186-0\"> their<\/span><span data-offset-key=\"eudaa-187-0\"> new<\/span><span data-offset-key=\"eudaa-188-0\"> technology<\/span><span data-offset-key=\"eudaa-189-0\"> M<\/span><span data-offset-key=\"eudaa-190-0\">Q<\/span><span data-offset-key=\"eudaa-191-0\">T<\/span><span data-offset-key=\"eudaa-192-0\">TS<\/span><span data-offset-key=\"eudaa-193-0\"> (<\/span><span data-offset-key=\"eudaa-194-0\">multi<\/span><span data-offset-key=\"eudaa-195-0\">&#8211;<\/span><span data-offset-key=\"eudaa-196-0\">code<\/span><span data-offset-key=\"eudaa-197-0\">book<\/span><span data-offset-key=\"eudaa-198-0\"> vector<\/span><span data-offset-key=\"eudaa-199-0\"> quant<\/span><span data-offset-key=\"eudaa-200-0\">ized<\/span><span data-offset-key=\"eudaa-201-0\"> T<\/span><span data-offset-key=\"eudaa-202-0\">TS<\/span><span data-offset-key=\"eudaa-203-0\">).<\/span><span data-offset-key=\"eudaa-204-0\"> To<\/span><span data-offset-key=\"eudaa-205-0\"> understand<\/span><span data-offset-key=\"eudaa-206-0\"> its<\/span><span data-offset-key=\"eudaa-207-0\"> potential<\/span><span data-offset-key=\"eudaa-208-0\"> for<\/span><span data-offset-key=\"eudaa-209-0\"> real<\/span><span data-offset-key=\"eudaa-210-0\">&#8211;<\/span><span data-offset-key=\"eudaa-211-0\">world<\/span><span data-offset-key=\"eudaa-212-0\"> voice<\/span><span data-offset-key=\"eudaa-213-0\"> synthesis<\/span><span data-offset-key=\"eudaa-214-0\">,<\/span><span data-offset-key=\"eudaa-215-0\"> they<\/span><span data-offset-key=\"eudaa-216-0\"> compare<\/span><span data-offset-key=\"eudaa-217-0\"> mel<\/span><span data-offset-key=\"eudaa-218-0\">&#8211;<\/span><span data-offset-key=\"eudaa-219-0\">spect<\/span><span data-offset-key=\"eudaa-220-0\">rogram<\/span><span data-offset-key=\"eudaa-221-0\">&#8211;<\/span><span data-offset-key=\"eudaa-222-0\">based<\/span><span data-offset-key=\"eudaa-223-0\"> systems<\/span><span data-offset-key=\"eudaa-224-0\"> in<\/span><span data-offset-key=\"eudaa-225-0\"> Section<\/span><span data-offset-key=\"eudaa-226-0\"> 5<\/span><span data-offset-key=\"eudaa-227-0\"> and<\/span><span data-offset-key=\"eudaa-228-0\"> carry<\/span><span data-offset-key=\"eudaa-229-0\"> out<\/span><span data-offset-key=\"eudaa-230-0\"> an<\/span><span data-offset-key=\"eudaa-231-0\"> ab<\/span><span data-offset-key=\"eudaa-232-0\">lation<\/span><span data-offset-key=\"eudaa-233-0\"> analysis<\/span><span data-offset-key=\"eudaa-234-0\">.<\/span><span data-offset-key=\"eudaa-235-0\"> They<\/span><span data-offset-key=\"eudaa-236-0\"> then<\/span><span data-offset-key=\"eudaa-237-0\"> compare<\/span><span data-offset-key=\"eudaa-238-0\"> M<\/span><span data-offset-key=\"eudaa-239-0\">Q<\/span><span data-offset-key=\"eudaa-240-0\">T<\/span><span data-offset-key=\"eudaa-241-0\">TS<\/span><span data-offset-key=\"eudaa-242-0\"> to<\/span><span data-offset-key=\"eudaa-243-0\"> non<\/span><span data-offset-key=\"eudaa-244-0\">&#8211;<\/span><span data-offset-key=\"eudaa-245-0\">aut<\/span><span data-offset-key=\"eudaa-246-0\">ore<\/span><span data-offset-key=\"eudaa-247-0\">gressive<\/span><span data-offset-key=\"eudaa-248-0\"> models<\/span><span data-offset-key=\"eudaa-249-0\">,<\/span><span data-offset-key=\"eudaa-250-0\"> finding<\/span><span data-offset-key=\"eudaa-251-0\"> that<\/span><span data-offset-key=\"eudaa-252-0\"> it<\/span><span data-offset-key=\"eudaa-253-0\"> produces<\/span><span data-offset-key=\"eudaa-254-0\"> better<\/span><span data-offset-key=\"eudaa-255-0\"> intellig<\/span><span data-offset-key=\"eudaa-256-0\">ibility<\/span><span data-offset-key=\"eudaa-257-0\"> and<\/span><span data-offset-key=\"eudaa-258-0\"> speaker<\/span><span data-offset-key=\"eudaa-259-0\"> transfer<\/span><span data-offset-key=\"eudaa-260-0\">ability<\/span><span data-offset-key=\"eudaa-261-0\">.<\/span><span data-offset-key=\"eudaa-262-0\"> Additionally<\/span><span data-offset-key=\"eudaa-263-0\">,<\/span><span data-offset-key=\"eudaa-264-0\"> M<\/span><span data-offset-key=\"eudaa-265-0\">Q<\/span><span data-offset-key=\"eudaa-266-0\">T<\/span><span data-offset-key=\"eudaa-267-0\">TS<\/span><span data-offset-key=\"eudaa-268-0\"> has<\/span><span data-offset-key=\"eudaa-269-0\"> greater<\/span><span data-offset-key=\"eudaa-270-0\"> pros<\/span><span data-offset-key=\"eudaa-271-0\">ody<\/span><span data-offset-key=\"eudaa-272-0\"> variety<\/span><span data-offset-key=\"eudaa-273-0\"> and<\/span><span data-offset-key=\"eudaa-274-0\"> natural<\/span><span data-offset-key=\"eudaa-275-0\">ness<\/span><span data-offset-key=\"eudaa-276-0\">,<\/span><span data-offset-key=\"eudaa-277-0\"> though<\/span><span data-offset-key=\"eudaa-278-0\"> non<\/span><span data-offset-key=\"eudaa-279-0\">&#8211;<\/span><span data-offset-key=\"eudaa-280-0\">aut<\/span><span data-offset-key=\"eudaa-281-0\">ore<\/span><span data-offset-key=\"eudaa-282-0\">gressive<\/span><span data-offset-key=\"eudaa-283-0\"> models<\/span><span data-offset-key=\"eudaa-284-0\"> have<\/span><span data-offset-key=\"eudaa-285-0\"> faster<\/span><span data-offset-key=\"eudaa-286-0\"> computing<\/span><span data-offset-key=\"eudaa-287-0\"> speed<\/span><span data-offset-key=\"eudaa-288-0\"> and<\/span><span data-offset-key=\"eudaa-289-0\"> higher<\/span><span data-offset-key=\"eudaa-290-0\"> resilience<\/span><span data-offset-key=\"eudaa-291-0\">.<\/span><span data-offset-key=\"eudaa-292-0\"> Furthermore<\/span><span data-offset-key=\"eudaa-293-0\">,<\/span><span data-offset-key=\"eudaa-294-0\"> M<\/span><span data-offset-key=\"eudaa-295-0\">Q<\/span><span data-offset-key=\"eudaa-296-0\">T<\/span><span data-offset-key=\"eudaa-297-0\">TS<\/span><span data-offset-key=\"eudaa-298-0\"> may<\/span><span data-offset-key=\"eudaa-299-0\"> achieve<\/span><span data-offset-key=\"eudaa-300-0\"> a<\/span><span data-offset-key=\"eudaa-301-0\"> lower<\/span><span data-offset-key=\"eudaa-302-0\"> signal<\/span><span data-offset-key=\"eudaa-303-0\">&#8211;<\/span><span data-offset-key=\"eudaa-304-0\">to<\/span><span data-offset-key=\"eudaa-305-0\">&#8211;<\/span><span data-offset-key=\"eudaa-306-0\">no<\/span><span data-offset-key=\"eudaa-307-0\">ise<\/span><span data-offset-key=\"eudaa-308-0\"> ratio<\/span><span data-offset-key=\"eudaa-309-0\"> with<\/span><span data-offset-key=\"eudaa-310-0\"> a<\/span><span data-offset-key=\"eudaa-311-0\"> clean<\/span><span data-offset-key=\"eudaa-312-0\">,<\/span><span data-offset-key=\"eudaa-313-0\"> quiet<\/span><span data-offset-key=\"eudaa-314-0\"> cue<\/span><span data-offset-key=\"eudaa-315-0\"> (<\/span><span data-offset-key=\"eudaa-316-0\">SN<\/span><span data-offset-key=\"eudaa-317-0\">R<\/span><span data-offset-key=\"eudaa-318-0\">).<\/span><span data-offset-key=\"eudaa-319-0\"> The<\/span><span data-offset-key=\"eudaa-320-0\"> authors<\/span><span data-offset-key=\"eudaa-321-0\"> make<\/span><span data-offset-key=\"eudaa-322-0\"> their<\/span><span data-offset-key=\"eudaa-323-0\"> source<\/span><span data-offset-key=\"eudaa-324-0\"> code<\/span><span data-offset-key=\"eudaa-325-0\"> available<\/span><span data-offset-key=\"eudaa-326-0\"> on<\/span><span data-offset-key=\"eudaa-327-0\"> GitHub<\/span><span data-offset-key=\"eudaa-328-0\"> for<\/span><span data-offset-key=\"eudaa-329-0\"> public<\/span><span data-offset-key=\"eudaa-330-0\"> use<\/span><span data-offset-key=\"eudaa-331-0\">.<\/span><\/p>\n<h2>Hug<span data-offset-key=\"c0v1h-35-0\">ging<\/span><span data-offset-key=\"c0v1h-36-0\"> Face<\/span><span data-offset-key=\"c0v1h-37-0\"> Transformers<\/span><span data-offset-key=\"c0v1h-38-0\"> has<\/span><span data-offset-key=\"c0v1h-39-0\"> recently<\/span><span data-offset-key=\"c0v1h-40-0\"> gained<\/span><span data-offset-key=\"c0v1h-41-0\"> its<\/span><span data-offset-key=\"c0v1h-42-0\"> first<\/span><span data-offset-key=\"c0v1h-43-0\"> text<\/span><span data-offset-key=\"c0v1h-44-0\">&#8211;<\/span><span data-offset-key=\"c0v1h-45-0\">to<\/span><span data-offset-key=\"c0v1h-46-0\">&#8211;<\/span><span data-offset-key=\"c0v1h-47-0\">speech<\/span><span data-offset-key=\"c0v1h-48-0\"> model<\/span><span data-offset-key=\"c0v1h-49-0\">,<\/span><span data-offset-key=\"c0v1h-50-0\"> Speech<\/span><span data-offset-key=\"c0v1h-51-0\">T<\/span><span data-offset-key=\"c0v1h-52-0\">5<\/span><span data-offset-key=\"c0v1h-53-0\">.<\/span><\/h2>\n<p>The\u00a0<span data-offset-key=\"c0v1h-54-0\"> highly successful T5 (Text-To-Text Transfer Transformer) has been the inspiration for the SpeechT5 framework, a unified-model which uses encoder-decoder pre-training for self-supervised learning of speech\/text representation<a href=\"https:\/\/huggingface.co\/spaces\/Matthijs\/speecht5-tts-demo\">. The SpeechT5 model has now been added to the Hugging Face Transformers toolkit<\/a>, an open-source library with easy access to the latest machine learning models. <\/span><\/p>\n<p>Spe<span data-offset-key=\"e95uh-116-0\">ech<\/span><span data-offset-key=\"e95uh-117-0\">T<\/span><span data-offset-key=\"e95uh-118-0\">5<\/span><span data-offset-key=\"e95uh-119-0\"> utilizes<\/span><span data-offset-key=\"e95uh-120-0\"> a<\/span><span data-offset-key=\"e95uh-121-0\"> conventional<\/span><span data-offset-key=\"e95uh-122-0\"> enc<\/span><span data-offset-key=\"e95uh-123-0\">oder<\/span><span data-offset-key=\"e95uh-124-0\">&#8211;<\/span><span data-offset-key=\"e95uh-125-0\">dec<\/span><span data-offset-key=\"e95uh-126-0\">oder<\/span><span data-offset-key=\"e95uh-127-0\"> design<\/span><span data-offset-key=\"e95uh-128-0\"> to<\/span><span data-offset-key=\"e95uh-129-0\"> develop<\/span><span data-offset-key=\"e95uh-130-0\"> combined<\/span><span data-offset-key=\"e95uh-131-0\"> contextual<\/span><span data-offset-key=\"e95uh-132-0\"> representations<\/span><span data-offset-key=\"e95uh-133-0\"> for<\/span><span data-offset-key=\"e95uh-134-0\"> both<\/span><span data-offset-key=\"e95uh-135-0\"> voice<\/span><span data-offset-key=\"e95uh-136-0\"> and<\/span><span data-offset-key=\"e95uh-137-0\"> text<\/span><span data-offset-key=\"e95uh-138-0\">.<\/span><span data-offset-key=\"e95uh-139-0\"> It<\/span><span data-offset-key=\"e95uh-140-0\"> features<\/span><span data-offset-key=\"e95uh-141-0\"> three<\/span><span data-offset-key=\"e95uh-142-0\"> distinct<\/span><span data-offset-key=\"e95uh-143-0\"> speech<\/span><span data-offset-key=\"e95uh-144-0\"> models<\/span><span data-offset-key=\"e95uh-145-0\">:<\/span><span data-offset-key=\"e95uh-146-0\"> text<\/span><span data-offset-key=\"e95uh-147-0\">&#8211;<\/span><span data-offset-key=\"e95uh-148-0\">to<\/span><span data-offset-key=\"e95uh-149-0\">&#8211;<\/span><span data-offset-key=\"e95uh-150-0\">speech<\/span><span data-offset-key=\"e95uh-151-0\"> (<\/span><span data-offset-key=\"e95uh-152-0\">for<\/span><span data-offset-key=\"e95uh-153-0\"> creating<\/span><span data-offset-key=\"e95uh-154-0\"> audio<\/span><span data-offset-key=\"e95uh-155-0\"> from<\/span><span data-offset-key=\"e95uh-156-0\"> nothing<\/span><span data-offset-key=\"e95uh-157-0\">),<\/span><span data-offset-key=\"e95uh-158-0\"> speech<\/span><span data-offset-key=\"e95uh-159-0\">&#8211;<\/span><span data-offset-key=\"e95uh-160-0\">to<\/span><span data-offset-key=\"e95uh-161-0\">&#8211;<\/span><span data-offset-key=\"e95uh-162-0\">text<\/span><span data-offset-key=\"e95uh-163-0\"> (<\/span><span data-offset-key=\"e95uh-164-0\">for<\/span><span data-offset-key=\"e95uh-165-0\"> automated<\/span><span data-offset-key=\"e95uh-166-0\"> speech<\/span><span data-offset-key=\"e95uh-167-0\"> recognition<\/span><span data-offset-key=\"e95uh-168-0\">),<\/span><span data-offset-key=\"e95uh-169-0\"> and<\/span><span data-offset-key=\"e95uh-170-0\"> speech<\/span><span data-offset-key=\"e95uh-171-0\">&#8211;<\/span><span data-offset-key=\"e95uh-172-0\">to<\/span><span data-offset-key=\"e95uh-173-0\">&#8211;<\/span><span data-offset-key=\"e95uh-174-0\">speech<\/span><span data-offset-key=\"e95uh-175-0\"> (<\/span><span data-offset-key=\"e95uh-176-0\">for<\/span><span data-offset-key=\"e95uh-177-0\"> carrying<\/span><span data-offset-key=\"e95uh-178-0\"> out<\/span><span data-offset-key=\"e95uh-179-0\"> speech<\/span><span data-offset-key=\"e95uh-180-0\"> aug<\/span><span data-offset-key=\"e95uh-181-0\">mentation<\/span><span data-offset-key=\"e95uh-182-0\"> or<\/span><span data-offset-key=\"e95uh-183-0\"> changing<\/span><span data-offset-key=\"e95uh-184-0\"> between<\/span><span data-offset-key=\"e95uh-185-0\"> voices<\/span><span data-offset-key=\"e95uh-186-0\">).<\/span><\/p>\n<p>The<span data-offset-key=\"a6pj9-241-0\"> core<\/span><span data-offset-key=\"a6pj9-242-0\"> concept<\/span><span data-offset-key=\"a6pj9-243-0\"> of<\/span><span data-offset-key=\"a6pj9-244-0\"> Speech<\/span><span data-offset-key=\"a6pj9-245-0\">T<\/span><span data-offset-key=\"a6pj9-246-0\">5<\/span><span data-offset-key=\"a6pj9-247-0\"> is<\/span><span data-offset-key=\"a6pj9-248-0\"> to<\/span><span data-offset-key=\"a6pj9-249-0\"> prepare<\/span><span data-offset-key=\"a6pj9-250-0\"> a<\/span><span data-offset-key=\"a6pj9-251-0\"> single<\/span><span data-offset-key=\"a6pj9-252-0\"> model<\/span><span data-offset-key=\"a6pj9-253-0\"> by<\/span><span data-offset-key=\"a6pj9-254-0\"> combining<\/span><span data-offset-key=\"a6pj9-255-0\"> text<\/span><span data-offset-key=\"a6pj9-256-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-257-0\">to<\/span><span data-offset-key=\"a6pj9-258-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-259-0\">speech<\/span><span data-offset-key=\"a6pj9-260-0\">,<\/span><span data-offset-key=\"a6pj9-261-0\"> speech<\/span><span data-offset-key=\"a6pj9-262-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-263-0\">to<\/span><span data-offset-key=\"a6pj9-264-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-265-0\">text<\/span><span data-offset-key=\"a6pj9-266-0\">,<\/span><span data-offset-key=\"a6pj9-267-0\"> text<\/span><span data-offset-key=\"a6pj9-268-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-269-0\">to<\/span><span data-offset-key=\"a6pj9-270-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-271-0\">text<\/span><span data-offset-key=\"a6pj9-272-0\">,<\/span><span data-offset-key=\"a6pj9-273-0\"> and<\/span><span data-offset-key=\"a6pj9-274-0\"> speech<\/span><span data-offset-key=\"a6pj9-275-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-276-0\">to<\/span><span data-offset-key=\"a6pj9-277-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-278-0\">speech<\/span><span data-offset-key=\"a6pj9-279-0\"> data<\/span><span data-offset-key=\"a6pj9-280-0\">.<\/span><span data-offset-key=\"a6pj9-281-0\"> This<\/span><span data-offset-key=\"a6pj9-282-0\"> encourages<\/span><span data-offset-key=\"a6pj9-283-0\"> the<\/span><span data-offset-key=\"a6pj9-284-0\"> model<\/span><span data-offset-key=\"a6pj9-285-0\"> to<\/span><span data-offset-key=\"a6pj9-286-0\"> learn<\/span><span data-offset-key=\"a6pj9-287-0\"> from<\/span><span data-offset-key=\"a6pj9-288-0\"> both<\/span><span data-offset-key=\"a6pj9-289-0\"> speech<\/span><span data-offset-key=\"a6pj9-290-0\"> and<\/span><span data-offset-key=\"a6pj9-291-0\"> written<\/span><span data-offset-key=\"a6pj9-292-0\"> text<\/span><span data-offset-key=\"a6pj9-293-0\">.<\/span><span data-offset-key=\"a6pj9-294-0\"> The<\/span><span data-offset-key=\"a6pj9-295-0\"> base<\/span><span data-offset-key=\"a6pj9-296-0\"> of<\/span><span data-offset-key=\"a6pj9-297-0\"> Speech<\/span><span data-offset-key=\"a6pj9-298-0\">T<\/span><span data-offset-key=\"a6pj9-299-0\">5<\/span><span data-offset-key=\"a6pj9-300-0\"> is<\/span><span data-offset-key=\"a6pj9-301-0\"> a<\/span><span data-offset-key=\"a6pj9-302-0\"> standard<\/span><span data-offset-key=\"a6pj9-303-0\"> Trans<\/span><span data-offset-key=\"a6pj9-304-0\">former<\/span><span data-offset-key=\"a6pj9-305-0\"> enc<\/span><span data-offset-key=\"a6pj9-306-0\">oder<\/span><span data-offset-key=\"a6pj9-307-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-308-0\">dec<\/span><span data-offset-key=\"a6pj9-309-0\">oder<\/span><span data-offset-key=\"a6pj9-310-0\"> structure<\/span><span data-offset-key=\"a6pj9-311-0\">,<\/span><span data-offset-key=\"a6pj9-312-0\"> which<\/span><span data-offset-key=\"a6pj9-313-0\"> can<\/span><span data-offset-key=\"a6pj9-314-0\"> perform<\/span><span data-offset-key=\"a6pj9-315-0\"> sequential<\/span><span data-offset-key=\"a6pj9-316-0\"> transformations<\/span><span data-offset-key=\"a6pj9-317-0\"> with<\/span><span data-offset-key=\"a6pj9-318-0\"> hidden<\/span><span data-offset-key=\"a6pj9-319-0\"> representations<\/span><span data-offset-key=\"a6pj9-320-0\">,<\/span><span data-offset-key=\"a6pj9-321-0\"> like<\/span><span data-offset-key=\"a6pj9-322-0\"> any<\/span><span data-offset-key=\"a6pj9-323-0\"> other<\/span><span data-offset-key=\"a6pj9-324-0\"> Trans<\/span><span data-offset-key=\"a6pj9-325-0\">former<\/span><span data-offset-key=\"a6pj9-326-0\">.<\/span><span data-offset-key=\"a6pj9-327-0\"> Pre<\/span><span data-offset-key=\"a6pj9-328-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-329-0\">nets<\/span><span data-offset-key=\"a6pj9-330-0\"> and<\/span><span data-offset-key=\"a6pj9-331-0\"> post<\/span><span data-offset-key=\"a6pj9-332-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-333-0\">nets<\/span><span data-offset-key=\"a6pj9-334-0\"> are<\/span><span data-offset-key=\"a6pj9-335-0\"> added<\/span><span data-offset-key=\"a6pj9-336-0\"> to<\/span><span data-offset-key=\"a6pj9-337-0\"> make<\/span><span data-offset-key=\"a6pj9-338-0\"> the<\/span><span data-offset-key=\"a6pj9-339-0\"> same<\/span><span data-offset-key=\"a6pj9-340-0\"> Trans<\/span><span data-offset-key=\"a6pj9-341-0\">former<\/span><span data-offset-key=\"a6pj9-342-0\"> suitable<\/span><span data-offset-key=\"a6pj9-343-0\"> for<\/span><span data-offset-key=\"a6pj9-344-0\"> text<\/span><span data-offset-key=\"a6pj9-345-0\"> and<\/span><span data-offset-key=\"a6pj9-346-0\"> audio<\/span><span data-offset-key=\"a6pj9-347-0\">.<\/span><span data-offset-key=\"a6pj9-348-0\"> The<\/span><span data-offset-key=\"a6pj9-349-0\"> pre<\/span><span data-offset-key=\"a6pj9-350-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-351-0\">nets<\/span><span data-offset-key=\"a6pj9-352-0\"> convert<\/span><span data-offset-key=\"a6pj9-353-0\"> the<\/span><span data-offset-key=\"a6pj9-354-0\"> input<\/span><span data-offset-key=\"a6pj9-355-0\"> of<\/span><span data-offset-key=\"a6pj9-356-0\"> text<\/span><span data-offset-key=\"a6pj9-357-0\"> or<\/span><span data-offset-key=\"a6pj9-358-0\"> speech<\/span><span data-offset-key=\"a6pj9-359-0\"> into<\/span><span data-offset-key=\"a6pj9-360-0\"> the<\/span><span data-offset-key=\"a6pj9-361-0\"> Trans<\/span><span data-offset-key=\"a6pj9-362-0\">former<\/span><span data-offset-key=\"a6pj9-363-0\">&#8216;s<\/span><span data-offset-key=\"a6pj9-364-0\"> hidden<\/span><span data-offset-key=\"a6pj9-365-0\"> representations<\/span><span data-offset-key=\"a6pj9-366-0\">,<\/span><span data-offset-key=\"a6pj9-367-0\"> while<\/span><span data-offset-key=\"a6pj9-368-0\"> the<\/span><span data-offset-key=\"a6pj9-369-0\"> post<\/span><span data-offset-key=\"a6pj9-370-0\">&#8211;<\/span><span data-offset-key=\"a6pj9-371-0\">nets<\/span><span data-offset-key=\"a6pj9-372-0\"> convert<\/span><span data-offset-key=\"a6pj9-373-0\"> the<\/span><span data-offset-key=\"a6pj9-374-0\"> Trans<\/span><span data-offset-key=\"a6pj9-375-0\">former<\/span><span data-offset-key=\"a6pj9-376-0\">&#8216;s<\/span><span data-offset-key=\"a6pj9-377-0\"> outputs<\/span><span data-offset-key=\"a6pj9-378-0\"> into<\/span><span data-offset-key=\"a6pj9-379-0\"> text<\/span><span data-offset-key=\"a6pj9-380-0\"> or<\/span><span data-offset-key=\"a6pj9-381-0\"> speech<\/span><span data-offset-key=\"a6pj9-382-0\">.<\/span><span data-offset-key=\"a6pj9-383-0\"> To<\/span><span data-offset-key=\"a6pj9-384-0\"> train<\/span><span data-offset-key=\"a6pj9-385-0\"> the<\/span><span data-offset-key=\"a6pj9-386-0\"> model<\/span><span data-offset-key=\"a6pj9-387-0\"> for<\/span><span data-offset-key=\"a6pj9-388-0\"> multiple<\/span><span data-offset-key=\"a6pj9-389-0\"> languages<\/span><span data-offset-key=\"a6pj9-390-0\">,<\/span><span data-offset-key=\"a6pj9-391-0\"> the<\/span><span data-offset-key=\"a6pj9-392-0\"> team<\/span><span data-offset-key=\"a6pj9-393-0\"> supplies<\/span><span data-offset-key=\"a6pj9-394-0\"> it<\/span><span data-offset-key=\"a6pj9-395-0\"> with<\/span><span data-offset-key=\"a6pj9-396-0\"> text<\/span><span data-offset-key=\"a6pj9-397-0\">\/<\/span><span data-offset-key=\"a6pj9-398-0\">speech<\/span><span data-offset-key=\"a6pj9-399-0\"> data<\/span><span data-offset-key=\"a6pj9-400-0\"> as<\/span><span data-offset-key=\"a6pj9-401-0\"> input<\/span><span data-offset-key=\"a6pj9-402-0\"> and<\/span><span data-offset-key=\"a6pj9-403-0\"> produces<\/span><span data-offset-key=\"a6pj9-404-0\"> the<\/span><span data-offset-key=\"a6pj9-405-0\"> corresponding<\/span><span data-offset-key=\"a6pj9-406-0\"> output<\/span><span data-offset-key=\"a6pj9-407-0\"> as<\/span><span data-offset-key=\"a6pj9-408-0\"> text<\/span><span data-offset-key=\"a6pj9-409-0\">\/<\/span><span data-offset-key=\"a6pj9-410-0\">speech<\/span><span data-offset-key=\"a6pj9-411-0\">.<\/span><\/p>\n<p>Spe<span data-offset-key=\"2idq7-172-0\">ech<\/span><span data-offset-key=\"2idq7-173-0\">T<\/span><span data-offset-key=\"2idq7-174-0\">5<\/span><span data-offset-key=\"2idq7-175-0\"> stands<\/span><span data-offset-key=\"2idq7-176-0\"> out<\/span><span data-offset-key=\"2idq7-177-0\"> from<\/span><span data-offset-key=\"2idq7-178-0\"> other<\/span><span data-offset-key=\"2idq7-179-0\"> models<\/span><span data-offset-key=\"2idq7-180-0\"> as<\/span><span data-offset-key=\"2idq7-181-0\"> it<\/span><span data-offset-key=\"2idq7-182-0\"> allows<\/span><span data-offset-key=\"2idq7-183-0\"> for<\/span><span data-offset-key=\"2idq7-184-0\"> multiple<\/span><span data-offset-key=\"2idq7-185-0\"> activities<\/span><span data-offset-key=\"2idq7-186-0\"> to<\/span><span data-offset-key=\"2idq7-187-0\"> be<\/span><span data-offset-key=\"2idq7-188-0\"> carried<\/span><span data-offset-key=\"2idq7-189-0\"> out<\/span><span data-offset-key=\"2idq7-190-0\"> with<\/span><span data-offset-key=\"2idq7-191-0\"> one<\/span><span data-offset-key=\"2idq7-192-0\"> architecture<\/span><span data-offset-key=\"2idq7-193-0\">,<\/span><span data-offset-key=\"2idq7-194-0\"> simply<\/span><span data-offset-key=\"2idq7-195-0\"> by<\/span><span data-offset-key=\"2idq7-196-0\"> adapting<\/span><span data-offset-key=\"2idq7-197-0\"> the<\/span><span data-offset-key=\"2idq7-198-0\"> pre<\/span><span data-offset-key=\"2idq7-199-0\">&#8211;<\/span><span data-offset-key=\"2idq7-200-0\">nets<\/span><span data-offset-key=\"2idq7-201-0\"> and<\/span><span data-offset-key=\"2idq7-202-0\"> post<\/span><span data-offset-key=\"2idq7-203-0\">&#8211;<\/span><span data-offset-key=\"2idq7-204-0\">nets<\/span><span data-offset-key=\"2idq7-205-0\">.<\/span><span data-offset-key=\"2idq7-206-0\"> The<\/span><span data-offset-key=\"2idq7-207-0\"> model<\/span><span data-offset-key=\"2idq7-208-0\"> has<\/span><span data-offset-key=\"2idq7-209-0\"> been<\/span><span data-offset-key=\"2idq7-210-0\"> fine<\/span><span data-offset-key=\"2idq7-211-0\">&#8211;<\/span><span data-offset-key=\"2idq7-212-0\">tun<\/span><span data-offset-key=\"2idq7-213-0\">ed<\/span><span data-offset-key=\"2idq7-214-0\"> to<\/span><span data-offset-key=\"2idq7-215-0\"> tackle<\/span><span data-offset-key=\"2idq7-216-0\"> a<\/span><span data-offset-key=\"2idq7-217-0\"> variety<\/span><span data-offset-key=\"2idq7-218-0\"> of<\/span><span data-offset-key=\"2idq7-219-0\"> tasks<\/span><span data-offset-key=\"2idq7-220-0\">,<\/span><span data-offset-key=\"2idq7-221-0\"> and<\/span><span data-offset-key=\"2idq7-222-0\"> studies<\/span><span data-offset-key=\"2idq7-223-0\"> have<\/span><span data-offset-key=\"2idq7-224-0\"> shown<\/span><span data-offset-key=\"2idq7-225-0\"> that<\/span><span data-offset-key=\"2idq7-226-0\"> it<\/span><span data-offset-key=\"2idq7-227-0\"> out<\/span><span data-offset-key=\"2idq7-228-0\">sh<\/span><span data-offset-key=\"2idq7-229-0\">ines<\/span><span data-offset-key=\"2idq7-230-0\"> all<\/span><span data-offset-key=\"2idq7-231-0\"> baseline<\/span><span data-offset-key=\"2idq7-232-0\"> models<\/span><span data-offset-key=\"2idq7-233-0\"> in<\/span><span data-offset-key=\"2idq7-234-0\"> a<\/span><span data-offset-key=\"2idq7-235-0\"> number<\/span><span data-offset-key=\"2idq7-236-0\"> of<\/span><span data-offset-key=\"2idq7-237-0\"> spoken<\/span><span data-offset-key=\"2idq7-238-0\"> language<\/span><span data-offset-key=\"2idq7-239-0\"> processing<\/span><span data-offset-key=\"2idq7-240-0\"> tasks<\/span><span data-offset-key=\"2idq7-241-0\">.<\/span><span data-offset-key=\"2idq7-242-0\"> To<\/span><span data-offset-key=\"2idq7-243-0\"> improve<\/span><span data-offset-key=\"2idq7-244-0\"> the<\/span><span data-offset-key=\"2idq7-245-0\"> model<\/span><span data-offset-key=\"2idq7-246-0\"> even<\/span><span data-offset-key=\"2idq7-247-0\"> further<\/span><span data-offset-key=\"2idq7-248-0\">,<\/span><span data-offset-key=\"2idq7-249-0\"> scientists<\/span><span data-offset-key=\"2idq7-250-0\"> plan<\/span><span data-offset-key=\"2idq7-251-0\"> to<\/span><span data-offset-key=\"2idq7-252-0\"> pre<\/span><span data-offset-key=\"2idq7-253-0\">&#8211;<\/span><span data-offset-key=\"2idq7-254-0\">train<\/span><span data-offset-key=\"2idq7-255-0\"> Speech<\/span><span data-offset-key=\"2idq7-256-0\">T<\/span><span data-offset-key=\"2idq7-257-0\">5<\/span><span data-offset-key=\"2idq7-258-0\"> with<\/span><span data-offset-key=\"2idq7-259-0\"> a<\/span><span data-offset-key=\"2idq7-260-0\"> larger<\/span><span data-offset-key=\"2idq7-261-0\"> model<\/span><span data-offset-key=\"2idq7-262-0\"> and<\/span><span data-offset-key=\"2idq7-263-0\"> more<\/span><span data-offset-key=\"2idq7-264-0\"> unl<\/span><span data-offset-key=\"2idq7-265-0\">abel<\/span><span data-offset-key=\"2idq7-266-0\">ed<\/span><span data-offset-key=\"2idq7-267-0\"> data<\/span><span data-offset-key=\"2idq7-268-0\">.<\/span><span data-offset-key=\"2idq7-269-0\"> Additionally<\/span><span data-offset-key=\"2idq7-270-0\">,<\/span><span data-offset-key=\"2idq7-271-0\"> they<\/span><span data-offset-key=\"2idq7-272-0\"> are<\/span><span data-offset-key=\"2idq7-273-0\"> exploring<\/span><span data-offset-key=\"2idq7-274-0\"> ways<\/span><span data-offset-key=\"2idq7-275-0\"> to<\/span><span data-offset-key=\"2idq7-276-0\"> use<\/span><span data-offset-key=\"2idq7-277-0\"> the<\/span><span data-offset-key=\"2idq7-278-0\"> framework<\/span><span data-offset-key=\"2idq7-279-0\"> to<\/span><span data-offset-key=\"2idq7-280-0\"> handle<\/span><span data-offset-key=\"2idq7-281-0\"> tasks<\/span><span data-offset-key=\"2idq7-282-0\"> involving<\/span><span data-offset-key=\"2idq7-283-0\"> spoken<\/span><span data-offset-key=\"2idq7-284-0\"> language<\/span><span data-offset-key=\"2idq7-285-0\"> processing<\/span><span data-offset-key=\"2idq7-286-0\"> in<\/span><span data-offset-key=\"2idq7-287-0\"> multiple<\/span><span data-offset-key=\"2idq7-288-0\"> languages<\/span><span data-offset-key=\"2idq7-289-0\">.<\/span><\/p>\n<h2>V<span data-offset-key=\"5ti69-316-0\">ALL<\/span><span data-offset-key=\"5ti69-317-0\">&#8211;<\/span><span data-offset-key=\"5ti69-318-0\">E<\/span><span data-offset-key=\"5ti69-319-0\"> by Microsoft<\/span><\/h2>\n<p>Microsoft<span data-offset-key=\"5ti69-296-0\"> has<\/span><span data-offset-key=\"5ti69-297-0\"> developed<\/span><span data-offset-key=\"5ti69-298-0\"> a<\/span><span data-offset-key=\"5ti69-299-0\"> revolutionary<\/span><span data-offset-key=\"5ti69-300-0\"> language<\/span><span data-offset-key=\"5ti69-301-0\"> model<\/span><a href=\"https:\/\/valle-demo.github.io\/\"><span data-offset-key=\"5ti69-302-0\"> for<\/span><span data-offset-key=\"5ti69-303-0\"> text<\/span><span data-offset-key=\"5ti69-304-0\">&#8211;<\/span><span data-offset-key=\"5ti69-305-0\">to<\/span><span data-offset-key=\"5ti69-306-0\">&#8211;<\/span><span data-offset-key=\"5ti69-307-0\">speech<\/span><span data-offset-key=\"5ti69-308-0\"> synthesis<\/span><span data-offset-key=\"5ti69-309-0\"> (<\/span><span data-offset-key=\"5ti69-310-0\">T<\/span><span data-offset-key=\"5ti69-311-0\">TS<\/span><span data-offset-key=\"5ti69-312-0\">)<\/span><span data-offset-key=\"5ti69-313-0\"> known<\/span><span data-offset-key=\"5ti69-314-0\"> as<\/span><span data-offset-key=\"5ti69-315-0\"> V<\/span><span data-offset-key=\"5ti69-316-0\">ALL<\/span><span data-offset-key=\"5ti69-317-0\">&#8211;<\/span><span data-offset-key=\"5ti69-318-0\">E<\/span><span data-offset-key=\"5ti69-319-0\">.<\/span><\/a><span data-offset-key=\"5ti69-320-0\"> The<\/span><span data-offset-key=\"5ti69-321-0\"> AI<\/span><span data-offset-key=\"5ti69-322-0\"> utilizes<\/span><span data-offset-key=\"5ti69-323-0\"> audio<\/span><span data-offset-key=\"5ti69-324-0\"> codec<\/span><span data-offset-key=\"5ti69-325-0\"> codes<\/span><span data-offset-key=\"5ti69-326-0\"> as<\/span><span data-offset-key=\"5ti69-327-0\"> intermediate<\/span><span data-offset-key=\"5ti69-328-0\"> representations<\/span><span data-offset-key=\"5ti69-329-0\"> and<\/span><span data-offset-key=\"5ti69-330-0\"> is<\/span><span data-offset-key=\"5ti69-331-0\"> capable<\/span><span data-offset-key=\"5ti69-332-0\"> of<\/span><span data-offset-key=\"5ti69-333-0\"> repl<\/span><span data-offset-key=\"5ti69-334-0\">icating<\/span><span data-offset-key=\"5ti69-335-0\"> someone<\/span><span data-offset-key=\"5ti69-336-0\">&#8216;s<\/span><span data-offset-key=\"5ti69-337-0\"> voice<\/span><span data-offset-key=\"5ti69-338-0\"> with<\/span><span data-offset-key=\"5ti69-339-0\"> only<\/span><span data-offset-key=\"5ti69-340-0\"> three<\/span><span data-offset-key=\"5ti69-341-0\"> seconds<\/span><span data-offset-key=\"5ti69-342-0\"> of<\/span><span data-offset-key=\"5ti69-343-0\"> audio<\/span><span data-offset-key=\"5ti69-344-0\"> input<\/span><span data-offset-key=\"5ti69-345-0\">.<\/span><span data-offset-key=\"5ti69-346-0\"> V<\/span><span data-offset-key=\"5ti69-347-0\">ALL<\/span><span data-offset-key=\"5ti69-348-0\">&#8211;<\/span><span data-offset-key=\"5ti69-349-0\">E<\/span><span data-offset-key=\"5ti69-350-0\"> is<\/span><span data-offset-key=\"5ti69-351-0\"> a<\/span><span data-offset-key=\"5ti69-352-0\"> neural<\/span><span data-offset-key=\"5ti69-353-0\"> codec<\/span><span data-offset-key=\"5ti69-354-0\"> language<\/span><span data-offset-key=\"5ti69-355-0\"> model<\/span><span data-offset-key=\"5ti69-356-0\"> which<\/span><span data-offset-key=\"5ti69-357-0\"> token<\/span><span data-offset-key=\"5ti69-358-0\">izes<\/span><span data-offset-key=\"5ti69-359-0\"> speech<\/span><span data-offset-key=\"5ti69-360-0\">,<\/span><span data-offset-key=\"5ti69-361-0\"> and<\/span><span data-offset-key=\"5ti69-362-0\"> then<\/span><span data-offset-key=\"5ti69-363-0\"> uses<\/span><span data-offset-key=\"5ti69-364-0\"> algorithms<\/span><span data-offset-key=\"5ti69-365-0\"> to<\/span><span data-offset-key=\"5ti69-366-0\"> generate<\/span><span data-offset-key=\"5ti69-367-0\"> wave<\/span><span data-offset-key=\"5ti69-368-0\">forms<\/span><span data-offset-key=\"5ti69-369-0\"> which<\/span><span data-offset-key=\"5ti69-370-0\"> sound<\/span><span data-offset-key=\"5ti69-371-0\"> like<\/span><span data-offset-key=\"5ti69-372-0\"> the<\/span><span data-offset-key=\"5ti69-373-0\"> speaker<\/span><span data-offset-key=\"5ti69-374-0\">,<\/span><span data-offset-key=\"5ti69-375-0\"> even<\/span><span data-offset-key=\"5ti69-376-0\"> repl<\/span><span data-offset-key=\"5ti69-377-0\">icating<\/span><span data-offset-key=\"5ti69-378-0\"> their<\/span><span data-offset-key=\"5ti69-379-0\"> unique<\/span><span data-offset-key=\"5ti69-380-0\"> tim<\/span><span data-offset-key=\"5ti69-381-0\">bre<\/span><span data-offset-key=\"5ti69-382-0\"> and<\/span><span data-offset-key=\"5ti69-383-0\"> emotional<\/span><span data-offset-key=\"5ti69-384-0\"> tone<\/span><span data-offset-key=\"5ti69-385-0\">.<\/span><span data-offset-key=\"5ti69-386-0\"> As<\/span><span data-offset-key=\"5ti69-387-0\"> stated<\/span><span data-offset-key=\"5ti69-388-0\"> in<\/span><span data-offset-key=\"5ti69-389-0\"> the<\/span><span data-offset-key=\"5ti69-390-0\"> research<\/span><span data-offset-key=\"5ti69-391-0\"> paper<\/span><span data-offset-key=\"5ti69-392-0\">,<\/span><span data-offset-key=\"5ti69-393-0\"> V<\/span><span data-offset-key=\"5ti69-394-0\">ALL<\/span><span data-offset-key=\"5ti69-395-0\">&#8211;<\/span><span data-offset-key=\"5ti69-396-0\">E<\/span><span data-offset-key=\"5ti69-397-0\"> can<\/span><span data-offset-key=\"5ti69-398-0\"> produce<\/span><span data-offset-key=\"5ti69-399-0\"> high<\/span><span data-offset-key=\"5ti69-400-0\">&#8211;<\/span><span data-offset-key=\"5ti69-401-0\">quality<\/span><span data-offset-key=\"5ti69-402-0\"> personalized<\/span><span data-offset-key=\"5ti69-403-0\"> speech<\/span><span data-offset-key=\"5ti69-404-0\"> with<\/span><span data-offset-key=\"5ti69-405-0\"> just<\/span><span data-offset-key=\"5ti69-406-0\"> a<\/span><span data-offset-key=\"5ti69-407-0\"> three<\/span><span data-offset-key=\"5ti69-408-0\">&#8211;<\/span><span data-offset-key=\"5ti69-409-0\">second<\/span><span data-offset-key=\"5ti69-410-0\"> sample<\/span><span data-offset-key=\"5ti69-411-0\"> of<\/span><span data-offset-key=\"5ti69-412-0\"> the<\/span><span data-offset-key=\"5ti69-413-0\"> speaker<\/span><span data-offset-key=\"5ti69-414-0\">&#8216;s<\/span><span data-offset-key=\"5ti69-415-0\"> voice<\/span><span data-offset-key=\"5ti69-416-0\">,<\/span><span data-offset-key=\"5ti69-417-0\"> without<\/span><span data-offset-key=\"5ti69-418-0\"> the<\/span><span data-offset-key=\"5ti69-419-0\"> need<\/span><span data-offset-key=\"5ti69-420-0\"> for<\/span><span data-offset-key=\"5ti69-421-0\"> additional<\/span><span data-offset-key=\"5ti69-422-0\"> structural<\/span><span data-offset-key=\"5ti69-423-0\"> engineering<\/span><span data-offset-key=\"5ti69-424-0\">,<\/span><span data-offset-key=\"5ti69-425-0\"> pre<\/span><span data-offset-key=\"5ti69-426-0\">&#8211;<\/span><span data-offset-key=\"5ti69-427-0\">made<\/span><span data-offset-key=\"5ti69-428-0\"> acoustic<\/span><span data-offset-key=\"5ti69-429-0\"> features<\/span><span data-offset-key=\"5ti69-430-0\">,<\/span><span data-offset-key=\"5ti69-431-0\"> or<\/span><span data-offset-key=\"5ti69-432-0\"> fine<\/span><span data-offset-key=\"5ti69-433-0\">&#8211;<\/span><span data-offset-key=\"5ti69-434-0\">tun<\/span><span data-offset-key=\"5ti69-435-0\">ing<\/span><span data-offset-key=\"5ti69-436-0\">.<\/span><span data-offset-key=\"5ti69-437-0\"> It<\/span><span data-offset-key=\"5ti69-438-0\"> also<\/span><span data-offset-key=\"5ti69-439-0\"> supports<\/span><span data-offset-key=\"5ti69-440-0\"> contextual<\/span><span data-offset-key=\"5ti69-441-0\"> learning<\/span><span data-offset-key=\"5ti69-442-0\"> and<\/span><span data-offset-key=\"5ti69-443-0\"> prompt<\/span><span data-offset-key=\"5ti69-444-0\">&#8211;<\/span><span data-offset-key=\"5ti69-445-0\">based<\/span><span data-offset-key=\"5ti69-446-0\"> zero<\/span><span data-offset-key=\"5ti69-447-0\">&#8211;<\/span><span data-offset-key=\"5ti69-448-0\">shot<\/span><span data-offset-key=\"5ti69-449-0\"> T<\/span><span data-offset-key=\"5ti69-450-0\">TS<\/span><span data-offset-key=\"5ti69-451-0\"> approaches<\/span><span data-offset-key=\"5ti69-452-0\">.<\/span><span data-offset-key=\"5ti69-453-0\"> Demon<\/span><span data-offset-key=\"5ti69-454-0\">stration<\/span><span data-offset-key=\"5ti69-455-0\"> audio<\/span><span data-offset-key=\"5ti69-456-0\"> clips<\/span><span data-offset-key=\"5ti69-457-0\"> are<\/span><span data-offset-key=\"5ti69-458-0\"> provided<\/span><span data-offset-key=\"5ti69-459-0\"> in<\/span><span data-offset-key=\"5ti69-460-0\"> the<\/span><span data-offset-key=\"5ti69-461-0\"> research<\/span><span data-offset-key=\"5ti69-462-0\"> paper<\/span><span data-offset-key=\"5ti69-463-0\">,<\/span><span data-offset-key=\"5ti69-464-0\"> with<\/span><span data-offset-key=\"5ti69-465-0\"> one<\/span><span data-offset-key=\"5ti69-466-0\"> sample<\/span><span data-offset-key=\"5ti69-467-0\"> being<\/span><span data-offset-key=\"5ti69-468-0\"> a<\/span><span data-offset-key=\"5ti69-469-0\"> three<\/span><span data-offset-key=\"5ti69-470-0\">&#8211;<\/span><span data-offset-key=\"5ti69-471-0\">second<\/span><span data-offset-key=\"5ti69-472-0\"> prompt<\/span><span data-offset-key=\"5ti69-473-0\"> that<\/span><span data-offset-key=\"5ti69-474-0\"> V<\/span><span data-offset-key=\"5ti69-475-0\">ALL<\/span><span data-offset-key=\"5ti69-476-0\">&#8211;<\/span><span data-offset-key=\"5ti69-477-0\">E<\/span><span data-offset-key=\"5ti69-478-0\"> must<\/span><span data-offset-key=\"5ti69-479-0\"> replicate<\/span><span data-offset-key=\"5ti69-480-0\">.<\/span><span data-offset-key=\"5ti69-481-0\"> To<\/span><span data-offset-key=\"5ti69-482-0\"> compare<\/span><span data-offset-key=\"5ti69-483-0\">,<\/span><span data-offset-key=\"5ti69-484-0\"> another<\/span><span data-offset-key=\"5ti69-485-0\"> sample<\/span><span data-offset-key=\"5ti69-486-0\"> is<\/span><span data-offset-key=\"5ti69-487-0\"> a<\/span><span data-offset-key=\"5ti69-488-0\"> previously<\/span><span data-offset-key=\"5ti69-489-0\">&#8211;<\/span><span data-offset-key=\"5ti69-490-0\">recorded<\/span><span data-offset-key=\"5ti69-491-0\"> phrase<\/span><span data-offset-key=\"5ti69-492-0\"> by<\/span><span data-offset-key=\"5ti69-493-0\"> the<\/span><span data-offset-key=\"5ti69-494-0\"> same<\/span><span data-offset-key=\"5ti69-495-0\"> speaker<\/span><span data-offset-key=\"5ti69-496-0\"> (<\/span><span data-offset-key=\"5ti69-497-0\">the<\/span><span data-offset-key=\"5ti69-498-0\"> &#8220;<\/span><span data-offset-key=\"5ti69-499-0\">ground<\/span><span data-offset-key=\"5ti69-500-0\"> truth<\/span><span data-offset-key=\"5ti69-501-0\">&#8220;),<\/span><span data-offset-key=\"5ti69-502-0\"> while<\/span><span data-offset-key=\"5ti69-503-0\"> the<\/span><span data-offset-key=\"5ti69-504-0\"> &#8220;<\/span><span data-offset-key=\"5ti69-505-0\">bas<\/span><span data-offset-key=\"5ti69-506-0\">eline<\/span><span data-offset-key=\"5ti69-507-0\">&#8220;<\/span><span data-offset-key=\"5ti69-508-0\"> sample<\/span><span data-offset-key=\"5ti69-509-0\"> is<\/span><span data-offset-key=\"5ti69-510-0\"> a<\/span><span data-offset-key=\"5ti69-511-0\"> typical<\/span><span data-offset-key=\"5ti69-512-0\"> text<\/span><span data-offset-key=\"5ti69-513-0\">&#8211;<\/span><span data-offset-key=\"5ti69-514-0\">to<\/span><span data-offset-key=\"5ti69-515-0\">&#8211;<\/span><span data-offset-key=\"5ti69-516-0\">speech<\/span><span data-offset-key=\"5ti69-517-0\"> synthesis<\/span><span data-offset-key=\"5ti69-518-0\"> example<\/span><span data-offset-key=\"5ti69-519-0\">,<\/span><span data-offset-key=\"5ti69-520-0\"> and<\/span><span data-offset-key=\"5ti69-521-0\"> the<\/span><span data-offset-key=\"5ti69-522-0\"> &#8220;<\/span><span data-offset-key=\"5ti69-523-0\">V<\/span><span data-offset-key=\"5ti69-524-0\">ALL<\/span><span data-offset-key=\"5ti69-525-0\">&#8211;<\/span><span data-offset-key=\"5ti69-526-0\">E<\/span><span data-offset-key=\"5ti69-527-0\">&#8220;<\/span><span data-offset-key=\"5ti69-528-0\"> sample<\/span><span data-offset-key=\"5ti69-529-0\"> is<\/span><span data-offset-key=\"5ti69-530-0\"> the<\/span><span data-offset-key=\"5ti69-531-0\"> output<\/span><span data-offset-key=\"5ti69-532-0\"> of<\/span><span data-offset-key=\"5ti69-533-0\"> the<\/span><span data-offset-key=\"5ti69-534-0\"> V<\/span><span data-offset-key=\"5ti69-535-0\">ALL<\/span><span data-offset-key=\"5ti69-536-0\">&#8211;<\/span><span data-offset-key=\"5ti69-537-0\">E<\/span><span data-offset-key=\"5ti69-538-0\"> model<\/span><span data-offset-key=\"5ti69-539-0\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI has drastically altered the way people go about their daily lives. Voice recognition has simplified activities like taking notes, typing documents, and more. Its speed and efficiency are what makes it so popular. With the progress made in AI, many voice recognition applications have been created. Google, Alexa, and Siri are a few examples [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":10159,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[77,62],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Fresh approaches to improve the quality of synthetic speech and text-to-speech<\/title>\n<meta name=\"description\" content=\"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Fresh approaches to improve the quality of synthetic speech and text-to-speech\" \/>\n<meta property=\"og:description\" content=\"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/\" \/>\n<meta property=\"og:site_name\" content=\"LenseUp, video and audio solutions\" \/>\n<meta property=\"article:published_time\" content=\"2023-02-24T16:06:30+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-07-31T15:09:06+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1640\" \/>\n\t<meta property=\"og:image:height\" content=\"924\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"LenseUp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"LenseUp\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/\",\"url\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/\",\"name\":\"Fresh approaches to improve the quality of synthetic speech and text-to-speech\",\"isPartOf\":{\"@id\":\"https:\/\/www.lenseup.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg\",\"datePublished\":\"2023-02-24T16:06:30+00:00\",\"dateModified\":\"2023-07-31T15:09:06+00:00\",\"author\":{\"@id\":\"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/dadfed1f52570f3378a4679e8e398337\"},\"description\":\"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage\",\"url\":\"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg\",\"contentUrl\":\"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg\",\"width\":1640,\"height\":924},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Accueil\",\"item\":\"https:\/\/www.lenseup.com\/en\/home-oct-2021\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Fresh approaches to improve the quality of synthetic speech and text-to-speech\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.lenseup.com\/en\/#website\",\"url\":\"https:\/\/www.lenseup.com\/en\/\",\"name\":\"LenseUp, multilingual audio and video solutions\",\"description\":\"Audioguides, audio books, audio and video translations, multilingual chatbots... discover LenseUp.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.lenseup.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/dadfed1f52570f3378a4679e8e398337\",\"name\":\"LenseUp\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/630b0f43e55077cd2abe39e3e9e2a52c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/630b0f43e55077cd2abe39e3e9e2a52c?s=96&d=mm&r=g\",\"caption\":\"LenseUp\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Fresh approaches to improve the quality of synthetic speech and text-to-speech","description":"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/","og_locale":"en_US","og_type":"article","og_title":"Fresh approaches to improve the quality of synthetic speech and text-to-speech","og_description":"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.","og_url":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/","og_site_name":"LenseUp, video and audio solutions","article_published_time":"2023-02-24T16:06:30+00:00","article_modified_time":"2023-07-31T15:09:06+00:00","og_image":[{"width":1640,"height":924,"url":"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg","type":"image\/jpeg"}],"author":"LenseUp","twitter_card":"summary_large_image","twitter_misc":{"Written by":"LenseUp","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/","url":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/","name":"Fresh approaches to improve the quality of synthetic speech and text-to-speech","isPartOf":{"@id":"https:\/\/www.lenseup.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage"},"image":{"@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage"},"thumbnailUrl":"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg","datePublished":"2023-02-24T16:06:30+00:00","dateModified":"2023-07-31T15:09:06+00:00","author":{"@id":"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/dadfed1f52570f3378a4679e8e398337"},"description":"Here are three innovative approaches that have been observed recently and that suggest significant progress in the field of synthetic voices.","breadcrumb":{"@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#primaryimage","url":"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg","contentUrl":"https:\/\/www.lenseup.com\/wp-content\/uploads\/2023\/02\/1.jpg","width":1640,"height":924},{"@type":"BreadcrumbList","@id":"https:\/\/www.lenseup.com\/en\/fresh-approaches-to-improve-synthetic-speech-and-text-to-speech\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Accueil","item":"https:\/\/www.lenseup.com\/en\/home-oct-2021\/"},{"@type":"ListItem","position":2,"name":"Fresh approaches to improve the quality of synthetic speech and text-to-speech"}]},{"@type":"WebSite","@id":"https:\/\/www.lenseup.com\/en\/#website","url":"https:\/\/www.lenseup.com\/en\/","name":"LenseUp, multilingual audio and video solutions","description":"Audioguides, audio books, audio and video translations, multilingual chatbots... discover LenseUp.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.lenseup.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/dadfed1f52570f3378a4679e8e398337","name":"LenseUp","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.lenseup.com\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/630b0f43e55077cd2abe39e3e9e2a52c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/630b0f43e55077cd2abe39e3e9e2a52c?s=96&d=mm&r=g","caption":"LenseUp"}}]}},"_links":{"self":[{"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/posts\/10155"}],"collection":[{"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/comments?post=10155"}],"version-history":[{"count":4,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/posts\/10155\/revisions"}],"predecessor-version":[{"id":10984,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/posts\/10155\/revisions\/10984"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/media\/10159"}],"wp:attachment":[{"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/media?parent=10155"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/categories?post=10155"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lenseup.com\/en\/wp-json\/wp\/v2\/tags?post=10155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}