The best way to Clone Voice and Lip-Sync a Video Like a Professional Utilizing Open-source Instruments #Imaginations Hub

The best way to Clone Voice and Lip-Sync a Video Like a Professional Utilizing Open-source Instruments #Imaginations Hub
Image source -


AI voice-cloning has taken social media by storm. It has opened a world of artistic potentialities. You should have seen memes or AI voice-overs of well-known personalities on social media. Have you ever puzzled how it’s executed? Positive, many platforms present APIs like Eleven Labs, however can we do it without spending a dime, utilizing open-source software program? The brief reply is YES. The open-source has TTS fashions and lip-syncing instruments to realize voice synthesis. So, on this article, we are going to discover open-source instruments and fashions for voice-cloning and lip-syncing.

Studying Aims

  • Discover open-source instruments for AI voice-cloning and lip-syncing.
  • Use FFmpeg and Whisper to transcribe movies.
  • Use the Coqui-AI’s xTTS mannequin to clone voice.
  • Use the Wav2Lip for lip-syncing movies.
  • Discover real-world use circumstances of this know-how.

This text was revealed as part of the Information Science Blogathon.

Open-Supply Stack

As you already know, we are going to use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS mannequin, and Wav2lip as our tech stack. However earlier than delving into the codes, let’s briefly focus on these instruments. And likewise because of the authors of those initiatives.

Whisper: Whisper is OpenAI’s ASR (Automated Speech Recognition) mannequin. It’s an encoder-decoder transformer mannequin educated with over 650k hours of numerous audio knowledge and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.

The encoders obtain the log-mel spectrogram of 30-second chunks of audio. Every encoder block makes use of self-attention to grasp totally different elements of audio alerts. The decoder then receives hidden state info from encoders and realized positional encodings. The decoder makes use of self-attention and cross-attention to foretell the following token. On the finish of the method, it outputs a sequence of tokens representing the acknowledged textual content. For extra on Whisper, check with the official repository.

Coqui TTS:  TTS is an open-source library from Coqui-ai. It hosts a number of text-to-speech fashions. It has end-to-end fashions like Bark, Tortoise, and xTTS, spectrogram fashions like Glow-TTS, FastSpeech, and many others, and Vocoders like Hifi-GAN, MelGAN, and many others. Furthermore, it gives a unified API for inferencing, fine-tuning, and coaching text-to-speech fashions. On this undertaking, we are going to use xTTS, an end-to-end multi-lingual voice-cloning mannequin. It helps 16 languages, together with English, Japanese, Hindi, Mandarin, and many others. For extra details about the TTS, check with the official TTS repository.

Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync Knowledgeable Is All You Want for Speech to Lip Era Within the Wild.” It makes use of a lip-sync discriminator to acknowledge face and lip actions. This works out nice for dubbing voices. For extra info, check with the official repository. We’ll use this forked repository of Wav2lip.


Now that we’re acquainted with the instruments and fashions we are going to use, let’s perceive the workflow. This can be a easy workflow. So, here’s what we are going to do.

  • Add a video to the Colab runtime and resize it to 720p format for higher lip-syncing.
  • Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
  • Use Google Translate or an LLM to translate the transcribed script to a different language.
  • Load the Multi-lingual xTTS mannequin with the TTS library and cross the script and reference audio mannequin for voice synthesis.
  • Clone the Wav2lip repository and obtain mannequin checkpoints. Run the file to sync the unique video with synthesized audio.

Now, let’s delve into the codes.

Step 1: Set up Dependencies

This undertaking would require important RAM and GPU consumption, so it’s prudent to make use of a Colab runtime. The free tier Colab gives 12GB of CPU and 15GB of T4 GPU. This needs to be sufficient for this undertaking. So, head over to your Colab and hook up with a GPU runtime.

Now, set up the TTS and Whisper.

!pip set up TTS
!pip set up git+ 

Step 2: Add Movies to Colab

Now, we are going to add a video and resize it to 720p format. The Wav2lip tends to carry out higher when the movies are in 720p format. This may be executed utilizing FFmpeg.

#@title Add Video

from google.colab import recordsdata
import os
import subprocess

uploaded = None
resize_to_720p = False

def upload_video():
  international uploaded
  international video_path  # Declare video_path as international to switch it
  uploaded = recordsdata.add()
  for filename in uploaded.keys():
    print(f'Uploaded filename')
    if resize_to_720p:
        filename = resize_video(filename)  # Get the identify of the resized video
    video_path = filename  # Replace video_path with both authentic or resized filename
    return filename

def resize_video(filename):
    output_filename = f"resized_filename"
    cmd = f"ffmpeg -i filename -vf 'scale=-1:720' output_filename", shell=True)
    print(f'Resized video saved as output_filename')
    return output_filename

# Create a type button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from import show

button = widgets.Button(description="Add Video")
checkbox = widgets.Checkbox(worth=False, description='Resize to 720p (higher outcomes)')
output = widgets.Output()

def on_button_clicked(b):
  with output:
    international video_path
    international resize_to_720p
    resize_to_720p = checkbox.worth
    video_path = upload_video()

show(checkbox, button, output)

This may output a type button for importing movies from a neighborhood system and a checkbox for enabling 720p resizing. It’s also possible to add a video manually to the present collab session and resize it utilizing a subprocess.

Step 3: Audio Extraction and Whisper Transcription

Now that we now have our video, the following factor we are going to do is extract audio utilizing FFmpeg and use Whisper to transcribe.

# @title Audio extraction (24 bit) and whisper conversion
import subprocess

# Guarantee video_path variable exists and isn't None
if 'video_path' in globals() and video_path is just not None:
    ffmpeg_command = f"ffmpeg -i 'video_path' -acodec pcm_s24le -ar 48000 -q:a 0 -map a
                       -y 'output_audio.wav'", shell=True)
    print("No video uploaded. Please add a video first.")

import whisper

mannequin = whisper.load_model("base")
end result = mannequin.transcribe("output_audio.wav")

whisper_text = end result["text"]
whisper_language = end result['language']

print("Whisper textual content:", whisper_text)

This may extract audio from the video in 24-bit format and can use the Whisper Base to transcribe it. For higher transcription, use Whisper small or medium fashions.

Step 4: Voice Synthesis

Now, to the voice cloning half. As I’ve talked about earlier than, we are going to use Coqui-ai’s xTTS mannequin. This is likely one of the greatest open-source fashions on the market for voice synthesis. Coqui-ai additionally gives many TTS fashions for various functions; do verify them. For our use case, which is voice-cloning, we are going to use the xTTS v2 mannequin.

Load the xTTS mannequin. This can be a massive mannequin with a dimension of 1.87 GB. So, this may take some time.

# @title Voice synthesis
from TTS.api import TTS
import torch
from import Audio, show  # Import the Audio and show modules

system = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(system)

XTTS at present helps 16 languages. Listed here are the ISO codes of languages the xTTS mannequin helps.



Observe: Languages like English and French do not need a personality restrict, whereas Hindi has a personality restrict of 250. Few different languages might need the restrict as properly.

For this undertaking, we are going to use the Hindi language, you possibly can experiment with others as properly.

So, the very first thing we’d like now’s to translate the transcribed textual content into Hindi. This may both be executed by Google Translate bundle or utilizing an LLM. As per my observations, GPT-3.5-Turbo performs a lot better than Google Translate. We are able to use OpenAI API to get our translation.

import openai

shopper = openai.OpenAI(api_key = "api_key")
completion =
    "role": "system", "content": "You are a helpful assistant.",
    "role": "user", "content": f"translate the texts to Hindi whisper_text"
translated_text = completion.selections[0].message

As we all know, Hindi has a personality restrict, so we have to do textual content pre-processing earlier than passing it to the TTS mannequin. We have to break up the textual content into chunks of lower than 250 characters.

text_chunks = translated_text.break up(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
  if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
    chunk += "।"

This can be a quite simple splitter. You possibly can create a special one or use Langchain’s recursive text-splitter. Now, we are going to cross every chunk to the TTS mannequin. The ensuing audio recordsdata will probably be merged utilizing FFmpeg.

def audio_synthesis(textual content, file_name):
      textual content,
  return file_name
file_names = []
for i in vary(len(final_chunks)):
    file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_i.wav")

As all of the recordsdata have the identical codec, we are able to simply merge them with FFmpeg. To do that, create a Txt file and add the file paths.

# this can be a remark
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'

Now, run the code beneath to merge recordsdata.

import subprocess

cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav", shell=True)

This may output the ultimate concatenated audio file. It’s also possible to play the audio in Colab.

from import Audio, show
show(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))

Step 5: Lip-Syncing

Now, to the lip-syncing half. To lip-sync our artificial audio with the unique video, we are going to use the Wav2lip repository. To make use of Wav2lip to sync audio, we have to set up the mannequin checkpoints. However earlier than that, in case you are on T4 GPU runtime, delete the xTTS and Whisper fashions within the present Colab session or restart the session.

import torch

    del tts
besides NameError:
    print("Voice mannequin already deleted")

    del mannequin
besides NameError:
    print("Whisper mannequin  deleted")


Now, clone the Wav2lip repository and set up the checkpoints.

# @title Dependencies
%cd /content material/

!git clone
!cd Wav2Lip && pip set up -r requirements_colab.txt

%cd /content material/Wav2Lip

!wget ' 
/obtain/fashions/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

!wget ' 
/obtain/fashions/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'

!wget ' 
/obtain/fashions/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

!pip set up batch-face

The Wav2lip has two fashions for lip-syncing. wav2lip and wav2lip_gan. In accordance with the authors of the fashions, the GAN mannequin requires much less effort in face detection however produces barely inferior outcomes. In distinction, the non-GAN mannequin can produce higher outcomes with extra guide padding and rescaling of the detection field. You possibly can check out each and see which one is doing higher.

Run the inference with the mannequin checkpoint path, video, and audio recordsdata.

%cd /content material/Wav2Lip

#That is the detection field padding, regulate incase of poor outcomes. 
#Often, the underside one is the largest challenge
pad_top =  0
pad_bottom =  15
pad_left =  0
pad_right =  0
rescaleFactor =  1

video_path_fix = f"'../video_path'"

!python --checkpoint_path 'checkpoints/wav2lip_gan.pth' 
--face $video_path_fix --audio "/content material/final_output_synth_audio_hi.wav" 
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth  
--outfile '/content material/output_video.mp4'

This may output a lip-synced video. If the video doesn’t look good, regulate the parameters and retry.

So, right here is the repository for the pocket book and some samples.

GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync

Actual-world Use Circumstances

Video voice-cloning and lip-syncing know-how have a variety of use circumstances throughout industries. Listed here are a couple of circumstances the place this may be useful.

Leisure: The leisure trade would be the most affected trade of all. We’re already witnessing the change. Voices of celebrities of present and bygone eras will be synthesized and re-used. This additionally poses moral challenges. Using synthesized voices needs to be executed responsively and inside the perimeter of legal guidelines.

Advertising and marketing: Customized advert campaigns with acquainted and relatable voices can vastly improve model attraction.

Communication: Language has at all times been a barrier to all types of actions. Cross-language communication continues to be a problem. Realtime end-to-end translation whereas protecting one’s accent and voice will revolutionize the way in which we talk. This may develop into a actuality in a couple of years.

Content material Creation: Content material creators will not rely upon translators to succeed in an even bigger viewers. With environment friendly voice cloning and lip-syncing, cross-language content material creation will probably be simpler. Podcasts and audiobook narration expertise will be enhanced with voice synthesis.


Voice synthesis is likely one of the most sought-after use circumstances of generative AI. It has the potential to revolutionize the way in which we talk. Ever because the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this hole will be stuffed. So, on this article, we explored the open-source method of voice-cloning and lip-syncing.

Key Takeaways

  • TTS, a Python library by Coqui-ai, serves and maintains widespread text-to-speech fashions.
  • The xTTS is a multi-lingual voice cloning mannequin able to cloning voice to 16 totally different languages.
  • Whisper is an ASR mannequin from OpenAI for environment friendly transcription and English translation.
  • Wav2lip is an open-source instrument for lip-syncing movies.
  • Voice cloning is likely one of the most occurring frontiers of generative AI, with a major potential affect on industries from leisure to advertising.

Steadily Requested Questions

Q1. Is AI voice cloning authorized?

A. Cloning voice could be unlawful because it infringes on copyright. Nevertheless, getting permission from the individual earlier than cloning is the suitable method to go about it.

Q2. Is AI voice cloning free?

A. Most AI voice cloning API providers require charges. Nevertheless, some open-source fashions can provide pretty first rate voice synthesis functionality.

Q3. What’s the greatest voice cloning mannequin?

A. This depends upon explicit use circumstances. The xTTS mannequin is an effective selection for multi-lingual voice synthesis. However for extra languages, Meta’s Fairseq fashions could be preferable.

This autumn. Can AI clone celeb voices?

A. Sure, it’s doable to clone the voice of a star. Nevertheless, be aware that any potential misuse can land you in authorized bother.

Q5. What’s the usage of voice cloning?

A. Voice cloning will be useful for a spread of use circumstances, equivalent to content material creation, narration in video games and flicks, Advert campaigns, and many others.

The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.

Related articles

You may also be interested in