Video Summarization Utilizing OpenAI Whisper and Hugging Chat API #Imaginations Hub

Video Summarization Utilizing OpenAI Whisper and Hugging Chat API #Imaginations Hub
Image source -


“Much less is extra,” as architect Ludwig Mies van der Rohe famously stated, and that is what summarization means. Summarization is a vital software in lowering voluminous textual content material into succinct, related morsels, interesting to at the moment’s fast-paced info consumption. In textual content purposes, summarization aids info retrieval, and helps decision-making. The combination of Generative AI, like OpenAI GPT-3-based fashions, has revolutionized this course of by not solely extracting key parts from textual content and producing coherent summaries that retain the supply’s essence. Apparently, Generative AI’s capabilities prolong past textual content to video summarization. This entails extracting pivotal scenes, dialogues, and ideas from movies, creating abridged representations of the content material. You’ll be able to obtain video summarization in many various methods, together with producing a brief abstract video, performing video content material evaluation, and highlighting key sections of the video or making a textual abstract of the video utilizing video transcription

The Open AI Whisper API leverages automated speech recognition know-how to transform spoken language into written textual content, therefore growing accuracy and effectivity of textual content summarization. Alternatively, the Hugging Face Chat API gives state-of-the-art language fashions like GPT-3.

Studying Goals

On this article we’ll find out about:

  • We find out about video summarization methods
  • Perceive the purposes of Video Summarization
  • Discover the Open AI Whisper mannequin structure
  • Study to implement the video textual summarization utilizing the Open AI Whisper and Hugging Chat API

This text was printed as part of the Knowledge Science Blogathon.

Video Summarization Strategies

Video Analytics

It entails the method of extracting significant info from a video. Use deep studying to trace and establish objects and motion in a video and establish the scenes. Among the widespread methods for video summarization are:

Keyframe Extraction and Shot Boundary Detection

This course of contains changing the video to a restricted variety of nonetheless photos. Video skim is one other time period for this shorter video of keyshots.

Video photographs are non-interrupted steady sequence of frames. Shot boundary recognition detects transitions between photographs, like cuts, fades, or dissolves, and chooses frames from every shot to construct a abstract. The under are the most important steps to extract a steady quick video abstract from an extended video:

  • Body Extraction – Snapshot of video is extracted from video, we are able to take 1fps for 30 fps video.
  • Face and Emotion Detection – We are able to then extract faces from video & rating the feelings of faces to detect emotion scores. Face detection utilizing SSD (Single Shot Multibox Detector).
  • Body Rating & Choice – Choose frames which have excessive emotion rating after which rank.
  • Ultimate Extraction – We extract subtitles from the video together with timestamps. We then extract the sentences similar to the extracted frames chosen above, together with their beginning and ending occasions within the video. Lastly, we merge the video components corresponding to those intervals to generate the ultimate abstract video.

Motion Recognition and Temporal Subsampling

On this we attempt to establish human motion carried out within the video that is broadly used software of Video analytics. We breakdown the video in small subsequences as a substitute of frames and attempt to estimate the motion carried out within the phase  by classification and sample recognition methods like HMC (Hidden Markov Chain Evaluation).

Single and Multi-modal Approaches

On this article we’ve used single modal method the place in we use the audio of video to create a abstract of video utilizing textual abstract. Right here we use a
single facet of video which is audio convert it to textual content after which get abstract utilizing that textual content.

In multi-modal method we mix info from many modalities like audio, visible, and textual content, give a holistic data of the video content material for extra correct summarization.

Functions of Video Summarization

Earlier than diving into the implementation of our video summarization we must always first know the purposes of video summarization. Beneath are among the listed examples of video summarization in a wide range of fields and domains:

  • Safety and Surveillance: Video summarization can enable us to investigate great amount of surveillance video to get essential occasions spotlight with out manually reviewing the video
  • Training and Coaching: One can ship key notes and coaching video thus college students can revise the video contents with out going via the entire video.
  • Content material Searching: Youtube makes use of this to focus on essential a part of video related to consumer search to be able to enable customers to determine they wish to watch that specific video or not primarily based on their search necessities.
  • Catastrophe Administration: For emergencies and disaster video summarization can enable to take actions primarily based on conditions highlighted within the video abstract.

Open AI Whisper Mannequin Overview

The Whisper mannequin of Open AI is a automated speech recognition(ASR). It’s used for transcribing speech audio into textual content.

Structure of Open AI Whisper Mannequin

It’s primarily based on the transformer structure, which stacks encoder and decoder blocks with an consideration mechanism that propagates info between them. It is going to take the audio recording, divide it into 30-second items, and course of each individually. For every 30-second recording, the encoder encodes the audio and preserves the placement of every phrase acknowledged, and the decoder makes use of this encoded info to find out what was stated.

The decoder will anticipate tokens from all of this info, that are mainly every phrase pronounced. It is going to then repeat this course of for the next phrase , utilising the entire similar info to help it establish the following one which makes extra sense.

 Whisper model task flowchart
Whisper mannequin process flowchart

Coding Instance for Video Textual Summarization

 Flowchart of Textual Video Summarization
Flowchart of Textual Video Summarization

1 – Set up and Load Libraries

!pip set up yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat

#Perform for saving audio from enter video id of youtube
def obtain(video_id: str) -> str:
    video_url = f''
    ydl_opts = 
        'format': 'm4a/bestaudio/finest',
        'paths': 'dwelling': 'audio/',
        'outtmpl': 'default': '%(id)s.%(ext)s',
        'postprocessors': [
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.obtain([video_url])
        if error_code != 0:
            increase Exception('Didn't obtain video')

    return f'audio/video_id.m4a'

#Name operate with video id
file_path = obtain('A_JQK_k4Kyc&t=99s')

3 – Transcribe audio to textual content utilizing Whisper

# Load whisper mannequin
whisper_model = whisper.load_model("tiny")

# Transcribe audio operate
def transcribe(file_path: str) -> str:
  # `fp16` defaults to `True`, which tells the mannequin to try to run on GPU.
  transcription = whisper_model.transcribe(file_path, fp16=False)
  return transcription['text']

#Name the transcriber operate with file path of audio  
transcript = transcribe('/content material/audio/A_JQK_k4Kyc.m4a')

 4 – Summarize transcribed textual content utilizing Hugging Chat

Be aware to make use of hugging chat api we have to login or join on hugging face platform. After that instead of “username” and “password” we have to go in our hugging face credentials.

from hugchat.login import Login

# login
signal = Login("username", "password")
cookies = signal.login()
signal.saveCookiesToDir("/content material")

# load cookies from usercookies
cookies = signal.loadCookiesFromDir("/content material") # It will detect if the JSON file exists, return cookies if it does and lift an Exception if it isn't.

# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<e mail>.json"

#Summarise Transcript
print('''Summarize the next :-'''+transcript))


In conclusion, the idea of summarization is a transformative pressure in info administration. It’s a robust software that distills voluminous content material into concise, significant types, tailor-made to the fast-paced consumption of at the moment’s world.

By the mixing of Generative AI fashions like OpenAI’s GPT-3, summarization has transcended its conventional boundaries, evolving right into a course of that not solely extracts however generates coherent and contextually correct summaries.

The journey into video summarization unveils its relevance throughout numerous sectors. The implementation of how audio extraction, transcription utilizing Whisper, and summarization via Hugging Face Chat could be seamlessly built-in to create video textual summaries.

Key Takeaways

1. Generative AI: Video summarization could be achieved utilizing generative AI applied sciences resembling LLMs and ASR.

2. Functions in Fields:  Video summarization is definitely helpful in lots of essential fields the place one has to investigate great amount of movies to mine essential info.

3. Fundamental Implementation:  On this article we explored fundamental code implementation of video summarization primarily based on audio dimension.

4. Mannequin Structure: We additionally learnt about fundamental structure of Open AI Whisper mannequin and its course of stream.

Regularly Requested Questions

Q1.  What are limits of Whisper API?

A. Whisper API name restrict is 50 in a min. There isn’t any audio size restrict however information upto 25 MB can solely be shared. One can cut back file dimension of audio by reducing bitrate of audio.

Q2. The Whisper API helps which file codecs?

A. The next file codecs: m4a, mp3, webm, mp4, mpga, wav, mpeg

Q3. What are the alternate options of Whisper API?

A. Among the main alternate options for Computerized Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.

This fall. What are the constraints of Computerized Speech Recognition (ASR) system?

A. One of many the issue in comprehending numerous accents of the identical language, necessity for specialised coaching purposes in specialised fields.

Q5. What are the alternate options to Computerized Speech Recognition (ASR)?

A. Superior analysis is happening within the discipline of speech recognition like decoding imagined speech from EEG indicators utilizing neural structure. This enables folks
with speech disabilities to speak their ideas of speech to outdoors world with assist of units. One such fascinating paper right here.

The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Writer’s discretion.

Related articles

You may also be interested in