Install needed libs

! pip install git+https://github.com/openai/whisper.git -q

     |████████████████████████████████| 4.9 MB 33.9 MB/s 
     |████████████████████████████████| 120 kB 73.6 MB/s 
     |████████████████████████████████| 6.6 MB 56.5 MB/s 
  Building wheel for whisper (setup.py) ... done

! pip install gradio -q

     |████████████████████████████████| 5.3 MB 32.8 MB/s 
     |████████████████████████████████| 57 kB 6.5 MB/s 
     |████████████████████████████████| 54 kB 3.8 MB/s 
     |████████████████████████████████| 270 kB 73.4 MB/s 
     |████████████████████████████████| 55 kB 4.6 MB/s 
     |████████████████████████████████| 112 kB 79.3 MB/s 
     |████████████████████████████████| 84 kB 3.9 MB/s 
     |████████████████████████████████| 84 kB 3.7 MB/s 
     |████████████████████████████████| 212 kB 50.9 MB/s 
     |████████████████████████████████| 2.3 MB 55.2 MB/s 
     |████████████████████████████████| 63 kB 2.1 MB/s 
     |████████████████████████████████| 80 kB 8.3 MB/s 
     |████████████████████████████████| 68 kB 5.4 MB/s 
     |████████████████████████████████| 43 kB 2.2 MB/s 
     |████████████████████████████████| 594 kB 53.9 MB/s 
     |████████████████████████████████| 4.0 MB 65.1 MB/s 
     |████████████████████████████████| 856 kB 10.2 MB/s 
  Building wheel for ffmpy (setup.py) ... done
  Building wheel for python-multipart (setup.py) ... done

import gradio as gr 
import os
import sys
import subprocess
#from moviepy.editor import VideoFileClip

import whisper

model = whisper.load_model("medium")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

Moving 0 files to the new cache system

100%|██████████████████████████████████████| 1.42G/1.42G [00:14<00:00, 108MiB/s]

def video2mp3(video_file, output_ext="mp3"):
    filename, ext = os.path.splitext(video_file)
    subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"], 
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.STDOUT)
    return f"{filename}.{output_ext}"

audio_file = "Speech_Martin_Luther_King-1963-08-26.mp3"

from IPython.display import Audio
Audio(audio_file)

def translate(audio_file):
    
    options = dict(beam_size=5, best_of=5)
    translate_options = dict(task="translate", **options)
    result = model.transcribe(audio_file,**translate_options)
    return result

result = translate(audio_file)

print(result["text"])

 I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation. Five score years ago, a great American in whose symbolic shadow we stand today signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity. But 100 years later, the Negro still is not free. 100 years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. 100 years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. 100 years later, the Negro is still languished in the corners of American society and finds himself in exile in his own land. And so we've come here today to dramatize a shameful condition. I say to you today, my friend, so even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream. I have a dream that one day this nation will rise up and live out the true meaning of its creed. We hold these truths to be self-evident that all men are created equal. I have a dream that one day on the Red Hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood. I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice. I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. I have a dream that one day down in Alabama with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification, one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. I have a dream that one day every valley shall be exalted and every hill and mountain shall be made low. The rough places will be made plain and the crooked places will be made straight and the glory of the Lord shall be revealed and all flesh shall see it together. This is our hope. So let freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty mountains of New York. Let freedom ring from the heightening alligators of Pennsylvania. Let freedom ring from the snow-capped Rockies of Colorado. Let freedom ring from the curvaceous slopes of California. But not only that, let freedom ring from Stone Mountain of Georgia. Let freedom ring from Lookout Mountain of Tennessee. Let freedom ring from every hill and mole hill of Mississippi, from every mountainside. Let freedom ring and when this happens, when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual, free at last, free at last. Thank God Almighty, we are free at last.

from whisper.utils import write_vtt

output_dir = '/content/'
audio_path = audio_file.split(".")[0]

with open(os.path.join(output_dir, audio_path + ".vtt"), "w") as vtt:
    write_vtt(result["segments"], file=vtt)

subtitle = audio_path + ".vtt"
output_video = audio_path + "_subtitled.mp4"

os.system(f"ffmpeg -i {input_video} -vf subtitles={subtitle} {output_video}")


# subprocess.call(["ffmpeg", "-i", input_video , "-vf", f"subtitles={subtitle}", f"{output_video}"], 
#                 stdout=subprocess.DEVNULL,
#                 stderr=subprocess.STDOUT)

0

Upload file from local

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving gty_march_on_washington_martin_luther_king_ll_130819_16x9_992.jpg to gty_march_on_washington_martin_luther_king_ll_130819_16x9_992.jpg
User uploaded file "gty_march_on_washington_martin_luther_king_ll_130819_16x9_992.jpg" with length 133849 bytes

subtitle = "Speech_Martin_Luther_King-1963-08-26.vtt"
subtitle

'Speech_Martin_Luther_King-1963-08-26.vtt'

subtitle

'Speech_Martin_Luther_King-1963-08-26.vtt'

!ffmpeg -loop 1 -framerate 25 -i gty_march_on_washington_martin_luther_king_ll_130819_16x9_992.jpg -i Speech_Martin_Luther_King-1963-08-26.mp3 -c:a aac \
-vf subtitles={subtitle}  -movflags +faststart -shortest output.mp4

ffmpeg version 3.4.11-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared
  libavutil      55. 78.100 / 55. 78.100
  libavcodec     57.107.100 / 57.107.100
  libavformat    57. 83.100 / 57. 83.100
  libavdevice    57. 10.100 / 57. 10.100
  libavfilter     6.107.100 /  6.107.100
  libavresample   3.  7.  0 /  3.  7.  0
  libswscale      4.  8.100 /  4.  8.100
  libswresample   2.  9.100 /  2.  9.100
  libpostproc    54.  7.100 / 54.  7.100
Input #0, image2, from 'gty_march_on_washington_martin_luther_king_ll_130819_16x9_992.jpg':
  Duration: 00:00:00.04, start: 0.000000, bitrate: 26769 kb/s
    Stream #0:0: Video: mjpeg, yuvj420p(pc, bt470bg/unknown/unknown), 992x558 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 25 tbn, 25 tbc
[mp3 @ 0x55c0cc0e0600] Estimating duration from bitrate, this may be inaccurate
Input #1, mp3, from 'Speech_Martin_Luther_King-1963-08-26.mp3':
  Metadata:
    title           : Track 02
    track           : 2
    TLEN            : 000000403400
  Duration: 00:06:43.46, start: 0.000000, bitrate: 320 kb/s
    Stream #1:0: Audio: mp3, 44100 Hz, stereo, s16p, 320 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (mjpeg (native) -> h264 (libx264))
  Stream #1:0 -> #0:1 (mp3 (native) -> aac (native))
Press [q] to stop, [?] for help
[Parsed_subtitles_0 @ 0x55c0cc0d8fa0] Shaper: FriBidi 0.19.7 (SIMPLE) HarfBuzz-ng 1.7.2 (COMPLEX)
[Parsed_subtitles_0 @ 0x55c0cc0d8fa0] Using font provider fontconfig
[Parsed_subtitles_0 @ 0x55c0cc0d8fa0] fontselect: (Arial, 400, 0) -> /usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf, 0, LiberationSans
[libx264 @ 0x55c0cc0fb200] using SAR=1/1
[libx264 @ 0x55c0cc0fb200] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
[libx264 @ 0x55c0cc0fb200] profile High, level 3.1
[libx264 @ 0x55c0cc0fb200] 264 - core 152 r2854 e9a5903 - H.264/MPEG-4 AVC codec - Copyleft 2003-2017 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=3 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'output.mp4':
  Metadata:
    encoder         : Lavf57.83.100
    Stream #0:0: Video: h264 (libx264) (avc1 / 0x31637661), yuvj420p(pc, progressive), 992x558 [SAR 1:1 DAR 16:9], q=-1--1, 25 fps, 12800 tbn, 25 tbc
    Metadata:
      encoder         : Lavc57.107.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
    Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s
    Metadata:
      encoder         : Lavc57.107.100 aac
[mp4 @ 0x55c0cc0e0c00] Starting second pass: moving the moov atom to the beginning of the file
frame=10136 fps= 78 q=-1.0 Lsize=   13555kB time=00:06:45.32 bitrate= 274.0kbits/s speed=3.12x    
video:6904kB audio:6318kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 2.512975%
[libx264 @ 0x55c0cc0fb200] frame I:41    Avg QP:17.75  size:154502
[libx264 @ 0x55c0cc0fb200] frame P:2554  Avg QP:21.20  size:   209
[libx264 @ 0x55c0cc0fb200] frame B:7541  Avg QP:31.33  size:    27
[libx264 @ 0x55c0cc0fb200] consecutive B-frames:  0.8%  0.0%  0.0% 99.2%
[libx264 @ 0x55c0cc0fb200] mb I  I16..4:  0.3% 88.7% 11.1%
[libx264 @ 0x55c0cc0fb200] mb P  I16..4:  0.0%  0.0%  0.0%  P16..4:  0.0%  0.0%  0.0%  0.0%  0.0%    skip:99.9%
[libx264 @ 0x55c0cc0fb200] mb B  I16..4:  0.0%  0.0%  0.0%  B16..8:  0.0%  0.0%  0.0%  direct: 0.0%  skip:100.0%  L0:41.5% L1:58.5% BI: 0.0%
[libx264 @ 0x55c0cc0fb200] 8x8 transform intra:86.2% inter:8.3%
[libx264 @ 0x55c0cc0fb200] coded y,uvDC,uvAC intra: 99.2% 0.0% 0.0% inter: 0.0% 0.0% 0.0%
[libx264 @ 0x55c0cc0fb200] i16 v,h,dc,p:  0%  0% 83% 16%
[libx264 @ 0x55c0cc0fb200] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 14% 13% 22%  6%  9%  8%  9%  8% 10%
[libx264 @ 0x55c0cc0fb200] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 21% 13%  9%  7% 10% 12% 10%  9% 10%
[libx264 @ 0x55c0cc0fb200] i8c dc,h,v,p: 100%  0%  0%  0%
[libx264 @ 0x55c0cc0fb200] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55c0cc0fb200] ref P L0: 70.6% 18.5%  9.8%  1.1%
[libx264 @ 0x55c0cc0fb200] ref B L0: 81.9% 16.7%  1.5%
[libx264 @ 0x55c0cc0fb200] ref B L1: 87.2% 12.8%
[libx264 @ 0x55c0cc0fb200] kb/s:139.49
[aac @ 0x55c0cc660500] Qavg: 1038.156

output_video = "output.mp4"

from IPython.display import HTML
from base64 import b64encode
mp4 = open(output_video,'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=400 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

import gradio as gr 
import os
import sys
import subprocess
#from moviepy.editor import VideoFileClip

import whisper
from whisper.utils import write_vtt

model = whisper.load_model("medium")

def video2mp3(video_file, output_ext="mp3"):
    filename, ext = os.path.splitext(video_file)
    subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"], 
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.STDOUT)
    return f"{filename}.{output_ext}"


def translate(input_video):

    audio_file = video2mp3(input_video)
    
    options = dict(beam_size=5, best_of=5)
    translate_options = dict(task="translate", **options)
    result = model.transcribe(audio_file,**translate_options)

    output_dir = '/content/'
    audio_path = audio_file.split(".")[0]

    with open(os.path.join(output_dir, audio_path + ".vtt"), "w") as vtt:
      write_vtt(result["segments"], file=vtt)

    subtitle = audio_path + ".vtt"
    output_video = audio_path + "_subtitled.mp4"

    os.system(f"ffmpeg -i {input_video} -vf subtitles={subtitle} {output_video}")

    return output_video



title = "Add Text/Caption to your YouTube Shorts - MultiLingual"

block = gr.Blocks()

with block:

    with gr.Group():
        with gr.Box(): 

           

            with gr.Row().style():
               
                inp_video = gr.Video(
                    label="Input Video",
                    type="filepath",
                    mirror_webcam = False
                )
                op_video = gr.Video()
            btn = gr.Button("Generate Subtitle Video")
        
        
        


        
        btn.click(translate, inputs=[inp_video], outputs=[op_video])
 
        gr.HTML('''
        <div class="footer">
                    <p>Model by <a href="https://github.com/openai/whisper" style="text-decoration: underline;" target="_blank">OpenAI</a> - Gradio App by <a href="https://twitter.com/1littlecoder" style="text-decoration: underline;" target="_blank">1littlecoder</a>
                    </p>
        </div>
        ''')

block.launch(debug = True)

/usr/local/lib/python3.7/dist-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://10733.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces