Skip to content

Devangshrivastava303/captionist-ISL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

captionist-ISL

An AI-based system that converts speech into multilingual Indian language captions for Deaf users and low-literacy individuals.


Project Aim

  • Real-time speech-to-text captioning in major Indian languages for Deaf users
  • Simplified captions for easy comprehension by low-literacy users
  • Low latency, high accuracy, and accessibility across diverse platforms

Project Structure

captionist-ISL/
|
|-- testing/                        <- Phase 1 test scripts and sample audio
|   |-- testing_dependencies.py     <- Whisper install and Hindi transcription test
|   |-- conversion.py               <- Audio format conversion (m4a, mp3, wav)
|   |-- audio.wav                   <- Test audio (mic recorded)
|   |-- audio.mp3                   <- Converted audio
|   `-- Recording.m4a               <- Raw recording for testing
|
|-- Video_captioning.py             <- Phase 2A main script
|-- extract_audios/                 <- Output folder (auto-created)
|   |-- {filename}_audio.wav        <- Extracted audio from video
|   `-- {filename}_captions.srt     <- Generated subtitle file
|
|-- README.md
|-- Requirements.txt
`-- .gitignore

Tech Stack

Component Tool
Audio capture sounddevice + scipy
Format convert ffmpeg
Transcription faster-whisper
GPU acceleration CUDA 12.7 + RTX 2050 (float16)
Caption format SRT (universal subtitle format)
Video playback VLC

Installation

Step 1 - Install PyTorch with CUDA

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Step 2 - Install Whisper and audio packages

pip install openai-whisper faster-whisper sounddevice numpy scipy

Step 3 - Install ffmpeg

Windows: Download from https://ffmpeg.org/download.html
Ubuntu : sudo apt install ffmpeg
macOS  : brew install ffmpeg

Step 4 - Install VLC

Download from: https://www.videolan.org/

Step 5 - Verify GPU

python -c "import torch; print(torch.cuda.is_available())"
# Should print: True

Supported Languages

Code Language Script
hi Hindi हिन्दी
ta Tamil தமிழ்
te Telugu తెలుగు
bn Bengali বাংলা
mr Marathi मराठी
gu Gujarati ગુજराती
kn Kannada ಕನ್ನಡ
ml Malayalam മലയാളം

Phase 1 - Dependency Testing

Tests Whisper installation, Hindi transcription, and audio conversion.

cd testing
python testing_dependencies.py
python conversion.py

What was tested:

  • Whisper base and small model loading
  • Hindi Devanagari script output with language="hi"
  • Audio format conversion using ffmpeg (m4a to wav, mp4 to wav)
  • GPU detection with torch.cuda.is_available()

Phase 2A - Video Caption Pipeline

Full pipeline: video or audio file in, SRT captions out, VLC auto-launched.

Usage

python Video_captioning.py
Enter video/audio file path: E:\videos\speech.mp4
Enter language code (default hi): hi

Pipeline Steps

Step 1 - extract_audio()    Video/audio file -> 16kHz WAV (ffmpeg)
Step 2 - transcribe()       WAV -> timestamped segments (faster-whisper GPU)
Step 3 - generate_srt()     Segments -> .SRT subtitle file
Step 4 - play_vlc()         Original video + SRT -> VLC player

Output

Two files are saved to the script directory:

speech_audio.wav        <- Extracted audio
speech_captions.srt     <- Subtitle file

SRT format example:

1
00:00:00,000 --> 00:00:03,640
तो जब मैं मुंबई आया

2
00:00:03,640 --> 00:00:07,200
मेरे दिमाग में बहुत सारे सपने थे

Performance on RTX 2050

Model Speed VRAM used
small 2.5x realtime ~1.5 GB
medium 1.5x realtime ~2.5 GB

Whisper Model Reference

Model Speed Accuracy RAM needed
tiny fastest low ~1 GB
base fast medium ~1 GB
small balanced good ~2 GB
medium slow better ~5 GB
large slowest best ~10 GB

Recommended: small for speed, medium for accuracy on Indian languages.


Known Issues

  • 48KBPS audio produces garbled transcription - use audio above 128KBPS for best results
  • Hindi and Urdu sound identical to Whisper - always set language="hi" explicitly
  • Small model may mix Hindi and English on code-switched speech - use medium model

Phase Roadmap

Phase 1     (done)    Dependency testing, Hindi transcription, audio conversion
Phase 2A    (done)    Video -> SRT -> VLC pipeline with GPU acceleration
Phase 2B    (next)    Live mic streaming, chunked real-time captions under 2s
Phase 3               Bhashini API integration for more regional languages
Phase 4               Low-literacy feature: chunk audio -> LLM summarize -> simple caption
Phase 5               Web UI and platform integration

Troubleshooting

torch.cuda.is_available() returns False Reinstall PyTorch with correct CUDA version:

pip uninstall torch torchaudio -y
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

ffmpeg not found Add ffmpeg to system PATH or reinstall from https://ffmpeg.org/download.html

VLC not launching Check VLC is installed at one of these paths:

C:\Program Files\VideoLAN\VLC\vlc.exe
C:\Program Files (x86)\VideoLAN\VLC\vlc.exe

Or add your custom VLC path to the VLC_PATHS list in Video_captioning.py

Hindi output comes in Urdu script Always pass language="hi" explicitly - never let Whisper auto-detect for Indian languages

Script exits after transcription with no error Add this line at the top of Video_captioning.py after imports:

sys.stdout.reconfigure(encoding='utf-8')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages