Free English Transcription
Transcribe English audio and video to text with AI. Fast, accurate, and free.
How It Works
- Go to the Free.ai Transcriber
- Upload your English audio or video file
- Our AI automatically detects English and transcribes it
- Download your transcript as text or SRT subtitles
English Transcription Features
- ✓Powered by faster-whisper (MIT licensed)
- ✓Automatic English language detection
- ✓Supports MP3, WAV, MP4, M4A, FLAC, and more
- ✓Timestamps and subtitle export (SRT)
- ✓No file size limits on paid plans
- ✓Private and secure -- files are deleted after processing
Language Details
| Language | English |
| ISO Code | en |
| AI Model | faster-whisper |
| Price | Free |
More Languages
View All LanguagesFAQ
Whisper large-v3-turbo lands in its top accuracy tier on English — under 7% word error rate on standard benchmarks. In practice that means clean studio audio comes back near-perfect, and conversational audio is usable with minimal cleanup. (Tier A, under 7% word error rate on benchmark sets — we publish honest WER tiers rather than marketing claims.)
Yes — English transcription draws from your daily free token pool first. Audio costs about 50 tokens per minute, so the anonymous daily pool covers a few hours of audio per day. Signed-in accounts get a larger pool plus 10,000 signup tokens. Past that, $1 buys 750,000 tokens (~250 hours of audio).
English transcription covers US, UK, Australian, Indian, and other major accents in one model. Whisper was trained on all of them and the transcript comes out in standard English spelling regardless of the speaker's accent.
MP3, WAV, M4A, FLAC, OGG, OPUS, and WEBM are accepted directly. For video (MP4, MOV, MKV) we extract the audio track server-side before sending it to Whisper — you do not need to convert anything yourself. Same pipeline regardless of source language, including English.
Anonymous uploads cap at roughly 500 MB per file. Signed-in accounts go up to 2 GB. Duration is not a hard limit — long files are chunked automatically (30-second windows with overlap) and stitched back into a single transcript with continuous timestamps. Multi-hour English recordings (podcasts, full lectures, meetings) work fine.
Yes — speaker diarization is on by default for every English transcript. The output is segmented as Speaker 1 / Speaker 2 / Speaker 3 with timestamps, so interviews, panel discussions, and multi-party meetings come back labeled. Diarization runs on a separate model and works the same across all languages we support.
Yes — paste the URL into /transcribe/youtube/ for YouTube or /transcribe/podcast/ for podcast feeds (Apple, Spotify, RSS). We download the audio, run it through Whisper with language=en, and return the transcript with timestamps and speaker labels. Typical English content: lectures, interviews, voice notes, and YouTube content in English all work — paste a URL into /transcribe/youtube/ or upload the file directly.
Whisper costs about 50 tokens per minute of audio, so a one-hour recording is ~3,000 tokens. $1 buys 750,000 tokens, which works out to roughly 250 hours of audio per dollar. Most users never spend anything — the free daily pool covers short clips, voice notes, and one-off podcasts.
Yes — both segment-level (every ~10-30 seconds) and word-level timestamps are available. Word-level is the default for VTT/SRT subtitle export so the captions sync line-by-line. On the API set timestamps="word" in the request body. English transcripts are returned in standard UTF-8 with the language's normal orthography.
Yes. POST audio (multipart/form-data, field name "file") to /v1/transcribe/ with language=en — or omit the language parameter to let Whisper auto-detect. Returns JSON with the transcript, segments, timestamps, and speaker labels. Full reference and SDK snippets at /api/.
Yes — once transcription finishes, click Translate or paste the text into /translate/. English pairs with every other language we support (200+). For meeting minutes pipe the transcript through /summarize/; for dubbing send it to /voice/tts/ to render audio in the target language.
Whisper is trained on 680K hours of noisy real-world audio, so English transcription is robust to background noise, music beds, and phone-quality recordings. Severe clipping or multiple overlapping speakers will still hurt accuracy. If a transcript comes back unusable, email contact@free.ai with the file — we will refund the tokens and look at whether a different engine handles your audio better.