Fal Speech-to-Text

Free.ai · stt · ~500 tokens per minute

Drop an audio or video file, or paste a URL below

~500 tokens per minute
Runs free on our GPUs. Upgrade for Fal Speech-to-Text →

Fal Speech-to-Text is a speech-to-text model. Routed through external models — ~500 tokens per minute (50% markup over upstream cost).

Use via API

OpenAI-compatible REST API. Generate a key and call this model in seconds.

curl -X POST https://api.free.ai/v1/stt/ \
  -H "Authorization: Bearer sk-free-..." \
  -H "Content-Type: application/json" \
  -d '{"model":"premium/speech-to-text","audio_url":"https://..."}'
API Documentation Get API Key

Frequently Asked Questions

Fal Speech-to-Text transcribes spoken audio into text. Upload an MP3, WAV, M4A, or video file and Fal Speech-to-Text returns the full transcript plus optional SRT/VTT subtitles with timestamps.

Fal Speech-to-Text handles dozens of languages — Whisper-family models cover 90+, Parakeet covers ~25, others vary. Pick "auto-detect" or specify the language for highest accuracy.

Word-error rate is 5–10% on clean English audio, 10–20% on noisy or accented audio. Large variants of the same architecture do meaningfully better on hard cases — pick larger when the audio is rough.

Yes — every segment includes start/end timestamps. Export as SRT or VTT and the times map straight onto your video.

Fal Speech-to-Text is a premium transcription engine. About ~500–1,500 tokens per minute of audio. $1 = 750,000 tokens.

MP3, WAV, M4A, FLAC, OGG, plus video (MP4, MOV, WebM) — we extract the audio. Max 500 MB per upload. Longer files? Split with /audio/cut/ or use /v1/stt/batch/.

Speaker diarization is a separate pass — toggle "diarize" on /transcribe/. Fal Speech-to-Text handles the transcription; diarization labels each segment with Speaker 1 / Speaker 2 / etc.

Yes — /batch/ accepts a folder of audio files. Each transcript lands in /account/?tab=history with the original filename. For folder-tree preservation use the API.

Yes — POST your audio to /v1/stt/transcribe/ with model="Fal Speech-to-Text". Returns JSON with text + segments + word-level timestamps. /api/ has the full reference.

Self-hosted models keep audio on our GPUs; premium pass through with a DPA. Audio is deleted after the share-window (24h anon, 7d signed-in). We do not train on your inputs.

Yes — Free.ai grants commercial use of transcripts. You need rights to the audio you uploaded (your own recording, licensed material, or content with consent).

Real-time factor is roughly 0.05–0.2× — a 60-minute podcast transcribes in 3–12 minutes. Premium models often finish faster. Use the queue button to close the tab.

Love Free.ai? Tell your friends!

Rate this page