For text/srt subtitles, translation would probably be easier. There's a plugin for that already if you're okay with online translation services: https://github.com/nopium/vlc-trans-lua
If set, the transcription output will be sent to the specified file or URL
(use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
The output will also be set in the "lavfi.whisper.text" frame metadata.
If the destination is a file and it already exists, it will be overwritten.
@item format
The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
Default value: @code{"text"}
I don't know if this can embed the subtitles, but it does support generating accompanying srt files.Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
It's up to the site admin to configure it that way, but it's possible some IP ranges/user agents are more often used by bots and therefore have an increased weight.
For old browsers there's also an option to use meta refresh instead of JS (https://anubis.techaro.lol/docs/admin/configuration/challenges/metarefresh) but that's quite a recent addition and not enabled by default.
I'm currently roaming in Finland with a Spanish SIM so would have expected the opposite in that case.
This page loaded pretty much instantly (certainly in the time it took to switch to the background tab I loaded in). But then ffmpeg is written by old school engineers with old school ways of working. Their social media accounts are a hilarity of trolling worthy of slashdot in its peak.
You can read it on one of these without having to pass that specific bot check
With the current broken default config my browser can't even run the JS challenge due to it using unsupported bleeding edge JS features.
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_system)
From the documentation:
> It runs automatic speech recognition using the OpenAI's Whisper model.
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
queue
The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).
This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.
https://tomwh.uk/git/whisper-chunk.git/
I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.
Unfortunately, you're only getting attention in 3 second chunks.
The "alternatives" and "confidence" field is the result of the N-best decodings described elsewhere in the thread.
That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.
The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.
I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
│
└── Dey well; Be well
A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
As far as how it happens to me is concerned, either something closer to speech than raw thoughts reports back the data in shared memory is invalid for selected language, or I find there's no text representation exist for what I am trying to say.
The "raw" thoughts work with the currently active language, for me, so at least for me, I just know strong Sapir-Whorf hypothesis is not even a hypothesis, but just a reasonable verbalization closely matching my own observations.
I don't get why people can't take it, even in the age of LLMs. It is what it is and that old guy is just never correct even for once.
(then there's also a feedback loop type of argument, that always happens when discussing any sort of perception-reality distinction, but let's ignore that for now)
At least for me, my brain is so bad and it's hard for me to truly hold a single thought in my head for a long time. Maybe it eventually settles into my subconscious but I don't really have a way to verify that.
I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
"How to wreck a nice beach you sing calm incense"
Consider the way "Commonwealth Bank" is pronounced in this news story: https://youtube.com/watch?v=MhkuHGRAAbg. An Australian English speaker would consider (most) Americans to be saying something like "Carmenwealth" rather "Commonwealth". See also the pronunciation of dog vs father in https://www.goalsenglish.com/lessons/2020/5/4/australian-english-vs-american-english-part-one-accent-differences.
It really ruins some poetry.
(Agree that the title is awesome, by the way!)
"Threesomes, with and without blame"
https://dl.acm.org/doi/10.1145/1570506.1570511
(From a professor I worked with a bit in grad school)
Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.
I try to limit my use of it to just enough for my accent and way of talking to bleed through. I don't go for full-on phonetics, but I'm often "droppin' my g's and usin' lotsa regional sayin's." It probably helps that the people I text have the same accent I do, though.
And when I'm watching subtitles in my own language (say because I want the volume low so I'm not disturbing others), I hate when the words I see don't match the words I hear. It's the quickest way I can imagine to get sucked out of the content and into awareness of the delivery of the content.
Sometimes they're edited down simply for space, because there wouldn't be time to easily read all the dialog otherwise. And sometimes repetition of words or phrases is removed, because it's clearer, and the emphasis is obvious from watching the moving image. And filler words like "uh" or "um" generally aren't included unless they were in the original script.
Most interestingly, swearing is sometimes toned down, just by skipping it -- removing an f-word in a sentence or similar. Not out of any kind of puritanism, but because swear words genuinely come across as more powerful in print than they do in speech. What sounds right when spoken can sometimes look like too much in print.
Subtitles are an art. Determining when to best time them, how to split up long sentences, how to handle different speakers, how to handle repetition, how to handle limited space. I used to want subtitles that were perfectly faithful to what was spoken. Then I actually got involved in making subtitles at one point, and was very surprised to discover that perfectly faithful subtitles didn't actually do the best job of communicating meaning.
Fictional subtitles aren't court transcripts. They serve the purpose of storytelling, which is the combination of a visible moving image full of emotion and action, and the subtitles. Their interplay is complex.
The artists are the writers, voice actors, and everyone else involved in creating the original media. Never, ever, a random stranger should contaminate it with his/her opinions or point of views.
Subtitles should be perfect transcriptions or the most accurate translations, never reinterpretations
That's the thing though, subtitles aren't intended as full transcripts. They are intended to allow a wide variety of people to follow the content.
A lot of people read slower than they would hear speech. So subtitles often need to condense or rephrase speech to keep pace with the video. The goal is usually to convey meaning clearly within the time available on screen. Not to capture every single word.
If they tried to be fully verbatim, you'd either have subtitles disappearing before most viewers could finish reading them or large blocks of text covering the screen. Subtitlers also have to account for things like overlapping dialogue, filler words, and false starts, which can make exact transcriptions harder to read and more distracting in a visual medium.
I mean, yeah in your own native language I agree it sort of sucks if you can still hear the spoken words as well. But, to be frank, you are also the minority group here as far as subtitle target audiences go.
And to be honest, if they were fully verbatim, I'd wager you quickly would be annoyed as well. Simply because you will notice how much attention they then draw, making you less able to actually view the content.
If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
That's just plain tone deaf, plain and simple. I was not talking about myself, or just youtube. You are not everyone else, your use case is not everyone else their use case. It really isn't that difficult.
And what are deaf people supposed to do in a cinema, or with broadcast TV?
(And I'm ignoring other uses, e.g. learning a foreign language; for that, sometimes you want the exact words, sometimes the gist, but it's highly situational; but even once you've learned the language itself, regional accents even without vocabulary changes can be tough).
But it’s great point that you need context to be sure.
Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
That's how I anecdotally feel and interpret how my own brain appear to work, so it could be different from how interpreters work or how actual human brains work, but as far as I see it, professional simultaneous interpreters don't seem to be agnostic for relevant pairs of languages at all.
"Madam, please believe me, maine homework kiya ha" [I did my homework].
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
I think it's a toolkit thing where some sort of event or timer goes off at the wrong time and the subtitles get cleared when they shouldn't. And then if you rewind and replay, it doesn't happen again (because spurious event/timer issue).
I don't disagree, yet here we are. It's got race condition vibes.
I don't know if it's related to the TV OS (LG WebOS in our case) but I guess that would be the common factor since it happens across multiple apps and languages.
Anyway, it's quirky and occasionally annoying, but that's about it. :)
Must be union thing.
It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.
Those are still cool IMO
https://kyutai.org/next/stt is natively streaming STT.
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
It's decent for classification but poor at transcription.
It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-made-anime-trollsubs/
Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.
The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question.
You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.
For example, using faster-whisper-xxl [1]:
Direct translation:
faster-whisper-xxl.exe --language English --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
Use Japanese, then translate: faster-whisper-xxl.exe --language Japanese --task translate --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
1. https://github.com/Purfview/whisper-standalone-winAnother option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for
│
└── Dey well; Be well
https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomment-2513
I think having this flow out to all of the deps of libav is a greater good than notions of lib purity.
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.You say issue, I say feature. It's a great way to just ignore boring babbling at parties or other social engagements where you're just not that engaged. Sort of like selective hearing in relationships, but used on a wider audience
So when I say I call it a feature, it's something I actually deal with unlike your uncharitable assumption.
But play two songs at the same time, or try talking to me with significant background noise, and I seem to be distinctly impaired vs. most others.
If I concentrate, I can sometimes work through it.
My uninformed model is a pipeline of sorts, and some sort of pre-processing isn't turned on. So the stuff after it has a much harder job.
I don't think I have any harder time appreciating complex music than I did before, but I'm more of a 60s-70s rock kinda guy and a former bass player, so I tend to focus more on the low end. Bass tends to be less complex because you can't fit as much signal into the waveform without getting unpleasant muddling.
And of course, just because we have similar symptoms doesn't mean the underlying causes are the same. My grandfather was hard of hearing so for all I know it's genetic and the timing was a coincidence. Who knows?
I have always pictured it working this way:
In the Cochlea, we have all the fine hair like sensors. The spread of them determines our range of frequencies, and this declines with age. Usually not too much, but could be as much as half. 10 to 12khz.
Good news in that is all the good stuff we crave is below 10khz. Don't sweat age related hearing loss too much.
The number of these sensors determines our ability to hear concurrent sounds, or complexity.
The shape of them impacts how loud sounds need to be to be heard.
Chances are, your loud exposure had harmonics that impacted many of these sensing hairs, but not in one place. The result is a loss of discrimination of concurrent sounds.
There are plenty to cover the frequency range, so things do not seem muffled or low. Their shape is good, not worn so you hear faint sounds well.
The lower number of them is the issue. Or, they are still there, just bent-- something prevents them from contrubuting.
Another way to think of this is in reverse:
Say you had 30 oscillators you could start at any frequency and time. How complex of a sound could you make? Now cut that in half.
What is lost?
The most complex, concurrent sound cases.
But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
There's also a great podcast app opportunity here I hope someone seizes.
I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.
Download -> generate subtitles -> feed to AI for summary works pretty well
10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):
> uv pip install torch torchvision torchaudio --torch-backend=auto
More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection
This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.
But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.
The team at Astral should be nominated for a Nobel Peace Prize.
One life-changing thing I've been using `uv` for:
System python version is 3.12:
$ python3 --version
Python 3.12.3
A script that requires a library we don't have, and won't work on our local python: $ cat test.py
#!/usr/bin/env python3
import sys
from rich import print
if sys.version_info < (3, 13):
print("This script will not work on Python 3.12")
else:
print(f"Hello world, this is python {sys.version}")
It fails: $ python3 test.py
Traceback (most recent call last):
File "/tmp/tmp/test.py", line 10, in <module>
from rich import print
ModuleNotFoundError: No module named 'rich'
Tell `uv` what our requirements are $ uv add --script=test.py --python '3.13' rich
Updated `test.py`
`uv` updates the script: $ cat test.py
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "rich",
# ]
# ///
import sys
from rich import print
if sys.version_info < (3, 13):
print("This script will not work on Python 3.12")
else:
print(f"Hello world, this is python {sys.version}")
`uv` runs the script, after installing packages and fetching Python 3.13 $ uv run test.py
Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB)
Downloading cpython-3.13.5-linux-x86_64-gnu (download)
Installed 4 packages in 7ms
Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
And if we run it with Python 3.12, we can see that errors: $ uv run --python 3.12 test.py
warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13`
Installed 4 packages in 7ms
This script will not work on Python 3.12
Works for any Python you're likely to want: $ uv python list
cpython-3.14.0b2-linux-x86_64-gnu <download available>
cpython-3.14.0b2+freethreaded-linux-x86_64-gnu <download available>
cpython-3.13.5-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13
cpython-3.13.5+freethreaded-linux-x86_64-gnu <download available>
cpython-3.12.11-linux-x86_64-gnu <download available>
cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3.12
cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3 -> python3.12
cpython-3.11.13-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11
cpython-3.10.18-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10
cpython-3.9.23-linux-x86_64-gnu <download available>
cpython-3.8.20-linux-x86_64-gnu <download available>
pypy-3.11.11-linux-x86_64-gnu <download available>
pypy-3.10.16-linux-x86_64-gnu <download available>
pypy-3.9.19-linux-x86_64-gnu <download available>
pypy-3.8.16-linux-x86_64-gnu <download available>
graalpy-3.11.0-linux-x86_64-gnu <download available>
graalpy-3.10.0-linux-x86_64-gnu <download available>
graalpy-3.8.5-linux-x86_64-gnu <download available>
It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.
The main limitation is being english only.
# NeMo does not run on 3.13+
python3.12 -m venv .venv
source .venv/bin/activate
git clone https://github.com/NVIDIA/NeMo.git nemo
cd nemo
pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
pip install .[asr]
deactivate
Then run a transcribe.py script in that venv: import os
import sys
import nemo.collections.asr as nemo_asr
model_path = sys.argv[1]
audio_path = sys.argv[2]
# Load from a local path...
asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path)
# Or download from huggingface ('org/model')...
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path)
output = asr_moel.transcribe([audio_path])
print(output[0])
With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.You'll need to modify the python script to process the response and output it in a format you can use.
winget install --id=Nikse.SubtitleEdit -e
Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].
whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2
Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.Thanks but I'm looking for live diarization.
I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.
Run Whisper audio transcriptions with one FFmpeg command
https://medium.com/@vpalmisano/run-whisper-audio-transcriptions-with-one-ffmpeg-command-c6ecda51901
Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254
Remind me of one of my own experiences with one of the Whisper model, where some random noise in the middle of the conversation was translated into "Don't forget to like and subscribe".
Really illustrate where the training data is coming from.
ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
| ./main - \
| head -2 \
| tail -1 \
| cut -d] -f2 \
| awk '{$1=$1};1'
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
> which I then....
Yes, please, go on...The LLM can screw up now and then and output absolute garbage. But I've got a knack now for figuring out what prompts it's gonna be hopeless on and I manually enter those.
Example:
Saying
Remove makhana from shopping list
Ends up running the command
gkeep items edit shopping_list --check makhana
There is a direct text interface too that skips the voice transcription.
The main thing is it does in a background window without interrupting my screen or me needing to wait for whatever slow webpage to load. I had it do a few things on GitHub like remind me when checks pass on PRs. You could potentially connect it to various things like your amazon account to check on your order, etc,.. as I write this I now realise I did what basically amounts to what folks do with MCP today. Maybe I should update it to use the protocol.
These days I have a little more idle time as a grad student than I did in a tech company, and I don't really need to manage home/cooking/... so I don't really use some of the more complicated features. I mostly just use it to schedule 1on1s with my guide and add reminders about assignments and TA work and talks and my music class.
Anyone found a way?
I could share a python script that is working pretty reliably for me.
https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
I still see their old one too, but Forgejo one is nice.
Basically a simple audio-to-text for personal use?
I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
[1] https://medium.com/@vpalmisano/run-whisper-audio-transcriptions-with-one-ffmpeg-command-c6ecda51901f
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
brew install uv
uv tool install openai-whisper
then add ~/.local/bin/ to $PATH
https://developer.apple.com/documentation/speech/speechtranscriber
https://developer.apple.com/documentation/speech/speechanalyzer
This is going to be great for real-time audio translation.