WhisperKit Versions Save

Swift native on-device speech recognition with Whisper for Apple Silicon

v0.7.2

1 week ago

Early stopping now keeps track of the chunked window internally when running async transcription via the VAD chunking method. This will give further control for stopping specific windows based on your custom criteria in the TranscriptionCallback.

What's Changed

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.7.1...v0.7.2

v0.7.1

1 week ago

Hotifx for shouldEarlyStop logic

What's Changed

  • Ensures early stopping flag on TextDecoder is always reset at the beginning of a new loop

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.7.0...v0.7.1

v0.7.0

2 weeks ago

This is a very exciting release because we're seeing yet another massive speedup in offline throughput thanks to VAD based chunking 🚀

Highlights

  • Energy VAD based chunking 🗣️ @jkrukowski
    • There is a new decoding option called chunkingStrategy which can significantly speed up your single file transcriptions with minimal WER downsides.
    • It works by finding a clip point in the middle of the longest silence (lowest audio energy) in the last 15s of a 30s window and uses that to split up all the audio ahead of time so it can be asynchronously decoded in parallel.
    • Heres a video of it in action, comparing .none chunking strategy with .vad

https://github.com/argmaxinc/WhisperKit/assets/1981179/0f865caa-3a08-412e-a0bf-080ec16a439a

  • Detect language helper:
    • You can now call detectLanguage with just an audio path as input from the main whisperKit object. This will return a simple language code and probability back as a tuple, and has minimal logging/timing.
    • Example:
let whisperKit = try await WhisperKit()
let (language, probs) = try await whisperKit.detectLanguage(audioPath: "your/audio/path/spanish.wav")
print(language) // "es"
  • WhisperKit via Expo @seb-sep
    • For anyone that's been wanting to use WhisperKit in react native, @seb-sep is maintaining a repo that makes it easy, and also setup an automation that will automatically update it with each new WhisperKit release, check it out here: https://github.com/seb-sep/whisper-kit-expo
  • Bug fixes and enhancements:
    • @jiangdi0924 and @fengcunhan contributed some nice fixes in this release with #136 and #138 (see below)
    • Also moved the decoding progress callback to be fully async so that it doesn't block the decoder thread

What's Changed

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.6.1...v0.7.0

v0.6.1

1 month ago

Smaller patch release with some nice improvements and two new contributors 🙌

Highlights

  • Tokenizer no longer requires a HubApi request to succeed if the files are already downloaded
    • This was a big request from the community and should enable offline transcription as long as everything is downloaded already
    • Also made the function public so you can bundle the tokenizer with the app along with the model files
  • @smpanaro found a really nice speedup across the board by using IOSurface backed MLMultiArrays
    • Especially noticeable on older devices
  • General cleanup, including a nice bug fix from @couche1 when streaming via the CLI

What's Changed

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.6.0...v0.6.1

v0.6.0

1 month ago

Highlights

  • Async batch transcription is here 🎉 contributed by @jkrukowski
    • With this release, you can now simultaneously transcribe multiple audio files at once, fully utilizing the new async prediction APIs released with iOS17/macOS14 (see the wwdc video here).
    • New interface with audioPaths input:
    •   let audioPaths = [
            "/path/to/file1.wav",
            "/path/to/file2.wav"
        ]
        let whisperKit = try await WhisperKit()
        let transcriptionResults: [[TranscriptionResult]?] = await whisperKit.transcribe(audioPaths: audioPaths)
      
    • You can also use it via the CLI using the new argument --audio-folder "path/to/folder/"
    • Future work will be chunking up single files to significantly speed up long-form transcription
    • Note that this entails breaking changes and deprecations, see below for the full upgrade guide.
  • Several bug fixes, accuracy improvements, and quality of life upgrades by @hewigovens @shawiz and @jkrukowski
    • Every issue raised and PR merged from the community helps make WhisperKit better every release, thank you and keep them coming! 🙏

⚠️ Upgrade Guide

We aim to minimize breaking changes, so with this update we added a few deprecation flags for changed interfaces, which will be removed later but for now are still usable and will not throw build errors. There are some breaking changes for lower level and newer methods so if you do notice build errors click the dropdown below to see the full guide.

Full Upgrade Guide

API changes

Deprecations

WhisperKit

Deprecated

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioPath: String,
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

Deprecated

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?

use instead

public func transcribe(
    audioArray: [Float],
    decodeOptions: DecodingOptions? = nil,
    callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]

TextDecoding

Deprecated

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> [DecodingResult]

use instead

func decodeText(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options decoderOptions: DecodingOptions,
    callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> DecodingResult

Deprecated

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> [DecodingResult]

use instead

func detectLanguage(
    from encoderOutput: MLMultiArray,
    using decoderInputs: DecodingInputs,
    sampler tokenSampler: TokenSampling,
    options: DecodingOptions,
    temperature: FloatType
) async throws -> DecodingResult

Breaking changes

  • removed Transcriber protocol

AudioProcessing

static func loadAudio(fromPath audioFilePath: String) -> AVAudioPCMBuffer?

becomes

static func loadAudio(fromPath audioFilePath: String) throws -> AVAudioPCMBuffer

AudioStreamTranscriber

public init(
    audioProcessor: any AudioProcessing, 
    transcriber: any Transcriber, 
    decodingOptions: DecodingOptions, 
    requiredSegmentsForConfirmation: Int = 2, 
    silenceThreshold: Float = 0.3, 
    compressionCheckWindow: Int = 20, 
    useVAD: Bool = true, 
    stateChangeCallback: AudioStreamTranscriberCallback?
)

becomes

public init(
    audioEncoder: any AudioEncoding,
    featureExtractor: any FeatureExtracting,
    segmentSeeker: any SegmentSeeking,
    textDecoder: any TextDecoding,
    tokenizer: any WhisperTokenizer,
    audioProcessor: any AudioProcessing,
    decodingOptions: DecodingOptions,
    requiredSegmentsForConfirmation: Int = 2,
    silenceThreshold: Float = 0.3,
    compressionCheckWindow: Int = 20,
    useVAD: Bool = true,
    stateChangeCallback: AudioStreamTranscriberCallback?
)

TextDecoding

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) -> DecodingInputs?

becomes

func prepareDecoderInputs(withPrompt initialPrompt: [Int]) throws -> DecodingInputs

What's Changed

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.5.0...v0.6.0

v0.5.0

2 months ago

This is a HUGE release with some great new features and fixes 🙌

Highlights

  • Timestamp logits filter by @jkrukowski
    • Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
    • This is on by default but can be disabled using the decoding option withoutTimestamps: true
  • Language detection by @Abhinay1997
    • New function on the TextDecoding protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio
    • Enabled by default for decoding options whereusePrefilPrompt: false and the language: nil and it is not an English only model.
  • First token log prob thresholds fallback check by @jkrukowski
    • This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
    • Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
  • Distil whisper support
    • Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
    • distil-large-v3 distil-large-v3_594MB distil-large-v3_turbo distil-large-v3_turbo_600MB
    • Note that these do not yet have word timestamp alignment heads, so can't be used with wordTimestamps: true
    • It can be run via CLI as well:
      • swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav

⚠️ Experimental new stream mode

We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.

Recommended settings for the best performance for this iteration are:

  • Max tokens per loop < 100
  • Max fallback count < 2
  • Prompt and cache prefill true

Looking for feedback on:

  • Token confirmation numbers that work well
  • Model, device, and settings combinations that work well

https://github.com/argmaxinc/WhisperKit/assets/1981179/0a88ca34-3a0e-4ff5-9829-9f980a4661ea

What's Changed

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.4.1...v0.5.0

v0.4.1

2 months ago

v0.4.0 was our first release on Homebrew, and this will be our first automated update to the formula, huge props to @jkrukowski for his contributions on this.

What's Changed

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.4.0...v0.4.1

v0.4.0

2 months ago

Lots of nice fixes in this release!

⚠️ Breaking change

We had to rename the CLI entry point in preparation for homebrew distribution, here is how to use it now: Old: swift run transcribe --audio-path path/to/your/audio.mp3 New: swift run whisperkit-cli transcribe --audio-path path/to/your/audio.mp3

What's Changed

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.3.3...v0.4.0

v0.3.3

2 months ago

What's Changed

Some great contributions in this patch:

New Contributors

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.3.2...v0.3.3

v0.3.2

3 months ago

What's Changed

With these our build warnings are now down to 0 🎉

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.3.1...v0.3.2