PyWhisperCpp API Reference

pywhispercpp.model

This module contains a simple Python API on-top of the C-style whisper.cpp API.

Segment

Segment(t0, t1, text, probability=np.nan)

A small class representing a transcription segment

Parameters:

t0 (int) –

start time
t1 (int) –

end time
text (str) –

text
probability (float, default: nan ) –

Confidence score for the segment, computed as the geometric mean of the token probabilities for the segment (NaN if not calculated). This makes it interpretable as a probability in [0, 1].

Source code in pywhispercpp/model.py

def __init__(self, t0: int, t1: int, text: str, probability: float = np.nan):
    """
    :param t0: start time
    :param t1: end time
    :param text: text
    :param probability: Confidence score for the segment, computed as the geometric mean of
        the token probabilities for the segment (NaN if not calculated).
        This makes it interpretable as a probability in [0, 1].
    """
    self.t0 = t0
    self.t1 = t1
    self.text = text
    self.probability = probability

Model

Model(
    model="tiny",
    models_dir=None,
    params_sampling_strategy=0,
    redirect_whispercpp_logs_to=False,
    use_openvino=False,
    openvino_model_path=None,
    openvino_device="CPU",
    openvino_cache_dir=None,
    context_params=None,
    **params
)

This classes defines a Whisper.cpp model.

Example usage.

model = Model('base.en', n_threads=6)
segments = model.transcribe('file.mp3')
for segment in segments:
    print(segment.text)

Parameters:

model (str, default: 'tiny' ) –

model name, default tiny, or a direct path to a ggml model file.
models_dir (Optional[str], default: None ) –

directory containing model files; if omitted, uses MODELS_DIR unless model is already a direct file path.
params_sampling_strategy (int, default: 0 ) –

sampling strategy selector; 0 uses greedy decoding and any other value uses beam search.
redirect_whispercpp_logs_to (Union[bool, TextIO, str, None], default: False ) –

log redirection target. Use False for no redirection, None for /dev/null, a file path string, or sys.stdout/sys.stderr.
use_openvino (bool, default: False ) –

whether to initialize the OpenVINO encoder backend.
openvino_model_path (Optional[str], default: None ) –

path to the OpenVINO model directory or files.
openvino_device (str, default: 'CPU' ) –

OpenVINO device name, default CPU.
openvino_cache_dir (Optional[str], default: None ) –

OpenVINO cache directory.
context_params (Optional[ContextParams], default: None ) –

optional whisper context loader params. Accepted keys are use_gpu, flash_attn, gpu_device, dtw_token_timestamps, dtw_aheads_preset, dtw_n_top, and dtw_mem_size. Omitted keys inherit from whisper_context_default_params().
params –

keyword-only decode parameters matching the public API documented in model.pyi. These values are forwarded to whisper_full_params and remain active for future calls. Supported keys: - n_threads: number of inference threads. Default is min(4, hardware_concurrency()). - n_max_text_ctx: max prompt-text tokens carried into the decoder. Default 16384. - offset_ms: audio start offset in milliseconds. Default 0. - duration_ms: audio duration to process in milliseconds. Default 0. - translate: translate output to English. Default False. - no_context: disable reuse of past transcription context. Default True. - no_timestamps: disable timestamp generation. Default False. - single_segment: force a single output segment. Default False. - print_special: print special tokens. Default False. - print_progress: print progress information. Default True. - print_realtime: print realtime output from whisper.cpp. Default False. - print_timestamps: print timestamps during realtime output. Default True. - token_timestamps: enable token-level timestamps. Default False. - thold_pt: token timestamp probability threshold. Default 0.01. - thold_ptsum: token timestamp sum threshold. Default 0.01. - max_len: max segment length in characters. Default 0. - split_on_word: split on words when max_len is used. Default False. - max_tokens: max tokens per segment. Default 0. - debug_mode: enable whisper.cpp debug mode. Default False. - audio_ctx: override audio context size. Default 0. - tdrz_enable: enable tinydiarize speaker-turn detection. Default False. - initial_prompt: initial text prompt prepended before decoding. Default None. - prompt_tokens: explicit prompt token sequence. Default None. - prompt_n_tokens: number of prompt tokens. Default 0. - carry_initial_prompt: prepend the initial prompt to each decode window. Default False. - language: language code. Default `. -detect_language: enable automatic language detection during transcription. DefaultFalse. -suppress_blank: suppress blank outputs. DefaultTrue. -suppress_non_speech_tokens: Python alias forsuppress_nst. DefaultFalse. -suppress_nst: suppress non-speech tokens. DefaultFalse. -suppress_regex: regex pattern used to suppress matching text during decoding. Default''. -temperature: initial decoding temperature. Default0.0. -max_initial_ts: maximum initial timestamp. Default1.0. -length_penalty: length penalty. Default-1.0. -temperature_inc: fallback temperature increment. Default0.2. -entropy_thold: entropy threshold. Default2.4. -logprob_thold: logprob threshold. Default-1.0. -no_speech_thold: no-speech threshold. Default0.6. -greedy: greedy-decoder settings, typically. -beam_search: beam-search settings. Default. -vad: enable VAD. DefaultFalse. -vad_model_path: path to the VAD model. DefaultNone`.

Source code in pywhispercpp/model.py

def __init__(self,
             model: str = 'tiny',
             models_dir: Optional[str] = None,
             params_sampling_strategy: int = 0,
             redirect_whispercpp_logs_to: Union[bool, TextIO, str, None] = False,
             use_openvino: bool = False,
             openvino_model_path: Optional[str] = None,
             openvino_device: str = 'CPU',
             openvino_cache_dir: Optional[str] = None,
             context_params: Optional[ContextParams] = None,
             **params):
    """
    :param model: model name, default `tiny`, or a direct path to a ggml model file.
    :param models_dir: directory containing model files; if omitted, uses `MODELS_DIR` unless `model`
                       is already a direct file path.
    :param params_sampling_strategy: sampling strategy selector; `0` uses greedy decoding and any
                                     other value uses beam search.
    :param redirect_whispercpp_logs_to: log redirection target. Use `False` for no redirection, `None`
                                        for `/dev/null`, a file path string, or `sys.stdout`/`sys.stderr`.
    :param use_openvino: whether to initialize the OpenVINO encoder backend.
    :param openvino_model_path: path to the OpenVINO model directory or files.
    :param openvino_device: OpenVINO device name, default `CPU`.
    :param openvino_cache_dir: OpenVINO cache directory.
    :param context_params: optional whisper context loader params. Accepted keys are `use_gpu`,
                           `flash_attn`, `gpu_device`, `dtw_token_timestamps`,
                           `dtw_aheads_preset`, `dtw_n_top`, and `dtw_mem_size`. Omitted keys inherit
                           from `whisper_context_default_params()`.
    :param params: keyword-only decode parameters matching the public API documented in `model.pyi`.
        These values are forwarded to `whisper_full_params` and remain active for future calls.
        Supported keys:
        - `n_threads`: number of inference threads. Default is `min(4, hardware_concurrency())`.
        - `n_max_text_ctx`: max prompt-text tokens carried into the decoder. Default `16384`.
        - `offset_ms`: audio start offset in milliseconds. Default `0`.
        - `duration_ms`: audio duration to process in milliseconds. Default `0`.
        - `translate`: translate output to English. Default `False`.
        - `no_context`: disable reuse of past transcription context. Default `True`.
        - `no_timestamps`: disable timestamp generation. Default `False`.
        - `single_segment`: force a single output segment. Default `False`.
        - `print_special`: print special tokens. Default `False`.
        - `print_progress`: print progress information. Default `True`.
        - `print_realtime`: print realtime output from whisper.cpp. Default `False`.
        - `print_timestamps`: print timestamps during realtime output. Default `True`.
        - `token_timestamps`: enable token-level timestamps. Default `False`.
        - `thold_pt`: token timestamp probability threshold. Default `0.01`.
        - `thold_ptsum`: token timestamp sum threshold. Default `0.01`.
        - `max_len`: max segment length in characters. Default `0`.
        - `split_on_word`: split on words when `max_len` is used. Default `False`.
        - `max_tokens`: max tokens per segment. Default `0`.
        - `debug_mode`: enable whisper.cpp debug mode. Default `False`.
        - `audio_ctx`: override audio context size. Default `0`.
        - `tdrz_enable`: enable tinydiarize speaker-turn detection. Default `False`.
        - `initial_prompt`: initial text prompt prepended before decoding. Default `None`.
        - `prompt_tokens`: explicit prompt token sequence. Default `None`.
        - `prompt_n_tokens`: number of prompt tokens. Default `0`.
        - `carry_initial_prompt`: prepend the initial prompt to each decode window. Default `False`.
        - `language`: language code. Default ``.
        - `detect_language`: enable automatic language detection during transcription. Default `False`.
        - `suppress_blank`: suppress blank outputs. Default `True`.
        - `suppress_non_speech_tokens`: Python alias for `suppress_nst`. Default `False`.
        - `suppress_nst`: suppress non-speech tokens. Default `False`.
        - `suppress_regex`: regex pattern used to suppress matching text during decoding. Default `''`.
        - `temperature`: initial decoding temperature. Default `0.0`.
        - `max_initial_ts`: maximum initial timestamp. Default `1.0`.
        - `length_penalty`: length penalty. Default `-1.0`.
        - `temperature_inc`: fallback temperature increment. Default `0.2`.
        - `entropy_thold`: entropy threshold. Default `2.4`.
        - `logprob_thold`: logprob threshold. Default `-1.0`.
        - `no_speech_thold`: no-speech threshold. Default `0.6`.
        - `greedy`: greedy-decoder settings, typically `{"best_of": 5}`.
        - `beam_search`: beam-search settings. Default `{"beam_size": -1, "patience": -1.0}`.
        - `vad`: enable VAD. Default `False`.
        - `vad_model_path`: path to the VAD model. Default `None`.
    """
    self.model_path = utils.resolve_model_path(model, models_dir)
    self._ctx = None
    self._context_params = self._resolve_context_params(context_params)
    self._sampling_strategy = pw.whisper_sampling_strategy.WHISPER_SAMPLING_GREEDY if params_sampling_strategy == 0 else \
        pw.whisper_sampling_strategy.WHISPER_SAMPLING_BEAM_SEARCH
    self._params = pw.whisper_full_default_params(self._sampling_strategy)
    # assign params
    self.params = params
    self._set_params(params)
    self.redirect_whispercpp_logs_to = redirect_whispercpp_logs_to
    self.use_openvino = use_openvino
    self.openvino_model_path = openvino_model_path
    self.openvino_device = openvino_device
    self.openvino_cache_dir = openvino_cache_dir
    # todo... maybe setup default callbacks for segments and abort globaly and/or per model instance?
    self._new_segment_callback = None
    # init the model
    self._init_model()

transcribe

transcribe(
    media,
    n_processors=None,
    new_segment_callback=None,
    abort_callback=None,
    extract_probability=False,
    **params
)

Transcribes the media provided as input and returns list of Segment objects. Accepts a media_file path (audio/video) or a raw numpy array.

Parameters:

media (AudioInput) –

Media file path or a numpy array
n_processors (Optional[int], default: None ) –

number of worker processes for whisper_full_parallel. If omitted, runs a single-process whisper_full() decode.
new_segment_callback (Optional[Callable[[Segment], None]], default: None ) –

callback invoked for each newly produced Segment during decoding.
abort_callback (Optional[Callable[[], bool]], default: None ) –

callback function returning True to abort an in-flight transcription early.
extract_probability (bool, default: False ) –

If True, calculates the geometric mean of token probabilities for each segment, providing a confidence score interpretable as a probability in [0, 1].
params –

additional keyword-only decode parameters matching the public API documented in model.pyi, with the same supported keys and defaults as Model.__init__. Any overrides applied here remain active for future calls.

Returns:

List[Segment] –

List of transcription segments

Source code in pywhispercpp/model.py

def transcribe(self,
               media: Union[str, np.ndarray],
               n_processors: Optional[int] = None,
               new_segment_callback: Optional[Callable[[Segment], None]] = None,
               abort_callback: Optional[Callable[[], bool]] = None,
               extract_probability: bool = False,
               **params) -> List[Segment]:
    """
    Transcribes the media provided as input and returns list of `Segment` objects.
    Accepts a media_file path (audio/video) or a raw numpy array.

    :param media: Media file path or a numpy array
    :param n_processors: number of worker processes for `whisper_full_parallel`. If omitted, runs a
                 single-process `whisper_full()` decode.
    :param new_segment_callback: callback invoked for each newly produced `Segment` during decoding.
    :param abort_callback: callback function returning True to abort an in-flight transcription early.
    :param extract_probability: If True, calculates the geometric mean of token probabilities for each segment,
        providing a confidence score interpretable as a probability in [0, 1].
    :param params: additional keyword-only decode parameters matching the public API documented in
        `model.pyi`, with the same supported keys and defaults as `Model.__init__`.
        Any overrides applied here remain active for future calls.
    :return: List of transcription segments
    """
    if isinstance(media, np.ndarray):
        audio = media
    else:
        if not Path(media).exists():
            raise FileNotFoundError(media)
        audio = self._load_audio(media)

    # update params if any
    self._set_params(params)

    # setting up callback. make sure self._new_segment_callback = None when new_segment_callback = None.
    # since this is no lonmger bound to the Model but on self 
    self._new_segment_callback = new_segment_callback
    pw.assign_new_segment_callback(
        self._params,
        self.__call_new_segment_callback if new_segment_callback is not None else None,
    )

    pw.assign_abort_callback(self._params, abort_callback)

    # run inference
    start_time = time()
    logger.info("Transcribing ...")
    res = self._transcribe(audio, n_processors=n_processors, extract_probability=extract_probability)
    end_time = time()
    logger.info(f"Inference time: {end_time - start_time:.3f} s")
    return res

get_params

get_params()

Returns a dict representation of the actual params

Returns:

Dict[str, Any] –

params dict

Source code in pywhispercpp/model.py

def get_params(self) -> dict:
    """
    Returns a `dict` representation of the actual params

    :return: params dict
    """
    res = {}
    for param in dir(self._params):
        if param.startswith('__'):
            continue
        try:
            res[param] = getattr(self._params, param)
        except Exception:
            # ignore callback functions
            continue
    return res

get_params_schema `staticmethod`

get_params_schema()

A simple link to ::: constants.PARAMS_SCHEMA

Returns:

Dict[str, Dict[str, Any]] –

dict of params schema

Source code in pywhispercpp/model.py

@staticmethod
def get_params_schema() -> dict:
    """
    A simple link to ::: constants.PARAMS_SCHEMA
    :return: dict of params schema
    """
    return constants.PARAMS_SCHEMA

lang_max_id `staticmethod`

lang_max_id()

Largest language id (i.e. number of available languages - 1) Direct binding to whisper.cpp/lang_max_id

Returns:

int –

Source code in pywhispercpp/model.py

@staticmethod
def lang_max_id() -> int:
    """
    Largest language id (i.e. number of available languages - 1)
    Direct binding to whisper.cpp/lang_max_id
    :return:
    """
    return pw.whisper_lang_max_id()

print_timings

print_timings()

Direct binding to whisper.cpp/whisper_print_timings

Returns:

None –

None

Source code in pywhispercpp/model.py

def print_timings(self) -> None:
    """
    Direct binding to whisper.cpp/whisper_print_timings

    :return: None
    """
    pw.whisper_print_timings(self._ctx)

system_info `staticmethod`

system_info()

Direct binding to whisper.cpp/whisper_print_system_info

Returns:

Any –

None

Source code in pywhispercpp/model.py

@staticmethod
def system_info() -> None:
    """
    Direct binding to whisper.cpp/whisper_print_system_info

    :return: None
    """
    return pw.whisper_print_system_info()

available_languages `staticmethod`

available_languages()

Returns a list of supported language codes

Returns:

List[str] –

list of supported language codes

Source code in pywhispercpp/model.py

@staticmethod
def available_languages() -> List[str]:
    """
    Returns a list of supported language codes

    :return: list of supported language codes
    """
    n = pw.whisper_lang_max_id()
    res = []
    for i in range(n+1):
        res.append(pw.whisper_lang_str(i))
    return res

auto_detect_language

auto_detect_language(media, offset_ms=None, n_threads=None)

Automatic language detection using whisper.cpp/whisper_pcm_to_mel and whisper.cpp/whisper_lang_auto_detect

Parameters:

media (AudioInput) –

Media file path or a numpy array
offset_ms (Optional[int], default: None ) –

offset in milliseconds; when omitted, uses the model's current offset_ms
n_threads (Optional[int], default: None ) –

number of threads to use; when omitted, uses the model's current n_threads

Returns:

Tuple[Tuple[str, float32], Dict[str, float32]] –

((detected_language, probability), probabilities for all languages)

Source code in pywhispercpp/model.py

def auto_detect_language(self, media: Union[str, np.ndarray], offset_ms: Optional[int] = None, n_threads: Optional[int] = None) -> Tuple[Tuple[str, np.float32], Dict[str, np.float32]]:
    """
    Automatic language detection using whisper.cpp/whisper_pcm_to_mel and whisper.cpp/whisper_lang_auto_detect

    :param media: Media file path or a numpy array
    :param offset_ms: offset in milliseconds; when omitted, uses the model's current `offset_ms`
    :param n_threads: number of threads to use; when omitted, uses the model's current `n_threads`
    :return: ((detected_language, probability), probabilities for all languages)
    """
    if isinstance(media, np.ndarray):
        audio = media
    else:
        if not Path(media).exists():
            raise FileNotFoundError(media)
        audio = self._load_audio(media)

    if offset_ms is None:
        offset_ms = self._params.offset_ms

    if n_threads is None:
        n_threads = self._params.n_threads

    pw.whisper_pcm_to_mel(self._ctx, audio, len(audio), n_threads)
    lang_count = self.lang_max_id() + 1
    probs = np.zeros(lang_count, dtype=np.float32)
    auto_detect = pw.whisper_lang_auto_detect(self._ctx, offset_ms, n_threads, probs)
    langs = self.available_languages()
    lang_probs = {langs[i]: probs[i] for i in range(lang_count)}
    return (langs[auto_detect], np.float32(probs[auto_detect])), lang_probs

pywhispercpp.constants

Constants

WHISPER_SAMPLE_RATE `module-attribute`

WHISPER_SAMPLE_RATE = WHISPER_SAMPLE_RATE

MODELS_BASE_URL `module-attribute`

MODELS_BASE_URL = (
    "https://huggingface.co/ggerganov/whisper.cpp"
)

MODELS_PREFIX_URL `module-attribute`

MODELS_PREFIX_URL = 'resolve/main/ggml'

PACKAGE_NAME `module-attribute`

PACKAGE_NAME = 'pywhispercpp'

MODELS_DIR `module-attribute`

MODELS_DIR = Path(user_data_dir(PACKAGE_NAME)) / 'models'

AVAILABLE_MODELS `module-attribute`

AVAILABLE_MODELS = [
    "base",
    "base-q5_1",
    "base-q8_0",
    "base.en",
    "base.en-q5_1",
    "base.en-q8_0",
    "large-v1",
    "large-v2",
    "large-v2-q5_0",
    "large-v2-q8_0",
    "large-v3",
    "large-v3-q5_0",
    "large-v3-turbo",
    "large-v3-turbo-q5_0",
    "large-v3-turbo-q8_0",
    "medium",
    "medium-q5_0",
    "medium-q8_0",
    "medium.en",
    "medium.en-q5_0",
    "medium.en-q8_0",
    "small",
    "small-q5_1",
    "small-q8_0",
    "small.en",
    "small.en-q5_1",
    "small.en-q8_0",
    "tiny",
    "tiny-q5_1",
    "tiny-q8_0",
    "tiny.en",
    "tiny.en-q5_1",
    "tiny.en-q8_0",
]

PARAMS_SCHEMA `module-attribute`

PARAMS_SCHEMA = {
    "n_threads": {
        "type": int,
        "description": "Number of threads to allocate for the inferencedefault to min(4, available hardware_concurrency)",
        "options": None,
        "default": None,
    },
    "n_max_text_ctx": {
        "type": int,
        "description": "max tokens to use from past text as prompt for the decoder",
        "options": None,
        "default": 16384,
    },
    "offset_ms": {
        "type": int,
        "description": "start offset in ms",
        "options": None,
        "default": 0,
    },
    "duration_ms": {
        "type": int,
        "description": "audio duration to process in ms",
        "options": None,
        "default": 0,
    },
    "translate": {
        "type": bool,
        "description": "whether to translate the audio to English",
        "options": None,
        "default": False,
    },
    "no_context": {
        "type": bool,
        "description": "do not use past transcription (if any) as initial prompt for the decoder",
        "options": None,
        "default": True,
    },
    "no_timestamps": {
        "type": bool,
        "description": "do not generate timestamps",
        "options": None,
        "default": False,
    },
    "single_segment": {
        "type": bool,
        "description": "force single segment output (useful for streaming)",
        "options": None,
        "default": False,
    },
    "print_special": {
        "type": bool,
        "description": "print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)",
        "options": None,
        "default": False,
    },
    "print_progress": {
        "type": bool,
        "description": "print progress information",
        "options": None,
        "default": True,
    },
    "print_realtime": {
        "type": bool,
        "description": "print results from within whisper.cpp (avoid it, use callback instead)",
        "options": None,
        "default": False,
    },
    "print_timestamps": {
        "type": bool,
        "description": "print timestamps for each text segment when printing realtime",
        "options": None,
        "default": True,
    },
    "token_timestamps": {
        "type": bool,
        "description": "enable token-level timestamps",
        "options": None,
        "default": False,
    },
    "thold_pt": {
        "type": float,
        "description": "timestamp token probability threshold (~0.01)",
        "options": None,
        "default": 0.01,
    },
    "thold_ptsum": {
        "type": float,
        "description": "timestamp token sum probability threshold (~0.01)",
        "options": None,
        "default": 0.01,
    },
    "max_len": {
        "type": int,
        "description": "max segment length in characters, note: token_timestamps needs to be set to True for this to work",
        "options": None,
        "default": 0,
    },
    "split_on_word": {
        "type": bool,
        "description": "split on word rather than on token (when used with max_len)",
        "options": None,
        "default": False,
    },
    "max_tokens": {
        "type": int,
        "description": "max tokens per segment (0 = no limit)",
        "options": None,
        "default": 0,
    },
    "debug_mode": {
        "type": bool,
        "description": "enable debug mode in whisper.cpp",
        "options": None,
        "default": False,
    },
    "audio_ctx": {
        "type": int,
        "description": "overwrite the audio context size (0 = use default)",
        "options": None,
        "default": 0,
    },
    "tdrz_enable": {
        "type": bool,
        "description": "enable tinydiarize speaker turn detection",
        "options": None,
        "default": False,
    },
    "initial_prompt": {
        "type": str,
        "description": "Initial prompt, these are prepended to any existing text context from a previous call",
        "options": None,
        "default": None,
    },
    "prompt_tokens": {
        "type": Tuple,
        "description": "tokens to provide to the whisper decoder as initial prompt",
        "options": None,
        "default": None,
    },
    "prompt_n_tokens": {
        "type": int,
        "description": "tokens to provide to the whisper decoder as initial prompt",
        "options": None,
        "default": 0,
    },
    "carry_initial_prompt": {
        "type": bool,
        "description": "always prepend the initial prompt to each decode window",
        "options": None,
        "default": False,
    },
    "language": {
        "type": str,
        "description": 'for auto-detection, set to None, "" or "auto"',
        "options": None,
        "default": "",
    },
    "detect_language": {
        "type": bool,
        "description": "enable automatic language detection during transcription",
        "options": None,
        "default": False,
    },
    "suppress_blank": {
        "type": bool,
        "description": "common decoding parameters",
        "options": None,
        "default": True,
    },
    "suppress_non_speech_tokens": {
        "type": bool,
        "description": "common decoding parameters",
        "options": None,
        "default": False,
    },
    "suppress_nst": {
        "type": bool,
        "description": "canonical whisper.cpp name for non-speech token suppression",
        "options": None,
        "default": False,
    },
    "suppress_regex": {
        "type": str,
        "description": "regex pattern used to suppress matching text during decoding",
        "options": None,
        "default": "",
    },
    "temperature": {
        "type": float,
        "description": "initial decoding temperature",
        "options": None,
        "default": 0.0,
    },
    "max_initial_ts": {
        "type": float,
        "description": "max_initial_ts",
        "options": None,
        "default": 1.0,
    },
    "length_penalty": {
        "type": float,
        "description": "length_penalty",
        "options": None,
        "default": -1.0,
    },
    "temperature_inc": {
        "type": float,
        "description": "temperature_inc",
        "options": None,
        "default": 0.2,
    },
    "entropy_thold": {
        "type": float,
        "description": 'similar to OpenAI\'s "compression_ratio_threshold"',
        "options": None,
        "default": 2.4,
    },
    "logprob_thold": {
        "type": float,
        "description": "logprob_thold",
        "options": None,
        "default": -1.0,
    },
    "no_speech_thold": {
        "type": float,
        "description": "no_speech_thold",
        "options": None,
        "default": 0.6,
    },
    "greedy": {
        "type": dict,
        "description": "greedy",
        "options": None,
        "default": {"best_of": 5},
    },
    "beam_search": {
        "type": dict,
        "description": "beam_search",
        "options": None,
        "default": {"beam_size": -1, "patience": -1.0},
    },
    "vad": {
        "type": bool,
        "description": "Enable VAD",
        "options": None,
        "default": False,
    },
    "vad_model_path": {
        "type": str,
        "description": "Path to VAD model",
        "options": None,
        "default": None,
    },
}

pywhispercpp.utils

Helper functions

download_model

download_model(
    model_name, download_dir=None, chunk_size=1024
)

Helper function to download the ggml models

Parameters:

model_name (str) –

name of the model, one of ::: constants.AVAILABLE_MODELS
download_dir –

Where to store the models
chunk_size –

size of the download chunk

Returns:

str –

Absolute path of the downloaded model

Source code in pywhispercpp/utils.py

def download_model(model_name: str, download_dir=None, chunk_size=1024) -> str:
    """
    Helper function to download the `ggml` models
    :param model_name: name of the model, one of ::: constants.AVAILABLE_MODELS
    :param download_dir: Where to store the models
    :param chunk_size: size of the download chunk

    :return: Absolute path of the downloaded model
    """
    if model_name not in AVAILABLE_MODELS:
        logger.error(f"Invalid model name `{model_name}`, available models are: {AVAILABLE_MODELS}")
        return
    if download_dir is None:
        download_dir = MODELS_DIR
        logger.info(f"No download directory was provided, models will be downloaded to {download_dir}")

    os.makedirs(download_dir, exist_ok=True)

    url = _get_model_url(model_name=model_name)
    file_path = Path(download_dir) / os.path.basename(url)
    # check if the file is already there
    if file_path.exists():
        logger.info(f"Model {model_name} already exists in {download_dir}")
    else:
        # download it from huggingface
        resp = requests.get(url, stream=True)
        total = int(resp.headers.get('content-length', 0))

        progress_bar = tqdm(desc=f"Downloading Model {model_name} ...",
                            total=total,
                            unit='iB',
                            unit_scale=True,
                            unit_divisor=1024)

        try:
            with open(file_path, 'wb') as file, progress_bar:
                for data in resp.iter_content(chunk_size=chunk_size):
                    size = file.write(data)
                    progress_bar.update(size)
            logger.info(f"Model downloaded to {file_path.absolute()}")
        except Exception as e:
            # error download, just remove the file
            os.remove(file_path)
            raise e
    return str(file_path.absolute())

resolve_model_path

resolve_model_path(model_name, models_dir=None)

Resolve a model name to a local model file.

Resolution order: 1. If model_name is an existing file path, return it. 2. Look for model_name and model_name.bin in models_dir. 3. If no local file is found, fall back to downloading a built-in model.

Parameters:

model_name (str) –

A built-in model name, a custom model name, or a direct path to a model file.
models_dir –

Directory to search for local models before downloading. Defaults to MODELS_DIR.

Returns:

str –

Absolute path to the resolved model file.

Source code in pywhispercpp/utils.py

def resolve_model_path(model_name: str, models_dir=None) -> str:
    """
    Resolve a model name to a local model file.

    Resolution order:
    1. If `model_name` is an existing file path, return it.
    2. Look for `model_name` and `model_name.bin` in `models_dir`.
    3. If no local file is found, fall back to downloading a built-in model.

    :param model_name: A built-in model name, a custom model name, or a direct path to a model file.
    :param models_dir: Directory to search for local models before downloading. Defaults to `MODELS_DIR`.
    :return: Absolute path to the resolved model file.
    """
    if Path(model_name).is_file():
        return str(Path(model_name).resolve())

    search_dir = Path(models_dir) if models_dir is not None else MODELS_DIR

    candidates = [
        search_dir / model_name,
        search_dir / f"{model_name}.bin",
    ]

    for candidate in candidates:
        if candidate.is_file():
            return str(candidate.resolve())

    return download_model(model_name, search_dir)

to_timestamp

to_timestamp(t, separator=',')

376 -> 00:00:03,760 1344 -> 00:00:13,440

Implementation from whisper.cpp/examples/main

Parameters:

t (int) –

input time from whisper timestamps
separator –

seprator between seconds and milliseconds

Returns:

str –

time representation in hh: mm: ss[separator]ms

Source code in pywhispercpp/utils.py

def to_timestamp(t: int, separator=',') -> str:
    """
    376 -> 00:00:03,760
    1344 -> 00:00:13,440

    Implementation from `whisper.cpp/examples/main`

    :param t: input time from whisper timestamps
    :param separator: seprator between seconds and milliseconds
    :return: time representation in hh: mm: ss[separator]ms
    """
    # logic exactly from whisper.cpp

    msec = t * 10
    hr = msec // (1000 * 60 * 60)
    msec = msec - hr * (1000 * 60 * 60)
    min = msec // (1000 * 60)
    msec = msec - min * (1000 * 60)
    sec = msec // 1000
    msec = msec - sec * 1000
    return f"{int(hr):02,.0f}:{int(min):02,.0f}:{int(sec):02,.0f}{separator}{int(msec):03,.0f}"

output_txt

output_txt(segments, output_file_path)

Creates a raw text from a list of segments

Implementation from whisper.cpp/examples/main

Parameters:

segments (list) –

list of segments

Returns:

str –

path of the file

Source code in pywhispercpp/utils.py

def output_txt(segments: list, output_file_path: str) -> str:
    """
    Creates a raw text from a list of segments

    Implementation from `whisper.cpp/examples/main`

    :param segments: list of segments
    :return: path of the file
    """
    if not output_file_path.endswith('.txt'):
        output_file_path = output_file_path + '.txt'

    absolute_path = Path(output_file_path).absolute()

    with open(str(absolute_path), 'w') as file:
        for seg in segments:
            file.write(seg.text)
            file.write('\n')
    return absolute_path

output_vtt

output_vtt(segments, output_file_path)

Creates a vtt file from a list of segments

Implementation from whisper.cpp/examples/main

Parameters:

segments (list) –

list of segments

Returns:

str –

Absolute path of the file

Source code in pywhispercpp/utils.py

def output_vtt(segments: list, output_file_path: str) -> str:
    """
    Creates a vtt file from a list of segments

    Implementation from `whisper.cpp/examples/main`

    :param segments: list of segments
    :return: path of the file

    :return: Absolute path of the file
    """
    if not output_file_path.endswith('.vtt'):
        output_file_path = output_file_path + '.vtt'

    absolute_path = Path(output_file_path).absolute()

    with open(absolute_path, 'w') as file:
        file.write("WEBVTT\n\n")
        for seg in segments:
            file.write(f"{to_timestamp(seg.t0, separator='.')} --> {to_timestamp(seg.t1, separator='.')}\n")
            file.write(f"{seg.text}\n\n")
    return absolute_path

output_srt

output_srt(segments, output_file_path)

Creates a srt file from a list of segments

Parameters:

segments (list) –

list of segments

Returns:

str –

Absolute path of the file

Source code in pywhispercpp/utils.py

def output_srt(segments: list, output_file_path: str) -> str:
    """
    Creates a srt file from a list of segments

    :param segments: list of segments
    :return: path of the file

    :return: Absolute path of the file
    """
    if not output_file_path.endswith('.srt'):
        output_file_path = output_file_path + '.srt'

    absolute_path = Path(output_file_path).absolute()

    with open(absolute_path, 'w') as file:
        for i in range(len(segments)):
            seg = segments[i]
            file.write(f"{i+1}\n")
            file.write(f"{to_timestamp(seg.t0, separator=',')} --> {to_timestamp(seg.t1, separator=',')}\n")
            file.write(f"{seg.text}\n\n")
    return absolute_path

output_csv

output_csv(segments, output_file_path)

Creates a srt file from a list of segments

Parameters:

segments (list) –

list of segments

Returns:

str –

Absolute path of the file

Source code in pywhispercpp/utils.py

def output_csv(segments: list, output_file_path: str) -> str:
    """
    Creates a srt file from a list of segments

    :param segments: list of segments
    :return: path of the file

    :return: Absolute path of the file
    """
    if not output_file_path.endswith('.csv'):
        output_file_path = output_file_path + '.csv'

    absolute_path = Path(output_file_path).absolute()

    with open(absolute_path, 'w') as file:
        for seg in segments:
            file.write(f"{10 * seg.t0}, {10 * seg.t1}, \"{seg.text}\"\n")
    return absolute_path

redirect_stderr

redirect_stderr(to=False)

Redirect stderr to the specified target.

Parameters:

to (bool | TextIO | str | None, default: False ) –
- None to suppress output (redirect to devnull),
- sys.stdout to redirect to stdout,
- A file path (str) to redirect to a file,
- False to do nothing (no redirection).

Source code in pywhispercpp/utils.py

@contextlib.contextmanager
def redirect_stderr(to: bool | TextIO | str | None = False) -> None:
    """
    Redirect stderr to the specified target.

    :param to:
        - None to suppress output (redirect to devnull),
        - sys.stdout to redirect to stdout,
        - A file path (str) to redirect to a file,
        - False to do nothing (no redirection).
    """

    if to is False:
        # do nothing
        yield
        return

    def _resolve_target(target):
        opened_stream = None
        if target is None:
            opened_stream = open(os.devnull, "w")
            return opened_stream, True
        if isinstance(target, str):
            opened_stream = open(target, "w")
            return opened_stream, True
        if hasattr(target, "write"):
            return target, False
        raise ValueError(
            "Invalid `to` parameter; expected None, a filepath string, or a file-like object."
        )

    sys.stderr.flush()
    try:
        original_fd = sys.stderr.fileno()
    except (AttributeError, OSError):
        # Jupyter or non-standard stderr implementations
        original_fd = None

    stream, should_close = _resolve_target(to)

    if original_fd is not None and hasattr(stream, "fileno"):
        saved_fd = os.dup(original_fd)
        try:
            os.dup2(stream.fileno(), original_fd)
            yield
        finally:
            os.dup2(saved_fd, original_fd)
            os.close(saved_fd)
            if should_close:
                stream.close()
        return

    # Fallback: Python-level redirect
    try:
        with contextlib.redirect_stderr(stream):
            yield
    finally:
        if should_close:
            stream.close()

pywhispercpp.examples

assistant

A simple example showcasing the use of pywhispercpp as an assistant. The idea is to use a VAD to detect speech (in this example we used webrtcvad), and when speech is detected we run the inference.

Assistant

Assistant(
    model="tiny",
    input_device=None,
    silence_threshold=8,
    q_threshold=16,
    block_duration=30,
    commands_callback=None,
    **model_params
)

Assistant class

Example usage

from pywhispercpp.examples.assistant import Assistant

my_assistant = Assistant(commands_callback=print, n_threads=8)
my_assistant.start()

Parameters:

model –

whisper.cpp model name or a direct path to aggml model
input_device (int, default: None ) –

The input device (aka microphone), keep it None to take the default
silence_threshold (int, default: 8 ) –

The duration of silence after which the inference will be running
q_threshold (int, default: 16 ) –

The inference won't be running until the data queue is having at least q_threshold elements
block_duration (int, default: 30 ) –

minimum time audio updates in ms
commands_callback (Callable[[str], None], default: None ) –

The callback to run when a command is received
model_log_level –

Logging level
model_params –

any other parameter to pass to the whsiper.cpp model see ::: pywhispercpp.constants.PARAMS_SCHEMA

Source code in pywhispercpp/examples/assistant.py

def __init__(self,
             model='tiny',
             input_device: int = None,
             silence_threshold: int = 8,
             q_threshold: int = 16,
             block_duration: int = 30,
             commands_callback: Callable[[str], None] = None,
             **model_params):

    """
    :param model: whisper.cpp model name or a direct path to a`ggml` model
    :param input_device: The input device (aka microphone), keep it None to take the default
    :param silence_threshold: The duration of silence after which the inference will be running
    :param q_threshold: The inference won't be running until the data queue is having at least `q_threshold` elements
    :param block_duration: minimum time audio updates in ms
    :param commands_callback: The callback to run when a command is received
    :param model_log_level: Logging level
    :param model_params: any other parameter to pass to the whsiper.cpp model see ::: pywhispercpp.constants.PARAMS_SCHEMA
    """

    self.input_device = input_device
    self.sample_rate = constants.WHISPER_SAMPLE_RATE  # same as whisper.cpp
    self.channels = 1  # same as whisper.cpp
    self.block_duration = block_duration
    self.block_size = int(self.sample_rate * self.block_duration / 1000)
    self.q = queue.Queue()

    self.vad = webrtcvad.Vad()
    self.silence_threshold = silence_threshold
    self.q_threshold = q_threshold
    self._silence_counter = 0

    self.pwccp_model = Model(model,
                             print_realtime=False,
                             print_progress=False,
                             print_timestamps=False,
                             single_segment=True,
                             no_context=True,
                             **model_params)
    self.commands_callback = commands_callback

start

start()

Use this function to start the assistant

Returns:

None –

None

Source code in pywhispercpp/examples/assistant.py

def start(self) -> None:
    """
    Use this function to start the assistant
    :return: None
    """
    logging.info(f"Starting Assistant ...")
    with sd.InputStream(
            device=self.input_device,  # the default input device
            channels=self.channels,
            samplerate=constants.WHISPER_SAMPLE_RATE,
            blocksize=self.block_size,
            callback=self._audio_callback):

        try:
            logging.info(f"Assistant is listening ... (CTRL+C to stop)")
            while True:
                time.sleep(0.1)
        except KeyboardInterrupt:
            logging.info("Assistant stopped")

gui

WorkerSignals

Bases: QObject

Defines signals available from a running worker thread. Supported signals are: - finished: No data - error: tuple (exctype, value, traceback.format_exc()) - result: list (the transcribed segments) - progress: int (0-100) - status_update: str