Package april_asr

april_asr provides Python bindings for the aprilasr library.

aprilasr provides an API for offline streaming speech-to-text applications, and enables low-latency on-device realtime speech recognition for live captioning or other speech recognition use cases.

Sub-modules

april_asr.example

Classes

class Model (path: str)
Expand source code
class Model:
    """
    Models end with the file extension `.april`. You need to pass a path to
    such a file to construct a Model type.

    Each model has its own sample rate in which it expects audio. There is a
    method to get the expected sample rate. Usually, this is 16000 Hz.

    Models also have additional metadata such as name, description, language.

    After loading a model, you can create one or more sessions that use the
    model.
    """
    def __init__(self, path: str):
        self._handle = _c.ffi.aam_create_model(path)

        if self._handle is None:
            raise Exception("Failed to load model")

    def get_name(self) -> str:
        """Get the name from the model's metadata"""
        return _c.ffi.aam_get_name(self._handle)

    def get_description(self) -> str:
        """Get the description from the model's metadata"""
        return _c.ffi.aam_get_description(self._handle)

    def get_language(self) -> str:
        """Get the language from the model's metadata"""
        return _c.ffi.aam_get_language(self._handle)

    def get_sample_rate(self) -> int:
        """Get the sample rate from the model's metadata"""
        return _c.ffi.aam_get_sample_rate(self._handle)

    def __del__(self):
        _c.ffi.aam_free(self._handle)
        self._handle = None

Models end with the file extension .april. You need to pass a path to such a file to construct a Model type.

Each model has its own sample rate in which it expects audio. There is a method to get the expected sample rate. Usually, this is 16000 Hz.

Models also have additional metadata such as name, description, language.

After loading a model, you can create one or more sessions that use the model.

Methods

def get_description(self) ‑> str
Expand source code
def get_description(self) -> str:
    """Get the description from the model's metadata"""
    return _c.ffi.aam_get_description(self._handle)

Get the description from the model's metadata

def get_language(self) ‑> str
Expand source code
def get_language(self) -> str:
    """Get the language from the model's metadata"""
    return _c.ffi.aam_get_language(self._handle)

Get the language from the model's metadata

def get_name(self) ‑> str
Expand source code
def get_name(self) -> str:
    """Get the name from the model's metadata"""
    return _c.ffi.aam_get_name(self._handle)

Get the name from the model's metadata

def get_sample_rate(self) ‑> int
Expand source code
def get_sample_rate(self) -> int:
    """Get the sample rate from the model's metadata"""
    return _c.ffi.aam_get_sample_rate(self._handle)

Get the sample rate from the model's metadata

class Result (*args, **kwds)
Expand source code
class Result(IntEnum):
    """
    Result type that is passed to your handler
    """

    PARTIAL_RECOGNITION = 1,
    """A partial recognition. The next handler call will contain an updated
    list of tokens."""

    FINAL_RECOGNITION = 2,
    """A final recognition. The next handler call will start from an empty
    token list."""

    ERROR_CANT_KEEP_UP = 3,
    """In an asynchronous session, this may be called when the system can't
    keep up with the incoming audio, and samples have been dropped. The
    accuracy will be affected. An empty token list is given"""

    SILENCE = 4
    """Called after some silence. An empty token list is given"""

Result type that is passed to your handler

Ancestors

  • enum.IntEnum
  • builtins.int
  • enum.ReprEnum
  • enum.Enum

Class variables

var ERROR_CANT_KEEP_UP

In an asynchronous session, this may be called when the system can't keep up with the incoming audio, and samples have been dropped. The accuracy will be affected. An empty token list is given

var FINAL_RECOGNITION

A final recognition. The next handler call will start from an empty token list.

var PARTIAL_RECOGNITION

A partial recognition. The next handler call will contain an updated list of tokens.

var SILENCE

Called after some silence. An empty token list is given

class Session (model: april_asr.Model,
callback: Callable[[april_asr.Result, List[april_asr.Token]], None],
asynchronous: bool = False,
no_rt: bool = False,
speaker_name: str = '')
Expand source code
class Session:
    """
    The session is what performs the actual speech recognition. It has
    methods to input audio, and it calls your given handler with decoded
    results.

    You need to pass a Model when constructing a Session.
    """
    def __init__(self,
            model: Model,
            callback: Callable[[Result, List[Token]], None],
            asynchronous: bool = False,
            no_rt: bool = False,
            speaker_name: str = ""
        ):
        config = _c.AprilConfig()
        config.flags = _c.AprilConfigFlagBits()

        if asynchronous and no_rt:
            config.flags.value = 2
        elif asynchronous:
            config.flags.value = 1
        else:
            config.flags.value = 0

        if speaker_name != "":
            spkr_data = struct.pack("@q", hash(speaker_name)) * 2
            config.speaker = _c.AprilSpeakerID.from_buffer_copy(spkr_data)

        config.handler = _HANDLER
        config.userdata = id(self)

        self.model = model
        self._handle = _c.ffi.aas_create_session(model._handle, config)
        if self._handle is None:
            raise Exception()

        self.callback = callback

    def get_rt_speedup(self) -> float:
        """
        If the session is asynchronous and realtime, this will return a
        positive float. A value below 1.0 means the session is keeping up, and
        a value greater than 1.0 means the input audio is being sped up by that
        factor in order to keep up. When the value is greater 1.0, the accuracy
        is likely to be affected.
        """
        return _c.ffi.aas_realtime_get_speedup(self._handle)

    def feed_pcm16(self, data: bytes) -> None:
        """
        Feed the given pcm16 samples in bytes to the session. If the session is
        asynchronous, this will return immediately and queue the data for the
        background thread to process. If the session is not asynchronous, this
        will block your thread and potentially call the handler before
        returning.
        """
        _c.ffi.aas_feed_pcm16(self._handle, data)

    def flush(self) -> None:
        """
        Flush any remaining samples and force the session to produce a final
        result.
        """
        _c.ffi.aas_flush(self._handle)

    def __del__(self):
        _c.ffi.aas_free(self._handle)
        self.model = None
        self._handle = None

The session is what performs the actual speech recognition. It has methods to input audio, and it calls your given handler with decoded results.

You need to pass a Model when constructing a Session.

Methods

def feed_pcm16(self, data: bytes) ‑> None
Expand source code
def feed_pcm16(self, data: bytes) -> None:
    """
    Feed the given pcm16 samples in bytes to the session. If the session is
    asynchronous, this will return immediately and queue the data for the
    background thread to process. If the session is not asynchronous, this
    will block your thread and potentially call the handler before
    returning.
    """
    _c.ffi.aas_feed_pcm16(self._handle, data)

Feed the given pcm16 samples in bytes to the session. If the session is asynchronous, this will return immediately and queue the data for the background thread to process. If the session is not asynchronous, this will block your thread and potentially call the handler before returning.

def flush(self) ‑> None
Expand source code
def flush(self) -> None:
    """
    Flush any remaining samples and force the session to produce a final
    result.
    """
    _c.ffi.aas_flush(self._handle)

Flush any remaining samples and force the session to produce a final result.

def get_rt_speedup(self) ‑> float
Expand source code
def get_rt_speedup(self) -> float:
    """
    If the session is asynchronous and realtime, this will return a
    positive float. A value below 1.0 means the session is keeping up, and
    a value greater than 1.0 means the input audio is being sped up by that
    factor in order to keep up. When the value is greater 1.0, the accuracy
    is likely to be affected.
    """
    return _c.ffi.aas_realtime_get_speedup(self._handle)

If the session is asynchronous and realtime, this will return a positive float. A value below 1.0 means the session is keeping up, and a value greater than 1.0 means the input audio is being sped up by that factor in order to keep up. When the value is greater 1.0, the accuracy is likely to be affected.

class Token (token)
Expand source code
class Token:
    """
    A token may be a single letter, a word chunk, an entire word, punctuation,
    or other arbitrary set of characters.

    To convert a token array to a string, simply concatenate the strings from
    each token. You don't need to add spaces between tokens, the tokens
    contain their own formatting.

    Tokens also contain the log probability, and a boolean denoting whether or
    not it's a word boundary. In English, the word boundary value is equivalent
    to checking if the first character is a space.
    """

    token: str = ""
    logprob: float = 0.0
    word_boundary: bool = False
    sentence_end: bool = False
    time: float = 0.0

    def __init__(self, token):
        self.token = token.token.decode("utf-8")
        self.logprob = token.logprob
        self.word_boundary = (token.flags.value & 1) != 0
        self.sentence_end = (token.flags.value & 2) != 0
        self.time = float(token.time_ms) / 1000.0

A token may be a single letter, a word chunk, an entire word, punctuation, or other arbitrary set of characters.

To convert a token array to a string, simply concatenate the strings from each token. You don't need to add spaces between tokens, the tokens contain their own formatting.

Tokens also contain the log probability, and a boolean denoting whether or not it's a word boundary. In English, the word boundary value is equivalent to checking if the first character is a space.

Class variables

var logprob : float
var sentence_end : bool
var time : float
var token : str
var word_boundary : bool