april-asr

aprilasr is a minimal library that provides an API for offline streaming speech-to-text applications

Status

This library is currently facing some major rewrites over 2025 to improve efficiency and properly fulfill the API contract of multi-session support. The model format is going to change.

Language support

The core library is written in C, and has a C API. Python and C# bindings are available.

Example in C

An example use of this library is provided in example.cpp. It can perform speech recognition on a wave file, or do streaming recognition by reading stdin.

It's built as the target main. After building aprilasr, you can run it like so:

$ ./main /path/to/file.wav /path/to/model.april

For streaming recognition, you can pipe parec into it. The command below will live caption your desktop audio.

$ parec --format=s16 --rate=16000 --channels=1 --latency-ms=100 --device=@DEFAULT_MONITOR@ | ./main - /path/to/model.april

Models

A few models are available, listed here.

The English models are based on csukuangfj's trained icefall model as the base, and trained with some extra data.

To export your own models, check out extra/exporting-howto.md

Building on Linux

Building requires ONNXRuntime. You can either try to build it from source or just download the release binaries.

Downloading ONNXRuntime

Run ./download_onnx_linux_x64.sh for linux-x64.

For other platforms the script should be very similar, or visit https://github.com/microsoft/onnxruntime/releases/tag/v1.13.1 and download the right zip/tgz file for your platform and extract the contents to a directory named lib.

You may also define the env variable ONNX_ROOT containing a path to where you extracted the archive, if placing it in lib isn't a choice.

Building ONNXRuntime from source (untested)

You don't need to do this if you've downloaded ONNXRuntime.

Follow the instructions here: https://onnxruntime.ai/docs/how-to/build/inferencing.html#linux

then run

cd build/Linux/RelWithDebInfo/
sudo make install

Building aprilasr

Run:

$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j4

You should now have main, libaprilasr.so and libaprilasr_static.so.

If running main fails because it can't find libonnxruntime.so.1.13.1, you may need to make libonnxruntime.so.1.13.1 accessible like so:

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd`/../lib/lib/

Building on Windows (msvc)

Create a folder called lib in the april-asr folder.

Download onnxruntime-win-x64-1.13.1.zip and extract the insides of the onnxruntime-win-x64-1.13.1 folder to the lib folder

Run cmake to configure and generate Visual Studio project files. Make sure you select x64 as the target if you have downloaded the x64 version of ONNXRuntime.

Open the ALL_BUILD.vcxproj and everything should build. The output will be in the Release or Debug folders.

When running main.exe you may receive an error message like this:

The application was unable to start correctly (0xc000007b)

To fix this, you need to make onnxruntime.dll available. One way to do this is to copy onnxruntime.dll from lib/lib/onnxruntime.dll to build/Debug and build/Release. You may need to distribute the dll together with your application.

Applications

Currently I'm developing Live Captions, a Linux desktop app that provides live captioning.

Acknowledgements

Thanks to the k2-fsa/icefall contributors for creating the speech recognition recipes and models.

This project makes use of a few libraries:

pocketfft, authored by Martin Reinecke, Copyright (C) 2008-2018 Max-Planck-Society, licensed under BSD-3-Clause
Sonic library, authored by Bill Cox, Copyright (C) 2010 Bill Cox, licensed under Apache 2.0 license
tinycthread, authored by Marcus Geelnard and Evan Nemerson, licensed under zlib/libpng license

The bindings are based on the Vosk API bindings, which is another speech recognition library based on previous-generation Kaldi. Vosk is Copyright 2019 Alpha Cephei Inc. and licensed under the Apache 2.0 license.

General Concepts

Before writing any code, it's recommended to understand these concepts. These apply to all of the language bindings.

Model

Model Diagram

Models end with the file extension .april. You can load these files using the AprilASR API.

Each model has its own sample rate in which it expects audio. There is a method to get the expected sample rate. Usually, this is 16000 Hz.

Models also have additional metadata such as name, description, language.

After loading a model, you can create one or more sessions that use the model.

Session

Session Diagram

In the most common case, you will have one session based on one model.

The session is what performs the actual speech recognition. It has methods to input audio, and it calls your given handler with decoded results.

Decoding

Data flow diagram

To perform speech-to-text, feed PCM16 audio of the speech to the session through the feed_pcm16 method (or equivalent in the language). Make sure it's in the correct sample rate and mono.

PCM16 means array of shorts with values between -32768 to 32767, each one describing one sample.

After calling feed_pcm16, the session will invoke the neural network and call your specified handler with a result. You can present this result to the user or do whatever you want with the result.

Multiple sessions

Multi-Session Diagram

In more advanced use cases, you may have multiple sessions performing recognition on multiple separate audio streams. When doing this, you can re-use the same model to minimize the memory use.

Async vs Non-Async

Synchronous session (default)

Synchronous session diagram

The simplest (and default) mode of operation are synchronous sessions.

In a synchronous session, when you call the function to feed audio, it will process the audio synchronously, call the handler if a new result is decoded, and finally return once it's done.

This means that calls to feed audio are extremely slow. This may be undesirable in some cases, such as in a live captioning situation. For this reason, you can choose to construct asynchronous sessions instead.

Asynchronous session

Asynchronous session diagram

An asynchronous session does not perform calculations on the calling thread.

Calls to feed audio are quick, as it copies the data and triggers a second thread to do the actual calculations. The second thread calls the handler at some point, when processing is done.

A caveat is that you must feed audio at a rate that comes out to 1 second per second. You should not feed multiple seconds or minutes at once. The internal buffer cannot fit more than a few seconds.

Asynchronous sessions are intended for streaming audio as it comes in, for live captioning for example. If you feed more than 1 second every second, you will get poor results (if any).

Async-Realtime vs Async-Non-Realtime

In an asynchronous session, there is a problem that the system may not be fast enough to process audio at the rate that it's coming in. This is where realtime and non-realtime sessions differ in behavior.

Realtime session diagram

A realtime session will work around this by automatically deciding to speed up incoming audio to a rate where the system can keep up. This involves some audio processing code, which may or may not be desirable.

Speeding up audio may reduce accuracy. It may not be severe at small values (such as 1.2x), but at larger values (such as over 2.0x) the accuracy may be severely impacted. There is a method you can call to get the current speedup value to know when this is happening, so you can display a warning to the user or similar.

A non-realtime session ignores this problem and assumes the system is fast enough. If this is not the case, the results will fall behind, the internal buffer will get full, ErrorCantKeepUp result will be called, and the results will be disastrously horrible.

Handler

The results are given via a callback (handler). It gets called by the session whenever it has new results. The parameters given to the callback include the result type and the token array.

Note that in an asynchronous session, the handler will be called from a different thread. Be sure to expect this and write thread-safe code, or use a synchronous session.

You should try to make your handler function fast to avoid slowing down the session.

The actual text can be extracted from the token array.

Result Type

The handler gets called with an enum explaining the result type:

Partial Recognition - the token array is a partial result and an updated array will be given in the next call
Final Recognition - the token array is final, the next call will start from an empty array
Error Can't Keep Up - called in an asynchronous non-realtime session if the system is not fast enough, may also be called briefly in an asynchronous realtime session, the token array is empty or null
Silence - there has been silence, the token array is empty or null

Token

A token may be a single letter, a word chunk, an entire word, punctuation, or other arbitrary set of characters.

To convert a token array to a string, simply concatenate the strings from each token. You don't need to add spaces between tokens, the tokens contain their own formatting.

Token example list

For example, the text "THAT'S COOL ELEPHANTS" may be represented as tokens like so:

[" THAT", "'", "S", " CO", "OL", " E", "LE", "P", "H", "ANT", "S"]
Simply concatenating these strings will give you the correct " THAT'S COOL ELEPHANTS", but with an extra space at the beginning. You may want to strip the final string to avoid the extra space.

Tokens contain more data than just the string however. They also contain the log probability, and a boolean denoting whether or not it's a word boundary. In English, the word boundary value is equivalent to checking if the first character is a space.

Dependencies

AprilASR depends on ONNXRuntime for ML inference. You will need both libraries for it to work:

Linux: libaprilasr.so and libonnxruntime.so
Windows: libaprilasr.dll and onnxruntime.dll

Model Downloads

There are two published English models:

april-english-dev-01110_en.april - trained with numbers and punctuation
aprilv0_en-us.april - previous model with no numbers or punctuation, spells numbers as words, often incorrectly

Some non-English models have been trained which are available for testing:

april-polish-dev-2_pl.april - Polish model, trained on Common Voice, 4.51% WER
april-french-dev-1_fr.april - French model, trained on Common Voice, 8.75% WER

All models are based on RNN-T (lstm-transducer-stateless2 recipe on icefall)

More models to come

Python

Installation

Run pip install april_asr

Getting Started

To get started, import AprilASR

import april_asr as april

Model

You can load a model like so:

your_model_path = "/path/to/model.april"
model = april.Model(your_model_path)

Models have a few metadata methods:

name: str = model.get_name()
description: str = model.get_description()
language: str = model.get_language()
sample_rate: int = model.get_sample_rate()

Session

Before creating a session, define a handler callback. Here is an example handler function that concatenates the tokens to a string and prints it:

def handler(result_type, tokens):
    s = ""
    for token in tokens:
        s = s + token.token
    
    if result_type == april.Result.FINAL_RECOGNITION:
        print("@"+s)
    elif result_type == april.Result.PARTIAL_RECOGNITION:
        print("-"+s)
    else:
        print("")

Now, a session may be created:

session = april.Session(model, handler)

Session Options

There are more options when it comes to creating a session, here is the initializer signature:

 class Session (model: april_asr.Model, callback: Callable[[april_asr.Result, List[april_asr.Token]], None], asynchronous: bool = False, no_rt: bool = False, speaker_name: str = '')

Refer to the General Concepts page for an explanation on asynchronous, non-realtime, and speaker name options

Feed data

Most of the examples use a very simple method like this to load and feed audio:

with open(wav_file_path, "rb") as f:
    data = f.read()

session.feed_pcm16(data)

This works only if the wav file is PCM16 and sampled in the correct sample rate. When you attempt to load an mp3, non-PCM16/non-16kHz wav file, or any other audio file in this way, you will likely get gibberish or no results.

To load more arbitrary audio files, you can use a Python library that handles audio loading (make sure librosa is installed: pip install librosa):

import librosa

# Load the audio samples as numpy floats
data, sr = librosa.load("/path/to/anything.mp3", sr=model.get_sample_rate(), mono=True)

# Convert the floats to PCM16 bytes
data = (data * 32767).astype("short").astype("<u2").tobytes()

session.feed_pcm16(data)

You can flush the session once the end of the file has been reached to force a final result:

session.flush()

Asynchronous

Asynchronous sessions are a little more complicated. You can create one by setting the asynchronous flag to true:

session = april.Session(model, handler, asynchronous=True)

Now, when feeding audio, be sure to feed it in realtime.

import librosa
import time

data, sr = librosa.load("/path/to/anything.mp3", sr=model.get_sample_rate(), mono=True)
data = (data * 32767).astype("short").astype("<u2").tobytes()

while len(data) > 0:
    chunk = data[:2400]
    data = data[2400:]
    
    session.feed_pcm16(chunk)
    if session.get_rt_speedup() > 1.5:
        print("Warning: System can't keep up, realtime speedup value of " + str(session.get_rt_speedup()))

    time.sleep(2400 / model.get_sample_rate())

Complete example

import april_asr as april
import librosa

# Change these values
model_path = "aprilv0_en-us.april"
audio_path = "audio.wav"

model = april.Model(model_path)


def handler(result_type, tokens):
    s = ""
    for token in tokens:
        s = s + token.token
    
    if result_type == april.Result.FINAL_RECOGNITION:
        print("@"+s)
    elif result_type == april.Result.PARTIAL_RECOGNITION:
        print("-"+s)
    else:
        print("")

session = april.Session(model, handler)

data, sr = librosa.load(audio_path, sr=model.get_sample_rate(), mono=True)
data = (data * 32767).astype("short").astype("<u2").tobytes()

session.feed_pcm16(data)
session.flush()

Congratulations! You have just performed speech recognition using AprilASR!

<!doctype html>

april_asr API documentation

Package `april_asr`

april_asr provides Python bindings for the aprilasr library.

aprilasr provides an API for offline streaming speech-to-text applications, and enables low-latency on-device realtime speech recognition for live captioning or other speech recognition use cases.

Sub-modules

april_asr.example

Classes

class Model (path: str)

Expand source code

class Model:
    """
    Models end with the file extension `.april`. You need to pass a path to
    such a file to construct a Model type.
Each model has its own sample rate in which it expects audio. There is a
method to get the expected sample rate. Usually, this is 16000 Hz.

Models also have additional metadata such as name, description, language.

After loading a model, you can create one or more sessions that use the
model.
&#34;&#34;&#34;
def __init__(self, path: str):
    self._handle = _c.ffi.aam_create_model(path)

    if self._handle is None:
        raise Exception(&#34;Failed to load model&#34;)

def get_name(self) -&gt; str:
    &#34;&#34;&#34;Get the name from the model&#39;s metadata&#34;&#34;&#34;
    return _c.ffi.aam_get_name(self._handle)

def get_description(self) -&gt; str:
    &#34;&#34;&#34;Get the description from the model&#39;s metadata&#34;&#34;&#34;
    return _c.ffi.aam_get_description(self._handle)

def get_language(self) -&gt; str:
    &#34;&#34;&#34;Get the language from the model&#39;s metadata&#34;&#34;&#34;
    return _c.ffi.aam_get_language(self._handle)

def get_sample_rate(self) -&gt; int:
    &#34;&#34;&#34;Get the sample rate from the model&#39;s metadata&#34;&#34;&#34;
    return _c.ffi.aam_get_sample_rate(self._handle)

def __del__(self):
    _c.ffi.aam_free(self._handle)
    self._handle = None</code></pre>


Models end with the file extension .april. You need to pass a path to
such a file to construct a Model type.
Each model has its own sample rate in which it expects audio. There is a
method to get the expected sample rate. Usually, this is 16000 Hz.
Models also have additional metadata such as name, description, language.
After loading a model, you can create one or more sessions that use the
model.
Methods


def get_description(self) ‑> str




Expand source code

def get_description(self) -> str:
    """Get the description from the model's metadata"""
    return _c.ffi.aam_get_description(self._handle)

Get the description from the model's metadata


def get_language(self) ‑> str




Expand source code

def get_language(self) -> str:
    """Get the language from the model's metadata"""
    return _c.ffi.aam_get_language(self._handle)

Get the language from the model's metadata


def get_name(self) ‑> str




Expand source code

def get_name(self) -> str:
    """Get the name from the model's metadata"""
    return _c.ffi.aam_get_name(self._handle)

Get the name from the model's metadata


def get_sample_rate(self) ‑> int




Expand source code

def get_sample_rate(self) -> int:
    """Get the sample rate from the model's metadata"""
    return _c.ffi.aam_get_sample_rate(self._handle)

Get the sample rate from the model's metadata



class Result
(*args, **kwds)




Expand source code

class Result(IntEnum):
    """
    Result type that is passed to your handler
    """
PARTIAL_RECOGNITION = 1,
&#34;&#34;&#34;A partial recognition. The next handler call will contain an updated
list of tokens.&#34;&#34;&#34;

FINAL_RECOGNITION = 2,
&#34;&#34;&#34;A final recognition. The next handler call will start from an empty
token list.&#34;&#34;&#34;

ERROR_CANT_KEEP_UP = 3,
&#34;&#34;&#34;In an asynchronous session, this may be called when the system can&#39;t
keep up with the incoming audio, and samples have been dropped. The
accuracy will be affected. An empty token list is given&#34;&#34;&#34;

SILENCE = 4
&#34;&#34;&#34;Called after some silence. An empty token list is given&#34;&#34;&#34;</code></pre>


Result type that is passed to your handler
Ancestors

enum.IntEnum
builtins.int
enum.ReprEnum
enum.Enum

Class variables

var ERROR_CANT_KEEP_UP

In an asynchronous session, this may be called when the system can't
keep up with the incoming audio, and samples have been dropped. The
accuracy will be affected. An empty token list is given

var FINAL_RECOGNITION

A final recognition. The next handler call will start from an empty
token list.

var PARTIAL_RECOGNITION

A partial recognition. The next handler call will contain an updated
list of tokens.

var SILENCE

Called after some silence. An empty token list is given




class Session
(model: april_asr.Model,
callback: Callable[[april_asr.Result, List[april_asr.Token]], None],
asynchronous: bool = False,
no_rt: bool = False,
speaker_name: str = '')




Expand source code

class Session:
    """
    The session is what performs the actual speech recognition. It has
    methods to input audio, and it calls your given handler with decoded
    results.
You need to pass a Model when constructing a Session.
&#34;&#34;&#34;
def __init__(self,
        model: Model,
        callback: Callable[[Result, List[Token]], None],
        asynchronous: bool = False,
        no_rt: bool = False,
        speaker_name: str = &#34;&#34;
    ):
    config = _c.AprilConfig()
    config.flags = _c.AprilConfigFlagBits()

    if asynchronous and no_rt:
        config.flags.value = 2
    elif asynchronous:
        config.flags.value = 1
    else:
        config.flags.value = 0

    if speaker_name != &#34;&#34;:
        spkr_data = struct.pack(&#34;@q&#34;, hash(speaker_name)) * 2
        config.speaker = _c.AprilSpeakerID.from_buffer_copy(spkr_data)

    config.handler = _HANDLER
    config.userdata = id(self)

    self.model = model
    self._handle = _c.ffi.aas_create_session(model._handle, config)
    if self._handle is None:
        raise Exception()

    self.callback = callback

def get_rt_speedup(self) -&gt; float:
    &#34;&#34;&#34;
    If the session is asynchronous and realtime, this will return a
    positive float. A value below 1.0 means the session is keeping up, and
    a value greater than 1.0 means the input audio is being sped up by that
    factor in order to keep up. When the value is greater 1.0, the accuracy
    is likely to be affected.
    &#34;&#34;&#34;
    return _c.ffi.aas_realtime_get_speedup(self._handle)

def feed_pcm16(self, data: bytes) -&gt; None:
    &#34;&#34;&#34;
    Feed the given pcm16 samples in bytes to the session. If the session is
    asynchronous, this will return immediately and queue the data for the
    background thread to process. If the session is not asynchronous, this
    will block your thread and potentially call the handler before
    returning.
    &#34;&#34;&#34;
    _c.ffi.aas_feed_pcm16(self._handle, data)

def flush(self) -&gt; None:
    &#34;&#34;&#34;
    Flush any remaining samples and force the session to produce a final
    result.
    &#34;&#34;&#34;
    _c.ffi.aas_flush(self._handle)

def __del__(self):
    _c.ffi.aas_free(self._handle)
    self.model = None
    self._handle = None</code></pre>


The session is what performs the actual speech recognition. It has
methods to input audio, and it calls your given handler with decoded
results.
You need to pass a Model when constructing a Session.
Methods


def feed_pcm16(self, data: bytes) ‑> None




Expand source code

def feed_pcm16(self, data: bytes) -> None:
    """
    Feed the given pcm16 samples in bytes to the session. If the session is
    asynchronous, this will return immediately and queue the data for the
    background thread to process. If the session is not asynchronous, this
    will block your thread and potentially call the handler before
    returning.
    """
    _c.ffi.aas_feed_pcm16(self._handle, data)

Feed the given pcm16 samples in bytes to the session. If the session is
asynchronous, this will return immediately and queue the data for the
background thread to process. If the session is not asynchronous, this
will block your thread and potentially call the handler before
returning.


def flush(self) ‑> None




Expand source code

def flush(self) -> None:
    """
    Flush any remaining samples and force the session to produce a final
    result.
    """
    _c.ffi.aas_flush(self._handle)

Flush any remaining samples and force the session to produce a final
result.


def get_rt_speedup(self) ‑> float




Expand source code

def get_rt_speedup(self) -> float:
    """
    If the session is asynchronous and realtime, this will return a
    positive float. A value below 1.0 means the session is keeping up, and
    a value greater than 1.0 means the input audio is being sped up by that
    factor in order to keep up. When the value is greater 1.0, the accuracy
    is likely to be affected.
    """
    return _c.ffi.aas_realtime_get_speedup(self._handle)

If the session is asynchronous and realtime, this will return a
positive float. A value below 1.0 means the session is keeping up, and
a value greater than 1.0 means the input audio is being sped up by that
factor in order to keep up. When the value is greater 1.0, the accuracy
is likely to be affected.




class Token
(token)




Expand source code

class Token:
    """
    A token may be a single letter, a word chunk, an entire word, punctuation,
    or other arbitrary set of characters.
To convert a token array to a string, simply concatenate the strings from
each token. You don&#39;t need to add spaces between tokens, the tokens
contain their own formatting.

Tokens also contain the log probability, and a boolean denoting whether or
not it&#39;s a word boundary. In English, the word boundary value is equivalent
to checking if the first character is a space.
&#34;&#34;&#34;

token: str = &#34;&#34;
logprob: float = 0.0
word_boundary: bool = False
sentence_end: bool = False
time: float = 0.0

def __init__(self, token):
    self.token = token.token.decode(&#34;utf-8&#34;)
    self.logprob = token.logprob
    self.word_boundary = (token.flags.value &amp; 1) != 0
    self.sentence_end = (token.flags.value &amp; 2) != 0
    self.time = float(token.time_ms) / 1000.0</code></pre>


A token may be a single letter, a word chunk, an entire word, punctuation,
or other arbitrary set of characters.
To convert a token array to a string, simply concatenate the strings from
each token. You don't need to add spaces between tokens, the tokens
contain their own formatting.
Tokens also contain the log probability, and a boolean denoting whether or
not it's a word boundary. In English, the word boundary value is equivalent
to checking if the first character is a space.
Class variables

var logprob : float



var sentence_end : bool



var time : float



var token : str



var word_boundary : bool







Sub-modules

april_asr.example


Classes


Model

get_description
get_language
get_name
get_sample_rate



Result

ERROR_CANT_KEEP_UP
FINAL_RECOGNITION
PARTIAL_RECOGNITION
SILENCE



Session

feed_pcm16
flush
get_rt_speedup



Token

logprob
sentence_end
time
token
word_boundary



Generated by pdoc 0.11.5.



C#
Installation
Install the nuget package from https://www.nuget.org/packages/AprilAsr
Getting Started
To get started, import AprilAsr
using AprilAsr;

Model
You can load a model like so:
string modelPath = "/path/to/model.april"; 
AprilModel model = new AprilModel(modelPath);

Models have a few metadata fields:
string name = model.Name;
string description = model.Description;
string language = model.Language;
int sampleRate = model.SampleRate;

Session
A session needs a callback. You can define one inline, this example concatenates the tokens to a string and prints it.
AprilSession session = new AprilSession(model, (result, tokens) => {
    if (tokens == null) return;

    string s = "";
    if(result == AprilResultKind.PartialRecognition) {
        s = "- ";
    }else if(result == AprilResultKind.FinalRecognition) {
        s = "@ ";
    }else{
        s = " ";
    }

    foreach(AprilToken token in tokens) {
        s += token.Token;
    }

    Console.WriteLine(s);
});

Session Options
There are more options when it comes to creating a session, here is the initializer signature:
public AprilSession(AprilModel model, SessionCallback callback, bool async = false, bool noRT = false, string speakerName = "") {

Refer to the General Concepts page for an explanation on asynchronous, non-realtime, and speaker name options
Feed data
Most of the examples use a very simple method like this to load and feed audio:
// Read the file data (assumes wav file is 16-bit PCM wav)
var fileData = File.ReadAllBytes(wavFilePath);
short[] shorts = new short[fileData.Length / 2];
Buffer.BlockCopy(fileData, 0, shorts, 0, fileData.Length);

// Feed the data
session.FeedPCM16(shorts, shorts.Length);

This works only if the wav file is PCM16 and sampled in the correct sample rate. When you attempt to load an mp3, non-PCM16/non-16kHz wav file, or any other audio file in this way, you will likely get gibberish or no results.

Asynchronous
Asynchronous sessions are a little more complicated. You can create one by setting the asynchronous flag to true:
AprilSession session = new AprilSession(..., async: true);

Now, when feeding audio, be sure to feed it in realtime.
var fileData = File.ReadAllBytes(wavFilePath);
short[] shorts = new short[2400];

for(int i=0; i<(fileData.Length/2); i+=shorts.Length){
    int size = Math.Min(shorts.Length, (fileData.Length/2) - i);
    Buffer.BlockCopy(fileData, i*2, shorts, 0, size*2);
    session.FeedPCM16(shorts, size);
    Thread.Sleep(size * 1000 / model.SampleRate);
}

session.Flush();

Complete example
using AprilAsr;

var modelPath = "aprilv0_en-us.april";
var wavFilePath = "audio.wav";

// Load the model and print metadata
var model = new AprilModel(modelPath);
Console.WriteLine("Name: " + model.Name);
Console.WriteLine("Description: " + model.Description);
Console.WriteLine("Language: " + model.Language);

// Create the session with an inline callback
var session = new AprilSession(model, (result, tokens) => {
    string s = "";
    if(result == AprilResultKind.PartialRecognition) {
        s = "- ";
    }else if(result == AprilResultKind.FinalRecognition) {
        s = "@ ";
    }else{
        s = " ";
    }

    foreach(AprilToken token in tokens) {
        s += token.Token;
    }

    Console.WriteLine(s);
});

// Read the file data (assumes wav file is 16-bit PCM wav)
var fileData = File.ReadAllBytes(wavFilePath);
short[] shorts = new short[fileData.Length / 2];
Buffer.BlockCopy(fileData, 0, shorts, 0, fileData.Length);

// Feed the data and flush
session.FeedPCM16(shorts, shorts.Length);
session.Flush();

Congratulations! You have just performed speech recognition using AprilAsr!







AprilAsr: AprilAsr Namespace Reference









 
 
  
   AprilAsr
   
  
 
 






  
Classes |
Enumerations |
Functions  
  AprilAsr Namespace Reference




Classes
class   AprilModel
  Models end with the file extension .april. You need to pass a path to such a file to construct a Model type.  More...

 
class   AprilSession
  The session is what performs the actual speech recognition. It has methods to input audio, and it calls your given handler with decoded results.  More...

 
struct   AprilToken
  A token may be a single letter, a word chunk, an entire word, punctuation, or other arbitrary set of characters.  More...

 


Enumerations
enum   AprilResultKind : int { PartialRecognition = 1
, FinalRecognition = 2
, ErrorCantKeepUp = 3
, Silence = 4
 }
  Result type that is passed to your handler.  More...

 


Functions

delegate void  SessionCallback (AprilResultKind kind, AprilToken[] tokens)
  Session callback type. You must provide one of such type when constructing a session. 

 

Enumeration Type Documentation

◆ AprilResultKind


      
        
          enum AprilAsr.AprilResultKind : int
        
      

Result type that is passed to your handler. 

Enumerator
PartialRecognition  A partial recognition. The next handler call will contain an updated list of tokens. 

FinalRecognition  A final recognition. The next handler call will start from an empty token list. 

ErrorCantKeepUp  In an asynchronous session, this may be called when the system can't keep up with the incoming audio, and samples have been dropped. The accuracy will be affected. An empty token list is given. 

Silence  Called after some silence. An empty token list is given. 







Generated by  1.9.8

april-asr Documentation

Sub-modules

Classes

Classes

Enumerations

Functions

Enumeration Type Documentation

◆ AprilResultKind