Inference Framework

Overview

wav2letter@everywhere is a multithreaded and a multi-platform library for researchers, production engineers and students to quickly put together trained DNN modules for online inference. This document is a user guide with in depth description of the streaming library architecture and tradeoffs.

Why Should You Use it?

The streaming inference DNN processing graph, allows researchers, production engineers, or to quickly:

Build streaming speech recognition system using wav2letter++.
Easily load trained modules into memory and processing efficient processing graph.
Unlimited number of concurrent processing streams.
Loaded modules are compressed for efficient memory use while maintaining high throughput and very low latency.
Currently released version supports hosts and soon we’ll release a version that supports Android and IOS.
Load existing or your own trained modules from wav2letter training frameworks.

Features

Module Library

The inference streaming framework come packed with ASR lego-block modules such as fully connected linear layer, streaming convolution, decoder, activation functions, feature extraction, etc.

Trained modules

Trained english module are freely available at: ...

Conversion tool

Easily extensible tool imports trained modules into inference format which can use less memory Currently supporting 16bit floating point internal representation. Converting 32bit modules to 16bit reduces the memory size while keeping high inference quality.

The tools currently supporting input from wav2letter, but users are encouraged to extend this simple tool for any needed input format.

Serialization

The inference streaming framework supports serialization into binary files. It also supports serialization into JSON and XML formats and serialization into any streamable destination. The user can later read such file to create a fully functional streaming inference DNN.

Multithreading

Single configured module supports any number of concurrent streams. This architecture maximizes efficiency while minimizing memory use.

Multi Platform

The streaming inference architecture is easily expandable to new platform. Currently we release the FBGEMM backend for hosts and servers. Soon, we’ll release the Android and IOS backends for on-device inference.

Free and open source

Wav2letter and wav2letter-inference are free and open-source projects supported by Facebook scientists and engineers and by the community.

Software architecture

This section describes the streaming library main abstructions

Inference Module

Inference module is the base abstraction for objects that process streaming input to output. It is the base class for DNN modules, activation functions, and composite modules. Composite modules chain modules, activation functions, and possibly other composite modules.

alt_text

Composite module are composed of simpler modules. For example, the TDSBlock is a subclass of Sequential and is composed of other modules including Residual modules. In fact the TDSBlock also has Activation functions. Each module has a memory manager. The module is using the memory manager for allocating temporary workspaces for processing. Users can use the default memory manager or extend that class in order to optimized for specific cases, collect stats etc.

class InferenceModule {
  Public:
     using StatePtr = std::shared_ptr<ModuleProcessingState>;
 
     virtual StatePtr start(StatePtr input) = 0;
 
     virtual StatePtr run(StatePtr input) = 0;
 
     virtual StatePtr finish(StatePtr input) {
           return run(input);
      }
 
   void setMemoryManager(std::shared_ptr<MemoryManager> memoryManager);
}

Inference module processes stream using the three methods: start(), run() and finish(). The user should call start() at the beginning of the stream, run() to process streaming input as it comes, and finish() at the end of the stream. Each of these methods are taking a shared pointer of ModuleProcessingState as input and returns another object of the same type as output. In fact, the same input object always returns the same output object regardless which method is called and how many times.

Module Processing State

Stream processing requires keeping some intermediate state. We keep that state in a vector of buffers per stream per module (only for modules that need it). ModuleProcessingState abstracts the complete state per stream as a linked list from first input to final output.

alt_text

Module processing state is a linked list node. The create a ModuleProcessingState, write one or more of its buffers first input in the stream, and call start(). start() allocate an output ModuleProcessingState and sets it as its next in the list. The user gets that output. The buffer(s) of the output hold the result. Complex modules may create multiple links in the ModuleProcessingState list between the user’s input and the returned output.

alt_text mps is short for ModuleProcessingState in the diagram above, holds the state per stream

class ModuleProcessingState {
Public:
  shared_ptr<IOBuffer>>& buffer(int index);
 
  vector<shared_ptr<IOBuffer>>& buffers();
 
  shared_ptr<ModuleProcessingState> next(bool createIfNotExists = false);
 
private:
 std::vector<std::shared_ptr<IOBuffer>> buffers_;
 std::shared_ptr<ModuleProcessingState> next_;
};

IOBuffer

The IOBuffer is simple self growing memory buffer. You'll mostly use these to write input and read output. A ModuleProcessingState is in fact a vector of these. Users and modules may use the IOBuffer for different types. For that this end we have the templated methods that allow to access the buffer as a buffer of any type. The size in all these methods is in the size of the specified type.

class IOBuffer {
public:
 template <typename T> void write(const T* buf, int size);
 
 template <typename T> void consume(int size);
 
 template <typename T>  int size() const;
 
 template <typename T> T* data();
}

Example use of member template methods:

vector<float> myInput = {..};
inputBuffer->write<float>(myInput.data(), myInput.size());
float* bufPtr = inputBuffer->data<float>();

Memory Manager

When modules need workspace for calculations, they asks the memory manager. The memory manager allows user to create specialized memory managers by subclassing the MemoryManager. If user does not set up a memory manager then the DefaultMemoryManager is used. This manager simply calls malloc and free.

Flexible Backend

Modules that can use backed acceleration, such as Conv1d and Linear layers, are instantiated using a factory function. The factory function creates a subclass object that is accelerated and optimized for the current architecture. For example createLinear() is declared at Linear.h in the module/nn directory.

using ParamPtr = std::shared_ptr<ModuleParameter>;
 
std::shared_ptr<Linear> createLinear(
   int nInput,
   int nOutput,
   ParamPtr weights,
   ParamPtr bias);

creatLinear() return a subclass of Linea that uses the best backend for the current architecture.

Architecture is specified at build time by setting W2L_INFERENCE_BACKEND.

cmake -DW2L_INFERENCE_BACKEND=fbgemm

This will create a Makefile that picks the backend implemented in the fbgemm source directory.

inference
 ├── common
 ├── decoder
 └── module
     ├── Linear.h
     └── nn
          └── backend
               └── fbgemm
                   ├── LinearFbGemm.cpp

The function createLinear() is implemented in LinearFbGemm.cpp

using ParamPtr = std::shared_ptr<ModuleParameter>;
 
std::shared_ptr<Linear> createLinear(
   int nInput,
   int nOutput,
   ParamPtr weights,
   ParamPtr bias) {
 return std::make_shared<LinearFbGemm>(nInput, nOutput, weights, bias);
}

It returns LinearFbGemm, a subclass of Linear that uses the FBGEMM library which is optimized for high performance, low precision (16bit FP) on x86 machines,

Code Example

Create a simple module

Creating a module is simple. All you need is the module parameters values. Next we wrap the values with ModuleParameter object and create the module directly.

#include "inference/module/Module.h"
#include "inference/module/Conv1dCreate.h"
 
// Create or load the parameter raw data
std::vector<float> convWeights = {-0.02, 0.21, .. }
std::vector<float> convBias = { 0.1, -0.2 }
 
// Use the raw data to create inference parameter objects.
const auto convWeightParam = std::make_shared<ModuleParameter>(
   DataType::FLOAT, convWeights.data(), convWeights.size());
const auto convBiasParam = std::make_shared<ModuleParameter>(
   DataType::FLOAT, convBias.data(), convBias.size());
 
// Create a configured DNN module.
auto conv = Conv1dCreate(
   inputChannels,
   outputChannels,
   kernelSize,
   stride,
   {leftPadding, rightPadding},
   groups,
   convWeightParam,
   convBiasParam);

Assemble Complex Modules

Complex networks are assembled from simple layers using the Sequential module.

auto linear = LinearCreate(inputChannels,
                       outputChannels,
                       linearWeightParam,
                       linearBiasParam);
 
auto layerNorm = std::make_shared<LayerNorm>(channels, layerNormWeight, layerNormbias);
 
auto sequence = std::make_shared<Sequential>();
sequence->add(conv);
sequence->add(std::make_shared<Relu>(dataType));
sequence->add(layerNorm);
sequence->add(std::make_shared<Relu>(dataType));
sequence->add(linear);

Process Input

auto input = std::make_shared<ModuleProcessingState>(1);
auto output = sequence->start(input)
 
std::shared_ptr<IOBuffer> inputBuffer = input->buffer(0);
std::shared_ptr<IOBuffer> outputBuffer = output->buffer(0);
 
while (yourInputSource.hasMore()) {
 vector<float> yourInput = yourInputSource.nextChunk();
 inputBuffer>write<float>(yourInput.data(), yourInput.size());
 
 // Run the module on the next input.
 // The buffers of the output are updates. The output object is the same one that
 // returns by start() and for every call of run().
 output = sequence->run(input);
 
 UseTheResult(outputBuffer->data<float>(), outputBuffer->size<float>();)
}
 
output = sequence->finish(input);
UseTheResult(outputBuffer->data<float>(), outputBuffer->size<float>();)

Serialize

// Save sequence to a binary file.
{
  ofstream myfile("dnn.bin");
  cereal::BinaryOutputArchive archive(myfile);
  archive(sequence);
}
 
// Load sequence from a binary file.
{
 ifstream myfile("dnn.bin");
 std::shared_ptr<Sequential> sequence;
 cereal::BinaryInputArchive archive(myfile);
 archive(sequence);
 
… sequence->run(...)
}

Conclusion

wav2letter@anywhere is a high performance, low overhead, multi-threaded, multi-platform framework for quickly assembling ASR inference for research and for embedding in products.