Keyword Detection integration

Keyword Detection (KD), also known as Voice Activation or Sound Trigger, is a feature that triggers a speech recognition engine when a predefined keyphrase (keyword) is successfully detected. Offloading the keyphrase detection algorithm to the embedded processing environment (i.e. dedicated DSP) reduces system power consumption while listening for an utterance.

The terms Voice Activation and Keyphrase Detection are often used interchangeably to describe end-to-end system-level use cases that include:

  • Keyphrase detection algorithm

  • Keyphrase enrollment (parametrization of keyphrase detection algorithm)

  • Management of an audio stream that is used to transport utterances

  • Steps made to reduce system-level power consumption

  • System wakeup on keyphrase detection

The Keyphrase Detector component typically is used to identify a firmware processing component that implements an algorithm for keyphrase detection in an audio stream.

The speech audio stream is used to indicate that the stream is primarily used to deliver data to the automatic speech recognition (ASR) algorithm. The voice audio stream typically indicates that the recipient of audio data is a human.

Depending on system-level requirements for the keyphrase detection algorithm and the speech recognition engine, different policies for keyphrase buffering and voice data streaming may be applied. This document covers the reference implementation available in SOF. The following sections cover the functional scope.

Note

Currently, SOF implements the Keyphrase Detector component with a reference trigger function that allows testing of E2E flow by detecting a rapid volume change.

Timing sequence

@startuml



scale max 1024 width



footer: timeline not to scale 

robust "Speech application" as App

concise "Audio Stream" as Audio



App is idle

Audio is "Preceeding"



@App

0 is idle

+180 is Processing



@Audio

0 is Keyphrase

@0 <-> @100 : keyphrase length - L1

@100 <-> @+80 : detection\ntime - L2

@180 <-> @+80 : burst \ntransmission time - L3

Audio@180 -> App@180 : notification

@260 <-> @+60 : safety \nmargin - L4

100 is Command

+200 is Following

@enduml

Figure 67 Basic diagram for a timing sequence

A keyphrase is preceded by a period of silence and is followed by a user command. In order to balance power savings and user experience, the host system (CPU) is activated only if a keyphrase is detected. To reduce the number of false triggers for user commands, the keyphrase can be sent to the host for additional (2nd stage) verification. This requires the FW to buffer the keyphrase in a memory. Keyphrase transmission to the host is as fast as possible (faster than real-time) to reduce latency for a system response.

End-2-End flows

@startuml



scale max 1024 width



participant "Userspace component" as usr

participant "Audio driver" as drv

participant "FW infrastructure" as fw

participant "Data transfer to Host" as dma

participant "Keyword detection algorithm" as kda

participant "Data transfer to DSP" as gpdma



box "Linux User/Kernel space" #LavenderBlush

	participant usr

	participant drv

end box



box "DSP" #LightBlue

	participant fw

	participant dma

	participant kda

	participant gpdma

end box



activate fw



drv -> fw : Setup audio topology \n (Speech Capture & Keyword Detection pipes)

usr -> drv : Prepare & Open PCM capture \n(snd_pcm_open/snd_pcm_hw_params)

drv -> fw : Stream Open & Preparation

drv -> fw : HW Params

group optional (depends on keyword detection algorithm implementation)

 usr -> drv : Send keyword detection algorithm parameters \n (snd_ctl_elem_tlv_write)

 drv -> fw : Send keyword detection algorithm parameters

 fw -> kda : Send keyword detection algorithm parameters

end



drv ->drv : DAPM power up event

drv -> fw : HW Params for Keyphrase Detection Pipeline

usr -> drv : Trigger start (alsamixer)

drv -> fw : Keyword detection algorithm & buffer manager triggered



fw -> fw : Keyphrase Buffer Manager \nin acquisition mode

fw -> gpdma 



activate gpdma



fw -> kda : keypharse detection enabled



activate kda



usr -> drv : Trigger start (snd_pcm_read)



note over usr

Speech application indefinitely 

waits for data.

end note 



ref over usr, drv, fw , gpdma, kda, dma  

Speech Capture pipeline is not transmitting data to Host system

Host system may enter the low power state

end ref



loop keyword detection algorithm \nexecuted on DSP

 kda <- gpdma 

end



hnote over kda : keyword is detected



fw <-- kda : FW event on keyword detection

fw -> kda : keyword detection disabled



deactivate kda 



fw -> fw : Keyphrase Buffer Manager \nin drain mode

drv <-- fw : notification on keyword detection

'drv -> fw : enable data transission to Host \n(Capture[Speech] pipeline to Host is running)

usr <-- drv : notification on keyword detection (optional)

gpdma -> dma 



activate dma



ref over dma 

Sending a burst of historic data (approx.2s) 

with detected keyword for

second stage verification on host.

end ref



gpdma <-- dma 



deactivate dma



usr <-- drv : snd_pcm_read completed 



fw -> fw : Keyphrase Buffer Manager \nin passthrough mode 



loop Realtime capture

 usr -> drv : snd_pcm_read

 gpdma -> dma 

 activate dma

 gpdma <-- dma 



 deactivate dma

 usr <-- drv : snd_pcm_read completed 

end 



ref over usr 

User space optionally performs second stage keyword verification.

end ref



usr -> drv : Trigger stop (alsamixer)

drv ->drv : DAPM power down event

drv -> fw : Stop Keyphrase Detection algorithm pipeline

usr -> drv : Trigger stop (snd_pcm_drop / snd_pcm_free)

drv -> fw : Close Speech capture stream

fw -> gpdma 



deactivate gpdma



ref over usr, drv, fw , gpdma, kda, dma  

The flow can be repeated for next user command starting from snd_pcm_open()

end ref



deactivate fw

@enduml

Figure 68 E2E flow for SW/FW components

The fundamental assumption for the flow is that the keyphrase detection sequence is controlled by the user space component (application) that opens and closes the speech audio stream. The audio topology must be set up before the speech stream is opened. There is an optional sequence to customize the keyword detection algorithm by behavior by sending run-time parameters.

During the Stream Open and Preparation phase, HW parameters are sent to the DAI and configuration parameters are passed from the topology to the FW components. The DAPM events handlers are used to control a Keyphrase Detector node of the FW topology graph by the audio driver. Once the keyphrase is detected, a notification is sent to the driver. At the same time, an internal event in the FW triggers, draining buffered audio data in burst mode to the host. Once the buffer is drained, the speech capture pipeline starts to work as a passthrough capture until it is closed by the user space application.

FW topology

@startuml



scale max 1024 width



skinparam rectangle {

   backgroundColor<<dai>> #6fccdd

   backgroundColor<<dma>> #f6ed80

   backgroundColor<<stream>> #d6d6de

   borderColor<<stream>> #d6d6de

   borderColor<<ppl>> #a1a1ca



   backgroundColor<<event>> #f05772

   stereotypeFontColor<<event>> #ffffff

   fontColor<<event>> #ffffff



   backgroundColor<<cpu>> #f0f0f0

}





together {

rectangle "MIC HW" as dmic #DDDDDD



rectangle "Speech Capture Pipeline" as ppl_1 <<FW pipeline >>{

 rectangle "MIC DAI" as dai_1 <<dai>>

 rectangle "Keyphrase Buffer Manager" as kpb

 dai_1 -> kpb : 2ch/16kHz/16bit

 rectangle "Host" as host

 }



}



rectangle "Keyphrase Detector Pipeline" as ppl_2 <<FW pipeline >>{

 rectangle "Channel selector" as sel

 rectangle "Keyphrase detection algorithm" as wov

 sel -> wov : 1ch/16kHz/16bit

}



rectangle "Host System" as hsys {

 rectangle "Host Memory" as hmem #DDDDDD

}



dmic -> dai_1

kpb -> host

kpb -> sel : 2ch/16kHz/16bit

host -> hmem : 2ch/16kHz/16bit

wov ..> kpb : FW events

wov ..> hsys : FW notifications

@enduml

Figure 69 Basic diagram for FW components topology

The diagram above provides an overview of FW and HW components that play a role in keyphrase detection flows. The components are organized in pipelines:

  1. Speech capture pipeline

    1. DMIC DAI configures the HW interface to capture data from microphones.

    2. The Keyphrase Buffer Manager is responsible for managing the data captured by microphones. This includes control of an internal buffer for incoming data and routing of incoming audio samples. The audio buffer with historic audio data is implemented as a cyclic buffer. While listening to a keyphrase, the component stores incoming data in an internal buffer and copies it to a sink that leads toward the keyword detector component. On successful detection of a keyphrase, the buffer is drained during a burst transmission to a host. Once the buffer is drained, it starts to work as a passthrough component on a capture pipeline.

    3. The host component configures transport (over DMA) to the host system. The component is responsible for transmitting from local memory (FW accessible) to remote (host CPU accessible) memory.

  2. Keyphrase detector pipeline

    1. The channel selector is responsible for providing a single channel on input to the keyphrase detection algorithm. The decision of which channel to select is made by the platform integrator. The component can accept parameters from a topology file.

    2. The keyphrase detection algorithm accepts audio frames and returns information if a keyphrase is detected. Note that the FW infrastructure can allow a FW event to be sent to the Keyphrase Buffer Manager component if a keyphrase is detected. The component also sends a notification to the audio driver and implements large parameters support.

KPBM state diagram

The state diagram below presents all possible keyphrase buffer manager states as well as the valid relationships between them.

@startuml

[*] --> KPB_DISABLED:  Start \nor\n [IPC] free \nmessage \nfrom either  state

KPB_DISABLED: Starting state of KPB - \nNo action has been taken yet

KPB_DISABLED--> KPB_CREATED: [IPC] \nnew component

KPB_DISABLED-[#0000FF]-> KPB_DISABLED: [IPC] \nreset



KPB_CREATED : New KPB component has been created

KPB_CREATED --> KPB_PREPARING: [IPC] \npcm params

KPB_CREATED -[#0000FF]-> KPB_CREATED : [IPC] \nreset



KPB_PREPARING: Prepare Key Phrase Buffer component.

KPB_PREPARING-> KPB_STATE_RUN: Success

KPB_PREPARING-> KPB_PREPARING: Failure

KPB_PREPARING-[#0000FF]-> KPB_PREPARING: [IPC] \nreset



KPB_STATE_RUN: KPB is prepared and ready.

KPB_STATE_RUN-[#0000FF]-> KPB_PREPARING: [IPC] \nreset

KPB_STATE_RUN---> KPB_STATE_INIT_DRAINING: [EVENT] \nkey phrase detected

KPB_STATE_RUN-> KPB_STATE_BUFFERING: Start \nbuffering



KPB_STATE_BUFFERING: Buffer incoming samples in the \ninternal history buffer

KPB_STATE_BUFFERING-> KPB_STATE_RUN: Done

KPB_STATE_BUFFERING-> KPB_STATE_INIT_DRAINING: Done

KPB_STATE_BUFFERING-> KPB_STATE_DRAINING: Done

KPB_STATE_BUFFERING-[#0000FF]-> KPB_STATE_RESETTING: [IPC] \nreset



KPB_STATE_INIT_DRAINING: KPB received detection event

KPB_STATE_INIT_DRAINING-[#0000FF]-> KPB_PREPARING: [IPC] \nreset

KPB_STATE_INIT_DRAINING--> KPB_STATE_DRAINING: Draining task starts

KPB_STATE_INIT_DRAINING--> KPB_STATE_BUFFERING: Start \nbuffering



KPB_STATE_DRAINING: KPB is draining internal history buffer \nto the client's buffer

KPB_STATE_DRAINING-->KPB_STATE_HOST_COPY: Draining done

KPB_STATE_DRAINING-[#0000FF]-> KPB_STATE_RESETTING: [IPC] \nreset

KPB_STATE_DRAINING--> KPB_STATE_BUFFERING: Start \nbuffering



KPB_STATE_RESETTING: KPB is preparing itself for the reset

KPB_STATE_RESETTING-->KPB_STATE_RESET_FINISHING



KPB_STATE_RESET_FINISHING: KPB is finishing reset sequence

KPB_STATE_RESET_FINISHING->KPB_PREPARING: Reset done



KPB_STATE_HOST_COPY: KPB is copying real time \nstream into client's buffer

@enduml

Figure 70 Keyphrase buffer manager state diagram

Latency & buffering

This section covers calculations needed to be done to properly configure the keyphrase buffer size. The symbols used in a formula below are depicted above; see Basic diagram for a timing sequence.

Note

The formula for size of a keyphrase buffer: ( L1 + L2 + L3 + L4 ) * number of channels * bitdepth = Size [Kb]

Specifically:

  1. L1 is defined as length of a keyphrase with preceding or trailing silence. The value depends highly on the keyphrase itself and detection algorithm requirements.

  2. L2 is a sum of the algorithmic (processing) latency of a detection algorithm and the additional time needed to execute additional components in pipelines as well as prepare and send notifications.

  3. L3 is the time required to send already-buffered data to the host. Typically, a Write Pointer (WP) is used to indicate where data that’s coming from microphones is written to a keyphrase buffer. The keyphrase buffer is organized as a cyclic buffer and the WP moves if data is coming from mics at a regular rate. The Read Pointer (RP) indicates from which offset in the buffer data is fetched to host. To start burst transmission, the RP is set to the WP - “history depth” position. The history depth is defined at FW or is passed from topology. The RP moves faster than the WP due to draining that is executed as a background task. The draining phase lasts until the RP again reaches the WP, which moves at a regular (slower) rate. This signals the end of the L3 period and the RP follows the WP at a rate that the data is available in the DAI DMA buffer. Implementation note: “history depth” may be updated on-the-fly during the draining phase if new data is captured in the meantime.

  4. L4 is a safety margin that can be accommodated in any period of time defined above. It is explicitly defined to make sure it is included in the calculation. L4 length depends on: an audio frame size that is processed by a detector; the amount of detector compute time; the output audio format; the keyphrase buffer size; etc.