Update explainability.md

70f63c7 verified 5 months ago

4.69 kB

	Field \| Response
	:------------------------------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------
	Intended Task/Domain: \| Speech To Speech Conversational Chat Agent
	Model Type: \| Speech Encoder (ConvNet, Transformer), Temporal Transformer, Depth Transfomer, Speech Decoder (ConvNet, Transformer)
	Intended Users: \| People working with conversational AI systems that respond to user speech with generated speech (and optionally accompanying text).
	Output: \| Speech, Text
	Describe how the model works: \| Personaplex is a real-time speech-to-speech conversational model that jointly performs streaming speech understanding and speech generation. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively to produce its spoken responses. Incoming user audio is incrementally encoded and fed to the model while Personaplex simultaneously generates its own outgoing speech, enabling natural conversational dynamics such as interruptions, barge-ins, overlaps, and rapid turn-taking.<br> Personaplex runs in a dual-stream configuration in which listening and speaking occur concurrently. This design allows the model to update its internal state based on the user’s ongoing speech while still producing fluent output audio, supporting highly interactive conversations. <br> Prior to conversation, Personaplex is conditioned on two prompts: a voice prompt and a text prompt. The voice prompt consists of a sequence of audio tokens that establish the target vocal characteristics and speaking style. The text prompt specifies persona attributes such as role, background, and scenario context. Together, these prompts define the model's conversational identity and guide its linguistic and acoustic behavior throughout the interaction.
	Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: \| Not Applicable
	Technical Limitations & Mitigation: \| Personaplex is trained with a 2048-token context window, corresponding to roughly 160 seconds of audio. Conversational context beyond this window may not be retained reliably.<br> The model’s knowledge and linguistic competence derive from its underlying Moshi base model. As a result, Personaplex may produce inaccurate or outdated responses and does not have access to recent events or comprehensive world knowledge. <br>Personaplex was not explicitly trained for reasoning or alignment. Its performance on tasks requiring multi-step reasoning, arithmetic, or safety-aligned behavior may therefore be limited.
	Verified to have met prescribed NVIDIA quality standards: \| Yes
	Performance Metrics: \| Conversational dynamics metrics include takeover rate (TOR); latency (seconds) for interruptions, pause handling, and turn-taking; frequency and Jensen–Shannon divergence (JSD) for backchannels; question-answering accuracy (score); and speaker similarity (SSIM).
	Potential Known Risks: \| The model may present errors in speech understanding and/or pronunciation. Additionally, the training data consists exclusively of English speech. The model is not expected to generalize well to non-English languages. Personaplex was trained primarily on assistant-style, customer service, and general conversational interactions. Its behavior may degrade when applied outside these domains. For security and privacy reasons, Personaplex does not support generating voice prompts from real user voice recordings.
	Licensing: \| GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ADDITIONAL INFORMATION: [CC-BY-4.0](https://huggingface.co/kyutai/moshiko-pytorch-bf16).

	Field \| Response
	:------------------------------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------
	Intended Task/Domain: \| Speech To Speech Conversational Chat Agent
	Model Type: \| Speech Encoder (ConvNet, Transformer), Temporal Transformer, Depth Transfomer, Speech Decoder (ConvNet, Transformer)
	Intended Users: \| People working with conversational AI systems that respond to user speech with generated speech (and optionally accompanying text).
	Output: \| Speech, Text
	Describe how the model works: \| Personaplex is a real-time speech-to-speech conversational model that jointly performs streaming speech understanding and speech generation. The model operates on continuous audio encoded with a neural codec and predicts both text tokens and audio tokens autoregressively to produce its spoken responses. Incoming user audio is incrementally encoded and fed to the model while Personaplex simultaneously generates its own outgoing speech, enabling natural conversational dynamics such as interruptions, barge-ins, overlaps, and rapid turn-taking.<br> Personaplex runs in a dual-stream configuration in which listening and speaking occur concurrently. This design allows the model to update its internal state based on the user’s ongoing speech while still producing fluent output audio, supporting highly interactive conversations. <br> Prior to conversation, Personaplex is conditioned on two prompts: a voice prompt and a text prompt. The voice prompt consists of a sequence of audio tokens that establish the target vocal characteristics and speaking style. The text prompt specifies persona attributes such as role, background, and scenario context. Together, these prompts define the model's conversational identity and guide its linguistic and acoustic behavior throughout the interaction.
	Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: \| Not Applicable
	Technical Limitations & Mitigation: \| Personaplex is trained with a 2048-token context window, corresponding to roughly 160 seconds of audio. Conversational context beyond this window may not be retained reliably.<br> The model’s knowledge and linguistic competence derive from its underlying Moshi base model. As a result, Personaplex may produce inaccurate or outdated responses and does not have access to recent events or comprehensive world knowledge. <br>Personaplex was not explicitly trained for reasoning or alignment. Its performance on tasks requiring multi-step reasoning, arithmetic, or safety-aligned behavior may therefore be limited.
	Verified to have met prescribed NVIDIA quality standards: \| Yes
	Performance Metrics: \| Conversational dynamics metrics include takeover rate (TOR); latency (seconds) for interruptions, pause handling, and turn-taking; frequency and Jensen–Shannon divergence (JSD) for backchannels; question-answering accuracy (score); and speaker similarity (SSIM).
	Potential Known Risks: \| The model may present errors in speech understanding and/or pronunciation. Additionally, the training data consists exclusively of English speech. The model is not expected to generalize well to non-English languages. Personaplex was trained primarily on assistant-style, customer service, and general conversational interactions. Its behavior may degrade when applied outside these domains. For security and privacy reasons, Personaplex does not support generating voice prompts from real user voice recordings.
	Licensing: \| GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). ADDITIONAL INFORMATION: [CC-BY-4.0](https://huggingface.co/kyutai/moshiko-pytorch-bf16).