--- library_name: diffusers tags: - modular-diffusers - diffusers - helios-pyramid - text-to-image --- This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework. **Pipeline Type**: HeliosPyramidAutoBlocks **Description**: Auto Modular pipeline for pyramid progressive generation (T2V/I2V/V2V) using Helios. This pipeline uses a 4-block architecture that can be customized and extended. ## Example Usage [TODO] ## Pipeline Architecture This modular pipeline is composed of the following blocks: 1. **text_encoder** (`HeliosTextEncoderStep`) - Text Encoder step that generates text embeddings to guide the video generation 2. **vae_encoder** (`HeliosPyramidAutoVaeEncoderStep`) - Encoder step that encodes video or image inputs. This is an auto pipeline block. - *video_encoder*: `HeliosVideoVaeEncoderStep` - Video Encoder step that encodes an input video into VAE latent space, producing image_latents (first frame) and video_latents (chunked video frames) for video-to-video generation. - *image_encoder*: `HeliosImageVaeEncoderStep` - Image Encoder step that encodes an input image into VAE latent space, producing image_latents (first frame prefix) and fake_image_latents (history seed) for image-to-video generation. 3. **denoise** (`HeliosPyramidAutoCoreDenoiseStep`) - Pyramid core denoise step that selects the appropriate denoising block. - *video2video*: `HeliosPyramidV2VCoreDenoiseStep` - V2V pyramid denoise block with progressive multi-resolution denoising. - *image2video*: `HeliosPyramidI2VCoreDenoiseStep` - I2V pyramid denoise block with progressive multi-resolution denoising. - *text2video*: `HeliosPyramidCoreDenoiseStep` - T2V pyramid denoise block with progressive multi-resolution denoising. 4. **decode** (`HeliosDecodeStep`) - Decodes all chunk latents with the VAE, concatenates them, trims to the target frame count, and postprocesses into the final video output. ## Model Components 1. text_encoder (`UMT5EncoderModel`) 2. tokenizer (`AutoTokenizer`) 3. guider (`ClassifierFreeGuidance`) 4. vae (`AutoencoderKLWan`) 5. video_processor (`VideoProcessor`) 6. transformer (`HeliosTransformer3DModel`) 7. scheduler (`HeliosScheduler`) ## Input/Output Specification ### Inputs **Required:** - `prompt` (`str`): The prompt or prompts to guide image generation. - `history_sizes` (`list`): Sizes of long/mid/short history buffers for temporal context. **Optional:** - `negative_prompt` (`str`): The prompt or prompts not to guide the image generation. - `max_sequence_length` (`int`), default: `512`: Maximum sequence length for prompt encoding. - `video` (`Any`): Input video for video-to-video generation - `height` (`int`), default: `384`: The height in pixels of the generated image. - `width` (`int`), default: `640`: The width in pixels of the generated image. - `num_latent_frames_per_chunk` (`int`), default: `9`: Number of latent frames per temporal chunk. - `generator` (`Generator`): Torch generator for deterministic generation. - `image` (`PIL.Image.Image | list[PIL.Image.Image]`): Reference image(s) for denoising. Can be a single image or list of images. - `num_videos_per_prompt` (`int`), default: `1`: Number of videos to generate per prompt. - `image_latents` (`Tensor`): image latents used to guide the image generation. Can be generated from vae_encoder step. - `video_latents` (`Tensor`): Encoded video latents for V2V generation. - `image_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for image latent noise. - `image_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for image latent noise. - `video_noise_sigma_min` (`float`), default: `0.111`: Minimum sigma for video latent noise. - `video_noise_sigma_max` (`float`), default: `0.135`: Maximum sigma for video latent noise. - `num_frames` (`int`), default: `132`: Total number of video frames to generate. - `keep_first_frame` (`bool`), default: `True`: Whether to keep the first frame as a prefix in history. - `pyramid_num_inference_steps_list` (`list`), default: `[10, 10, 10]`: Number of denoising steps per pyramid stage. - `latents` (`Tensor`): Pre-generated noisy latents for image generation. - `None` (`Any`): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc. - `attention_kwargs` (`dict`): Additional kwargs for attention processors. - `fake_image_latents` (`Tensor`): Fake image latents used as history seed for I2V generation. - `output_type` (`str`), default: `np`: Output format: 'pil', 'np', 'pt'. ### Outputs - `videos` (`list`): The generated videos.