--- inference: false pipeline_tag: image-text-to-text datasets: - teamcraft/TeamCraft-Data-Cen ---

# TeamCraft-VLA-7B-Cen Model Card TeamCraft-VLA-7B-Cen is a multi-modal vision-language action model designed for centralized multi-agent collaborations. The model encodes multi-modal prompts specifying the task, visual observations, and agent inventory information at each timestep to generate actionable outputs. ## Usage We provide a full environment with detailed running instruction on [GitHub](https://github.com/teamcraft-bench/teamcraft). ## Model details The TeamCraft-VLA (Vision-Language-Action) architecture integrates a CLIP ViT-L/14 visual encoder with a linear projector for modality alignment and Vicuna-v1.5-7B (Llama 2.0) as the LLM backbone, combining visual and text embeddings to generate actions for multi-agent tasks. **Model Type:** - Vision-Language Action Model **Model version:** - v1.0 **Model date:** - TeamCraft-VLA-7B-Cen is trained on September 2024 **Training dataset:** - [Teamcraft centralized full dataset](https://huggingface.co/datasets/teamcraft/TeamCraft-Data-Cen) ## Uses ### Direct use - **Primary intended uses:** The primary use of the TeamCraft-VLA-7B-Cen is research on multi-agents under multi-modal settings. - **Primary intended users:** The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, multi-agent system, and artificial intelligence. ### Out-of-Scope Use: - The model is not designed for real-world decision-making or deployment in safety-critical systems. - The model not be used for tasks requiring ethical reasoning, moral judgments, or any applications where improper actions could lead to harm or violation of regulations. ## License Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.