---
inference: false
pipeline_tag: image-text-to-text
datasets:
- teamcraft/TeamCraft-Data-Cen
---
<br>
<br>

# TeamCraft-VLA-7B-Cen Model Card

TeamCraft-VLA-7B-Cen is a multi-modal vision-language action model designed for centralized multi-agent collaborations. The model encodes multi-modal prompts specifying the task, visual observations, and agent inventory information at each timestep to generate actionable outputs.

## Usage

We provide a full environment with detailed running instruction on [GitHub](https://github.com/teamcraft-bench/teamcraft).

## Model details

The TeamCraft-VLA (Vision-Language-Action) architecture integrates a CLIP ViT-L/14 visual encoder with a linear projector for modality alignment and Vicuna-v1.5-7B (Llama 2.0) as the LLM backbone, combining visual and text embeddings to generate actions for multi-agent tasks.

**Model Type:**

- Vision-Language Action Model

**Model version:**

- v1.0

**Model date:**

- TeamCraft-VLA-7B-Cen is trained on September 2024

**Training dataset:**

- [Teamcraft centralized full dataset](https://huggingface.co/datasets/teamcraft/TeamCraft-Data-Cen)


## Uses

### Direct use
- **Primary intended uses:**
The primary use of the TeamCraft-VLA-7B-Cen is research on multi-agents under multi-modal settings.

- **Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, multi-agent system, and artificial intelligence.

### Out-of-Scope Use:

- The model is not designed for real-world decision-making or deployment in safety-critical systems.

- The model not be used for tasks requiring ethical reasoning, moral judgments, or any applications where improper actions could lead to harm or violation of regulations.

## License
Llama 2 is licensed under the LLAMA 2 Community License, 
Copyright (c) Meta Platforms, Inc. All Rights Reserved.