Instructions to use meituan-longcat/LongCat-Flash-Thinking-2601 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use meituan-longcat/LongCat-Flash-Thinking-2601 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meituan-longcat/LongCat-Flash-Thinking-2601", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import LongcatCausalLM
model = LongcatCausalLM.from_pretrained("meituan-longcat/LongCat-Flash-Thinking-2601", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use meituan-longcat/LongCat-Flash-Thinking-2601 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "meituan-longcat/LongCat-Flash-Thinking-2601"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meituan-longcat/LongCat-Flash-Thinking-2601",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/meituan-longcat/LongCat-Flash-Thinking-2601

SGLang

How to use meituan-longcat/LongCat-Flash-Thinking-2601 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "meituan-longcat/LongCat-Flash-Thinking-2601" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meituan-longcat/LongCat-Flash-Thinking-2601",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "meituan-longcat/LongCat-Flash-Thinking-2601" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "meituan-longcat/LongCat-Flash-Thinking-2601",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use meituan-longcat/LongCat-Flash-Thinking-2601 with Docker Model Runner:
```
docker model run hf.co/meituan-longcat/LongCat-Flash-Thinking-2601
```

REAP & Auto Round

by infinityai - opened Jan 20

Discussion

infinityai

Jan 20

can some one do a REAP on the model weights to reduce the memory size

https://github.com/CerebrasResearch/reap

It will be interesting to see how well it can be quantised afterwards also using the intel auto round

https://github.com/intel/auto-round

Raymond257

Jan 21

can some one do a REAP on the model weights to reduce the memory size

https://github.com/CerebrasResearch/reap

It will be interesting to see how well it can be quantised afterwards also using the intel auto round

https://github.com/intel/auto-round

which harward are you going to deploy the model on?

infinityai

Jan 21

I have a mac m2 with 96gb that I can try models on , I wish I had more ram though , I was going to build a pc with 512gb but then ram prices went through the roof so ended up buying a second hand m2 mac to test on for the moment.

I am quite surprised with some good quantisation a lot of the biggest models can fit within that ram size, I'm just pushing the limits at the moment.

Raymond257

Jan 21

for application or for research? it's much more difficult to run this model on a mac m2 with 96gb... the memory is too small, even with doubled machines...

infinityai

Jan 22

Just wanted to use the model and see if I can get it working, mostly comparing it against other models, just trying to find the ideal model for running on my machine

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment