Instructions to use meituan-longcat/LongCat-Flash-Thinking-2601 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use meituan-longcat/LongCat-Flash-Thinking-2601 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="meituan-longcat/LongCat-Flash-Thinking-2601", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import LongcatCausalLM model = LongcatCausalLM.from_pretrained("meituan-longcat/LongCat-Flash-Thinking-2601", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use meituan-longcat/LongCat-Flash-Thinking-2601 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "meituan-longcat/LongCat-Flash-Thinking-2601" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meituan-longcat/LongCat-Flash-Thinking-2601", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/meituan-longcat/LongCat-Flash-Thinking-2601
- SGLang
How to use meituan-longcat/LongCat-Flash-Thinking-2601 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "meituan-longcat/LongCat-Flash-Thinking-2601" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meituan-longcat/LongCat-Flash-Thinking-2601", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "meituan-longcat/LongCat-Flash-Thinking-2601" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "meituan-longcat/LongCat-Flash-Thinking-2601", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use meituan-longcat/LongCat-Flash-Thinking-2601 with Docker Model Runner:
docker model run hf.co/meituan-longcat/LongCat-Flash-Thinking-2601
REAP & Auto Round
can some one do a REAP on the model weights to reduce the memory size
https://github.com/CerebrasResearch/reap
It will be interesting to see how well it can be quantised afterwards also using the intel auto round
can some one do a REAP on the model weights to reduce the memory size
https://github.com/CerebrasResearch/reap
It will be interesting to see how well it can be quantised afterwards also using the intel auto round
which harward are you going to deploy the model on?
I have a mac m2 with 96gb that I can try models on , I wish I had more ram though , I was going to build a pc with 512gb but then ram prices went through the roof so ended up buying a second hand m2 mac to test on for the moment.
I am quite surprised with some good quantisation a lot of the biggest models can fit within that ram size, I'm just pushing the limits at the moment.
for application or for research? it's much more difficult to run this model on a mac m2 with 96gb... the memory is too small, even with doubled machines...
Just wanted to use the model and see if I can get it working, mostly comparing it against other models, just trying to find the ideal model for running on my machine