Instructions to use reeducator/vicuna-13b-free with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use reeducator/vicuna-13b-free with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="reeducator/vicuna-13b-free")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("reeducator/vicuna-13b-free") model = AutoModelForMultimodalLM.from_pretrained("reeducator/vicuna-13b-free") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use reeducator/vicuna-13b-free with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "reeducator/vicuna-13b-free" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "reeducator/vicuna-13b-free", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/reeducator/vicuna-13b-free
- SGLang
How to use reeducator/vicuna-13b-free with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "reeducator/vicuna-13b-free" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "reeducator/vicuna-13b-free", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "reeducator/vicuna-13b-free" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "reeducator/vicuna-13b-free", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use reeducator/vicuna-13b-free with Docker Model Runner:
docker model run hf.co/reeducator/vicuna-13b-free
Thank you
Thank you so much for being the first to do this, seriously. Can you post the hyperparameters you used for training?
You're welcome, and many thanks for leading the work on the dataset.
The hyperparameters are same as they were in the 1.0 snapshot of the FastChat repository, except for the modifications to train on 40GB A100s (mainly gradient accumulation steps and such):
..... (most setup specific items removed)
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps $((128 * 512 / 2048 / 2 / 4 / 4)) \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
I will probably try add --fsdp "shard_grad_op auto_wrap" also for the next run, since there are reports it could help make it a bit faster.
Yes, thanks! And thanks for going straight to ggml version -- long live llama.cpp!!
By the way, any reason for not going with the 1.1 version? I assume you already started before it was released but maybe there was another reason?
Thanks for training this. Seems good! Re-quantized the f16 bin (thanks for providing it as well!) with https://github.com/ggerganov/llama.cpp/pull/896 as it seems to improve outputs/perplexity. Side note: I suggest editing the pull so bool useNewQuantization = false; is bool useNewQuantization = true; and/or case 2: quantized_type = GGML_TYPE_Q4_0; break; is case 2: quantized_type = GGML_TYPE_Q4_0; useNewQuantization = true; break; so you can use 2 with quantize. Otherwise the model will be written with ftype 4 and the model will report as mostly q4_1/some f16 when it isn't.
will this work with Llamacpp ?
Also, what do you reccomend for best results in the prompts text file? I currently use:
You are an AI language model designed to assist the Human by answering their questions, offering advice, and engaging in casual conversation in a friendly, helpful, and informative manner. You respond clearly, coherently, and you consider the conversation history.
Human: Hey, how's it going?
Assistant: Hey there! I'm doing great, thank you. What can I help you with today? Let's have a fun chat!
With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)
Thank you!!
@spanielrassler Yeah, 1.1 dropped while 1.0 was already in the training and two epochs in. Bummer, but the next version will be 1.1.
With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)
That stopped the repeating but it now seems to default to censored answeres again? Do you think you might be able to do an uncensored version of " vicuna-13B-1.1-GPTQ-4bit-32g.GGML.bin" ??
https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g-GGML
Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)
Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)
yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)
Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)
yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)
absolutely correct. Good things come to those who wait. Thanks for the model btw!