Instructions to use reeducator/vicuna-13b-free with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use reeducator/vicuna-13b-free with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="reeducator/vicuna-13b-free")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("reeducator/vicuna-13b-free")
model = AutoModelForMultimodalLM.from_pretrained("reeducator/vicuna-13b-free")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use reeducator/vicuna-13b-free with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "reeducator/vicuna-13b-free"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reeducator/vicuna-13b-free",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/reeducator/vicuna-13b-free

SGLang

How to use reeducator/vicuna-13b-free with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "reeducator/vicuna-13b-free" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reeducator/vicuna-13b-free",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "reeducator/vicuna-13b-free" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "reeducator/vicuna-13b-free",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use reeducator/vicuna-13b-free with Docker Model Runner:
```
docker model run hf.co/reeducator/vicuna-13b-free
```

Thank you

by anon8231489123 - opened Apr 14, 2023

Discussion

anon8231489123

Apr 14, 2023

Thank you so much for being the first to do this, seriously. Can you post the hyperparameters you used for training?

reeducator

Owner Apr 14, 2023

You're welcome, and many thanks for leading the work on the dataset.

The hyperparameters are same as they were in the 1.0 snapshot of the FastChat repository, except for the modifications to train on 40GB A100s (mainly gradient accumulation steps and such):

    ..... (most setup specific items removed)
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps $((128 * 512 / 2048 / 2 / 4 / 4)) \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

I will probably try add --fsdp "shard_grad_op auto_wrap" also for the next run, since there are reports it could help make it a bit faster.

snakerassler

Apr 14, 2023

Yes, thanks! And thanks for going straight to ggml version -- long live llama.cpp!!

By the way, any reason for not going with the 1.1 version? I assume you already started before it was released but maybe there was another reason?

zatochu

Apr 14, 2023

•

edited Apr 14, 2023

Thanks for training this. Seems good! Re-quantized the f16 bin (thanks for providing it as well!) with https://github.com/ggerganov/llama.cpp/pull/896 as it seems to improve outputs/perplexity. Side note: I suggest editing the pull so bool useNewQuantization = false; is bool useNewQuantization = true; and/or case 2: quantized_type = GGML_TYPE_Q4_0; break; is case 2: quantized_type = GGML_TYPE_Q4_0; useNewQuantization = true; break; so you can use 2 with quantize. Otherwise the model will be written with ftype 4 and the model will report as mostly q4_1/some f16 when it isn't.

nucleardiffusion

Apr 14, 2023

•

edited Apr 14, 2023

will this work with Llamacpp ?

nucleardiffusion

Apr 14, 2023

Also, what do you reccomend for best results in the prompts text file? I currently use:

You are an AI language model designed to assist the Human by answering their questions, offering advice, and engaging in casual conversation in a friendly, helpful, and informative manner. You respond clearly, coherently, and you consider the conversation history.

Human: Hey, how's it going?

Assistant: Hey there! I'm doing great, thank you. What can I help you with today? Let's have a fun chat!

snakerassler

Apr 14, 2023

•

edited Apr 14, 2023

With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)

mancub

Apr 15, 2023

Thank you!!

reeducator

Owner Apr 15, 2023

@spanielrassler Yeah, 1.1 dropped while 1.0 was already in the training and two epochs in. Bummer, but the next version will be 1.1.

nucleardiffusion

Apr 15, 2023

With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)

That stopped the repeating but it now seems to default to censored answeres again? Do you think you might be able to do an uncensored version of " vicuna-13B-1.1-GPTQ-4bit-32g.GGML.bin" ??

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g-GGML

mancub

Apr 15, 2023

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

nucleardiffusion

Apr 15, 2023

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)

Chille9

Apr 16, 2023

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)

absolutely correct. Good things come to those who wait. Thanks for the model btw!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment