can't run inference on multi GPU
#8
by daryl149 - opened
Works on a single A6000:
from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer
tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="sequential", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)
Throws error on 2 V100S cards (hosting 17GB of model weights each):
from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer
tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="auto", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)
throws:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "myvenv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1558, in generate
return self.sample(
File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2641, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Only difference is I'm using device_map auto to make use of both GPUs. (Also happens for .to('cuda'), .to(0), .to(1) instead of .to(model.device).)
ah, there's an open bug in transformers for it:
https://github.com/huggingface/transformers/issues/22914
daryl149 changed discussion status to closed
Update:
The inf/nan is caused by CUDA 11.8 and bitsandbytes==0.38.1. It's solved by downgrading to CUDA 11.6 and bitsandbytes 0.31.8
However, the inference on multi gpu is still broken. It returns gibberish when using load_in_8bit=True. See this issue I created in transformers https://github.com/huggingface/transformers/issues/23989
daryl149 changed discussion status to open