- 
                Notifications
    You must be signed in to change notification settings 
- Fork 30.9k
Description
System Info
I'm running inference on a GPU EC2 instance using CUDA. After doing a little profiling I noticed the model.generate method was the clear bottleneck. Upon closer inspection running htop showed that during this method call only a single cpu core is used and is maxed out to 100%. I've made sure sure all of my weights, biases and activations are all on the gpu. Running nvidia-smi shows the proper amount of VRAM usage. My question is, is there a way to speed this method up using multiple CPU cores? I haven't dug deep into the method to see exactly what is causing the issue yet. Just curious if there's something obvious I'm missing.
(Basic sample code)
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
import torch
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)
PROMPT = f"""### Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
### Input: "What is the difference between a llama and a vicuna?"
### Response:"""
inputs = tokenizer(
    PROMPT,
    return_tensors="pt",
)
input_ids = inputs["input_ids"].cuda()
generation_config = GenerationConfig(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.15,
)
  generation_output = model.generate(
      input_ids=input_ids,
      generation_config=generation_config,
      return_dict_in_generate=True,
      output_scores=True,
      max_new_tokens=128,
  )
for s in generation_output.sequences:
    print(tokenizer.decode(s))
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
-  An officially supported task in the examplesfolder (such as GLUE/SQuAD, ...)
- My own task or dataset (give details below)
Reproduction
Run this method with any model and profile the cpu cores to see it's only using a single core:
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128,
)
Expected behavior
Multiprocessing to help balance the load.