vllm的SamplingParams参数 - 文章 - 开发者社区

vllm部署示例


            
from vllm import LLM, SamplingParams  
  
# Sample prompts.  
prompts = [  
    "Hello, my name is",  
    "The president of the United States is",  
    "The capital of France is",  
    "The future of AI is",  
]  
# Create a sampling params object.  
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)  
  
# Create an LLM.  
llm = LLM(model="facebook/opt-125m")  
# Generate texts from the prompts. The output is a list of RequestOutput objects  
# that contain the prompt, generated text, and other information.  
outputs = llm.generate(prompts, sampling_params)  
# Print the outputs.  
for output in outputs:  
    prompt = output.prompt  
    generated_text = output.outputs[0].text  
    print(f"Prompt: {prompt!r}, Generated text: {generated\_text!r}"

参数列表


            
n: Number of output sequences to return for the given prompt.  
best_of: Number of output sequences that are generated from the prompt.  
    From these `best_of` sequences, the top `n` sequences are returned.  
    `best_of` must be greater than or equal to `n`. This is treated as  
    the beam width when `use_beam_search` is True. By default, `best_of`  
    is set to `n`.  
presence_penalty: Float that penalizes new tokens based on whether they  
    appear in the generated text so far. Values > 0 encourage the model  
    to use new tokens, while values < 0 encourage the model to repeat  
    tokens.  
frequency_penalty: Float that penalizes new tokens based on their  
    frequency in the generated text so far. Values > 0 encourage the  
    model to use new tokens, while values < 0 encourage the model to  
    repeat tokens.  
repetition_penalty: Float that penalizes new tokens based on whether  
    they appear in the prompt and the generated text so far. Values > 1  
    encourage the model to use new tokens, while values < 1 encourage  
    the model to repeat tokens.  
temperature: Float that controls the randomness of the sampling. Lower  
    values make the model more deterministic, while higher values make  
    the model more random. Zero means greedy sampling.  
top_p: Float that controls the cumulative probability of the top tokens  
    to consider. Must be in (0, 1]. Set to 1 to consider all tokens.  
top_k: Integer that controls the number of top tokens to consider. Set  
    to -1 to consider all tokens.  
min_p: Float that represents the minimum probability for a token to be  
    considered, relative to the probability of the most likely token.  
    Must be in [0, 1]. Set to 0 to disable this.  
use_beam_search: Whether to use beam search instead of sampling.  
length_penalty: Float that penalizes sequences based on their length.  
    Used in beam search.  
early_stopping: Controls the stopping condition for beam search. It  
    accepts the following values: `True`, where the generation stops as  
    soon as there are `best_of` complete candidates; `False`, where an  
    heuristic is applied and the generation stops when is it very  
    unlikely to find better candidates; `"never"`, where the beam search  
    procedure only stops when there cannot be better candidates  
    (canonical beam search algorithm).  
stop: List of strings that stop the generation when they are generated.  
    The returned output will not contain the stop strings.  
stop_token_ids: List of tokens that stop the generation when they are  
    generated. The returned output will contain the stop tokens unless  
    the stop tokens are special tokens.  
include_stop_str_in_output: Whether to include the stop strings in output  
    text. Defaults to False.  
ignore_eos: Whether to ignore the EOS token and continue generating  
    tokens after the EOS token is generated.  
max_tokens: Maximum number of tokens to generate per output sequence.  
logprobs: Number of log probabilities to return per output token.  
    Note that the implementation follows the OpenAI API: The return  
    result includes the log probabilities on the `logprobs` most likely  
    tokens, as well the chosen tokens. The API will always return the  
    log probability of the sampled token, so there  may be up to  
    `logprobs+1` elements in the response.  
prompt_logprobs: Number of log probabilities to return per prompt token.  
skip_special_tokens: Whether to skip special tokens in the output.  
spaces_between_special_tokens: Whether to add spaces between special  
    tokens in the output.  Defaults to True.  
logits_processors: List of functions that modify logits based on  
    previously generated tokens.