bash /scratch/serve/exe/run.sh
This will load the api, but will not register any models or load any workers. It requires all the GPU's in the node to be empty, since it will run them in exclusive process mode.
curl -X POST "http://localhost:8081/models?url={model_name}.tar.gz&initial_workers={num_workers}"
curl http://localhost:8081/models/{model_name}
endpoint = 'http://0.0.0.0:8080/predictions/{model_name}'
response = requests.post(endpoint, json={
"seed": 0,
"prompt": "Alan Turing was a ",
"max_tokens": 128,
"temperature": 0.7,
"top_p": 0.7,
"top_k": 50,
"logprobs": 0,
"stop": [],
})
Note that the API is not set up for batching, so prompt
should be a string.
- Launch singularity shell
singularity run --nv /scratch/serve.sif /bin/bash
- Create a handler for handling requests (See example handlers)
- Create a model configuration file (See examples in
/scratch/serve/model_store
) - Create
.tar.gz
runtime filecd /scratch/serve/model_store torch-model-archiver --model-name {model_name} --version 1.0 --handler /scratch/serve/custom_handler/{handler}.py -r requirements.txt -f -c {model_name}-config.yaml --archive-format tgz
Note that the handler assumes that each model can fit on a single GPU. Also, if you need any additional packages to run the model, add them to the /scratch/serve/model_store/requirements.txt before creating the .mar
file.
singularity run --nv /scratch/serve.sif /bin/bash
export TEMP=/scratch/tmp
torchserve --stop
Beware the following sections. Worker reallocation is a bit tricky, as the workers often crash or try to run on gpus alread in use. You're much better off if you can assign workers when you register the model.
curl -v -X PUT "http://localhost:8081/models/{model_name}?min_worker={num_workers}"
curl -v -X PUT "http://localhost:8081/models/{model_name}?min_worker={num_workers}"
Be careful not to assign more than 8 workers in total between all models, otherwise models may start to OOM. So always remove workers before adding workers.