Text Generation Inference API

To load api

bash /scratch/serve/exe/run.sh

This will load the api, but will not register any models or load any workers. It requires all the GPU's in the node to be empty, since it will run them in exclusive process mode.

To register a model with an initial number of workers

curl -X POST "http://localhost:8081/models?url={model_name}.tar.gz&initial_workers={num_workers}"

To check the status of a model

curl http://localhost:8081/models/{model_name}

To send an inference request to a model

endpoint = 'http://0.0.0.0:8080/predictions/{model_name}'
response = requests.post(endpoint, json={
    "seed": 0,
    "prompt": "Alan Turing was a ",
    "max_tokens": 128,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "logprobs": 0,
    "stop": [],
})

Note that the API is not set up for batching, so prompt should be a string.

To deploy additional models

Launch singularity shell

singularity run --nv /scratch/serve.sif /bin/bash

Create a handler for handling requests (See example handlers)
Create a model configuration file (See examples in /scratch/serve/model_store)

Create .tar.gz runtime file

cd /scratch/serve/model_store
torch-model-archiver --model-name {model_name} --version 1.0 --handler /scratch/serve/custom_handler/{handler}.py  -r requirements.txt -f -c {model_name}-config.yaml --archive-format tgz

Note that the handler assumes that each model can fit on a single GPU. Also, if you need any additional packages to run the model, add them to the /scratch/serve/model_store/requirements.txt before creating the .mar file.

To stop the api

singularity run --nv /scratch/serve.sif /bin/bash
export TEMP=/scratch/tmp
torchserve --stop

Beware the following sections. Worker reallocation is a bit tricky, as the workers often crash or try to run on gpus alread in use. You're much better off if you can assign workers when you register the model.

To add/increase workers for a model

curl -v -X PUT "http://localhost:8081/models/{model_name}?min_worker={num_workers}"

To remove/decrease workers for a model

curl -v -X PUT "http://localhost:8081/models/{model_name}?min_worker={num_workers}"

Be careful not to assign more than 8 workers in total between all models, otherwise models may start to OOM. So always remove workers before adding workers.

Name		Name	Last commit message	Last commit date
Latest commit History 3,529 Commits
.github		.github
benchmarks		benchmarks
binaries		binaries
ci		ci
custom_handler		custom_handler
docker		docker
docs		docs
examples		examples
exe		exe
frontend		frontend
helper_jobs		helper_jobs
kubernetes		kubernetes
model-archiver		model-archiver
model_store		model_store
plugins		plugins
requirements		requirements
serving-sdk		serving-sdk
test		test
ts		ts
ts_scripts		ts_scripts
workflow-archiver		workflow-archiver
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PyPiDescription.rst		PyPiDescription.rst
README.md		README.md
README_OLD.md		README_OLD.md
SECURITY.md		SECURITY.md
_config.yml		_config.yml
codecov.yml		codecov.yml
link_check_config.json		link_check_config.json
mypy.ini		mypy.ini
setup.py		setup.py
test.py		test.py
torchserve_sanity.py		torchserve_sanity.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Generation Inference API

To load api

To register a model with an initial number of workers

To check the status of a model

To send an inference request to a model

To deploy additional models

To stop the api

Beware the following sections. Worker reallocation is a bit tricky, as the workers often crash or try to run on gpus alread in use. You're much better off if you can assign workers when you register the model.

To add/increase workers for a model

To remove/decrease workers for a model

About

Releases

Packages

Languages

License

kylemontgomery1/serve

Folders and files

Latest commit

History

Repository files navigation

Text Generation Inference API

To load api

To register a model with an initial number of workers

To check the status of a model

To send an inference request to a model

To deploy additional models

To stop the api

Beware the following sections. Worker reallocation is a bit tricky, as the workers often crash or try to run on gpus alread in use. You're much better off if you can assign workers when you register the model.

To add/increase workers for a model

To remove/decrease workers for a model

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages