More than 1 year has passed since last update.

Running BLOOM 176B inference on ABCI

Posted at 2022-11-24

With evergrowing size of recent pretrained language models (or foundation models), running inference of these models poses many challenges. In this post, I will explain steps needed to run GPT-3 sized models (e.g., BLOOM) with a single 8xA100 (40GB) machine on ABCI using Huggingface inference server, DeepSpeed and bitsandbytes.

Since dependent libraries are under rapid development, it is subject to change in near future (though this this post all the versions pinned down, so it should work in foreseeable future).

Prerequisite

Log onto ABCI interactive node (A). It must be the A node (see https://docs.abci.ai/ja/getting-started/), not the old interactive node, as they have different stdlib versions etc.

You can run the following within the login node.

git clone https://github.com/huggingface/transformers-bloom-inference.git
cd transformers-bloom-inference
git checkout 556ccac6534648800ada288f528c5c23d0c5f5ed

module load cuda/11.6/11.6.2 cudnn/8.3/8.3.3 nccl/2.12/2.12.12-1 gcc/11.2.0

# Use virtualenv of your choice --- I used pyenv
pyenv virtualenv 3.9.5 transformers-bloom-inference-ag
pyenv local transformers-bloom-inference-ag

pip install -U pip
pip install torch==1.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install transformers==4.21.2 flask flask_api gunicorn pydantic accelerate==0.14.0 huggingface-hub==0.10.1 deepspeed==0.7.5 deepspeed-mii==0.0.2

# Set cache dir because the home directory has limited quota
# I have set it to global scratch area for the sake of simplicity but I suggest you set it to a persistent storage area
export TRANSFORMERS_CACHE="/scratch/${USER}/cache/huggingface/transformers"
export HF_DATASETS_CACHE="/scratch/${USER}/cache/huggingface/dataset"
mkdir -p $TRANSFORMERS_CACHE
mkdir -p $HF_DATASETS_CACHE

# Download the model (transformers-bloom-inference does not currently support local models)
# It will take about 4 hours and approx 400GB of disk
python -m inference_server.download_model --model_name microsoft/bloom-deepspeed-inference-int8

Running benchmark

Let's run the benchmark to confirm that everything were configured appropriately.

Launch an interactive job.

qrsh -l rt_AF=1 -l h_rt=2:00:00 -g ${YOUR_GROUP_ID}

Run following commands:

cd transformers-bloom-inference
module load cuda/11.6/11.6.2 cudnn/8.3/8.3.3 nccl/2.12/2.12.12-1 gcc/11.2.0

export TRANSFORMERS_CACHE="/scratch/${USER}/cache/huggingface/transformers"
export HF_DATASETS_CACHE="/scratch/${USER}/cache/huggingface/dataset"

deepspeed \
  --num_gpus 8 \
  --module inference_server.benchmark \
  --model_name microsoft/bloom-deepspeed-inference-int8 \
  --dtype int8 \
  --deployment_framework ds_inference \
  --benchmark_cycles 5 \
  --model_class AutoModelForCausalLM \
  --batch_size 8

I was able to run up to the batch size of 8, but I encountered OOM at the batch size of 16. While this is much smaller than 256 as reported in Huggingface's blog post, this is probably due to the fact we are using 8xA100 (40GB) instead of their 8xA100 (80GB)¹.

Launching server

Launch an interactive job.

qrsh -l rt_AF=1 -l h_rt=2:00:00 -g ${YOUR_GROUP_ID}

Copy the host name.

echo $HOST

Run following commands:

cd transformers-bloom-inference
module load cuda/11.6/11.6.2 cudnn/8.3/8.3.3 nccl/2.12/2.12.12-1 gcc/11.2.0

export TRANSFORMERS_CACHE="/scratch/${USER}/cache/huggingface/transformers"
export HF_DATASETS_CACHE="/scratch/${USER}/cache/huggingface/dataset"

# You might be able to increase MAX_BATCH_SIZE and/or MAX_INPUT_LENGTH
TOKENIZERS_PARALLELISM=false \
MODEL_NAME=microsoft/bloom-deepspeed-inference-int8 \
MODEL_CLASS=AutoModelForCausalLM \
DEPLOYMENT_FRAMEWORK=ds_inference \
DTYPE=int8 \
MAX_INPUT_LENGTH=2048 \
MAX_BATCH_SIZE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
gunicorn -t 0 -w 1 -b 0.0.0.0:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'

You must wait until you see server has started on 50950 (or something similar) printed. The process dies if you submit a request during the launch.

Then, from your local machine:

# This is an example. Use the host name from above
ABCI_HOST=a0104
YOUR_LOCAL_PORT=25000
ssh -i ${YOUR_SSH_KEY_FILE} -L 10022:es:22 -l ${YOUR_USER_ID} as.abci.ai
ssh -N -L ${YOUR_LOCAL_PORT}:${ABCI_HOST}:5000 -l ${YOUR_USER_ID} -i ${YOUR_SSH_KEY_FILE} -p 10022 localhost

Congrats! You now have your local machine's 25000 port is connected to 5000 port of the worker machine.

You can POST a query from your local machine.

curl -X POST -H "Content-Type: application/json" -d '{"text": ["DeepSpeed is a machine", "You can post multiple texts"], "max_new_tokens": 40}' http://localhost:${YOUR_LOCAL_PORT}/generate/

Since the model itself takes considerable amount of memory, we can only fit considerably smaller batch size than half. ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up