Date: 21.09.2025
CUDA PyTorch BitsAndBytes Mistral Test
Requirments
- NVIDIA Tesla P100 16Gb VRAM
- Workstation 40 GB RAM, 500GB SSD, 750W Power supply
- Ubuntu 24.04 LTS
- Python 3.12
TESTs
Preapre python environment for cuda + PyTorch + BitsAndBytes:
- Priparing PyTorch
mkdir -p ~/llm && cd ~/llm
python3 -m venv .venv_llm_cuda_bnb_mistral
source ./.venv_llm_cuda_bnb_mistral/bin/activate
python -m pip install --upgrade pip
pip install "torch==2.5.0" "torchvision==0.20.0" "torchaudio==2.5.0" --index-url https://download.pytorch.org/whl/cu124
pip install "bitsandbytes==0.44.1"
pip install transformers accelerate
python3 -c "import torch; print(torch.__version__); print(torch.cuda.is_available());print(torch.cuda.get_device_name(0));"
- Expected responce
2.5.0+cu124
True
Tesla P100-PCIE-16GB
- Check BitsAndBytes installation
python -m bitsandbytes
Get the Mistral
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1 mistral
Create script test_cuda_mistral.py:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0))
model_path = "/home/sysadmin/llm/mistral"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16
).to("cuda")
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device=0 # Use GPU
)
print(generator("What you know about sun?", max_new_tokens=160)[0]["generated_text"])
Create script test_cuda_bnb4_mistral.py:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transformers import BitsAndBytesConfig
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0))
model_path = "/home/sysadmin/llm/mistral"
qconf = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=False,
bnb_4bit_compute_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=qconf,
torch_dtype=torch.float16
)
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer
)
print(generator("What you know about sun?", max_new_tokens=160)[0]["generated_text"])
Create script test_cuda_bnb8_mistral.py:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from transformers import BitsAndBytesConfig
import torch
print("GPU available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0))
model_path = "/home/sysadmin/llm/mistral"
qconf = BitsAndBytesConfig(
load_in_8bit=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=qconf,
torch_dtype=torch.float16
)
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer
)
print(generator("What you know about Sun?", max_new_tokens=160)[0]["generated_text"])
Run test
Check
nvidia-smi
during each testwhile true; do nvidia-smi; sleep 1; done
- Without quantization
python ./test_cuda_mistral.py
- With quantization 4bit
python ./test_cuda_bnb4_mistral.py
- With quantization 8bit
python ./test_cuda_bnb8_mistral.py