Quantizing a model means reducing the number of bits used to represent its weights and activations. This can help make the model smaller and faster, which is good because it means it can run on smaller computers and devices more easily. However, because the model uses fewer bits, it may not be able to represent its parameters as precisely as before. This can lead to some loss of accuracy or quality. So, quantizing a model can be a trade-off between size and speed on one hand, and accuracy or quality on the other. But overall, quantizing is a useful technique that can help make machine learning more accessible and efficient.
import torch import torch.nn as nn from transformers import GPT2Tokenizer, GPT2LMHeadModel # Load the GPT-2 model and tokenizer model_name = 'gpt2' tokenizer = GPT2Tokenizer.from_pretrained(model_name) model = GPT2LMHeadModel.from_pretrained(model_name) # Generate some sample text prompt = "The quick brown fox jumps over the lazy dog." input_ids = tokenizer.encode(prompt, return_tensors='pt') output = model.generate(input_ids, max_length=100, do_sample=True, num_return_sequences=1) sample_text = tokenizer.decode(output[0], skip_special_tokens=True) print("Sample text: ", sample_text) # Quantize the model down quantization_config = { 'activation': { 'dtype': torch.quint8, 'qscheme': torch.per_tensor_affine, 'reduce_range': True }, 'weight': { 'dtype': torch.qint8, 'qscheme': torch.per_tensor_symmetric, 'reduce_range': True } } quantized_model = torch.quantization.quantize_dynamic(model, quantization_config, dtype=torch.qint8) # Save the quantized model torch.save(quantized_model.state_dict(), 'quantized_model.pt') ## ------------------------------- Generate using Quantized Model # Load the quantized model quantized_model = GPT2LMHeadModel.from_pretrained(model_name) quantized_model.load_state_dict(torch.load('quantized_model.pt')) # Generate some sample text using the quantized model output_quantized = quantized_model.generate(input_ids, max_length=100, do_sample=True, num_return_sequences=1) sample_text_quantized = tokenizer.decode(output_quantized[0], skip_special_tokens=True) print("Sample text (quantized): ", sample_text_quantized)