What is FlexGen?
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
GitHub - FMInference/FlexGen: Running large language models on a single GPU for throughput-oriented…
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen…
Can We run it on Google Colab?
Yes, we can. However, the free version resource is limited, so we need to use the lightest model.
FlexGen/opt_config.py at main · FMInference/FlexGen
Running large language models on a single GPU for throughput-oriented scenarios. - FlexGen/opt_config.py at main ·…
!git clone https://github.com/FMInference/FlexGen.git
!pip install flexgen
!git checkout 9d888e5e3e6d78d6d4e1fdda7c8af508b889aeae
!python flexgen/apps/chatbot.py --model facebook/opt-125m
As you can see, the conversation didn’t make sense but actually, we could run FlexGen on Google Colab 😂
opt-350m is not implemetned
!python flexgen/apps/chatbot.py --model facebook/opt-1.3b
If you pay some money, probably you will be able to run opt-1.3b + more.
What is opt?
OPT （Open Pre-trained Transformer）is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.