Run FlexGen on Google Colab

2 min readMar 14, 2023

What is FlexGen?

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.

GitHub - FMInference/FlexGen: Running large language models on a single GPU for throughput-oriented…

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen…

github.com

Can We run it on Google Colab?

Yes, we can. However, the free version resource is limited, so we need to use the lightest model.

FlexGen/opt_config.py at main · FMInference/FlexGen

Running large language models on a single GPU for throughput-oriented scenarios. - FlexGen/opt_config.py at main ·…

github.com

!git clone https://github.com/FMInference/FlexGen.git
!pip install flexgen
cd FlexGen
!git checkout 9d888e5e3e6d78d6d4e1fdda7c8af508b889aeae
!python flexgen/apps/chatbot.py --model facebook/opt-125m

As you can see, the conversation didn’t make sense but actually, we could run FlexGen on Google Colab 😂

opt-350m is not implemetned
https://github.com/FMInference/FlexGen/blob/main/flexgen/opt_config.py#L75-L76

!python flexgen/apps/chatbot.py --model facebook/opt-1.3b

If you pay some money, probably you will be able to run opt-1.3b + more.

What is opt?

OPT （Open Pre-trained Transformer）is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.

OPT: Open Pre-trained Transformer Language Models

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable…

arxiv.org