EXAONE-3.5-2.4B: A Ultra-lightweight but High Performing LLM on Just 6GB GPU

EXAONE-3.5-2.4B

Today, there are many Large Language Models (LLMs) available, ranging from billions to trillions of parameters in size. While the list of LLMs is extensive, it becomes much shorter when considering in-house models with lower GPU requirements. Performance remains the critical factor, and finding a model that combines low GPU consumption with high performance is increasingly rare in today's landscape.

Recently, I explored the EXAONE-3.5-2.4B-Instruct model developed by LG AI Research. I tested it across NER, Text Classification, Code Generation, Q&A, Summarization, Text Generation, and Article Generation — showing better performance than IBM's Granite 2B, Google's Gemma 2B, and Microsoft's Phi-2, while consuming only ~6GB GPU memory.

Background

I'm working on a RAG chatbot project handling sensitive business data that requires high contextual understanding. Key challenges:

Data Security: Sensitive business data must not be exposed.
Context Window: Token count grows exponentially with chat history, making commercial APIs prohibitively expensive.
Multi-Task Handling: Needs NER, Python & SQL code generation, moderation, segmentation, and RAG-based Q&A — either one versatile model or multiple specialized ones.
Latency: One task waits on another's output, so latency must be minimal.
Limited Budget: GPU requirements should stay around 5GB for cost-effective inference.
Accuracy: High contextual understanding is essential.

After evaluating top-ranked smaller models from the Open LLM Leaderboard — tiiuae/Falcon3-1B-Instruct, google/gemma-2b, ibm-granite/granite-3.1-2b-instruct, microsoft/phi-2 — EXAONE-3.5-2.4B-Instruct stood out for contextual understanding, multi-task capability, efficient resource usage (~6GB GPU & 2GB RAM), and structured output formatting.

Benchmark

EXAONE-3.5 comes in three variants: 2.4B, 7.8B, and 32B parameters. Below are benchmark comparison scores of the 2.4B & 7.8B variants against similar models.

Benchmark score comparison

Hardware Requirements

Testing was conducted on Kaggle Free Tier:

GPU: Tesla P100 16GB
RAM: 29GB
Disk: 60GB

Resource consumption after a 2-hour session:

GPU Memory: 6.3GB
RAM Usage: ~2GB
Disk Space: 11.1GB

Code

Explore the Kaggle Notebook for the pretrained variant. Change the accelerator to GPU — setup takes ~3–4 minutes. Use cases explored:

Named Entity Recognition (NER)
Text Classification
Python & HTML Code Generation
Q&A
Text Summarization
Text Generation

Future Works

Next steps with EXAONE-3.5-2.4B:

Fine-tuning: The pre-trained model was used for quick testing. Fine-tuning should improve JSON formatting and context understanding.
Quantization: Will optimize GPU consumption while preserving knowledge.
Multilingual Capability: Currently performs well in English and Korean only — multilingual capability needs further exploration.

Conclusion

EXAONE-3.5-2.4B-Instruct demonstrates that smaller language models can deliver effective performance while being resource-efficient. With a modest ~6GB GPU requirement and strong NLP capabilities across multiple tasks, it's an excellent choice for production deployments where both performance and resource optimization matter.