
Today, there are many Large Language Models (LLMs) available, ranging from billions to trillions of parameters in size. While the list of LLMs is extensive, it becomes much shorter when considering in-house models with lower GPU requirements. Performance remains the critical factor, and finding a model that combines low GPU consumption with high performance is increasingly rare in today's landscape.
Recently, I explored the EXAONE-3.5-2.4B-Instruct model developed by LG AI Research. I tested it across NER, Text Classification, Code Generation, Q&A, Summarization, Text Generation, and Article Generation — showing better performance than IBM's Granite 2B, Google's Gemma 2B, and Microsoft's Phi-2, while consuming only ~6GB GPU memory.
Background
I'm working on a RAG chatbot project handling sensitive business data that requires high contextual understanding. Key challenges:
- Data Security: Sensitive business data must not be exposed.
- Context Window: Token count grows exponentially with chat history, making commercial APIs prohibitively expensive.
- Multi-Task Handling: Needs NER, Python & SQL code generation, moderation, segmentation, and RAG-based Q&A — either one versatile model or multiple specialized ones.
- Latency: One task waits on another's output, so latency must be minimal.
- Limited Budget: GPU requirements should stay around 5GB for cost-effective inference.
- Accuracy: High contextual understanding is essential.
After evaluating top-ranked smaller models from the Open LLM Leaderboard — tiiuae/Falcon3-1B-Instruct, google/gemma-2b, ibm-granite/granite-3.1-2b-instruct, microsoft/phi-2 — EXAONE-3.5-2.4B-Instruct stood out for contextual understanding, multi-task capability, efficient resource usage (~6GB GPU & 2GB RAM), and structured output formatting.
Benchmark
EXAONE-3.5 comes in three variants: 2.4B, 7.8B, and 32B parameters. Below are benchmark comparison scores of the 2.4B & 7.8B variants against similar models.
Hardware Requirements
Testing was conducted on Kaggle Free Tier:
- GPU: Tesla P100 16GB
- RAM: 29GB
- Disk: 60GB
Resource consumption after a 2-hour session:
- GPU Memory: 6.3GB
- RAM Usage: ~2GB
- Disk Space: 11.1GB
Code
Explore the Kaggle Notebook for the pretrained variant. Change the accelerator to GPU — setup takes ~3–4 minutes. Use cases explored:
- Named Entity Recognition (NER)
- Text Classification
- Python & HTML Code Generation
- Q&A
- Text Summarization
- Text Generation
Future Works
Next steps with EXAONE-3.5-2.4B:
- Fine-tuning: The pre-trained model was used for quick testing. Fine-tuning should improve JSON formatting and context understanding.
- Quantization: Will optimize GPU consumption while preserving knowledge.
- Multilingual Capability: Currently performs well in English and Korean only — multilingual capability needs further exploration.
Conclusion
EXAONE-3.5-2.4B-Instruct demonstrates that smaller language models can deliver effective performance while being resource-efficient. With a modest ~6GB GPU requirement and strong NLP capabilities across multiple tasks, it's an excellent choice for production deployments where both performance and resource optimization matter.