ILMU, Malaysia’s first fully homegrown large language model (LLM) was launched in Kuala Lumpur in August by YTL AI Labs Sdn Bhd, a subsidiary of YTL Power International Bhd, in a bid to develop sovereign AI capabilities for Malaysia. The acronym ILMU stands for “Intelek Luhur Malaysia Untukmu” or translated into English, “Malaysian Intellect Integrity for You”, representing Malaysian intelligence, developed in Malaysia, for Malaysians.
The LLM was trained using Malaysian languages, data and cultural context. It understands and responds in Malay, Manglish (Malaysian English) and regional dialects like Kelantanese, across text, voice and visual.
w.media interviews Professor Chan Chee Seng, Faculty of Computer Science and Information Technology, Universiti Malaya, for a deep dive on ILMU. Chan led the university’s team that collaborated with YTL AI Labs.
Q1. Since when did the ILMU project start?
The ILMU project traces its roots to early 2023 at Universiti Malaya, where it began as a final-year project by three students (Lawerence Chieng, Jeraelyn Tan, and Jia Xuan). Their initial goal was to study ChatGPT, which had just been released in late 2022, with a particular focus on understanding and mitigating the problem of hallucination in large language models. What started as a student-led research effort quickly gained momentum and, by late 2023, evolved into a full-fledged national initiative led by YTL AI Labs in collaboration with Universiti Malaya. This transition from student research to a sovereign foundation model trained from scratch highlights Malaysia’s talent pipeline and innovation capacity, ensuring that ILMU is both wholly Malaysian in intellectual property and deeply grounded in our national context.
ILMU was built from scratch as a foundation model, not a fine-tuned version on other platforms. We are not alone in this direction, e.g. local pioneers like Mesolitica, which developed MaLLaM using about 10 nodes of Nvidia A100 GPUs, have shown it is possible for Malaysians to build large language models independently. ILMU takes this much further. It was trained on well over 100 GPU nodes, an order of magnitude larger in scale, giving us the capacity to compete with the world’s leading systems.
To ensure ILMU is not just technically capable but deeply Malaysian, we also created MalayMMLU, the first dedicated benchmark for Bahasa Malaysia. This benchmark has been accepted at Empirical Methods in Natural Language Processing (EMNLP), one of the world’s leading NLP conferences, giving Malaysia recognition on the global stage while ensuring ILMU is trained, tested, and validated for Malaysian contexts.
On the MalayMMLU benchmark, ILMU achieved 87.2%, outperforming models like GPT-5, GPT-4o, and DeepSeek-V3.
Q2. How do you obtain the data needed for training/inference?
ILMU’s training data was carefully curated from diverse sources to support pre-training and downstream applications. These include:
- Publicly available data
- Licensed third-party corpora
- Malaysia-centric sources, such as educational, cultural, and government materials
Malay-language data is indeed a low-resource area globally, and that is precisely why ILMU exists. The challenge is not only about quantity, but also about quality and relevance. To address this, we expand our corpus through partnerships with local institutions and communities, rigorous curation of trusted sources, and human-guided synthetic data generation to fill gaps in underrepresented topics.
We also have a dedicated in-house data team that ensures high-quality annotation, filtering, and validation, so that ILMU reflects Malaysia’s linguistic richness and cultural diversity.
In short, while global LLMs may have access to far more raw data overall, ILMU is built on the ‘right data’ for Malaysia.
Q3. Can you give examples of sources for the ILMU library?
Examples include:
- Curriculum-aligned content, spanning primary to secondary school subjects
- Linguistic diversity data, including literary hikayat, colloquial Bahasa Pasar, and royal Bahasa Istana
- Cultural content such as Malaysian foods (ondeh-ondeh, satay), traditional games (congkak, wau), and landmarks (Batu Caves, Petronas Towers) for vision grounding
- Audio corpora covering Malaysian-accented speech, dialects, and code-switching
Q4. How many people were involved?
To be honest, I’ve probably lost count, but certainly more than 100 people have been involved in ILMU’s journey in one way or another. It goes far beyond just the core research team: from school teachers who helped mark ILMU’s PT3 benchmark papers, to interns, engineers, academics, and industry researchers contributing to different stages of development.
We also want to acknowledge the open-source community, both in Malaysia and abroad, whose tools and insights helped guide us. That ecosystem of sharing is part of why projects like ILMU could succeed. But it is important to emphasize that ILMU was built by Malaysians, in Malaysia, for Malaysians. The architecture, training, and deployment were led here, ensuring the intellectual property and cultural grounding remain sovereign.
Q5. How safe is it from data breaches and hackers?
Safety is one of ILMU’s core design pillars. We distinguish clearly between two categories of information:
- Training Data → Model Weights
- All data used to train ILMU is transformed into model weights through the training process. Once training is complete, the model does not store or expose raw training data.
- ILMU is served through a closed API, meaning access is controlled and internal data cannot be retrieved through standard queries.
- The entire system is hosted in Malaysia, fully owned and operated in-country. This ensures that both the compute infrastructure and data sovereignty are under Malaysian control.
- User Inputs → Runtime Data
- User queries are handled at runtime and are not incorporated into the base model weights. They remain transient and are protected under strict data privacy and governance protocols.
- We apply guardrail layers at both input and output stages. These include approaches inspired by Llama Guard for example, which provide runtime filtering for harmful prompts, prompt injections, and unsafe outputs.
- Additional monitoring and alignment checks are conducted in collaboration with trusted AI safety partners, ensuring the system meets both local regulatory expectations and global best practices.
At a systems level, ILMU employs defence-in-depth: encrypted data storage, role-based access controls, network isolation, and continuous auditing. Safety evaluation has been benchmarked on SafetyBench, where ILMU demonstrated strong resilience to unsafe prompts.
Our guiding principle is clear: open where possible, closed where necessary. This means sharing research, benchmarks, and learnings openly, while keeping sensitive infrastructure and APIs tightly secured to protect Malaysian users and data.
ILMU was not built just as a research experiment, but as an infrastructure model designed to support Malaysia’s most critical sectors. In fact, ILMU is already being used in the financial sector through Ryt Bank, where it powers AI-driven services that are secure, compliant, and tailored to Malaysian users. This shows how a sovereign model can directly support regulated industries while ensuring that both data and governance remain local.
That is to say, ILMU is not just a product, it is a national ecosystem. With every iteration and improvement, we Malaysians learn and improve together, from students and teachers who help build benchmarks like MalayMMLU, to researchers, engineers, industry partners, and policymakers. ILMU is about more than technology; it is about building Malaysia’s AI future collectively.
All these developments, ILMU’s progress, student achievements, and national AI strategy reflect a broader ecosystem where Malaysia is building not just models, but capacity and governance. As our students, researchers, and policymakers grow better, ILMU becomes more than a technical feat, it becomes part of our national journey towards becoming an AI-producing society.
Q6. Which other countries are implementing sovereign LLMs?
- China: DeepSeek, GLM, Qwen families
- Indonesia: Sahabat AI
- Europe: Mistral (France), part of Europe’s push for AI sovereignty
We also believe that every country should pursue its own sovereign model. The reason is simple: language, culture, and values are not universal. A model trained mostly on English, Chinese, or French data will never fully capture the nuances of Bahasa Malaysia, Manglish, or our multicultural society. Sovereign LLMs allow each nation to safeguard its linguistic heritage, legal frameworks, cultural identity, and data sovereignty.
In short, sovereign AI is not just about technology. It is about digital independence, cultural preservation, and national resilience.