Skip to main content

“Training a Successful Turkish Large Language Model from Scratch – How Much Text Data Do We Need?”

Bilişim Dergisi

Tarih:  Date -

📌 Published in Bilişim Journal 2025 / Issue 199

The development of large language models (LLMs) in Turkish presents significant challenges due to limited data resources. This article, authored by Prof. Dr. Murat Karakaya, explores the data requirements and key factors that influence the success of training a Turkish LLM from scratch.

The article includes:

  • Predictions regarding the approximate amount of data needed to train a successful LLM,
  • An evaluation of the adaptability of open-source language models to Turkish,
  • Discussions on copyright issues, data quality, ethical concerns, and strategies to be followed during model training.

Aimed especially at academics, researchers, AI developers, and public/private sector organizations, this study seeks to contribute to the advancement of Turkish natural language processing.

📄 Access the article here: https://www.bilisimdergisi.org.tr/bilisim-dergisi-2025-sayi-199
🗓️ Published in: Bilişim Journal – Issue 199, 2025