Professional Documents
Culture Documents
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
2023 Yılı
2. Dönem Başvurusu
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
A. GENEL BİLGİLER
ÖZET
Türkçe özetin araştırma önerisinin (a) özgün değeri, (b) yöntemi, (c) yönetimi ve (d) yaygın etkisi hakkında bilgileri
kapsaması beklenir. Bu bölümün en son yazılması önerilir.
Özet
1. ÖZGÜN DEĞER
Araştırma önerisinde ele alınan konunun kapsamı ve sınırları ile önemi literatürün eleştirel bir değerlendirmesinin
yanı sıra nitel veya nicel verilerle açıklanır.
Özgün değer yazılırken araştırma önerisinin bilimsel değeri, farklılığı ve yeniliği, hangi eksikliği nasıl gidereceği
veya hangi soruna nasıl bir çözüm geliştireceği ve/veya ilgili bilim veya teknoloji alan(lar)ına kavramsal, kuramsal
ve/veya metodolojik olarak ne gibi özgün katkılarda bulunacağı literatüre atıf yapılarak açıklanır.
Önerilen çalışmanın araştırma sorusu ve varsa hipotezi veya ele aldığı problem(ler)i açık bir şekilde ortaya
konulur.
In the realm of Natural Language Processing (NLP), extensive progress has been made in
languages widely used, such as English, Spanish, French, German, and Mandarin Chinese.
However, a noticeable gap persists in the literature, datasets, and models dedicated to
agglutinative languages, particularly Turkish. Although commendable efforts have been
made in the field of Turkish NLP, including the creation of datasets, such as those made
available by Türkiye Açık Kaynak Platformu and contributors on platforms like Kaggle, and
the development of models like BERT, ALBERT, ELECTRA, and DistilBERT, significant
opportunities for innovation and advancement remain.
The unique character of this research project is unmistakable. Rather than aiming to rival
existing endeavors, it stands out by contributing to and enriching the broader landscape of
Turkish NLP. The data produced, the models trained, and the literature generated within this
project will not only bolster its own objectives but will also provide invaluable assets for
fellow researchers and related projects in the field. In this collaborative spirit, the research's
inherent value is further elevated.
Moreover, the importance of this research transcends its impact on existing initiatives. The
2
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
ultimate aspiration of this research is the development of a web application with wide-
ranging utility, catering to public and private institutions, government agencies, and the
general populace. This application's capabilities encompass real-time monitoring of social
media, blogs, and websites to identify derogatory language. This pioneering approach serves
as a bulwark against the adverse effects of harmful content in the online sphere, contributing
to a safer and more constructive digital environment.
Crucially, the value of this project does not hinge solely on the end product. The data forged
during this research will serve as a cornerstone resource for subsequent investigations, both
by the researcher and the broader community. By providing a robust foundation for future
work, this project ensures the continued development of valuable solutions in the realm of
Turkish Natural Language Processing.
Araştırma önerisinin amacı ve hedefleri açık, ölçülebilir, gerçekçi ve araştırma süresince ulaşılabilir nitelikte
olacak şekilde yazılır.
1. Main Objective: Our research project aims to push the boundaries of Turkish Natural Language
Processing, bridging the gap between NLP in Turkish and more widely used languages. This will be achieved by
creating an extensive Turkish language dataset and training a model to effectively detect harmful language in
social media. The ultimate goal is to make this model accessible via a web application, benefitting public and
private institutions and the general population.
2. Specific Objectives:
a. Extensive Dataset Creation: The first objective is to construct a comprehensive Turkish language
dataset. This dataset will serve not only our project but also the broader research community working in the field
of Turkish NLP.
b. Model Development: The second objective is to build a high-accuracy model that can proficiently identify
harmful posts and comments in Turkish social media. This model will be trained using the extensive dataset we
create.
c. Web Application: The third goal is to develop a web application that integrates our model and utilizes
APIs to access social media, allowing real-time monitoring and mitigation of harmful language.
3. Alignment with Özgün Değer: These objectives align with the originality and significance of our project, as
discussed in the "Özgün Değer" section. They represent concrete steps toward addressing the challenges and
gaps in Turkish NLP.
4. Realistic and Attainable: With a project timeline of 6 months and a maximum period of 12 months, we
consider these objectives realistic and attainable within the scope of our research project.
5. Focus on Dataset Quality: A substantial portion of our project timeline will be dedicated to creating a large,
high-quality dataset. This process involves gathering and curating various data sources, unifying them into a
single dataset, and labeling the data through the use of pretrained models and LLMs.
6. Model Training and Fine-Tuning: Following dataset creation, our efforts will focus on training different
models, creating various features, and fine-tuning the model to ensure the highest possible accuracy in
identifying harmful language.
7. Web Application Development: The final step involves the creation of a web application designed to access
and analyze social media data for harmful language in real time.
8. Evaluation Metrics: While evaluation metrics will be discussed in detail in the "Yöntem" section, we will use
methods like cross-validation and time-based data splitting to assess model performance. Cross-validation, a
widely accepted metric, will help gauge model performance based on today's data. The time-based approach
will evaluate the model's ability to predict future harmful content.
9. Real-World Impact: Our project's potential impact is significant, not only for the Turkish-speaking world but
also for Turkic languages (e.g., Azerbaijani and Turkmen). A single achieved goal can have far-reaching
implications. For example, a well-curated dataset could revolutionize future solutions in Turkish and Turkic NLP.
10. Modern Methodology: We will employ high-quality, contemporary approaches that are widely used in
addressing similar problems. This ensures that our results are realistically achievable through proven, state-of-
the-art methods.
2. YÖNTEM
3
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
Araştırma önerisinde uygulanacak yöntem ve araştırma teknikleri (veri toplama araçları ve analiz yöntemleri dahil)
ilgili literatüre atıf yapılarak açıklanır. Yöntem ve tekniklerin çalışmada öngörülen amaç ve hedeflere ulaşmaya
elverişli olduğu ortaya konulur.
The research methodology for this project comprises several key phases, each meticulously designed to
achieve the primary objectives and deliver invaluable outcomes.
Data Collection and Preprocessing:
• Initial Data Collection: Our journey begins with the collection of freely available, already labeled data
from diverse sources such as Türkiye Açık Kaynak Platformu, Kaggle, Hugging Face, GitHub, and
others. These sources provide a wealth of labeled data, forming the foundation for our subsequent
processes.
• Exploratory Data Analysis: This phase initiates with a comprehensive analysis of the dataset to gain
essential insights. The insights derived from this analysis guide subsequent data preprocessing steps.
• Model Baseline: To establish a reference point for evaluating the impact of specific preprocessing steps
and overall model performance, a baseline model is created using the dataset's core features.
• TurkishNLP: Utilizing the TurkishNLP library, which includes advanced components for text
normalization, stemming, and tokenization, we perform data preprocessing. This not only ensures data
homogeneity but also significantly enhances the quality of Turkish language data.
• Preprocessing Micro Service: To empower users with the flexibility to customize data preprocessing
according to their preferences, a FastAPI service is developed for parametric data preprocessing using
the TurkishNLP library.
• Preprocessing Micro Service ~ Params Tuning: Through experimentation, we identify the optimal data
preprocessing steps for the shared dataset, refining the data to enhance model performance.
• SHAP Analysis: The SHAP analysis provides invaluable insights into the model's classification
performance. While it may not predict the need for additional data, it uncovers factors influencing
decision-making, highlighting areas where the model can benefit from improved feature engineering.
This phase focuses on sourcing data from various social media platforms, including Twitter, Facebook,
Instagram, Reddit, and others, using APIs to gather a diverse range of text samples in Turkish. These datasets
are amalgamated into a single, comprehensive, labeled dataset to ensure data quality and relevance.
1. Data Collection from Various Social Media Platforms: We aggregate data from various social media
sources, each with its unique format and structure. Preprocessing and data normalization are essential
steps, considering these variations.
2. Labeling with Pretrained Models: To label this extensive dataset accurately, we harness the capabilities
of state-of-the-art pretrained language models, such as BERT, ALBERT, ELECTRA, DistilBERT, GPT
Bard, and others. These models, driven by cutting-edge algorithms, help identify harmful language,
4
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
4. Creation of a Comprehensive Dataset: The labeled data is amalgamated into a unified, extensive
dataset, characterized by both quantity and quality. This dataset serves as a valuable resource,
benefitting a broad user base, including researchers, educators, businesses, and the general public.
Broad Accessibility:
Our commitment extends to making this comprehensive dataset accessible to the public, encompassing
corporate and private organizations, researchers, professors, entrepreneurs, and anyone interested in
advancing natural language processing in Turkish. By providing such a high-quality dataset, we catalyze the
growth and development of the Turkish NLP community, fostering innovation and solutions.
Web Application:
To bring our developed model and insights to a diverse audience, we will create a user-friendly web application.
This application empowers users to make predictions, access model insights, and seamlessly interact with the
model's capabilities for diverse applications.
Building upon the findings from the previous phases, the development and refinement of multilabel models
result in high-performance models, with specific metrics fine-tuned during the development phase.
• TDDI Model Service: We create a FastAPI service for model deployment, enabling users to make
predictions using the developed model.
• Social Content Analysis Application: Our application is dedicated to analyzing content on social media
platforms, encompassing keyword tracking, hashtag analysis, and brand mention monitoring. It offers
invaluable insights to businesses, aiding them in refining marketing strategies and enhancing their
online presence.
Evaluation Metrics:
• Time-Based Train-Test Split (TBTTS): In the context of time series problems, a time-based train-test
split is utilized, ensuring the model is evaluated on unseen, future data, preserving the temporal
structure of the data. Nevertheless, there is a risk of overfitting when testing repeatedly over the same
dataset using a single test set.
• Mixed Approach: We adopt a mixed evaluation approach, leveraging the strengths of both cross-
validation and TBTTS. Cross-validation provides insights into overall model performance, while TBTTS
evaluates the model's predictive capabilities on future data, offering a holistic understanding of the
model's performance.
This meticulously designed methodology ensures that our project is adaptable and versatile for real-world
5
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
6
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI ARAŞTIRMA ÖNERİSİ FORMU
3 PROJE YÖNETİMİ
Araştırma önerisinde yer alacak başlıca iş paketleri ve hedefleri, her bir iş paketinin hangi sürede gerçekleştirileceği, başarı ölçütü ve araştırmanın başarısına katkısı “ İş-
Zaman Çizelgesi” doldurularak verilir. Literatür taraması, gelişme ve sonuç raporu hazırlama aşamaları, araştırma sonuçlarının paylaşımı, makale yazımı ve malzeme
alımı ayrı birer iş paketi olarak gösterilmemelidir.
Başarı ölçütü olarak her bir iş paketinin hangi kriterleri sağladığında başarılı sayılacağı açıklanır. Başarı ölçütü, ölçülebilir ve izlenebilir nitelikte olacak şekilde nicel veya
nitel ölçütlerle (ifade, sayı, yüzde, vb.) belirtilir.
Web Application
Launching a user-friendly and seamless web
4 Development Accessibilityof Arciel Baez 8
application
the model to a wide audience
7
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI ARAŞTIRMA ÖNERİSİ FORMU
Dissemination of
Presenting research findings in at least two
Results Sharing research
5 Büşra Kaya 9 conferences and publishing one scientific
results with the scientific and
article
intrepreneur community
8
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
9
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
Bu bölümde projenin yürütüleceği kurum ve kuruluşlarda var olan ve projede kullanılacak olan altyapı/ekipman
(laboratuvar, araç, makine-teçhizat, vb.) olanakları belirtilir.
4. YAYGIN ETKİ
Önerilen çalışma başarıyla gerçekleştirildiği takdirde araştırmadan elde edilmesi öngörülen ve beklenen
yaygın etkilerin neler olabileceği, diğer bir ifadeyle yapılan araştırmadan ne gibi çıktı, sonuç ve etkilerin elde
edileceği aşağıdaki tabloda verilir.
Bilimsel/Akademik
(Makale, Bildiri, Kitap Bölümü, Kitap)
Ekonomik/Ticari/Sosyal
(Ürün, Prototip, Patent, Faydalı Model, Üretim İzni,
Çeşit Tescili, Spin-off/Start- up Şirket, Görsel/İşitsel
Arşiv, Envanter/Veri Tabanı/Belgeleme Üretimi, Telife
Konu Olan Eser, Medyada Yer Alma, Fuar, Proje
Pazarı, Çalıştay, Eğitim vb. Bilimsel Etkinlik, Proje
Sonuçlarını Kullanacak Kurum/Kuruluş, vb. diğer
yaygın etkiler)
Makina/Teçhizat
(Demirbaş)
Hizmet Alımı
Ulaşım
TOPLAM
10
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
ARAŞTIRMA ÖNERİSİ FORMU
NOT: Bütçe talebiniz olması halinde hem bu tablonun hem de TÜBİTAK Yönetim Bilgi Sistemi (TYBS) başvuru
ekranında karşınıza gelecek olan bütçe alanlarının doldurulması gerekmektedir. Yukardaki tabloda girilen bütçe
kalemlerindeki rakamlar ile, TYBS başvuru ekranındaki rakamlar arasında farklılık olması halinde TYBS
ekranındaki veriler dikkate alınır ve başvuru sonrasında değiştirilemez.
7. EKLER
EK-1: KAYNAKLAR
11