Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022

TÜBİTAK–2209-A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA
PROJELERİ DESTEĞİ PROGRAMI

Başvuru formunun Arial 9 yazı tipinde, her bir konu başlığı altında verilen açıklamalar göz önünde
bulundurularak hazırlanması ve ekler hariç toplam 20 sayfayı geçmemesi beklenir (Alt sınır
bulunmamaktadır). Değerlendirme araştırma önerisinin özgün değeri, yöntemi, yönetimi ve yaygın etkisi
başlıkları üzerinden yapılacaktır.
ARAŞTIRMA ÖNERİSİ FORMU
2023 Yılı
2. Dönem Başvurusu
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI
A. GENEL BİLGİLER
Başvuru Sahibinin Adı Soyadı: Arciel Aliognis Baez Zamora

Araştırma Önerisinin Başlığı: XXX
Danışmanın Adı Soyadı: Doç. Dr. Emine Baş ??? Doç. Dr. koymalıyım mı?
Araştırmanın Yürütüleceği Kurum/Kuruluş: Konya Teknik Üniversitesi ???
ÖZET
Türkçe özetin araştırma önerisinin (a) özgün değeri, (b) yöntemi, (c) yönetimi ve (d) yaygın etkisi hakkında bilgileri
kapsaması beklenir. Bu bölümün en son yazılması önerilir.
Özet
En az 25, en fazla 450 kelime yazılmalıdır.
Anahtar Kelimeler: ??? anahtar kelime sadece bir kelime mi?
1. ÖZGÜN DEĞER
1.1. Konunun Önemi, Araştırma Önerisinin Özgün Değeri ve Araştırma Sorusu/Hipotezi
Araştırma önerisinde ele alınan konunun kapsamı ve sınırları ile önemi literatürün eleştirel bir değerlendirmesinin
yanı sıra nitel veya nicel verilerle açıklanır.
Özgün değer yazılırken araştırma önerisinin bilimsel değeri, farklılığı ve yeniliği, hangi eksikliği nasıl gidereceği
veya hangi soruna nasıl bir çözüm geliştireceği ve/veya ilgili bilim veya teknoloji alan(lar)ına kavramsal, kuramsal
ve/veya metodolojik olarak ne gibi özgün katkılarda bulunacağı literatüre atıf yapılarak açıklanır.
Önerilen çalışmanın araştırma sorusu ve varsa hipotezi veya ele aldığı problem(ler)i açık bir şekilde ortaya
konulur.
In the realm of Natural Language Processing (NLP), extensive progress has been made in
languages widely used, such as English, Spanish, French, German, and Mandarin Chinese.
However, a noticeable gap persists in the literature, datasets, and models dedicated to
agglutinative languages, particularly Turkish. Although commendable efforts have been
made in the field of Turkish NLP, including the creation of datasets, such as those made
available by Türkiye Açık Kaynak Platformu and contributors on platforms like Kaggle, and
the development of models like BERT, ALBERT, ELECTRA, and DistilBERT, significant
opportunities for innovation and advancement remain.
The unique character of this research project is unmistakable. Rather than aiming to rival
existing endeavors, it stands out by contributing to and enriching the broader landscape of
Turkish NLP. The data produced, the models trained, and the literature generated within this
project will not only bolster its own objectives but will also provide invaluable assets for
fellow researchers and related projects in the field. In this collaborative spirit, the research's
inherent value is further elevated.
Moreover, the importance of this research transcends its impact on existing initiatives. The
2
ultimate aspiration of this research is the development of a web application with wide-
ranging utility, catering to public and private institutions, government agencies, and the
general populace. This application's capabilities encompass real-time monitoring of social
media, blogs, and websites to identify derogatory language. This pioneering approach serves
as a bulwark against the adverse effects of harmful content in the online sphere, contributing
to a safer and more constructive digital environment.
Crucially, the value of this project does not hinge solely on the end product. The data forged
during this research will serve as a cornerstone resource for subsequent investigations, both
by the researcher and the broader community. By providing a robust foundation for future
work, this project ensures the continued development of valuable solutions in the realm of
Turkish Natural Language Processing.
1.2. Amaç ve Hedefler
Araştırma önerisinin amacı ve hedefleri açık, ölçülebilir, gerçekçi ve araştırma süresince ulaşılabilir nitelikte
olacak şekilde yazılır.
1. Main Objective: Our research project aims to push the boundaries of Turkish Natural Language
Processing, bridging the gap between NLP in Turkish and more widely used languages. This will be achieved by
creating an extensive Turkish language dataset and training a model to effectively detect harmful language in
social media. The ultimate goal is to make this model accessible via a web application, benefitting public and
private institutions and the general population.
2. Specific Objectives:
a. Extensive Dataset Creation: The first objective is to construct a comprehensive Turkish language
dataset. This dataset will serve not only our project but also the broader research community working in the field
of Turkish NLP.
b. Model Development: The second objective is to build a high-accuracy model that can proficiently identify
harmful posts and comments in Turkish social media. This model will be trained using the extensive dataset we
create.
c. Web Application: The third goal is to develop a web application that integrates our model and utilizes
APIs to access social media, allowing real-time monitoring and mitigation of harmful language.
3. Alignment with Özgün Değer: These objectives align with the originality and significance of our project, as
discussed in the "Özgün Değer" section. They represent concrete steps toward addressing the challenges and
gaps in Turkish NLP.
4. Realistic and Attainable: With a project timeline of 6 months and a maximum period of 12 months, we
consider these objectives realistic and attainable within the scope of our research project.
5. Focus on Dataset Quality: A substantial portion of our project timeline will be dedicated to creating a large,
high-quality dataset. This process involves gathering and curating various data sources, unifying them into a
single dataset, and labeling the data through the use of pretrained models and LLMs.
6. Model Training and Fine-Tuning: Following dataset creation, our efforts will focus on training different
models, creating various features, and fine-tuning the model to ensure the highest possible accuracy in
identifying harmful language.
7. Web Application Development: The final step involves the creation of a web application designed to access
and analyze social media data for harmful language in real time.
8. Evaluation Metrics: While evaluation metrics will be discussed in detail in the "Yöntem" section, we will use
methods like cross-validation and time-based data splitting to assess model performance. Cross-validation, a
widely accepted metric, will help gauge model performance based on today's data. The time-based approach
will evaluate the model's ability to predict future harmful content.
9. Real-World Impact: Our project's potential impact is significant, not only for the Turkish-speaking world but
also for Turkic languages (e.g., Azerbaijani and Turkmen). A single achieved goal can have far-reaching
implications. For example, a well-curated dataset could revolutionize future solutions in Turkish and Turkic NLP.
10. Modern Methodology: We will employ high-quality, contemporary approaches that are widely used in
addressing similar problems. This ensures that our results are realistically achievable through proven, state-of-
the-art methods.
2. YÖNTEM
3
Araştırma önerisinde uygulanacak yöntem ve araştırma teknikleri (veri toplama araçları ve analiz yöntemleri dahil)
ilgili literatüre atıf yapılarak açıklanır. Yöntem ve tekniklerin çalışmada öngörülen amaç ve hedeflere ulaşmaya
elverişli olduğu ortaya konulur.
Yöntem bölümünün araştırmanın tasarımını, bağımlı ve bağımsız değişkenleri ve istatistiksel yöntemleri

kapsaması gerekir. Araştırma önerisinde herhangi bir ön çalışma veya fizibilite yapıldıysa bunların sunulması
beklenir. Araştırma önerisinde sunulan yöntemlerin iş paketleri ile ilişkilendirilmesi gerekir.
The research methodology for this project comprises several key phases, each meticulously designed to
achieve the primary objectives and deliver invaluable outcomes.
Data Collection and Preprocessing:
• Initial Data Collection: Our journey begins with the collection of freely available, already labeled data
from diverse sources such as Türkiye Açık Kaynak Platformu, Kaggle, Hugging Face, GitHub, and
others. These sources provide a wealth of labeled data, forming the foundation for our subsequent
processes.
• Exploratory Data Analysis: This phase initiates with a comprehensive analysis of the dataset to gain
essential insights. The insights derived from this analysis guide subsequent data preprocessing steps.
• Model Baseline: To establish a reference point for evaluating the impact of specific preprocessing steps
and overall model performance, a baseline model is created using the dataset's core features.
• TurkishNLP: Utilizing the TurkishNLP library, which includes advanced components for text
normalization, stemming, and tokenization, we perform data preprocessing. This not only ensures data
homogeneity but also significantly enhances the quality of Turkish language data.
• Preprocessing Micro Service: To empower users with the flexibility to customize data preprocessing
according to their preferences, a FastAPI service is developed for parametric data preprocessing using
the TurkishNLP library.
Model Development and Enhancement:
• Preprocessing Micro Service ~ Params Tuning: Through experimentation, we identify the optimal data
preprocessing steps for the shared dataset, refining the data to enhance model performance.
• SHAP Analysis: The SHAP analysis provides invaluable insights into the model's classification
performance. While it may not predict the need for additional data, it uncovers factors influencing
decision-making, highlighting areas where the model can benefit from improved feature engineering.
Data Augmentation and Enrichment:
This phase focuses on sourcing data from various social media platforms, including Twitter, Facebook,
Instagram, Reddit, and others, using APIs to gather a diverse range of text samples in Turkish. These datasets
are amalgamated into a single, comprehensive, labeled dataset to ensure data quality and relevance.
1. Data Collection from Various Social Media Platforms: We aggregate data from various social media
sources, each with its unique format and structure. Preprocessing and data normalization are essential
steps, considering these variations.
2. Labeling with Pretrained Models: To label this extensive dataset accurately, we harness the capabilities
of state-of-the-art pretrained language models, such as BERT, ALBERT, ELECTRA, DistilBERT, GPT
Bard, and others. These models, driven by cutting-edge algorithms, help identify harmful language,
4
sentiment, and other linguistic attributes in the text.
3. Incorporating Few-Shot Learning: In addition to pretrained models, we employ few-shot learning

techniques to enhance the model's ability to target specific data effectively.
4. Creation of a Comprehensive Dataset: The labeled data is amalgamated into a unified, extensive
dataset, characterized by both quantity and quality. This dataset serves as a valuable resource,
benefitting a broad user base, including researchers, educators, businesses, and the general public.
Broad Accessibility:
Our commitment extends to making this comprehensive dataset accessible to the public, encompassing
corporate and private organizations, researchers, professors, entrepreneurs, and anyone interested in
advancing natural language processing in Turkish. By providing such a high-quality dataset, we catalyze the
growth and development of the Turkish NLP community, fostering innovation and solutions.
Web Application:
To bring our developed model and insights to a diverse audience, we will create a user-friendly web application.
This application empowers users to make predictions, access model insights, and seamlessly interact with the
model's capabilities for diverse applications.
Multilabel Model Development:
Building upon the findings from the previous phases, the development and refinement of multilabel models
result in high-performance models, with specific metrics fine-tuned during the development phase.
Deployment and Application:
• TDDI Model Service: We create a FastAPI service for model deployment, enabling users to make
predictions using the developed model.
• Social Content Analysis Application: Our application is dedicated to analyzing content on social media
platforms, encompassing keyword tracking, hashtag analysis, and brand mention monitoring. It offers
invaluable insights to businesses, aiding them in refining marketing strategies and enhancing their
online presence.
Evaluation Metrics:
• Cross-Validation: Cross-validation is employed to assess the model's overall performance, minimizing

overfitting by testing the model on diverse data subsets. However, it's important to note that cross-
validation does not account for the temporal dimension of data, rendering it less suitable for time series
problems.
• Time-Based Train-Test Split (TBTTS): In the context of time series problems, a time-based train-test
split is utilized, ensuring the model is evaluated on unseen, future data, preserving the temporal
structure of the data. Nevertheless, there is a risk of overfitting when testing repeatedly over the same
dataset using a single test set.
• Mixed Approach: We adopt a mixed evaluation approach, leveraging the strengths of both cross-
validation and TBTTS. Cross-validation provides insights into overall model performance, while TBTTS
evaluates the model's predictive capabilities on future data, offering a holistic understanding of the
model's performance.
This meticulously designed methodology ensures that our project is adaptable and versatile for real-world
5
applications, underlining our dedication to delivering valuable outcomes.
6
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI ARAŞTIRMA ÖNERİSİ FORMU
3 PROJE YÖNETİMİ
3.1 İş- Zaman Çizelgesi
Araştırma önerisinde yer alacak başlıca iş paketleri ve hedefleri, her bir iş paketinin hangi sürede gerçekleştirileceği, başarı ölçütü ve araştırmanın başarısına katkısı “ İş-
Zaman Çizelgesi” doldurularak verilir. Literatür taraması, gelişme ve sonuç raporu hazırlama aşamaları, araştırma sonuçlarının paylaşımı, makale yazımı ve malzeme
alımı ayrı birer iş paketi olarak gösterilmemelidir.
Başarı ölçütü olarak her bir iş paketinin hangi kriterleri sağladığında başarılı sayılacağı açıklanır. Başarı ölçütü, ölçülebilir ve izlenebilir nitelikte olacak şekilde nicel veya
nitel ölçütlerle (ifade, sayı, yüzde, vb.) belirtilir.
İŞ-ZAMAN ÇİZELGESİ (*)
İP Kim(ler) Tarafından Zaman Aralığı

İş Paketlerinin Adı ve Hedefleri Başarı Ölçütü ve Projenin Başarısına Katkısı
No Gerçekleştirileceği (..-.. Ay)
Literature Review to Establishing a

1 Büşra Kaya 1 Analysis and synthesis of relevant scientific literature
theoretical foundation
Data Collection and

Preparation Providing
2 Arciel Baez 2-5 Sourcing and preparing a high-quality dataset
foundational data for the data-
driven model
Model Development High-

Arciel Baez and Büşra Achieving an F-1 Macro score of 97% or
3 performance multi-label Kaya
6-7
higher on future data
classification model
Web Application
Launching a user-friendly and seamless web
4 Development Accessibilityof Arciel Baez 8
application
the model to a wide audience
7
2209/A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA PROJELERİ DESTEĞİ PROGRAMI ARAŞTIRMA ÖNERİSİ FORMU
Dissemination of
Presenting research findings in at least two
Results Sharing research
5 Büşra Kaya 9 conferences and publishing one scientific
results with the scientific and
article
intrepreneur community
(*) Çizelgedeki satırlar ve sütunlar gerektiği kadar genişletilebilir ve çoğaltılabilir.
8
3.2 Risk Yönetimi
Araştırmanın başarısını olumsuz yönde etkileyebilecek riskler ve bu risklerle karşılaşıldığında araştırmanın

başarıyla yürütülmesini sağlamak için alınacak tedbirler (B Planı) ilgili iş paketleri belirtilerek ana hatlarıyla
aşağıdaki Risk Yönetimi Tablosu’nda ifade edilir. B planlarının uygulanması araştırmanın temel hedeflerinden
sapmaya yol açmamalıdır.
RİSK YÖNETİMİ TABLOSU*

İP
En Önemli Riskler Risk Yönetimi (B Planı)
No
Data allocation issues from social • Diversify Data Sources:
media platforms due to strict usage Acquire data from various
policies and access limitations. social media platforms to
reduce reliance on a single
source.
• Scheduled Data Gathering:
Plan data collection during
low-traffic periods to minimize
the risk of account
1 suspension.
• Platform Policy Compliance:
Understand and adhere to
platform terms of service and
data usage policies.
• Rate Limiting: Implement rate
limiting to stay within
platform-defined usage
thresholds.
Ineffective dissemination of research Audience-specific Platforms: Choose

findings that could limit the project's
impact. appropriate platforms for specific target
audiences:
• Scientific articles for the
academic community.
• GitHub, HuggingFace, and
Kaggle for the development
2 community.
• Event participation for the
entrepreneur community
(private and public).
• Various social media
platforms for the general
public.
(*) Tablodaki satırlar gerektiği kadar genişletilebilir ve çoğaltılabilir.
3.3. Araştırma Olanakları
9
Bu bölümde projenin yürütüleceği kurum ve kuruluşlarda var olan ve projede kullanılacak olan altyapı/ekipman
(laboratuvar, araç, makine-teçhizat, vb.) olanakları belirtilir.
ARAŞTIRMA OLANAKLARI TABLOSU (*)
Kuruluşta Bulunan Altyapı/Ekipman Türü, Modeli

Projede Kullanım Amacı
(Laboratuvar, Araç, Makine-Teçhizat, vb.)
(*) Tablodaki satırlar gerektiği kadar genişletilebilir ve çoğaltılabilir.
4. YAYGIN ETKİ
Önerilen çalışma başarıyla gerçekleştirildiği takdirde araştırmadan elde edilmesi öngörülen ve beklenen
yaygın etkilerin neler olabileceği, diğer bir ifadeyle yapılan araştırmadan ne gibi çıktı, sonuç ve etkilerin elde
edileceği aşağıdaki tabloda verilir.
ARAŞTIRMA ÖNERİSİNDEN BEKLENEN YAYGIN ETKİ TABLOSU

Önerilen Araştırmadan Beklenen Çıktı, Sonuç ve
Yaygın Etki Türleri
Etkiler
Bilimsel/Akademik
(Makale, Bildiri, Kitap Bölümü, Kitap)
Ekonomik/Ticari/Sosyal
(Ürün, Prototip, Patent, Faydalı Model, Üretim İzni,
Çeşit Tescili, Spin-off/Start- up Şirket, Görsel/İşitsel
Arşiv, Envanter/Veri Tabanı/Belgeleme Üretimi, Telife
Konu Olan Eser, Medyada Yer Alma, Fuar, Proje
Pazarı, Çalıştay, Eğitim vb. Bilimsel Etkinlik, Proje
Sonuçlarını Kullanacak Kurum/Kuruluş, vb. diğer
yaygın etkiler)
Araştırmacı Yetiştirilmesi ve Yeni Proje(ler)

Oluşturma
(Yüksek Lisans/Doktora Tezi, Ulusal/Uluslararası Yeni
Proje)
5. BÜTÇE TALEP ÇİZELGESİ
Bütçe Türü Talep Edilen Talep Gerekçesi

Bütçe Miktarı
(TL)
Sarf Malzeme
Makina/Teçhizat
(Demirbaş)
Hizmet Alımı
Ulaşım
TOPLAM
10
NOT: Bütçe talebiniz olması halinde hem bu tablonun hem de TÜBİTAK Yönetim Bilgi Sistemi (TYBS) başvuru
ekranında karşınıza gelecek olan bütçe alanlarının doldurulması gerekmektedir. Yukardaki tabloda girilen bütçe
kalemlerindeki rakamlar ile, TYBS başvuru ekranındaki rakamlar arasında farklılık olması halinde TYBS
ekranındaki veriler dikkate alınır ve başvuru sonrasında değiştirilemez.
6. BELİRTMEK İSTEDİĞİNİZ DİĞER KONULAR

Sadece araştırma önerisinin değerlendirilmesine katkı sağlayabilecek bilgi/veri (grafik, tablo, vb.) eklenebilir.
7. EKLER
EK-1: KAYNAKLAR
11

Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022

Uploaded by

Copyright:

Available Formats

You might also like

Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022

Uploaded by

Copyright:

Available Formats

TÜBİTAK–2209-A ÜNİVERSİTE ÖĞRENCİLERİ ARAŞTIRMA

PROJELERİ DESTEĞİ PROGRAMI

ARAŞTIRMA ÖNERİSİ FORMU

Başvuru Sahibinin Adı Soyadı: Arciel Aliognis Baez Zamora

En az 25, en fazla 450 kelime yazılmalıdır.

Anahtar Kelimeler: ??? anahtar kelime sadece bir kelime mi?

1.1. Konunun Önemi, Araştırma Önerisinin Özgün Değeri ve Araştırma Sorusu/Hipotezi

1.2. Amaç ve Hedefler

Yöntem bölümünün araştırmanın tasarımını, bağımlı ve bağımsız değişkenleri ve istatistiksel yöntemleri

Model Development and Enhancement:

Data Augmentation and Enrichment:

sentiment, and other linguistic attributes in the text.

3. Incorporating Few-Shot Learning: In addition to pretrained models, we employ few-shot learning

Multilabel Model Development:

Deployment and Application:

• Cross-Validation: Cross-validation is employed to assess the model's overall performance, minimizing

applications, underlining our dedication to delivering valuable outcomes.

3.1 İş- Zaman Çizelgesi

İŞ-ZAMAN ÇİZELGESİ (*)

İP Kim(ler) Tarafından Zaman Aralığı

Literature Review to Establishing a

Data Collection and

Model Development High-

(*) Çizelgedeki satırlar ve sütunlar gerektiği kadar genişletilebilir ve çoğaltılabilir.

3.2 Risk Yönetimi

Araştırmanın başarısını olumsuz yönde etkileyebilecek riskler ve bu risklerle karşılaşıldığında araştırmanın

RİSK YÖNETİMİ TABLOSU*

Ineffective dissemination of research Audience-specific Platforms: Choose

(*) Tablodaki satırlar gerektiği kadar genişletilebilir ve çoğaltılabilir.

3.3. Araştırma Olanakları

ARAŞTIRMA OLANAKLARI TABLOSU (*)

Kuruluşta Bulunan Altyapı/Ekipman Türü, Modeli

(*) Tablodaki satırlar gerektiği kadar genişletilebilir ve çoğaltılabilir.

ARAŞTIRMA ÖNERİSİNDEN BEKLENEN YAYGIN ETKİ TABLOSU

Araştırmacı Yetiştirilmesi ve Yeni Proje(ler)

5. BÜTÇE TALEP ÇİZELGESİ

Bütçe Türü Talep Edilen Talep Gerekçesi

6. BELİRTMEK İSTEDİĞİNİZ DİĞER KONULAR

You might also like