Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

CS698V/CS779: Statistical Natural Language Processing

Course Handout 1

Course Description:
Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of
communication for humans. With the growth of the world wide web, data in the form of text has grown
exponentially. It calls for the development of algorithms and techniques for processing natural language for
the automation and development of intelligent machines. This course will primarily focus on understanding
and developing linguistic techniques, statistical learning algorithms and models for processing language.
We will have a statistical approach towards natural language processing, wherein we will learn how one
could develop natural language understanding models from statistical regularities in large corpora of natural
language texts while leveraging linguistics theories.
CS779 is a research project based course, participants are required to work on open and unsolved research
problems in NLP and consequently considerable effort is expected from the participant. As in the previous
offering of the course, work on the project can possibly lead to a publication.

Pre-requisites :
Must: Introduction to Machine Learning (CS771) or equivalent course, Proficiency in Linear Algebra, Prob-
ability and Statistics, Proficiency in Python Programming
Desirable: Probabilistic Machine Learning (CS772), Topics in Probabilistic Modeling and Inference (CS775),
Deep Learning for Computer Vision (CS776)

Course Instructor: Dr. Ashutosh Modi


Course TAs:
Samik Some (Email: samik@cse.iitk.ac.in )
Shubham Kumar Nigam (Email: sknigam@cse.iitk.ac.in)
Tushar Shandhilya (Email: stushar@cse.iitk.ac.in)
Karishma Satchidanand Laud (Email: kslaud@iitk.ac.in)
Gargi Singh (Email: sgargi@cse.iitk.ac.in)
Chayan Dhaddha (Email: cdhaddha@cse.iitk.ac.in)
Ashwani Bhat (Email: bashwani@cse.iitk.ac.in)

Course Email: In case you want to communicate with the instructor, please do not send any direct
emails to the instructor (these will most likely end in spam), use this course email for the communication:
nlp.course.iitk@gmail.com

Course Webpage: https://ashutosh-modi.github.io/teaching/CS779.html

Weekly Meeting Session: Monday 3:30PM to 5PM

Meeting Platform: MS Teams


In case you are not there on MS Teams please create an account via IITK subscription. To get IITK sub-
scription please fill this form: https://web.iitk.ac.in/ccnew/Office365/Office_365_subscription_
at_IITK.htm

1
You would need to log into IITK network via VPN to fill the form. Once you have the account, please
contact TAs to add you to the Teams channel.
A separate team/channel (CS779: Statistical Natural Language Processing) has been set up for the course,
all the announcements will made on this channel.

Course Contents:

1. Introduction to Natural Language (NL): why is it hard to process NL, linguistics fundamentals, etc.
2. Language Models: n-grams, smoothing, class-based, brown clustering
3. Sequence Labeling: HMM, MaxEnt, CRFs, related applications of these models e.g. Part of Speech
tagging, etc.
4. Parsing: CFG, Lexicalized CFG, PCFGs, Dependency parsing
5. Applications: Named Entity Recognition, Coreference Resolution, text classification, toolkits e.g.
Spacy, etc.
6. Distributional Semantics: distributional hypothesis, vector space models, etc.
7. Distributed Representations: Neural Networks (NN), Backpropogation, Softmax, Hierarchical Softmax
8. Word Vectors: Feedforward NN, Word2Vec, GloVE, Contextualization (ELMo etc.), Subword infor-
mation (FastText, etc.)
9. Deep Models: RNNs, LSTMs, Attention, CNNs, applications in language, etc.
10. Sequence to Sequence models: machine translation and other applications
11. Transformers: BERT, transfer learning and applications

References: There are no specific references, this course gleans information from a variety of sources like
books, research papers, other courses, etc. Relevant references would be suggested in the lectures. Some of
frequent references are as follows:

1. Speech and Language Processing, Daniel Jurafsky, James H.Martin,


2. Foundations of Statistical Natural Language Processing, CH Manning, H Schtze
3. Introduction to Natural Language Processing, Jacob Eisenstein
4. Natural Language Understanding, James Allen

Grading: This is a research project oriented course and the project carries the maximum weightage. Given
that course is going to be online, all exams will be conducted online. The tentative weightage for different
components is as follows. Please note that this is tentative (due to COVID uncertainties and factors beyond
Instructor’s control) and weightage might change.

Quizzes: 35%
Scribe Notes and Cheat Sheets: 5%
Project: 60%
Mid-Sem Exam: Project paper and presentation
End-Sem Exam: Project paper and presentation

You might also like