Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

EN.580.

437: Biomedical Data Design

Instructor: Adam Charles

TAs: Beepul Bharti, Jayanta Dey, Noga Mudrik, Michelle Nguyen,


Yutao Tang, Zhenzhen Wang

Course Plan

BDD is a 1 year course split across the two semesters. The first semester is focused on 1)
getting situated with and learning about a problem, 2) developing an approach, and 3) beginning
to execute the approach. The second semester is focused on 1) executing and debugging the
approach, and 2) Technical written and oral communication. Specifically:
Semester 1:
● Week 1: Intro and mini-Hackathon (Rank-order matching)
● Weeks 2-3: group and project selection
○ Students will select from a list of projects
○ Groups will be chosen around each project (<6 members/group)
○ Rank-order voting will ensure best fit matching across projects
● Weeks 4-7: Formulate an approach to the problem
○ Deliverable: Each group will submit a short proposal outlining their project.
including
■ Problem formulation: What is the problem you are tackling and why is it
important? What are the specific challenges?
■ The proposed method: How are you planning to solve the problem? What
methods are you going to use and how will they address the stated
problem?
■ Expected result & deliverables: What tangible evidence of your effort will
be produced? Is there a target performance? Is there a target code-base?
■ Expected pitfalls & mitigation strategies: What is most likely to go wrong?
Are there coping strategies if said issues do arise?
○ Starting in week 4 class time will include a weekly recap from each group. These
recaps will outline current progress of the group. During the project organization
weeks, the recap will consist of presenting the literature and identified
approaches with the express goal of getting feedback from the broader group.
● Approach execution & Iteration Weeks 8-14
○ Deliverable: Each group will submit by the end of the semester a mid-project
report outlining the successes and challenges faced so far. This report should
reference the original proposal and outline:
■ What of the proposal has been tried, and what were the challenges and
results?
■ What in the initial proposal did not work and what steps were made to
address or circumvent the problem?
■ What are the next steps and are there any required changes to the project
deliverables in light of the work thus far?
○ After week 7, the weekly recaps will persist, however the topic will change from
the literature and identified approaches to current progress and emerging
challenges or roadblocks that the group is dealing with.
Semester 2:
● Weeks 1-8: Approach execution & iteration
● Weeks 9-14: Dissemination.
○ Deliverable: Each group will submit
■ A project report that will serve to present the entirety of the effort over
both semesters. This will include
● Project description and motivation
● Methods & Approach
● Results & deliverables
● Conclusions & future work
■ A finalized code-base with full commenting and documentation
■ A 30-minute oral presentation of the work
■ A poster design
Class details
- Feedback requesting material to be covered
- Compile resources
- GIT organization - one repo within the org for each group
Links & references
In past years the Neural Data Design course (NDD) has built a robust set of links to best coding
practices and other important aspects of research, such as figure making. The links are
available here: https://neurodatadesign.io/links/
First in-class project: coding a rank-order assignment algorithm

Goal: The goal of the project will be to write a piece of software (possible languages are
Python, Julia, and MATLAB), that matches N patients with K doctors. Each patient is allowed to
provide a ranked list of their preference for doctors, however doctors are prohibited from
displaying preferences for patients. Thus the code should takes in the following:
● A list of ranked preferences, 1 list for each patient
● A maximum capacity for each doctor (can initially assume the same capacity - note the
total capacity should exceed the number of patients
And the code should return:
● A list of assignments indicating which doctors are to take care of which patients

Details: For this assignment please work in groups of at most 3 individuals. Teams can choose
to implement a classical algorithm, such as the Hungarian algorithm, however other algorithms
are also acceptable, including any of a number of auction or transport optimization algorithms in
the literature. The code should be include:
1. A Github repository housing all the code synced amongst the group
2. Commented and documented code, including references and an explanation of the
algorithm implemented
3. A functioning demo script (can be a jupytr notebook)

Purpose: The purpose of this project is to identify strengths and weaknesses of the class in
scientific coding. Performance and questions during this in-class assignment will help guide the
introductory weeks and which additional materials might be addressed in more detail.
Written assignment #1: Project proposal

Goal: The goal of this project is to write a proposal that will serve as a blueprint for the project
that will be worked on for the remainder of the course.

Details: Each group will submit a short proposal (Not to exceed 7 pages) outlining their project.
The proposal should cover all of the following:
● Problem formulation: What is the problem you are tackling and why is it important? What
are the specific challenges?
● The proposed method: How are you planning to solve the problem? What methods are
you going to use and how will they address the stated problem?
● Expected result & deliverables: What tangible evidence of your effort will be produced?
Is there a target performance? Is there a target code-base?
● Expected pitfalls & mitigation strategies: What is most likely to go wrong? Are there
coping strategies if said issues do arise?
While results figures would be unavailable, conceptual figures and example data plots (e.g.,
images of data to be classified) are welcome.

Purpose: The purpose of this project is to learn how to create a compelling narrative outlining a
proposed plan of action to perform highly technical work. This includes 1) How to develop and
motivate a set of complementary aims 2) How to discuss feasibility and mitigation plans 3) How
to identify milestones of success.
Projects for Biomedical Data Design (2022)

Groups will form around the following projects:


1. Option 1: Using M-Phate to visualize RNN (Noga);
Option 2: M-phate for analysis of lottery ticket convergence and variability
2. Brain tumor segmentation OR Lung nodule detection (Yutao Tang)
3. Python codebase for popular simulation studies (Jayanta Dey)
4. Longitudinal summarization of clinical notes related to CAD (Michelle Nguyen)
5. Python implementation and optimization of a calcium imaging simulator (Adam)
6. Automated Gleason Grading (a MICCA Challenge) (Zhenzhen Wang)

Project 1:
Option 1: M-Phate for visualizing RNNs (Noga)

Description:

Using M-phate as described here to visualize the training evolution and the internal dynamics of
RNN.

Steps:

● Step 1: Apply MPHATE to RNNs to look at how RNN representations of data change
over learning. In this step the group can look at various RNN architectures (e.g., LSTMs,
GRUs, etc.) and learning methods, (e.g., Backprop through time, FORCE, FullFORCE)
● Step 2: Design new metrics for graph construction, suitable for RNN architectures
● Step 3: Apply to neural data (e.g., from the CRCNS data website)

Deliverables:

● Code that runs different RNN training procedures & MPHATE plots (Figure 1 from the
paper) of those learning
● Code that applies the new metrics into the M-phate, M-phate plots using a new metric
and comparison to the original metric
● Code that includes applying the above new metrics to neural recordings. If we manage
to also get “neural learning” data - apply the model to “neural learning” as well and
submit the code. M-phate figures of neural learning and neural dynamics
● A short summary of the results (2p including graphs, etc).

Option 2: M-phate for analysis of lottery ticket convergence and variability (Noga)

Description:
Using M-phate as described here to visualize the data representation in the lottery ticket
evolution and its variability over pruning iterations. This exploration will target estimating the
lottery ticket robustness to hyperparameters modulations.

Steps:

● Step 1: Replicate the results and figures 1,2 from the M-phate paper on a new networks
trained on a new dataset.
● Step 2: Replicate the results on figure 2 and the graphs from figures 1,3 from the lottery
ticket hypothesis paper, each student will focus on only one of the presented
architectures
● Step 3: visualize each of the between-pruning phases of the lottery ticket network using
M-phate
● Step 4: Compare the visualizations of different between-pruning phases and study the
variability of the lottery ticket by comparing the last image of M-phate from different
phases
● Step 5 (bonus): Study the effects of modulating lottery ticket parameters on the variability
and ability to converge to the same winning ticket

Deliverables:

● M-phate code applied to the new dataset and replication of figures 1,2 from the M-phate
paper.
● Code with the application of the lottery ticket process applied to a new given dataset and
replication of figures 1,3 + results in figure 2 from the lottery ticket paper
● M-phate graphs describing the network evolution between pruning phases + a graph
including the last image from each M-phate run to explore the winning ticket variability
● Develop a metric to assess the lottery ticket variability over pruning iterations using the
M-phate results
● If time permits - Graphs of the effect of the pruning intensity, early stopping, learning rate
(and potentially more hyperparameters) effect on the variability and convergence to the
same ticket
● Future direction (potentially for capstone project, etc.) - study the density of different
winning tickets and the conditions under which we converge to different tickets.

Project 2a: Brain tumor segmentation (Yutao Tang)

Description: Brain tumors are one of the cancers with the highest death rate which are
challenging to diagnose, hard to treat and resistant to conventional therapy given the difficulties
in delivering treatment to the brain. Automatic segmentation of brain tumors is important for
clinical diagnosis, treatment planning, and patient prognosis evaluation as the boundaries
between normal tissues and the tumors are usually ambiguous. Therefore, developing a tool for
automatic brain tumor segmentation is helpful for doctors.

Deliverables: 1) Get an understanding of brain tumor and explore the datasets (eg. BraTS [1]),
2) Develop a working system for brain tumor segmentation in python, and 3) Compare the
developed algorithm to the existing benchmark and state-of-the-art methods.

Reference:
[1] http://braintumorsegmentation.org/

Project 3b: Lung nodule detection (Yutao Tang)

Description: A lung nodule is an abnormal tissue that develops in the lungs. Although most of
the lung nodules are benign, however, some of them can continue to grow into lung cancer. The
size of the nodules is important - the bigger the nodule, the more frequent follow-up checks that
patients need. The five year survival rate for patients treated for cancerous pulmonary nodules
is around 50% and if the nodule has diameter less than 10mm the rate can increase up to 80%
which is the reason why early detection is critical [1]. The clinical diagnosis of lung nodules is
through CT scans. But at the early stage, the nodules can be very small so they could skip the
doctor's attention. Therefore, an automatic lung nodule detection tool is helpful for doctors’
identification of the nodules and the diagnosis of the disease.

Deliverables: 1) Get an understanding of lung nodule and lung cancer and explore the datasets
(eg. LUNA [2]), 2) Develop a working system for lung nodule detection in python, and 3)
Compare the developed algorithm to the existing benchmark and state-of-the-art methods.

Reference:
[1].
https://www.urmc.rochester.edu/encyclopedia/content.aspx?contenttypeid=22&contentid=pulmo
narynodules#:~:text=About%2040%20percent%20of%20pulmonary,why%20early%20detection
%20is%20critical.
[2]. https://luna16.grand-challenge.org/

Project 3: Python codebase for popular simulation studies (Jayanta Dey)

Description: Doing simulation study is an important step for doing machine learning and
hypothesis testing research. However, there is no well-maintained and well-defined repository to
generate all the popular simulation dataset. For example, have a look at the simulation studies
here: https://hyppo.neurodata.io/user_guide/sims.html. In this project, we would like to create a
python package to do all the simulation study and replicate some of the popular simulation study
in the literature.

Deliverables:
1. A python package containing documentations and tutorials on how to use the package.
2. A tutorial describing the simulation study and insights in experiments from popular
papers.

—----------------------------------------------------------------------------------------------------------------------------

Project 4: Longitudinal summarization of clinical notes related to CAD (Michelle Nguyen)

Description: Clinical notes from the electronic health record contain rich phenotypic
information; however, the glut of information can be difficult for care providers to easily access
and interpret. To reduce care provider burden, automatic text summarizers have been explored
as a solution to aggregate information from a variety of clinical notes and display the information
in an accessible way.
The 2014 i2b2/UTHealth data set contains longitudinal clinical records for three diabetic patient
groups: those who have CAD, develop CAD, and do not have CAD, each annotated for CAD
risk factors. The records contain a wide variety of notes and presents interesting NLP
challenges (different tones and styles in narratives, abbreviations, long summaries, etc.). The
objective of this project is to implement and evaluate four different deep learning models for
longitudinal clinical notes summarization.

Deliverables: 1) Prepare i2b2 data set and yield working implementations of 4 cNLP
summarizers on dataset 2) Develop evaluation metrics for clinical NLP summarizers
(quantitative and qualitative) 3) Evaluate and compare clinical NLP summarizers

Project 5: Python implementation and optimization of a calcium imaging simulator (Adam)

Description: Calcium imaging is a critical technology in neuroscience, as it enables large-scale


access to neural activity at the cellular level. Validation and assessment of calcium imaging
methods, however, is hindered by the lack of available ground truth data. Recent work
developed a biophysical simulation-based approach to generating realistic ground truth
data—based on statistical models of neural anatomy and function—with which to validate
current and future approaches (optics, algorithms, etc.). The current codebase is 1) in MATLAB,
limiting access to users, 2) relatively slow at large scales, and 3) limited in the small-scale
modeling. Improving accessibility across code-bases, speed and efficiency, and scope will make
simulation-based analyses more impactful to a wider audience. This project will work on
translating the currently available software to python and improving the code to achieve faster
runtime.
Deliverables: 1) A working implementation of the NAOMi simulation suite in python, 2) a speed
test and run-time analysis of the code, and 3) A comparison of 3-5 publicly available algorithms
using the output of NAOMi in python

Project 6: Automated Gleason Grading (a MICCA Challenge) (Zhenzhen Wang)

Description: Prostate cancer is one of the most common cancers in men around the world. The
Gleason grading system remains the most powerful prognostic predictor for patients with
prostate cancer since the 1960s. The method is to evaluate lesion levels of Prostate cancer via
the microscopic inspection of stained biopsy tissue and estimate the Gleason score. However, it
is time-consuming for pathologists to identify the cellular and glandular patterns for Gleason
grading. Moreover, its application requires highly-trained pathologists, is tedious, and yet
suffers from limited inter-pathologist reproducibility. With the development of digital scanners,
the computer-aided diagnosis method based on a convolutional neural network has played an
important role in medical image segmentation and detection. To help drive forward research and
innovation for Automated Gleason Grading in computational pathology, we organize the
Automated Gleason Grading Challenge (AGGC) 2022. The challenge requires researchers to
develop algorithms that identify distinct Gleason Patterns within the H&E-stained whole slide
image dataset.
Deliverables: (1) Learn background of prostate cancer, Gleason system, and related work for
this problem. (3) Draft proposal describing motivation, aims and methods. (4) Implement their
approach in python and discuss results
Reference:
[1] https://aggc22.grand-challenge.org/AGGC22/

Project 7: Studying Interpretability in Machine Learning via Shapley Values

Description: Machine learning models, in particular artificial neural networks, are increasingly
used to inform decision making in high-stakes scenarios across a variety of fields---from
financial services, to public safety, and healthcare. While neural networks have achieved
remarkable performance in many settings, their complex nature raises concerns on their
interpretability in real-world scenarios. As a result, several a-posteriori explanation methods
have been proposed to highlight the features that influence a model's prediction. Notably, the
Shapley value---a game theoretic quantity that satisfies several desirable properties---has
gained popularity in the machine learning explainability literature. However, one can run into
many issues when trying to use Shapley values. First, since a machine learning model takes a
input of fixed dimensionality, properly implementing Shapley-based explanations requires
“masking” features by sampling from the conditional distribution of the features. Second,
Shapley values for image classifications and generating Shapley values for every pixel in an
image is computationally infeasible. As a result, this project will aim to develop algorithms to
tackle these issues.

Deliverables: 1) Develop/implement conditional samplers to calculate Shapley values 2)


Implement Shapley-based explanations with respect to “super-pixels” 3) Explore using
vision-transformers to handle “masking” instead of using conditional samplers

You might also like