Professional Documents
Culture Documents
Style and Semantics
Style and Semantics
Things
Krishnapriya Vishnubhotla (KP)
Intro: Style in NLP
● Due to:
○ Lexical choices (big words vs small words)
● Stylometry:
○ Surface features (word lengths, sentence lengths)
○ Syntactic features (function word frequencies, PoS tag frequencies, parse tree features, character trigrams)
● Problematic:
○ What should be preserved?
○ Adds to already problematic evaluation metrics
Complications
Relations
Variable / Concept
● “The dog ate the bone that he found.”
● Modified Autoencoders
Paraphrases
● Encode into two vectors
● Use both to reconstruct
● Restrict information using
motivational/adversarial
discriminators
Semantic z classifier
Syntactic
What kinds of supervision?
Datasets
● Style class labels
● Paraphrases ● Paraphrase datasets
● Heuristic info: ● Parallel style transfer datasets
○ BoW for content ○ Formality
● Syntax: Syntax tree features ○ Diachronic language change
○ Tree edit distance ● Data-to-text datasets
○ ~Synthetic
Synthetic Dataset: PersonageNLG
Evaluation:
● Style transfer (swap variables + generate)
● Retrieval
● Prediction (kNN)
More supervision == better representations
● Kinda boring
● Just train a separate supervised model
for each end-goal?
● Style transfer:
○ Generation problems
○ Evaluation problems
● Real-world text: not so cleanly
separable.
:(
What would be interesting?
● Unsupervised disentanglement?
○ beta-VAE in vision
○ At least for the synthetic dataset
● Evaluating the representations:
○ Probe for linguistic knowledge/features
○ Robust to “noise”? → domain adaptation/zero-shot prediction
● Using pre-trained models?
● (TBD) Should the latent spaces be entirely unrelated?
○ Where do style and semantics intersect?
○ What is a “latent space of sentences” anyway?