Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Style, Semantics, and Other

Things
Krishnapriya Vishnubhotla (KP)
Intro: Style in NLP

● Uniqueness of writing style

● Due to:
○ Lexical choices (big words vs small words)

○ Sentence structure (short n simple vs complex with clauses)

● Stylometry:
○ Surface features (word lengths, sentence lengths)

○ Lexical features (LIWC, number of hapax legomena)

○ Syntactic features (function word frequencies, PoS tag frequencies, parse tree features, character trigrams)

● Authorship attribution, plagiarism detection, digital forensics


Form and Meaning

● Text generation process:


○ a meaning, or content +
○ Form, or style
● Multiple surface realisations are possible for the same meaning
● Natural language corpora:
○ Complex vs simple wikipedia
○ Literary translations
● Closely related to: paraphrases
Paraphrases

● Paraphrase identification, generation


● Datasets: Quora Question Pairs, Microsoft Research Paraphrase Corpus, ParaNMT
● Semantic Textual Similarity tasks
NLP: Style Transfer

● Lots of work on style transfer in NLP

● “Style” ---> factor of variation


○ Sentiment
○ Attributes
○ Topics

● Usually guided by the dataset used.

● Problematic:
○ What should be preserved?
○ Adds to already problematic evaluation metrics
Complications

● There are no true synonyms -- “near-synonyms”


● Changing active to passive → change of focus
● Pragmatics -- viewpoint, framing, denotation, connotation, implication.
● Can draw some fuzzy boundaries between clusters of near-synonyms at a word-level
○ What about for phrases/sentences/documents?
● Style: Literary definition: what is “lost in translation”
Meaning Representations

● Formal representation of meaning/semantics


● Lots of CL research on logical forms, compositionality
● Two relatively-recent projects I came across
○ Abstract Meaning Representation (AMR)
○ Minimal Recursion Semantics (MRS)
Abstract Meaning Representation

● Rooted, directed, (edge+leaf)-labelled graph


● Uses PropBank frames
● Example: “The dog is eating a bone,”

Relations
Variable / Concept
● “The dog ate the bone that he found.”

● Has ways to handle:


○ Coreference
○ Negation
○ Numbers/quantity
○ Names
Generalisation capabilities

- The man described the mission as a disaster.


- The man’s description of the mission: disaster. Same AMR.
- As the man described it, the mission was a disaster.
- The man described the mission as disastrous.

● Abstracts away morphological and syntactic variations.


● But does not handle synonyms
○ “afraid” and “terrified” are treated as different concepts.
● Useful?
○ Not yet.
○ Purpose: dataset to help develop algorithms that can generate AMRs.
Minimal Recursion Semantics

● Another formalism: phrase structure grammar


● More fine-grained
● Can distinguish between tense, number.
● Practical utility:
○ Has a command-line parser you can use
○ Can generate simple paraphrases
Practical Utility

● Unlikely that they can parse many real-world sentences:


○ LIT paper: successful at 19.7% of SNLI sentences

● Using AMR to detect paraphrases:


○ ~85% on the Microsoft Paraphrase Corpus

● A separate research problem, not a tool to be used.


Back to Representation Learning

● Let us assume we have…

● Some proxy information for:


○ Form
○ Meaning Text t

Form Vector Meaning Vector

Stylistic similarity Semantic similarity


Neural Models
z classifier

● Modified Autoencoders
Paraphrases
● Encode into two vectors
● Use both to reconstruct
● Restrict information using
motivational/adversarial
discriminators

Semantic z classifier

Syntactic
What kinds of supervision?

Datasets
● Style class labels
● Paraphrases ● Paraphrase datasets
● Heuristic info: ● Parallel style transfer datasets
○ BoW for content ○ Formality
● Syntax: Syntax tree features ○ Diachronic language change
○ Tree edit distance ● Data-to-text datasets
○ ~Synthetic
Synthetic Dataset: PersonageNLG

● Personality model might be questionable


● BUT gives us two neat dimensions of variation.
All the losses later….

Evaluation:
● Style transfer (swap variables + generate)
● Retrieval
● Prediction (kNN)
More supervision == better representations

● Kinda boring
● Just train a separate supervised model
for each end-goal?
● Style transfer:
○ Generation problems
○ Evaluation problems
● Real-world text: not so cleanly
separable.
:(
What would be interesting?

● Unsupervised disentanglement?
○ beta-VAE in vision
○ At least for the synthetic dataset
● Evaluating the representations:
○ Probe for linguistic knowledge/features
○ Robust to “noise”? → domain adaptation/zero-shot prediction
● Using pre-trained models?
● (TBD) Should the latent spaces be entirely unrelated?
○ Where do style and semantics intersect?
○ What is a “latent space of sentences” anyway?

You might also like