Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Visual Sentiment Analysis Based on Objective

Text Description of Images


Introduction
Visual Sentiment Analysis aims to estimate the polarity of the sentiment evoked by images in terms of
positive or negative sentiment. The aim of the paper is that most of the state of the art works exploit the
text associated with a social post provided by the user. However, such textual data is typically noisy due
to the subjectivity of the user which usually includes text useful to maximize the diffusion of the social
post. In this paper we extract and employ an Objective Text description of images automatically extracted
from the visual content rather than the classic Subjective Text provided by the users. To understand it
more clearly, let us take the following example: “The dog is barking”. Now, the ambiguity behind this
sentence is that the word ‘bark’ doesn’t describe whether the dog is happy to see the person or it is
provoked by the person. In such cases, it’s not possible to identify the sentiment. Another example of such
an idea is a simple sentence: ‘No!’. Now, the ambiguity behind this sentence is that we don’t know
whether the person is disagreeing with the person or it is said due to grief. To overcome such ambiguity,
the project uses to build such a model, which not only takes words as data but also the visuals to describe
the sentiment.

The proposed approach exploits one visual view and three textual features based on the objective text
extracted from the images, namely Objective Textual (OT), Objective Sentiment (OS) and Objective
Revisited (OR) features. According to this approach, a given text is represented as a feature vector which
elements are obtained by multiplying the sentiment scores of the contained words by their frequencies.
The sentiment scores are taken from SentiWordNet, and a re-ranking of such scores is performed for the
words whose neutral score is higher than either the negative and the positive ones. All the text-based
features considered in the proposed approach share the same preprocessing stage of the text extracted with
the deep learning architecture. The OR feature that we compute is hence a vector W in which each Wi
element is defined as follows:
{T Fi × posWi, where Wi ∈ [pos words]
Wi = {T Fi × negWi, where Wi ∈ [neg words]
{0 otherwise

Proposed Model:

Loss Function:
The loss of each sub-network is measured by cross entropy:
La/n = L(z , t; I,W) = -logP(Z=t|I,W)
𝑒𝑥𝑝(𝑍𝑘)
P(z = t|I,W) = softmax(zk)= 𝑘
∑ 𝑒𝑥𝑝(𝑍𝑖)
𝑖=1

Algorithm:

Data Sets used:


SentiBank [Borth et al., 2013]. SentiBank is widely-used, which contains about one-half million images
from Flickr with designed ANPs as queries. The sentiment label of each image is decided by the
sentiment polarity of the corresponding ANP. We use this dataset for weak supervision, as noisy ANP
labels are provided. The training/testing split is 90% and 10%, respectively.
Twitter “five-agree” [You et al., 2015b]. The dataset is more challenging than SentiBank, which contains
581 positive samples and 301 negative samples. Each sample is labeled by at least 5 AMT workers. The
training/testing split is 80% and 20%, respectively. The network is pre-trained on SentiBank, and
fine-tuned on Twitter. Mutual supervision has been used as ANP labels are unavailable.

Comparison with past works:


Conclusions:
In this paper, we propose DCAN which is a novel CNN structure with deep coupled adjective and noun
networks for visual sentiment analysis. The network can effectively learn middle-level sentiment features
from noisy web images with ANP labels, and achieve the best result on both SentiBank and Twitter
dataset to the best of our knowledge. Since the ANP labels are human-designed, the focus is on
automatically discovering robust middle-level representation to guide the learning of sentiment in future
work.

You might also like