Socialized Mobile Photography: Learning To Photograph With Social Context Via Mobile Devices

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/264585821

Socialized Mobile Photography: Learning to Photograph With Social Context


via Mobile Devices

Article  in  IEEE Transactions on Multimedia · January 2014


DOI: 10.1109/TMM.2013.2283468

CITATIONS READS

38 4,233

4 authors:

Wenyuan Yin Tao Mei


University at Buffalo, The State University of New York Microsoft
8 PUBLICATIONS   100 CITATIONS    472 PUBLICATIONS   13,776 CITATIONS   

SEE PROFILE SEE PROFILE

Chang-Wen Chen Shipeng Li


University at Buffalo, The State University of New York Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS)
559 PUBLICATIONS   7,652 CITATIONS    371 PUBLICATIONS   9,318 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Depth Image Processing View project

Photography Recommendation View project

All content following this page was uploaded by Chang-Wen Chen on 23 November 2014.

The user has requested enhancement of the downloaded file.


184 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

Socialized Mobile Photography: Learning to


Photograph With Social Context via Mobile Devices
Wenyuan Yin, Student Member, IEEE, Tao Mei, Senior Member, IEEE, Chang Wen Chen, Fellow, IEEE, and
Shipeng Li, Fellow, IEEE

Abstract—The popularity of mobile devices equipped with var- However, mobile cameras cannot guarantee perfect photos.
ious cameras has revolutionized modern photography. People are Although mobile cameras harness a variety of technologies
able to take photos and share their experiences anytime and any- to take care of many camera settings (e.g., auto-exposure) for
where. However, taking a high quality photograph via mobile de-
point-and-shoot ease, capturing high quality photos is still a
vice remains a challenge for mobile users. In this paper we investi-
gate a photography model to assist mobile users in capturing high challenging task for amateur mobile users, not to mention those
quality photos by using both the rich context available from mobile lacking photography knowledge and experience. Therefore,
devices and crowdsourced social media on the Web. The photog- assisting amateur users to capture high quality photos via
raphy model is learned from community-contributed images on the their mobile devices becomes a demanding task. While most
Web, and dependent on user’s social context. The context includes existing research has predominantly focused on how to retrieve
user’s current geo-location, time (i.e., time of the day), and weather and manage photos on mobile devices [2], [3], or how to adapt
(e.g., clear, cloudy, foggy, etc.). Given a wide view of scene, our so-
cialized mobile photography system is able to suggest the optimal media considering unique characteristics of mobile devices
view enclosure (composition) and appropriate camera parameters [39], there have been few attempts to address this topic before.
(aperture, ISO, and exposure time). Extensive experiments have To obtain high quality photos, various types of commercial
been performed for eight well-known hot spot landmark locations software such as Photoshop have been developed for post-pro-
where sufficient context tagged photos can be obtained. Through cessing to adjust photo quality. However, most of them are de-
both objective and subjective evaluations, we show that the pro- signed for desktop PC, which require intensive computation. Al-
posed socialized mobile photography system can indeed effectively
suggest proper composition and camera parameters to help the though there exist some mobile applications for image post-pro-
user capture high quality photos. cessing, they are only able to conduct simple operations, such as
cropping and contrast adjustment. These post-processing tools
Index Terms—Camera parameters, mobile photography, social cannot always fix poorly captured images. For example, the in-
context, social media, view enclosure.
formation lost in an over-exposed image is usually unrecover-
able by any post-processing technique. Therefore, it is desirable
to assist mobile users in obtaining high quality photos while the
I. INTRODUCTION
photos are being taken. For example, if we can suggest the op-
timal scene composition and suitable camera settings (i.e., aper-
T HE recent popularity of mobile devices and the rapid
development of wireless network technologies have
revolutionized the way people take and share multimedia
ture, ISO, and exposure time) based on the user’s current con-
text (i.e., geo-location, time, and weather condition) and input
scene, the user’s ability to capture high quality photos via mo-
content. With the pervasiveness of mobile devices, more and
bile devices will be improved significantly.
more people are taking photos to share their experiences using
On the other hand, computational aesthetics of photography
their mobile devices anytime and anywhere. Market research
have emerged as a hot research area. It aims to automatically as-
indicates that more than 27% of photos were captured by
sess or enhance image quality with computational models based
smartphones in 2011, while the number was merely 17% in the
on various visual features, such as lighting (e.g., light, color,
previous year [1]. The booming development of built-in mobile
and texture) and composition (e.g., the rule of thirds). A com-
cameras (such as the advanced eight megapixel resolution and
prehensive survey on computational aesthetics can be found in
the large aperture) has triggered a trend that may lead to
[5]. However, aesthetic assessment or enhancement methods in
mobile cameras replacing traditional handheld cameras.
existing works are also designed for desktop PCs, and focus on
evaluating or enhancing image quality as post-processing tools.
Manuscript received October 09, 2012; revised February 08, 2013 and May Moreover, they only apply some simple and general photog-
18, 2013; accepted June 27, 2013. Date of publication September 25, 2013;
date of current version December 12, 2013. This work was supported by NSF
raphy rules to the aesthetics evaluations. For example, when
Grant 0964797 and a Gift Funding from Kodak. Part of this work was performed assessing the photo composition, they tend to place objects ac-
when the first author visited Microsoft Research Asia as a research intern. The cording to the rule of thirds or put the horizontal line lower in the
associate editor coordinating the review of this manuscript and approving it for frame, according to the golden ratio [6]. In many cases, instead
publication was Dr. Vasileios Mezaris.
W. Yin and C. W. Chen are with the State University of New York at Buffalo, of applying these simple rules slavishly, professional photogra-
Buffalo, NY 14260 USA (e-mail: chencw@buffalo.edu). phers usually adapt the composition and the camera exposure
T. Mei and S. Li are with Microsoft Research Asia, Beijing 100080, China parameters for photographing, e.g., the scene and the lighting
(e-mail: tmei@microsoft.com).
Color versions of one or more of the figures in this paper are available online
conditions. Therefore, the existing general photo aesthetics as-
at http://ieeexplore.ieee.org. sessment and enhancement approaches are far from enough to
Digital Object Identifier 10.1109/TMM.2013.2283468 guide mobile user to capture high quality photos on the fly.

1520-9210 © 2013 IEEE


YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 185

Based on the above analysis, it is desirable to develop an


intelligent built-in tool to assist mobile users in capturing high
quality photos. Since composition and exposure parameters
(i.e., aperture, ISO, and exposure time) are two key factors
affecting photo quality [7], the system should be able to pro-
vide suggestions to mobile users in terms of these two aspects.
Therefore, we propose an intelligent mobile photography
system, which can not only assist mobile user to find good view
enclosures, but also help them set correct exposure parameters.

A. Challenges and Opportunities


Despite recent development on computational aesthetics, in-
telligent mobile photography still remains a challenge because
no unified knowledge can always be applied to various scenes
and contexts. Various composition rules and exposure prin-
ciples apply adaptively and flexibly when capturing different
objects or scenes with diverse perspectives and arrangements
under various lighting conditions. Professional photographers
take years of training to obtain sufficient photography knowl-
edge. They usually skillfully adopt the domain knowledge to
capture scene under different conditions. Photographers are
artists whose knowledge is difficult to model and represent
with simple rules. The descriptions on the complexity of photo
composition and camera settings are given in Section III.
Therefore, assisting mobile users to take professional photos
under various shooting situations is a challenging task. Fig. 1. Illustration of how crowdsourced images help for view suggestion and
Fortunately, the availability of rich context sensors on mo- exposure parameter suggestion.
bile devices and the explosion of images associated with their
capture contexts, camera parameters and social information on
social media community bring us good opportunities to solve similar season and time of the day, and their weather condition
this challenging problem. On mobile devices, the GPS and time is also overcast as the current weather condition, so by setting
of the camera can be obtained and some other photography re- the camera parameters to exposure time (ET): 1/200, aperture:
lated context information like weather condition at the shooting 8.0 and ISO: 100 as the two images, the user should be able to
time can be further inferred. On the other hand, a large volume obtain high quality photos with proper exposure.
of photos on social media websites are associated with meta- Motivated by the above observations of mobile devices and
data such as GPS and timestamp, as well as EXIF including social media, we propose a socialized mobile photography
the camera parameters. Moreover, in the social media commu- system leveraging both the context and crowdsourced photog-
nity, the photo quality can be implicitly inferred from number raphy knowledge to aid mobile users in acquiring high-quality
of views, number of favorites, and comments, or even explicitly photos. We predominantly focus on outdoor landscape pho-
obtained from ratings. Despite some noise in the media meta- tography on mobile devices. To achieve intelligent mobile
data, the aggregated photographs captured nearby containing photography, the following problems need to be solved. First,
the same scene with their metadata from the media commu- the system should be able to suggest view enclosure with
nity can provide significant insight into relevant photography optimal composition by mining the scene specific composition
rules in terms of composition and exposure parameters for mo- rules from the crowdsourced community-contributed photos.
bile photography assistance. Second, the optimal exposure parameters have to be recom-
For example, as shown in Fig. 1, when capturing a photo mended given the suggested composition and the lighting
for “Statue of Liberty” from the perspective as shown in the condition.
input wide-view image, using the content and context of the The proposed socialized mobile photography system is
input image, photos with the same content from similar per- shown in Fig. 2. Given the input wide-view image and the
spectives and their associated social information which can re- shooting context, i.e., geo-location, time, and weather, the
flect their qualities can be crowdsourced. By analyzing the com- system is able to suggest view enclosures with optimal com-
position and their aesthetic quality of these crowdsourced im- position as marked in red rectangle by mining the relevant
ages, we find that pictures with the Statue on the right third line composition rules from crowdsourced photos captured nearby.
are usually highly preferred; hence the optimal composition of Then, the system suggests optimal exposure parameters,
the input image can be inferred. Moreover, with the input time i.e., ISO, aperture, and exposure time (ET) as shown on the
and weather condition, we can also estimate the optimal camera upper-right corner of the screen, by mining exposure rules from
exposure parameters from the crowdsourced photos with sim- the crowdsourced images with similar content and context.
ilar content and lighting conditions. Here, the first two crowd- Therefore, the optimal view enclosure and exposure parameters
sourced similar photos with high ratings were captured under can be suggested by the proposed system to capture high quality
186 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

present the challenges of mobile photography to justify the


need of the proposed mobile photography system. The proposed
socialized mobile photography system overview and the details
on offline photography learning and online mobile photography
suggestion are introduced in Section IV. Experiment designs
and evaluations are demonstrated in Section V. Finally, we
conclude this paper in Section VI.

II. RELATED WORK


Recently, considerable research efforts have been made on
Fig. 2. Demonstration of the proposed socialized mobile photography system.
photo aesthetics computation. Based on some heuristic low
level features which are expected to discriminate between
pleasing and displeasing images, linear regression is applied
photos in the given scene and context, as long as sufficient to predict numerical aesthetics ratings in [4]. In [24], by
images containing similar content are crowdsourced nearby. determining perceptual factors that distinguish professional
To intelligently aid mobile users in capturing high quality photos and snapshots, visual features such as spatial distribu-
photos considering the complex scene and context dependent tion of edges, color distribution, and hue count were adopted
composition and exposure principles, two fundamental com- for photo quality assessment. Considering that professional
ponents are developed in the socialized mobile photography photographers often skillfully differentiate the main subject
system: 1) offline composition and exposure parameter learning, of the photo from the background, features such as lighting
and 2) online photography suggestion. difference and subject composition were adopted in [25] for
Considering various views in a given scene due to the differ- photo quality classification. Visual attention model based on
ence of capture location, perspective, and content, viewpoint saliency map is deployed for photo assessment in [26]. In [27],
clusters within a scope of location are discovered based on high level describable image attributes including compositional
image local features as well as their geo-locations by unsu- attributes, content attributes and sky illumination attributes are
pervised methods. Relevant composition knowledge is mined designed for photo quality prediction. In particular, query spe-
by view cluster ranking and view cluster specific composition cific ‘interestingness’ is estimated with the consideration of the
learning. To learn the effects of various contexts and different variety of the usefulness of the proposed attributes for different
shooting content to exposure, metric learning is carried out subjects. Instead of using various hand-crafting features or
for exposure parameter suggestion. The time consuming view attributes intuitively or theoretically related to photo aesthetics,
clustering, composition and exposure learning processes are visual features widely used in computer vision domain such
all performed offline, which make the system applicable and as Bag-of-Visual-Words and Fisher Vector are introduced
practical for mobile applications. In the online process, a for photo quality classification in [28]. The classification
portion of view enclosure candidates are discarded by cluster performance improvements achieved by [27], [28] are useful
rankings obtained offline, which make the online view decision evidences of the fact that the photography is a complicated and
quite efficient. Then top ranked view enclosure is selected from scene-dependent task.
the remaining candidates based on the learned view specific Using the same set of features as in [4], an aesthetic quality
composition model. Finally, the optimal exposure parameters inference engine is developed [29] to allow users to upload pho-
are suggested according to the content of the optimal view and tographs and automatically rate their aesthetic quality based on
the shooting context based on the learned exposure model. the distance to the SVM hyperplane. Later on, they extend their
work into OSCAR [30], which aims to retrieve highly rated im-
B. Contributions ages with similar semantics and similar composition based on
We make the following three major contributions: low level features by classifying image composition into textual,
• We propose a general framework for mobile photography diagonal, horizontal, centered, and vertical categories to allow
using the rich social context from mobile devices. To the users to imitate. Without considering the rich context character-
best of our knowledge, little research has been conducted istics of mobile device, in many cases, the retrieved results with
on this topic before. different semantic objects have limited power to assist photog-
• We solve the problem of mobile photography by leveraging raphy. In addition, for novice mobile photographers, it is still
both rich context and crowdsourced images on the Web. difficult to take appealing photos without explicit composition
To overcome the photography challenge due to the com- and exposure suggestions under various shooting situations.
plex scene and context-dependent characteristics, a view To enhance the photo quality, several schemes retarget
cluster discovering followed by a view specific compo- images based on several composition guidelines [6], [8]. In [6]
sition learning and exposure parameter learning scheme they relocate the object to a more aesthetically pleasing location
is developed to suggest the optimal view and parameters based on rule of thirds for single object image enhancement,
based on the discovered photography rules. while cropping or expanding the photo to achieve pleasing bal-
• We develop a mobile photography system and evaluate ance for landscape images without a dominant object. In [8], an
through objective and subjective evaluations. optimal version of the input image is produced by crop-and-re-
The remainder of the paper is organized as follows. In target based on known composition rules such as rule of
Section II related works are discussed. In Section III, we thirds, diagonal dominance and visual balance. However, these
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 187

post-processing techniques for photo enhancement require


intensive computation or manual manipulations. To accurately
segment objects, either complicated detection and segmentation
techniques or annoying user-guided manipulations are required.
In addition, complicated inpainting techniques are needed for
image re-composition. Moreover, a number of poor quality
photos captured by improper exposure parameter cannot be
recovered or enhanced as described earlier.
Since the explicit rule-based composition suggestion system
can not capture all photography knowledge, several approaches
have been developed in [31] and [32], attempting to discover the
photography principles through learning. In [31], a view finding
system is proposed to automatically generate professional view
enclosure by mining the underlying knowledge of professional
photographers from massively crawled professional photos. To
model the general photography knowledge, omni-range con-
text model is developed to learn the patch spatial distributions
and correlation distributions of pair-wise patches to guide the
photo composition. Considering the variation of photographic Fig. 3. Professional photo examples with different composition for different
scenes: (a), (b), and (c) are about object placement; (d), (e), and (f) are about
styles of different photographers, a preference-aware aesthetic object size determination; (g), (h), and (i) are about horizon placement.
model for view suggestion has been proposed in [32] by con-
structing an aesthetic feature library with bag-of-aesthetics-pre-
serving features in a bottom-up fashion. However, we argue that III. CHALLENGES OF MOBILE PHOTOGRAPHY
a general composition model is not sufficient to model the com-
It is quite difficult for non-professional mobile users to take
plicated professional composition knowledge since in various
high quality photos with good composition and exposure pa-
scenes containing different objects with various perspectives, rameters due to the following great challenges.
professional photographer tend to compose photos in different
manners as illustrated in Section III-A. In addition, exposure is A. Complexity of Photo Composition
not considered in this photography system. Moreover, the gen-
Photo composition is a complicated problem which is highly
eral photography systems have not utilized the context informa-
dependent on scenes being shot, objects, and perspectives. Al-
tion as they are not designed for mobile device.
though a few typical composition rules have been introduced
With the availability of context sensing of mobile devices, con-
into existing works on computational aesthetics, e.g., the rule of
text-based mobile applications have drawn numerous attentions,
thirds and golden ratio placement of horizontal lines [6], these
such as the mobile image search [2], landmark retrieval [3] and
rules are not always applied dogmatically by professional pho-
identification [33]. With the successful use of contexts in these tographers. Moreover, they are non-exhaustive, indicating that
mobile applications, an image recommendation application is they cannot cover all possible composition principles.
developed in [34] to help user to compose a scene by utilizing the Fig. 3 shows some professional photos. Besides placing the
location context. Without looking into the image visual content, object by the rule of thirds as in (b), symmetrical composition,
nearby photos with different semantics may be retrieved for which places a single subject right in the middle as (c) and (i), is
framing assistance, but in that case the recommended image often applied to achieve a sense of equilibrium. The unresolved
may have little reference value for current photo composition. balance, placing the subject far from the center, is also used to
Moreover, without composition learning, the users have to be create visual tension (a). Furthermore, both extremes and all va-
directed to the specific place of the recommended image exactly rieties of balance in between have their uses in photography. Not
and align the current photo with the recommended one manually. only the subject position but also its relative size in the photo
The work towards crowd-sourced learning to photography in is determined by many factors, such as the information content
[35] tried to find optimal view enclosure. Albeit limited con- of the subject and the relationship between the subject and its
texts and online media metadata, i.e., location and time of the settings. The shooting content-dependent characteristic makes
day are leveraged for crowdsourcing to mine scene specific pho- the simplified subject size models as in [8] not adequate for
tography knowledge, lots of other necessary photography re- composition optimization. As the examples in (d), (e), and (f),
lated information is ignored such as weather context and media these photos containing different objects have various sizes. The
EXIF metadata. In addition, perspective difference of the same same objects from different views can be captured with diverse
subject is not taken into account for view suggestion, which is sizes as (a) and (d), which make the object scale determination
a non-negligible aspect for photo aesthetics. Moreover, the ex- more complicated. Another typical example is the placement of
posure problem is not considered in the photography system, horizon, which is a key element in many landscape images. The
but good view finding alone cannot guarantee a high quality simplified scheme to put it lower with the golden ratio division
photo acquisition as mentioned earlier. From the technical per- like (g) in [6] is not a hard rule to follow in real cases. When
spective, the online contextual relevant image retrieval and aes- the ground has plenty of interest, it encourages a high position
thetics model learning processes are quite time consuming for of the horizon to draw attention to the ground as (h). A typical
mobile applications. photography technique is to create reflection, which place the
188 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

Fig. 4. Framework of the proposed socialized mobile photography system.

horizon onto the middle, when water appears on the ground to day and night, and even vary with different time of day due to
give a feeling of tranquility as (i). the sunshine variations. In addition, in the same time period of
Most photographers adapt the photography repertoire, a set the day at the same place, the exposure also varies with dif-
of compositional possibilities, obtained from their photography ferent weather conditions. Therefore, the parameters have to be
knowledge or experience, to the shooting scenes [7]. Therefore, adapted accordingly to obtain high quality images.
we claim that photography view finding is a highly complex and
scene-dependent task, and the introduction of several simple and
IV. SOCIALIZED MOBILE PHOTOGRAPHY
general composition rules is insufficient to guide the mobile user
to capture high quality photos.
A. Approach Overview
B. Complexity of Camera Settings The framework of the proposed socialized mobile photog-
Exposure has a critical effect on photo lightness, color and con- raphy system is shown in Fig. 4. The system takes the to-be-
trast, and the human visual system is quite sensitive to all of them. taken wide-view image along with the mobile user geo-location
Unsuitable exposure can make a poor quality photo even though as input and sends it to the cloud server. The input wide-view
the composition is successful, and quite often even post-pro- image can either be directly taken, or synthesized from mul-
cessing can not improve it into an appealing work. Unfeasible tiple consecutive photos taken by the mobile user. By jointly
setting of any exposure related camera parameters can sale out the utilizing the input wide-view image and its geo-location as well
photo quality, which is difficult for mobile users to handle. as the lighting condition related contexts such as time, date and
Professional photographers have to adjust the exposure param- weather condition which can be obtained from the Internet, the
eters adaptively considering the shooting scenes, i.e., the subject system will suggest optimal view enclosures and proper expo-
and its settings. From the image lightness and color perspective, sure parameters best fitting the shooting content and context
the main subject has to be captured into an acceptable tone, and at based on photography rules learned from crowdsourced social
the same time, the contrast of the shooting subject and its setting media data and metadata nearby.
has to be taken into account given the appearance relations be- Two fundamental components are needed in the socialized
tween them to create an ideal attention effect. Therefore, exposure mobile photography system to aid mobile users in capturing
parameters highly depend on the shooting content. professional photographs. First, an offline photography learning
Various lighting conditions influenced by many context fac- process is needed to mine composition and exposure rules.
tors such as time and weather make the exposure parameter Second, the input wide-view image content and context is uti-
adjustment a more complicated problem. Even for the same lized together to find relevant photography rules to recommend
shooting content, the exposure has to be varied with the lighting optimal view enclosures and exposure parameters for mobile
conditions. The exposures are significantly different between users.
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 189

In the offline photography learning procedure, view cluster view cluster should contain the same main objects, we use local
discovering is performed first in a certain scope of geo-loca- features with geometric verification to capture image content
tions by clustering based on both image visual features and similarity. To facilitate the clustering and online relevant view
their geo-locations. Because some view clusters are intrinsically discovering process, the crowdsourced photos in the location
more appealing than others, view cluster ranking is carried out. scope are indexed by inverted files [9] based on SIFT [10] vi-
For example, as shown in the view cluster ranking part of Fig. 4, sual words. To overcome the false matching caused by the am-
the cluster of the first row is significantly better than that of the biguity of the visual words, geometric relationships among the
fourth row. The view cluster ranking results will be utilized in visual words are also recorded into the index by spatial coding
the online optimal view enclosure searching step to make the as in [11]. Hence, using the index, the image content similarity
searching process much more efficient. As mentioned earlier, can be efficiently computed based on the matching score formu-
due to the non-exhaustiveness and flexibility of the composi- lated by the number of matched visual words passing the spatial
tion rules when taking pictures for different portions of scenes, verification [11]. In addition, considering images captured from
composition learning is performed for each view cluster discov- close places usually have similar content from similar perspec-
ered. As the instances shown in the view specific composition tives, location is also adopted to the view cluster discovering
learning part of Fig. 4, for the view cluster of the second row, process. The image location similarity is calculated based on
the rule of thirds is more appropriate to apply than symmetrical their GPS Euclidean distance. Then the view clusters can be
composition; while for the cluster of the third row, instead of the discovered by a clustering process based on image similarity
rule of thirds, symmetrical composition of the close-up view of obtained by fusion of their content similarity and location sim-
the “Golden Gate Bridge” pier is preferred. Moreover, due to the ilarity. The similarity fusion is achieved using their product.
fact that professional photographers usually adjust the camera However, it is difficult to manually specify the number of
exposure parameters according to the brightness and color of views considering the difference in content and perspectives for
the objects and the whole settings, as well as the lighting con- clustering, even after manually going through the whole dataset
dition influenced by a variety of factors, such as the intensity crawled from the location scope. Therefore, affinity propaga-
and the direction of sunshine which are affected by the season tion [12] which does not require the specification of number of
and the time of the day as well as weather conditions, metric clusters, is performed to cluster the images into different views
learning is carried out to model the various effects of content based on the fused image similarity.
and context to the exposure parameters. View Cluster Denoising. To learn the photography rules of
In the online photography suggestion stage, utilizing the vi- a given content from a certain perspective, we need to model
sual content and geo-location of the input, relevant view clus- the relevant rules for each cluster. However, the noisy images
ters similar to all possible view enclosure candidates of the input without the main objects in the view clusters discovered by the
image are found. Considering the fact that some view enclosure above clustering process would negatively affect the photog-
candidates are intrinsically bad no matter how to tune the object raphy learning process. It is necessary to denoise the images
placement, a large portion of view enclosure candidates similar without the same content. To identify the image sharing the
to the low ranked view clusters are discarded. Afterwards, the main objects with other images in the cluster, we first select the
optimal view enclosures will be selected based on the offline iconic image based on local features. The image with maximum
learned view specific composition principles only from the re- total content similarity scores with the others within the cluster
maining enclosure candidates. Once mobile users are provided are chosen as the iconic image of the view cluster. Afterwards,
with the optimal view enclosure, the appropriate exposure pa- the images with content similarity score less than the threshold
rameters, i.e., exposure time, aperture and ISO suitable for the are considered as noisy images without the main ob-
view and lighting conditions are suggested. With the suggested jects in the cluster and thus are discarded. Then, noisy clusters
view enclosure and corresponding exposure parameters, mobile without representative content in the scene such as the ones con-
users can capture high quality photos with appealing composi- taining portraits and crowds can be removed by discarding the
tion and exposure. clusters with quite a small number of images. Here we discard
the clusters with less than 5 images. Finally, we can expect im-
B. Offline Photography Learning ages with very similar content from the same viewpoints to end
1) View Cluster Discovering: Intuitively, when photogra- up in different view clusters.
phers visit a certain scope of geo-location, they tend to cap- 2) View Cluster Ranking: As aforementioned, some view
ture photos from a certain number of photo-worthy viewpoints. clusters have more appealing content and composition than
The aggregated photographs taken in the location scope asso- others, therefore, we rank the view clusters discovered. Later
ciated with their social information from the media community on, in the online view enclosure selection stage, the enclosures
can provide significant insight on the aesthetics and composi- having relative less appealing content can be discarded directly
tion rules of different viewpoints. To discover those repeated to facilitate the optimal view searching. We adopt the cluster
views from the crowdsourced photographs of the given location size and score distribution of the images in the cluster to rank
scope, we perform image clustering by jointly utilizing image the view clusters.
visual features and capture locations. The target is to expose dif- View Cluster Ranking. Suppose we have the whole ranking
ferent views with different portion of the scene from a variety list of the crowdsourced images based on their aesthetic scores,
of perspectives. to rank the view clusters, the score distributions of the individual
View Clustering. An efficient content and geo-location based images in the cluster need to be taken into account. In the ideal
clustering process is carried out to discover all typical view case, the individual images of high ranking clusters should have
clusters in the location scope. As the images within the same high aesthetic scores in the average sense. On the other hand, the
190 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

individual image scores of highly ranked clusters should not be more users. We also consider this parameter in aesthetic
scattered too much, since the clusters with compact scores tend score generation to complement the number of favorites.
to have more stable aesthetic scores. Inspired by the commonly Therefore, by weakening the time fading effect of interest-
used metrics, average precision, in information retrieval field, ingness rankings via highlighting the impacts from number of
we formulate the average score of cluster by views and favorites, interestingness rankings are utilized to gen-
erate photo aesthetic scores by fusing the above three factors as
follows:
(3)
(1)
where and are the number of views and number of favorites
of image , respectively. is the rank of image based on
where is the number of images in the view cluster. is the
interestingness. is the total number of crawled images in the
-th image belonging to cluster in the whole rank list.
location scope. In this way, the aesthetic scores ranged from 0
is the ratio of the total score of images within the cluster k to the
to 100 are generated, in which high quality photos are assigned
total score of images at the cut-off rank of image in the whole
with high aesthetic scores. Through empirical analysis, we set
rank list, in which is the -th image belonging to cluster
and .
and is the -th image in the rank list no matter which cluster it
3) View Specific Composition Learning: The view clusters
belongs to. In addition, the appealing degree of the view cluster
obtained are expected to contain different content or captured
can also be reflected by the size of the cluster, since pleasing
with different perspectives from different location tiles. As il-
objects tend to draw more photographers’ attention and hence
lustrated in Fig. 4, different composition rules are adopted to
a large number of images are aggregated in the view cluster.
those different view clusters. Therefore, view specific composi-
Therefore, the view clusters are ranked according to the scores
tion learning has to be performed to extract different composi-
calculated by
tion principles for each view cluster.
(2) The photographs of each cluster usually contain the same ob-
jects from similar perspectives but with different positions and
The larger the view cluster size, the higher the view is scored. scales in the frame. The difference is one key factor leading to
Image Aesthetic Score Generation. In many social media the different aesthetical scores. To characterize the composition
websites such as Photo.net [13] and DPChallenge [14], most difference in terms of the main object placement for each cluster,
photos are rated by a number of professional photographers, the the camera operations compared with the cluster iconic image
ratings can almost reflect the photo aesthetics. Due to the lack of are utilized to represent the main object composition as shown in
GPS information, we did not use the data from these websites. Fig. 5. The camera operation is defined as horizontal translation,
However, we can expect the availability of the context infor- vertical translation, zoom in/out, and rotation as in [17]. Using
mation of them in the near future due to the user demand and the matched SIFT points, the coordinates in the given image
advanced techniques. Given the rich information of Flickr [15], can be represented by the affine model based on the
we generate image aesthetic scores based on several heuristic coordinates of its corresponding matched points in the cluster
criterions. iconic image as follows:
• Ranking of interestingness. Interestingness-enabled
search is provided by Flickr API. As in [16], the inter- (4)
estingness rankings are based on the quantity of user
entered metadata such as tags, comments and annotations,
The parameters of the affine model
the number of users who assigned metadata, user access
can be calculated by the least square method based on all
patterns, and a lapse of time of the media objects. Based
matched SIFT points in the given image. Based on the affine
on the interestingness rankings the top photos are usu-
parameters, the camera operations can be obtained by
ally with high qualities. Hence, we utilize this important
information to generate the aesthetic scores. Due to the (5)
consideration of lapse of time, although the interestingness where
based ranking can reflect the photo quality, the ranking
scheme tends to rank the newly uploaded photos higher (6)
than the old photos. To improve the rankings of those
appealing but old photos, we enhance the influence by the
number of views and the number of favorites considering The terms of the camera operations , , , and
the fact that these photos usually have high quantity in represent the camera horizontal translation, vertical
these two terms, even though the interestingness has translation, zoom in/out degree, and rotation, respectively [18].
subsumed the two terms. As shown in Fig. 5, the object composition in terms of scale and
• Number of favorites. The number of favorites explicitly location can be captured by the camera operations compared
shows how many people liked the image. Hence it is a with the view cluster iconic image.
straightforward reflection of the photo’s degree of appeal. In addition to the modeled main object, some other salient
• Number of views. The number of views of the image can objects in the photos can also affect the image compositions.
reflect the attention drawn from the social media commu- Therefore, the spatial distribution of saliency is also utilized to
nity. Usually, high quality images tend to be viewed by capture the composition. We divide the image into 5 5 grids,
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 191

lized in the training process to overcome the quality ambiguity


issue. The SVM classifier training leads to the learned hyper-
plane which is able to separate the photos with good and bad
compositions in cluster . Afterwards, the image aesthetic score
can be inferred by the rescaled sigmoid function based on the
distance from the hyperplane to the given image by

(7)

As the distance goes from 0 to the negative infinity, the aesthetic


score decreases from 50 to 0, while as the distance goes from 0
to the positive infinity, the aesthetic score goes from 50 to 100.
We employ five-fold cross validation to search for their corre-
sponding optimal parameters error parameter , tube-width
and kernel parameter .
4) Exposure Metric Learning: Professional photographers
have to jointly consider the shooting object and the setting,
as well as the lighting condition influenced by various con-
text factors for exposure parameter adjustment. Therefore, we
have to learn the exposure parameters by jointly considering the
shooting content and various lighting related contexts.
Exposure Feature. The shooting content is one primary
factor for exposure learning. Different hues are perceived with
different light values, the colors on the objects and settings, as
well as the contrast between them in the view enclosure, have
direct influence on the exposure. As the images in the same
view clusters usually contain the same objects from the similar
capture angle, we utilize the cluster id to represent the shooting
Fig. 5. Illustration of view specific composition learning. content feature. Moreover, a series of features are extracted
to capture the contextual information related to the lighting
conditions. Currently, our system incorporates the following
the average saliency value in each grid is calculated to form the temporal and weather contextual features, since these contexts
vector to represent the saliency spatial distribution. The saliency have strong and direct influence on lighting, and easy to be
map is computed by spectral residual approach [19]. Hence, obtained or inferred.
the camera operation vector and the saliency spatial distribution • Time of the day. The sunshine direction and luminance
are concatenated together to capture the image composition as vary with the time of the day. For example, two pro-
shown in Fig. 5. fessional photographs with the same shooting content
Note that when computing the composition features, the im- captured at the sunrise time and at noon have different
ages are normalized to the same size, i.e., 640 426 or 426 exposure parameter settings to fit the lighting difference.
640 for vertical and horizontal images respectively, since most Here, considering the variation of sunrise and sunset time,
medium 640 version images we crawled from Flickr are with we quantize the time period of the day into six bins:
aspect ratio of 3:2 or 2:3, and the larger side is usually 640 [sunrise time-1hr, sunrise time+1hr), [sunrise time+1hr,
pixels. In our experiment, we assume the photos captured by 11am), [11am, 2pm), [2pm, sunset time-1hr), [sunset
mobile camera are also with the same aspect ratio 3:2 or 2:3. time-1hr, sunset time+1hr), and [sunset+1hr, sunrise time
When the aspect ratio of the mobile capture photos is different, of the next day-1hr). The sunrise and sunset time of every
training images with the given input aspect ratio can be crowd- day in the given location can be obtained from many
sourced from the social media websites and the system can work weather record websites [21]. In this system, we obtain the
in the same way. For each view cluster, the composition models historical sunrise and sunset time from [22] by specifying
of horizontal image and vertical images are learned separately, the geo-location and date for the crowdsourced photos.
since they may follow different composition rules when camera • Month. The light intensity changes with season for a given
orientations are different, though the objects and perspectives geo-location. For example, the light intensity at noon is
are the same. stronger in summer than that in winter. Hence, month is
To learn the composition rules for view cluster , we treat also an important temporal factor for exposure.
composition learning as a two class classification problem using • Weather condition. Lighting is influenced by the weather
RBF kernel based Support Vector Machine (SVM) classifier conditions obviously. For example, the light is stronger
[20]. In cluster , the photos with aesthetic score higher than in sunny days than that of cloudy days, which also di-
and the ones with score lower than are consid- rectly affect the exposures. We define possible values of
ered as high quality and low quality photos, respectively, where the weather condition as: clear, cloudy, flurries, fog, over-
is the median value of the aesthetic scores in cluster . The cast, rain, and snow. The historical hourly weather infor-
photos with score between and are not uti- mation is also obtained from [22] given the date, time and
192 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

geo-location. In this system, the dataset is built by GPS en- the separately predicted aperture, ISO and exposure time are ap-
abled search from Flickr, and hence all photos are tagged plicable, when putting them together it is not guaranteed that the
with their GPS. In addition, the date and time can be found exposure value fits the given content and contexts conditions.
from the EXIF of photos. To suggest exposure parameters with reasonable EV, there are
Exposure Feature Space Metric Learning. The exposure two possible ways. One is to learn the distance metric of the ex-
parameter setting is a complicated problem due to the various posure feature space by using the coupled aperture, ISO and ET
shooting content and lighting conditions affected by various triples as labels, but the combination of the large number of pos-
contextual factors. Although we may identify all possible in- sible values of the three parameters make the number of labels
fluencing factors, it is still quite difficult to formulate their dif- exponentially increased, and hence degrade performance. The
ferent effects. A simple instance is that when taking photos at other way is to predict EV and any two of the three parameters.
night, time of day may be a dominant factor rather than weather However, EV determination is complicated because it is sensi-
conditions, while at noon time, the lighting difference between tive to the layout of the intensity and colors in the shooting view
cloudy and clear weather conditions may have a greater influ- and lighting conditions. Although camera manufactories have
ence than time on the exposure. With these context dimensions made efforts to estimate correct exposure values by improving
intertwined, it is hard to determine their influences on exposure. lighting meters for decades, it is still challenging to provide cor-
The exposure setting problem becomes even more challenging rect exposure value automatically. Hence, predicting exposure
when taking into account the diversity of exposure for different value directly according to shooting content and context is not
shooting content. practical. Empirical analysis also validates that the direct expo-
Therefore, the proposed system models the exposure param- sure value prediction performance is not satisfactory.
eters by supervised distance metric learning, which generally Although camera lighting meters are indeed improved nowa-
aims to learn a transformation of the feature space to maxi- days, professional photographers sometimes need to adjust the
mize the classification performance. Here, the hope is to find the camera computed exposure value by increasing or decreasing
transformation of the exposure feature space, with the ability certain compensation levels to achieve perfect EV. By setting
to rescale the feature dimensions according to their effects on the exposure compensation (EC), the camera lighting meters
the exposure parameter selections. Ideally, the learned transfor- and the photographers’ knowledge are both sufficiently utilized
mation can project the photos with similar exposure parame- to obtain the ideal EV. Inspired by it, we adopt EC to achieve
ters into clusters as illustrated in the exposure metric learning optimal EV. We learn the distance metrics of the exposure fea-
module of Fig. 4. ture space for EC. Similarly, metric learning of exposure fea-
To perform distance metric learning for the exposure feature ture space for aperture and ISO are also performed, respectively.
space, Large Margin Nearest Neighbor (LMNN) [23], a distance Once the optimal EV, aperture and ISO are obtained, ET can be
metric learner designed for -nearest neighbor (kNN) classifier, calculated according to (8).
is utilized. It aims to maintain that the -nearest neighbors al-
ways belong to the same class while the samples from different
C. Online Mobile Photography Suggestion
classes are separated by a large margin. The approach maxi-
mizes the kNN classification performance without the unimodal Given the input wide-view image with size , view
distribution assumption for each class, which is a valuable prop- enclosure candidates at various positions in different scales need
erty for fitting our model. For example, the exposure parameters to be generated for optimal view selection. As mentioned in
for the same view cluster under similar weather conditions at Section IV-B-3, we assume the mobile captured photos are of
sunrise and sunset may be the same, but fall into different clus- aspect ratio 2:3 or 3:2. We slide a window of the given aspect
ters after exposure feature space transformation. In the metric ratio with moving step size from
learning step, the exposure parameters of the photos are repre- or for horizontal and vertical input image,
sented in the exposure feature space with their exposure param- respectively, until the largest possible window size with scaling
eter values as labels. ratio to generate all possible view enclosure candidates.
Exposure Compensation, Aperture and ISO Learning. Then, a number of view candidates will be discarded first based
As different exposure parameters have different sensitivity and on view cluster rankings. The view enclosure with the best com-
functionality to the content and context dimensions, we have to positions will be selected from the remaining ones, utilizing the
learn the metrics for the parameters separately. For example, at offline learned view specific composition rules.
night, photographers tend to increase the ISO values to over- 1) Relevant View Discovering: Once the view enclosure can-
come the issue of weak lighting, while besides the concerns didates are generated, their relevant view clusters containing the
about lighting, they tend to use large aperture to reduce the same content have to be discovered for their composition aes-
depth of field when shooting single objects. A straightforward thetics judgment. Using the image index built offline, the most
way is to model the exposure feature space distance metrics relevant image can be efficiently retrieved for each of the en-
for the three most commonly used exposure parameters, i.e., closure candidates. We can consider the view cluster, which the
aperture, ISO and exposure time, directly. Then, the exposure relevant image belongs to, as the relevant view cluster of the
value can be calculated by enclosure candidate. The visual words extraction only needs to
be performed once on the input image and the visual words of
(8) each candidate can be obtained based on the enclosure coordi-
nates accordingly.
where and are exposure value and exposure time, re- 2) Low Ranking View Enclosure Discarding: It is difficult to
spectively. However, it is not reasonable because even though decide which part of a given input panoramic image to capture
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 193

automatically, especially when there are some other objects be-


sides the main objects. The ranking of the relevant view cluster
can help decide which objects to capture to make the photo more
appealing. In addition, the exhaustive searching by assessing all
view candidates using the learned composition rules is compu-
tationally costly and hence are not applicable for real-time mo-
bile processing. Taking advantage of the facts that some view
enclosures containing certain portion of the scenes have intrin-
sically better compositions than others, we carry out a pre-pro-
cessing step to make the optimal view searching run efficiently.
Once the relevant view cluster with similar content is found for
each of the view enclosure candidates, utilizing the view cluster
rankings already obtained offline, we only search through the
candidates belonging to the highest ranked view cluster out of
all relevant clusters. In this way, a large percentage of enclosure
candidates belonging to relatively low ranked view clusters can
Fig. 6. The flowchart of exposure parameter suggestion.
be discarded first.
3) Optimal View Enclosure Searching: Once the view en-
closure candidates belonging to low ranking view clusters are
discarded, we perform optimal view searching through the re- median intensity of the suggested view enclosure and the input
maining candidates belonging to the top ranked relevant cluster. panoramic image, respectively. Here, is set to 2.2.
For the view candidates relevant to the same view cluster, they Utilizing the suggested view content and context as described
contain similar objects with slight variance on the arrangement in Section IV-B-4, EC can be predicted and thus the correct EV
and scale. To make a view suggestion, the candidate with the for the suggested view enclosure can be calculated by
highest predicted aesthetic score is considered as optimal view
enclosure. Therefore, we predict the aesthetic scores of the re- (10)
maining candidates belonging to the top ranked relevant cluster
using the learned view specific composition rules and suggest where is the predicted EC for the suggested view. In
the most highly ranked one to mobile users. addition, with the view content and contexts, the corresponding
4) Exposure Parameter Suggestion: In the offline exposure aperture and ISO can also be predicted based on their learned
metric learning module, the system learns the distance metrics metrics, respectively.
of the exposure feature space to model the diverse effects of One problem with exposure parameter prediction is the
the content and context dimensions to EC, aperture and ISO, possible confliction of the diverse capabilities of mobile cam-
respectively. Once the optimal view enclosure is obtained, the eras and the camera parameters of the crawled photos taken
metrics learned is applied to the current content and contexts. In by some professional cameras. For example, the predicted
the transformed exposure feature space, the photos with similar aperture may not be supported by some mobile devices.
exposure parameters are projected into clusters due to the local Therefore, in that case, all the training samples of the previous
optimization property of LMNN designed for kNN. Hence, the predicted label are temporarily removed to predict the next
simple kNN classifier, which makes the classification decision optimal parameter until it is allowed by the current camera
based on the most frequent label surrounding the input data in capabilities.
the transformed feature space learned by LMNN, is utilized to Once getting the correct EV, aperture and ISO, the corre-
predict the optimal exposure parameters according to the con- sponding ET can be calculated with equation (8). At that time,
tent and contexts. It is consistent with the intuition that photos if the ET is larger than the threshold , then the picture taken
containing the similar content and contexts at the dominant di- would probably be blurred due to hand jittering. If it is smaller
mensions should have similar exposure parameters. Hence, the than , then the aperture, ISO and ET are suggested; oth-
nearest samples in the transformed exposure feature space can erwise the exposure parameters have to be adjusted under the
be utilized effectively to predict exposure parameters. same EV.
The flowchart of exposure parameter suggestion is shown in The exposure parameter adjustment process is illustrated in
Fig. 6. Once the optimal view enclosure is obtained, the optimal Fig. 7. Inspired by the fact that aperture priority mode is usually
EV fitting the view and context is estimated. To sufficiently adopted when capturing landscape scenes, which means that the
take advantage of the camera light metering results of the input photographers try to adjust ISO or ET while keeping the optimal
image and avoid the inconvenience of asking mobile users to aperture value to achieve the optimal EV. Therefore, if ET of the
manually capture the suggested view, we estimate the EV of the initial predicted exposure parameter set is over long, keeping
suggested view enclosure from the camera meter as follows: the optimal EV, we first adjust the predicted ISO value by (11)
in the current camera allowable ISO settings, in which is
(9) the current ISO value, is the updated ISO value.

(11)
where and are the estimated EV of the sug-
gested view enclosure and the input panoramic image from Hence, ET can be reduced while maintaining the same EV.
the camera meter, respectively. and are the Then, if ET is below , the parameter adjustment stop and
194 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

V. EXPERIMENTS AND EVALUATIONS


To validate the proposed socialized mobile photography
system, we performed objective evaluations on the system
components and subjective evaluations on the view suggestion
and exposure parameter suggestion results.

A. Dataset Building
Up to now, there is no publicly available dataset containing
sufficient photos with context information for photo quality
assessment. To guarantee sufficient photos with the required
context information for the composition learning and expo-
sure learning processes, we build our own dataset from eight
well-known hot spot landmark locations. The proposed pho-
tography learning process, including view cluster discovering,
composition learning and exposure parameter learning, is per-
formed utilizing the crowdsourced data from eight hot spot
places: “Golden Gate Bridge,” “Taj Mahal,” “Sydney Opera
House,” “Portland Head Light,” “Statue of Liberty,” “Jefferson
Memorial,” “Eiffel Tower,” and “United States Capitol.” In each
Fig. 7. The exposure parameter adjustment process. location, we use Flickr API flickr.photos.search to perform in-
terestingness based search in descending order for photos within
the radius of one kilometer by specifying the geo-location of the
the updated parameter set is suggested; otherwise, further ad- landmark. Hence, all photos collected are tagged with latitude
justment is needed. When updating the ISO value, we have to and longitude. The capture date and time is obtained from photo
check the current camera allowable range of ISO values. Once EXIF data. The weather information of all photos can be found
the maximum ISO value is reached, we are only able to de- from [22] by providing their capture date, hour and city. For each
crease ET by decreasing the aperture value using (12) under the location, about 2,000–4,000 photos are obtained due to limits of
same EV. Flickr API. The number of views ranges from 0 to 61,397, and the
number of favorites is from 0 to 455. They were captured at dif-
(12) ferent time of day ranging from 2001 to 2012, and under various
weather conditions: clear, cloudy, flurries, fog, overcast, rain,
Similarly, when updating aperture, the updated aperture has to be and snow. For each location, we randomly select 50 wide-view
supported by the current camera. Hence, we can obtain the op- images for system performance evaluation, i.e., view suggestion
timal set of exposure parameters fitting the suggested view and evaluation and camera parameter suggestion evaluation, and the
lighting contexts considering the mobile camera capabilities. remaining photos serve as training and validation data for compo-
5) Online Mobile Photography Suggestion Computation: sition learning and camera parameter metric learning. We split the
Given an input wide-view panoramic photo, with the view data into ten folds and in the system component evaluation, i.e.,
specific composition model and the exposure parameter metrics composition learning accuracy and exposure parameter learning
learned offline, the proposed socialized photography system accuracy, ten fold cross validation is carried out. In the system
can efficiently suggest optimal view and proper camera param- evaluation part, we utilize the composition model and camera
eters. Once the input photo and the associated GPS information parameter metrics learned in one randomly selected round.
is sent to the cloud server, the SIFT feature and saliency feature
are generated. Later on, the view enclosure candidates can be
B. View Cluster Discovering Results
generated by sliding windows and their features can be obtained
quickly by simply cutting out from the original image features. We carried out the proposed view cluster discovering ap-
Then, with the indices of the photos in the location, the most proach as described in Section IV-B-1. After the content and
relevant images of the candidates can be retrieved and thus geo-location based view clustering and local feature based
their view clusters can be found in a parallel fashion. Due to cluster denoising, 11–33 clusters are obtained for each location.
the limited number of view clusters, the highest ranked relevant The number of clusters for the eight locations are listed in
cluster can be obtained quickly. The aesthetic scores of the Table I. Example photos of the top three ranked clusters of four
candidates belonging to the highest ranked relevant cluster are hot spot locations using the view cluster ranking method illus-
parallelly predicted with the offline learned composition model. trated in Section IV-B-2 are demonstrated in Fig. 8. Despite the
Hence, the optimal view candidate can be suggested efficiently. existence of noisy images, the proposed view discovering ap-
Since the current weather information can be prefetched for the proach indeed found meaningful view clusters in the locations.
given location, the offline learned exposure parameter metrics
can be employed to obtain suitable parameters with a simple C. Evaluation of Composition Suggestion
operation. Usually the parameter adjustment can be finished
with 0–3 iterations. Hence, the optimal view and proper camera 1) Composition Learning Accuracy: To validate the pro-
parameters can be sent to the client quickly. posed view-specific composition learning component, we
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 195

TABLE I TABLE II
NUMBER OF CLUSTER FOR EIGHT LOCATIONS MSE OF THE PREDICTED AESTHETIC SCORES FOR EIGHT LOCATIONS

Fig. 8. The top three view clusters discovered of four locations: from top to
bottom are “Golden Gate Bridge,” “United States Capitol,” “Taj Mahal,” and
“Portland Head Light.”.

implement the composition learning approach described in


Section IV-B-3 for each discovered cluster and calculate the
accuracy of the predicted aesthetic scores for the photos in
each hot spot place. Through ten fold cross validation, the Fig. 9. Example results of suggested views, in which each image is from one
hot-spot location: (a) input wide-view image, (b) optimal relevant cluster iconic
average Mean Squared Error (MSE) of the predicted results image, (c) suggested view.
for each location is demonstrated in Table II. The minimum
and maximum MSE is 288 and 375 for the eight locations,
respectively, which means the prediction error ranges from 2) Subjective Evaluation of View Suggestion: Due to the lack
17.0 to 19.4 on average. The variance of the images and the of a systematic evaluation function for measuring the photo
reduction percentage of the predicted error from variance for composition in terms of the users’ level of satisfaction, we con-
each location is also shown in Table II. The variances range duct user studies to evaluate the composition of the suggested
from 430 to 565. We can see that the prediction error has been views. Fifteen subjects including three females and with ages
reduced between 14% and 19% from the variance. ranging from 20 to 44 are invited to rate the view suggestion
196 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

Fig. 10. Results of suggested views, in which input images and suggested views of each hot spot landmark location are shown in one row. (b)(d)(f)(h) are suggested
views of input wide-view images (a)(c)(e)(g), respectively.

results and the corresponding wide view input photos for com- racy of the subjective evaluation due to fatigue or boredom, no
parison. Note that the input wide view images downloaded from more samples were assessed. The average ratings of the pro-
Flickr were usually already carefully composed by the pho- fessional and the amateur photographers on the composition in
tographers. Three subjects are professional photographers, one terms of question (3) for the eight locations are demonstrated in
of whom is a photographer working in a photographic studio, Fig. 13(a). The blue and green solid lines show the average pro-
and the other two are students majoring in photography. All fessional ratings for the input and output composition, respec-
of them have more than five years of photography experiences tively. The red and black dashed lines show the average ama-
with single lens reflex cameras. Twelve subjects are amateurs, teur ratings for the input and output composition, respectively.
who are not majoring in photography but have at least two years From the figure, we find that, for most landmark locations, the
of experiences with single lens reflex cameras. For each of the suggested compositions are better than the input compositions
eight locations, five photos and corresponding results are ran- either from the professional or the amateurs’ perspective. In the
domly selected for the user study. Hence, 40 image pairs are cases of Statue of Liberty and Jefferson Memorial, the ratings
rated by the 15 subjects. Due to the page limit, for each location of the suggested compositions are slightly worse than or similar
one input wide view photo example and the corresponding view to the input compositions due to the failure in the detection of
suggestion result as well as the iconic image of the optimal rele- distinguished salient points because of noise on the background
vant view cluster are shown in Fig. 9. The remaining four input or overly small scale of foreground objects. Extensive adoption
photos and their suggested views of each location are demon- of more robust composition features will be performed in the fu-
strated in Fig. 10. The subjects are asked to rate each photo in ture to overcome this issue. In addition, the standard deviations
the following three perspectives: of the ratings from the three professional photographers and the
1) Is the photo visually appealing in terms of focal length? twelve amateur photographers for the input and output of the
2) Is the photo visually appealing in terms of object 40 photos from the eight landmark locations in terms of ques-
placement? tion (3) on overall composition are demonstrated in Fig. 14(a).
3) What’s the overall composition rating? The blue and green solid lines are the standard deviation of the
The subjects are asked to provide their ratings on the three professional ratings for input and output, respectively. The red
questions for each photo using 1(very bad), 2(bad), 3(neutral), and black dashed lines are the standard deviation of the am-
4(good) and 5(very good). The subjective composition evalu- ateur ratings for input and output, respectively. The standard
ation and the subject camera parameters evaluation in V-D-2 deviations on the ratings of the 40 photo pairs from the three
took 1.5–2 hours for each photographer. To avoid the inaccu- professional photographers and the twelve amateur photogra-
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 197

Fig. 12. The rating distributions of the three professional photographers on


Fig. 11. The rating distributions of the three professional photographers on
question (1), (2) and (3) are demonstrated in (a), (b) and (c), respectively, while
question (1), (2) and (3) are demonstrated in (a), (b) and (c), respectively, while
the rating distributions of the twelve amateur photographers are demonstrated
the rating distributions of the twelve amateur photographers are demonstrated
in (d), (e) and (f). The black bars show the rates on the camera parameters of the
in (d), (e) and (f). The black bars show the rates on the input photos and the
input photos and the white ones show the rates on the corresponding suggested
white ones show the rates on the corresponding suggested views.
camera parameters.

TABLE III
ERROR RATE OF THE PREDICTED EXPOSURE PARAMETERS

phers are smaller than 1.24 and 1.85, respectively. The rating
distributions of the three professional photographers on ques-
tion (1), (2) and (3) are demonstrated in Fig. 11(a), (b) and (c),
respectively, while the rating distributions of the twelve am-
ateur photographers are demonstrated in (d), (e) and (f). The
black bars show the rates on the input photos and the white
ones show the rates on the corresponding suggested views. From
the figure, we can observe that although the rating distributions
of the professional and amateur photographers are slightly dif-
ferent, there is a tendency that the composition of the suggested
views are significantly improved compared with the input wide Fig. 13. The average rating of the three professional photographers and the
view photos in terms of focal length, object placement and the twelve amateur photographers for the input and output of the eight landmark lo-
overall composition. cations in terms of (a) overall composition and (b) overall exposure parameters.
The blue and green solid lines are the average professional rating for input and
output, respectively. The red and black dashed lines are the average amateur
D. Evaluation of Camera Parameter Suggestion ratings for input and output, respectively.
1) Exposure Parameter Learning Accuracy: To validate the
proposed exposure parameter learning component, we carried Section IV-B-4 and calculate the average error rate of the pre-
out metric learning for aperture, ISO and EC as described in dicted parameters through ten fold cross validation for each hot
198 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

TABLE IV
THE PREDICTED AND SUGGESTED EXPOSURE PARAMETERS OF THE SUGGESTED
VIEWS AND CORRESPONDING INPUT IMAGES OF FIG. 9

Fig. 14. The standard deviation of the ratings from the three professional pho-
tographers and the twelve amateur photographers for the input and output of the
40 photos from the eight landmark locations in terms of (a) overall composi-
tion and (b) overall exposure parameters. The blue and green solid lines are the
standard deviation of the professional ratings for input and output, respectively. We can observe that for most locations, the suggested param-
The red and black dashed lines are the standard deviation of the amateur ratings eters are better than or similar to the input parameters, either
for input and output, respectively. from the professional or the amateurs’ ratings. In addition, the
standard deviations of the ratings from the three professional
spot place. The average error rate of the predicted results be- photographers and the twelve amateur photographers for the
fore parameter adjustment for each location is demonstrated in input and output of the 40 photos from the eight landmark lo-
Table III. cations in terms of question (3) on overall exposure parame-
2) Subjective Evaluation of Exposure Parameter Sugges- ters are demonstrated in Fig. 14(b). The blue and green solid
tion: To evaluate the suggested exposure parameters, the 15 lines are the standard deviation of the professional ratings for
photographers are also invited to evaluate the suggested expo- input and output, respectively. The red and black dashed lines
sure parameters for the suggested view of the same set of 40 are the standard deviation of the amateur ratings for input and
photos. The subjects are required to answer the following three output, respectively. The standard deviations on the ratings of
questions: the 40 photo pairs from the three professional photographers
1) Do the camera parameters make the photo underexposed and the twelve amateur photographers are smaller than 0.94 and
or overexposed? 1.15, respectively. The rating distributions of the three profes-
2) Does the exposure time make the photo blurred when using sional photographers on question (1), (2) and (3) are shown in
mobile cameras? Fig. 12(a), (b), and (c), respectively, while the rating distribu-
3) Are the camera parameters reasonable from overall tions of the twelve amateur photographers are shown in (d), (e),
perspective? and (f). The black bars show the ratings on the camera parame-
For question (1), they are asked to answer with (under- ters of the original photos, while the white show the ratings on
exposed), (proper exposure), and (overexposed). For ques- the suggested camera parameters. From the evaluations of both
tion (2), they provide binary answers (yes) or (no). For professional and amateur photographers, we can find that the
question (3), they are asked to answer with 1 (not reasonable), 2 improper lightings, i.e., under-exposure and over-exposure, as
(reasonable, but can be improved) and 3 (perfect). When eval- well as blur are both reduced via parameter suggestion. Overall,
uate the camera parameters, the lighting related contexts, i.e., a large portion of unreasonable exposure parameters of the input
month, time of the day, and weather conditions are also pre- photos are corrected. Hence, the exposure parameter suggested
sented to the photographers. The average ratings of the profes- by the proposed system significantly improved the photo quality
sional and amateur photographers on the reasonability of the captured by mobile cameras.
input and output exposure parameters in terms of question (3) Assuming the mobile camera aperture range and ISO range
for the eight landmark locations are demonstrated in Fig. 13(b), are [2.8, 22.6] and [100, 1600], respectively, the predicted expo-
in which the blue and green solid lines show the ratings of sure parameters and the suggested ones after parameter adjust-
the professional photographers on the input and output, respec- ments of the suggested views as well as the original parameters
tively, and the red and black dashed lines show the ratings of the of the input photo of Fig. 9 are shown in Table IV. The EV and
amateur photographers on the input and output, respectively. EC are also demonstrated for comparison.
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 199

VI. CONCLUSION AND FUTURE WORK [7] M. Freeman, The Photographer’s Eye: Composition and Design for
Better Digital Photos. Lewes, U.K.: Ilex Press, 2007.
The contradiction between the popularity of photo capturing [8] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, “Optimizing photo com-
and sharing by mobile devices and the lack of photography position,” Comput. Graph. Forum, vol. 29, no. 2, pp. 469–478, 2010.
[9] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
knowledge and skills of most mobile users serves as the pri- object matching in videos,” in Proc. ICCV, 2003, pp. 1470–1477.
mary motivation of our work. We propose a socialized mobile [10] D. G. Lowe, “Distinctive image features from scale-invariant key-
photography system to assist mobile user in capturing high points,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
[11] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large
quality photos by using both the rich context available from scale partial-duplicate web image search,” in Proc. ACM Multimedia,
mobile devices and crowdsourced social media on the Web. 2010, pp. 511–520.
Considering the flexible and adaptive adoption of photography [12] B. Frey and D. Dueck, “Newblock clustering by passing messages be-
tween data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
principles with different content and perspective, view clusters [13] Photo.net [Online]. Available: http://photo.net/
are discovered and the view specific composition rules are [14] DPChallenge [Online]. Available: http://www.dpchallenge.com/
learned from the community-contributed images. Leveraging [15] Flickr [Online]. Available: http://www.flickr.com/
[16] D. S. Butterfield, C. Fake, C. J. Henderson-Begg, and S. Mourachov,
user’s social context, the proposed socialized mobile pho- “Interestingness Ranking of Media Objects,” US Patent Application
tography system is able to suggest optimal view enclosure 20060242139, 2006.
to achieve appealing composition. Due to the complex scene [17] J.-G. Kim, H. S. Chang, J. Kim, and H.-M. Kim, “Efficient camera
content and a number of shooting-related contexts to exposure motion characterization for MPEG video indexing,” in Proc. IEEE Int.
Conf. Multimedia and Expo, 2000, pp. 1171–1174.
parameters, metric learning is applied to suggest appropriate [18] T. Mei, X.-S. Hua, C.-Z. Zhu, H.-Q. Zhou, and S. Li, “Home video
camera parameters. Currently, we aim to solve the mobile visual quality assessment with spatiotemporal factors,” IEEE Trans.
photography problem for some hot spot landmark locations. Circuits Syst. Video Technol., vol. 17, no. 6, pp. 699–706, Jun. 2007.
[19] X. Hou and L. Zhang, “Saliency detection: A spectral residual ap-
Objective and subjective evaluations for the hot spot landmark proach,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-
photos validated the effectiveness of the proposed socialized tion, 2007.
mobile photography system. [20] C. C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, 2011.
There are several interesting future works. Photography [21] J. Zhuang, T. Mei, S. C. H. Hoi, Y.-Q. Xu, and S. Li, “When recommen-
knowledge transfer from other similar geo-locations may solve dation meets mobile: Contextual and personalized recommendation on
the problem of insufficient data for photography learning the go,” in Proc. Ubicomp, 2011, pp. 153–162.
[22] wunderground.com [Online]. Available: http://www.wunder-
and thus improve the performance of the socialized mobile ground.com/
photography system [36]. Moreover, in the current system, [23] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
we can only search the optimal view enclosure from the input margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
pp. 207–244, 2009.
wide-view photo. However, the optimal view may sometimes
[24] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for
go beyond the scope of the input views. It is very challenging photo quality assessment,” in Proc. IEEE Conf. Computer Vision and
to quantitatively evaluate the effect of the constraint on the Pattern Recognition, 2006, pp. 419–426.
initial input-wide view image. Such an evaluation is beyond the [25] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing on
the subject,” in Proc. ECCV, 2008, pp. 386–399.
scope of this paper and will be considered in the future work. [26] X. Sun, H. Yao, R. Ji, and S. Liu, “Photo assessment based on compu-
A possible way of optimal view enclosure finding beyond tational visual attention model,” in Proc. ACM Multimedia, 2009, pp.
the input scope may be realized via building 3D models of 541–544.
[27] S. Dhar, V. Ordonez, and T. L. Berg, “High level describable attributes
the given location from crowdsourced images. Furthermore, for predicting aesthetics and interestingness,” in Proc. IEEE Conf.
people tend to capture consumer photos with mobile cameras Computer Vision and Pattern Recognition, 2011, pp. 1657–1664.
in various scenes and events as in [37]. Hence, we may extend [28] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka, “Assessing the
aesthetic quality of photographs using generic image descriptors,” in
our system into more scene types and events by analyzing and Proc. ICCV, 2011, pp. 1784–1791.
inferring human portraits and activities. Finally, city-scale mo- [29] R. Datta and J. Z. Wang, “ACQUINE: Aesthetic quality inference en-
bile photography suggestion system can be built by integrating gine—Real-time automatic rating of photo aesthetics,” in Proc. Multi-
media Information Retrieval, 2010, pp. 421–424.
automatic landmark recommendation [38] into the current [30] L. Yao, P. Suryanarayan, M. Qiao, J. Z. Wang, and J. Li, “OSCAR:
photography suggestion system. On-site composition and aesthetics feedback through exemplars for
photographers,” Int. J. Comput. Vision, vol. 96, no. 3, pp. 353–383,
2012.
REFERENCES [31] B. Cheng, B. Ni, S. Yan, and Q. Tian, “Learning to photograph,” in
Proc. ACM Multimedia, 2010, pp. 291–300.
[1] [Online]. Available: http://tinyurl.com/c8mp7jt [32] H.-H. Su, T.-W. Chen, C.-C. Kao, W. H. Hsu, and S.-Y. Chien, “Prefer-
[2] C. Zhu, K. li, Q. Lv, L. Shang, and R. P. Dick, “iScope: Personalized ence-aware view recommendation system for scenic photos based on
multi-modality image search for mobile devices,” in Proc. MOBISYS, bag-of-aesthetics-preserving features,” IEEE Trans. Multimedia, vol.
2009, pp. 277–290. 14, no. 3, pp. 833–843, Jun. 2012.
[3] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao, “Lo- [33] D. M. Chen, G. Baatz, K. Koser, S. S. Tsai, R. Vedantham, T. Pyl-
cation discriminative vocabulary coding for mobile landmark search,” vanainen, K. Roimela, X. Chen, and J. Bach, “City-scale landmark
Int. J. Comput. Vision, vol. 96, no. 3, pp. 290–314, 2012. identification on mobile devices,” in Proc. IEEE Conf. Computer Vi-
[4] R. datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in pho- sion and Pattern Recognition, 2011, pp. 737–744.
tographic images using a computational approach,” in Proc. ECCV, [34] S. Bourke, K. McCarthy, and B. Smyth, “The social camera: A case-
2006, pp. 288–301. study in contextual image recommendation,” in Proc. IUI, 2011, pp.
[5] D. Joshi, R. Datta, Q.-T. Luong, E. Fedorovskaya, J. Z. Wang, J. Li, 13–22.
and J. Luo, “Aesthetics and emotions in images: A computational per- [35] W. Yin, T. Mei, and C. W. Chen, “Crowd-sourced learning to photo-
spective,” IEEE Signal Process. Mag., vol. 28, no. 5, pp. 94–115, 2011. graph via mobile devices,” in Proc. IEEE Int. Conf. Multimedia and
[6] S. Bhattacharya, R. Sukthankar, and M. Shah, “A framework for photo- Expo, 2012.
quality assessment and enhancement based on visual aesthetics,” in [36] W. Yin, T. Mei, and C. W. Chen, “Assessing photo quality with geo-
Proc. ACM Multimedia, 2010, pp. 271–280. context and crowdsourced photos,” in Proc. VCIP, 2012.
200 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014

[37] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and A. Vakali, Chang Wen Chen (F’04) is a Professor of Computer
“Cluster-based landmark and event detection for tagged photo collec- Science and Engineering at the State University
tions,” IEEE Multimedia, vol. 18, no. 1, pp. 52–63, Jan. 2011. of New York at Buffalo, USA. Previously, he was
[38] M. D. Choudhury, M. Feldman, S. Amer-Yahia, N. Golbandi, R. Allen S. Henry Endowed Chair Professor at Florida
Lempel, and C. Yu, “Automatic construction of travel itineraries Institute of Technology from 2003 to 2007, a faculty
using social breadcrumbs,” in Proc. 21st ACM Conf. Hypertext and member at the University of Missouri-Columbia
Hypermedia, 2010, pp. 35–44. from 1996 to 2003 and at the University of Rochester,
[39] W. Yin, J. Luo, and C. W. Chen, “Event-based semantic image adapta- Rochester, NY, from 1992 to 1996. He has served
tion for user-centric mobile display devices,” IEEE Trans. Multimedia, as the Editor-in-Chief for IEEE Transactions on
vol. 13, no. 3, pp. 432–442, Jun. 2011. Circuits and Systems for Video Technology from
January 2006 to December 2009 and an Editor for
Proceedings of IEEE, IEEE T-MM, IEEE JSAC, IEEE JETCAS and IEEE
Multimedia Magazine. He and his students have received six (6) Best Paper
Awards and have been placed among Best Paper Award finalists many times. He
is a recipient of Sigma Xi Excellence in Graduate Research Mentoring Award
in 2003, Alexander von Humboldt Research Award in 2009 and SUNY-Buffalo
Wenyuan Yin (S’10) received B.E. degree from Exceptional Scholar–Sustained Achievements Award in 2012. He is an IEEE
Nanjing University of Science and technology in Fellow and an SPIE Fellow.
2006. She is now pursuing the Ph.D. degree in the
Department of Computer Science and Engineering,
State University of New York at Buffalo. Her
current research interests include image and video Shipeng Li (F’10) joined Microsoft Research
semantic understanding, media quality assessment, Asia (MSRA) in May 1999. He is now a principal
mobile media adaptation, video transcoding, image researcher and research manager of the Media
processing, machine learning and computer vision. Computing group. He also serves as the research
She received the Best Student Paper Award at VCIP area manager coordinating the multimedia research
2012. activities at MSRA. From October 1996 to May
1999, he was with the Multimedia Technology
Laboratory at Sarnoff Corporation (formerly David
Sarnoff Research Center and RCA Laboratories)
Tao Mei (M’07–SM’11) is a Lead Researcher as a member of the technical staff. He has been
with Microsoft Research Asia, Beijing, China. He actively involved in research and development in
received the B.E. degree in automation and the Ph.D. broad multimedia areas. He has made several major contributions adopted
degree in pattern recognition and intelligent systems by MPEG-4 and H.264 standards. He invented and developed the world first
from the University of Science and Technology of cost-effective high-quality legacy HDTV decoder in 1998. He started P2P
China, Hefei, China, in 2001 and 2006, respectively. streaming research at MSRA as early as in August 2000. He led the building of
His current research interests include multimedia the first working scalable video streaming prototype across the Pacific Ocean
information retrieval and computer vision. He has in 2001. He has been an advocate of scalable coding format and is instrumental
authored or co-authored over 100 papers in journals in the SVC extension of H.264/AVC standard. He first proposed the 694;
and conferences, and holds eight U.S. granted Media 2.0 concepts that outlined the new directions of next generation internet
patents. He was the recipient of several best paper media research (2006). He has authored and coauthored more than 200 journal
awards, including the Best Paper Awards at ACM Multimedia in 2007 and and conference papers and holds 90+ US patents in image/video processing,
2009, and the IEEE Transactions on Multimedia Prize Paper Award 2013. He compression and communications, digital television, multimedia, and wireless
is an Associate Editor of Neurocomputing and the Journal of Multimedia. communication.

View publication stats

You might also like