Professional Documents
Culture Documents
Socialized Mobile Photography: Learning To Photograph With Social Context Via Mobile Devices
Socialized Mobile Photography: Learning To Photograph With Social Context Via Mobile Devices
Socialized Mobile Photography: Learning To Photograph With Social Context Via Mobile Devices
net/publication/264585821
CITATIONS READS
38 4,233
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Chang-Wen Chen on 23 November 2014.
Abstract—The popularity of mobile devices equipped with var- However, mobile cameras cannot guarantee perfect photos.
ious cameras has revolutionized modern photography. People are Although mobile cameras harness a variety of technologies
able to take photos and share their experiences anytime and any- to take care of many camera settings (e.g., auto-exposure) for
where. However, taking a high quality photograph via mobile de-
point-and-shoot ease, capturing high quality photos is still a
vice remains a challenge for mobile users. In this paper we investi-
gate a photography model to assist mobile users in capturing high challenging task for amateur mobile users, not to mention those
quality photos by using both the rich context available from mobile lacking photography knowledge and experience. Therefore,
devices and crowdsourced social media on the Web. The photog- assisting amateur users to capture high quality photos via
raphy model is learned from community-contributed images on the their mobile devices becomes a demanding task. While most
Web, and dependent on user’s social context. The context includes existing research has predominantly focused on how to retrieve
user’s current geo-location, time (i.e., time of the day), and weather and manage photos on mobile devices [2], [3], or how to adapt
(e.g., clear, cloudy, foggy, etc.). Given a wide view of scene, our so-
cialized mobile photography system is able to suggest the optimal media considering unique characteristics of mobile devices
view enclosure (composition) and appropriate camera parameters [39], there have been few attempts to address this topic before.
(aperture, ISO, and exposure time). Extensive experiments have To obtain high quality photos, various types of commercial
been performed for eight well-known hot spot landmark locations software such as Photoshop have been developed for post-pro-
where sufficient context tagged photos can be obtained. Through cessing to adjust photo quality. However, most of them are de-
both objective and subjective evaluations, we show that the pro- signed for desktop PC, which require intensive computation. Al-
posed socialized mobile photography system can indeed effectively
suggest proper composition and camera parameters to help the though there exist some mobile applications for image post-pro-
user capture high quality photos. cessing, they are only able to conduct simple operations, such as
cropping and contrast adjustment. These post-processing tools
Index Terms—Camera parameters, mobile photography, social cannot always fix poorly captured images. For example, the in-
context, social media, view enclosure.
formation lost in an over-exposed image is usually unrecover-
able by any post-processing technique. Therefore, it is desirable
to assist mobile users in obtaining high quality photos while the
I. INTRODUCTION
photos are being taken. For example, if we can suggest the op-
timal scene composition and suitable camera settings (i.e., aper-
T HE recent popularity of mobile devices and the rapid
development of wireless network technologies have
revolutionized the way people take and share multimedia
ture, ISO, and exposure time) based on the user’s current con-
text (i.e., geo-location, time, and weather condition) and input
scene, the user’s ability to capture high quality photos via mo-
content. With the pervasiveness of mobile devices, more and
bile devices will be improved significantly.
more people are taking photos to share their experiences using
On the other hand, computational aesthetics of photography
their mobile devices anytime and anywhere. Market research
have emerged as a hot research area. It aims to automatically as-
indicates that more than 27% of photos were captured by
sess or enhance image quality with computational models based
smartphones in 2011, while the number was merely 17% in the
on various visual features, such as lighting (e.g., light, color,
previous year [1]. The booming development of built-in mobile
and texture) and composition (e.g., the rule of thirds). A com-
cameras (such as the advanced eight megapixel resolution and
prehensive survey on computational aesthetics can be found in
the large aperture) has triggered a trend that may lead to
[5]. However, aesthetic assessment or enhancement methods in
mobile cameras replacing traditional handheld cameras.
existing works are also designed for desktop PCs, and focus on
evaluating or enhancing image quality as post-processing tools.
Manuscript received October 09, 2012; revised February 08, 2013 and May Moreover, they only apply some simple and general photog-
18, 2013; accepted June 27, 2013. Date of publication September 25, 2013;
date of current version December 12, 2013. This work was supported by NSF
raphy rules to the aesthetics evaluations. For example, when
Grant 0964797 and a Gift Funding from Kodak. Part of this work was performed assessing the photo composition, they tend to place objects ac-
when the first author visited Microsoft Research Asia as a research intern. The cording to the rule of thirds or put the horizontal line lower in the
associate editor coordinating the review of this manuscript and approving it for frame, according to the golden ratio [6]. In many cases, instead
publication was Dr. Vasileios Mezaris.
W. Yin and C. W. Chen are with the State University of New York at Buffalo, of applying these simple rules slavishly, professional photogra-
Buffalo, NY 14260 USA (e-mail: chencw@buffalo.edu). phers usually adapt the composition and the camera exposure
T. Mei and S. Li are with Microsoft Research Asia, Beijing 100080, China parameters for photographing, e.g., the scene and the lighting
(e-mail: tmei@microsoft.com).
Color versions of one or more of the figures in this paper are available online
conditions. Therefore, the existing general photo aesthetics as-
at http://ieeexplore.ieee.org. sessment and enhancement approaches are far from enough to
Digital Object Identifier 10.1109/TMM.2013.2283468 guide mobile user to capture high quality photos on the fly.
horizon onto the middle, when water appears on the ground to day and night, and even vary with different time of day due to
give a feeling of tranquility as (i). the sunshine variations. In addition, in the same time period of
Most photographers adapt the photography repertoire, a set the day at the same place, the exposure also varies with dif-
of compositional possibilities, obtained from their photography ferent weather conditions. Therefore, the parameters have to be
knowledge or experience, to the shooting scenes [7]. Therefore, adapted accordingly to obtain high quality images.
we claim that photography view finding is a highly complex and
scene-dependent task, and the introduction of several simple and
IV. SOCIALIZED MOBILE PHOTOGRAPHY
general composition rules is insufficient to guide the mobile user
to capture high quality photos.
A. Approach Overview
B. Complexity of Camera Settings The framework of the proposed socialized mobile photog-
Exposure has a critical effect on photo lightness, color and con- raphy system is shown in Fig. 4. The system takes the to-be-
trast, and the human visual system is quite sensitive to all of them. taken wide-view image along with the mobile user geo-location
Unsuitable exposure can make a poor quality photo even though as input and sends it to the cloud server. The input wide-view
the composition is successful, and quite often even post-pro- image can either be directly taken, or synthesized from mul-
cessing can not improve it into an appealing work. Unfeasible tiple consecutive photos taken by the mobile user. By jointly
setting of any exposure related camera parameters can sale out the utilizing the input wide-view image and its geo-location as well
photo quality, which is difficult for mobile users to handle. as the lighting condition related contexts such as time, date and
Professional photographers have to adjust the exposure param- weather condition which can be obtained from the Internet, the
eters adaptively considering the shooting scenes, i.e., the subject system will suggest optimal view enclosures and proper expo-
and its settings. From the image lightness and color perspective, sure parameters best fitting the shooting content and context
the main subject has to be captured into an acceptable tone, and at based on photography rules learned from crowdsourced social
the same time, the contrast of the shooting subject and its setting media data and metadata nearby.
has to be taken into account given the appearance relations be- Two fundamental components are needed in the socialized
tween them to create an ideal attention effect. Therefore, exposure mobile photography system to aid mobile users in capturing
parameters highly depend on the shooting content. professional photographs. First, an offline photography learning
Various lighting conditions influenced by many context fac- process is needed to mine composition and exposure rules.
tors such as time and weather make the exposure parameter Second, the input wide-view image content and context is uti-
adjustment a more complicated problem. Even for the same lized together to find relevant photography rules to recommend
shooting content, the exposure has to be varied with the lighting optimal view enclosures and exposure parameters for mobile
conditions. The exposures are significantly different between users.
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 189
In the offline photography learning procedure, view cluster view cluster should contain the same main objects, we use local
discovering is performed first in a certain scope of geo-loca- features with geometric verification to capture image content
tions by clustering based on both image visual features and similarity. To facilitate the clustering and online relevant view
their geo-locations. Because some view clusters are intrinsically discovering process, the crowdsourced photos in the location
more appealing than others, view cluster ranking is carried out. scope are indexed by inverted files [9] based on SIFT [10] vi-
For example, as shown in the view cluster ranking part of Fig. 4, sual words. To overcome the false matching caused by the am-
the cluster of the first row is significantly better than that of the biguity of the visual words, geometric relationships among the
fourth row. The view cluster ranking results will be utilized in visual words are also recorded into the index by spatial coding
the online optimal view enclosure searching step to make the as in [11]. Hence, using the index, the image content similarity
searching process much more efficient. As mentioned earlier, can be efficiently computed based on the matching score formu-
due to the non-exhaustiveness and flexibility of the composi- lated by the number of matched visual words passing the spatial
tion rules when taking pictures for different portions of scenes, verification [11]. In addition, considering images captured from
composition learning is performed for each view cluster discov- close places usually have similar content from similar perspec-
ered. As the instances shown in the view specific composition tives, location is also adopted to the view cluster discovering
learning part of Fig. 4, for the view cluster of the second row, process. The image location similarity is calculated based on
the rule of thirds is more appropriate to apply than symmetrical their GPS Euclidean distance. Then the view clusters can be
composition; while for the cluster of the third row, instead of the discovered by a clustering process based on image similarity
rule of thirds, symmetrical composition of the close-up view of obtained by fusion of their content similarity and location sim-
the “Golden Gate Bridge” pier is preferred. Moreover, due to the ilarity. The similarity fusion is achieved using their product.
fact that professional photographers usually adjust the camera However, it is difficult to manually specify the number of
exposure parameters according to the brightness and color of views considering the difference in content and perspectives for
the objects and the whole settings, as well as the lighting con- clustering, even after manually going through the whole dataset
dition influenced by a variety of factors, such as the intensity crawled from the location scope. Therefore, affinity propaga-
and the direction of sunshine which are affected by the season tion [12] which does not require the specification of number of
and the time of the day as well as weather conditions, metric clusters, is performed to cluster the images into different views
learning is carried out to model the various effects of content based on the fused image similarity.
and context to the exposure parameters. View Cluster Denoising. To learn the photography rules of
In the online photography suggestion stage, utilizing the vi- a given content from a certain perspective, we need to model
sual content and geo-location of the input, relevant view clus- the relevant rules for each cluster. However, the noisy images
ters similar to all possible view enclosure candidates of the input without the main objects in the view clusters discovered by the
image are found. Considering the fact that some view enclosure above clustering process would negatively affect the photog-
candidates are intrinsically bad no matter how to tune the object raphy learning process. It is necessary to denoise the images
placement, a large portion of view enclosure candidates similar without the same content. To identify the image sharing the
to the low ranked view clusters are discarded. Afterwards, the main objects with other images in the cluster, we first select the
optimal view enclosures will be selected based on the offline iconic image based on local features. The image with maximum
learned view specific composition principles only from the re- total content similarity scores with the others within the cluster
maining enclosure candidates. Once mobile users are provided are chosen as the iconic image of the view cluster. Afterwards,
with the optimal view enclosure, the appropriate exposure pa- the images with content similarity score less than the threshold
rameters, i.e., exposure time, aperture and ISO suitable for the are considered as noisy images without the main ob-
view and lighting conditions are suggested. With the suggested jects in the cluster and thus are discarded. Then, noisy clusters
view enclosure and corresponding exposure parameters, mobile without representative content in the scene such as the ones con-
users can capture high quality photos with appealing composi- taining portraits and crowds can be removed by discarding the
tion and exposure. clusters with quite a small number of images. Here we discard
the clusters with less than 5 images. Finally, we can expect im-
B. Offline Photography Learning ages with very similar content from the same viewpoints to end
1) View Cluster Discovering: Intuitively, when photogra- up in different view clusters.
phers visit a certain scope of geo-location, they tend to cap- 2) View Cluster Ranking: As aforementioned, some view
ture photos from a certain number of photo-worthy viewpoints. clusters have more appealing content and composition than
The aggregated photographs taken in the location scope asso- others, therefore, we rank the view clusters discovered. Later
ciated with their social information from the media community on, in the online view enclosure selection stage, the enclosures
can provide significant insight on the aesthetics and composi- having relative less appealing content can be discarded directly
tion rules of different viewpoints. To discover those repeated to facilitate the optimal view searching. We adopt the cluster
views from the crowdsourced photographs of the given location size and score distribution of the images in the cluster to rank
scope, we perform image clustering by jointly utilizing image the view clusters.
visual features and capture locations. The target is to expose dif- View Cluster Ranking. Suppose we have the whole ranking
ferent views with different portion of the scene from a variety list of the crowdsourced images based on their aesthetic scores,
of perspectives. to rank the view clusters, the score distributions of the individual
View Clustering. An efficient content and geo-location based images in the cluster need to be taken into account. In the ideal
clustering process is carried out to discover all typical view case, the individual images of high ranking clusters should have
clusters in the location scope. As the images within the same high aesthetic scores in the average sense. On the other hand, the
190 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014
individual image scores of highly ranked clusters should not be more users. We also consider this parameter in aesthetic
scattered too much, since the clusters with compact scores tend score generation to complement the number of favorites.
to have more stable aesthetic scores. Inspired by the commonly Therefore, by weakening the time fading effect of interest-
used metrics, average precision, in information retrieval field, ingness rankings via highlighting the impacts from number of
we formulate the average score of cluster by views and favorites, interestingness rankings are utilized to gen-
erate photo aesthetic scores by fusing the above three factors as
follows:
(3)
(1)
where and are the number of views and number of favorites
of image , respectively. is the rank of image based on
where is the number of images in the view cluster. is the
interestingness. is the total number of crawled images in the
-th image belonging to cluster in the whole rank list.
location scope. In this way, the aesthetic scores ranged from 0
is the ratio of the total score of images within the cluster k to the
to 100 are generated, in which high quality photos are assigned
total score of images at the cut-off rank of image in the whole
with high aesthetic scores. Through empirical analysis, we set
rank list, in which is the -th image belonging to cluster
and .
and is the -th image in the rank list no matter which cluster it
3) View Specific Composition Learning: The view clusters
belongs to. In addition, the appealing degree of the view cluster
obtained are expected to contain different content or captured
can also be reflected by the size of the cluster, since pleasing
with different perspectives from different location tiles. As il-
objects tend to draw more photographers’ attention and hence
lustrated in Fig. 4, different composition rules are adopted to
a large number of images are aggregated in the view cluster.
those different view clusters. Therefore, view specific composi-
Therefore, the view clusters are ranked according to the scores
tion learning has to be performed to extract different composi-
calculated by
tion principles for each view cluster.
(2) The photographs of each cluster usually contain the same ob-
jects from similar perspectives but with different positions and
The larger the view cluster size, the higher the view is scored. scales in the frame. The difference is one key factor leading to
Image Aesthetic Score Generation. In many social media the different aesthetical scores. To characterize the composition
websites such as Photo.net [13] and DPChallenge [14], most difference in terms of the main object placement for each cluster,
photos are rated by a number of professional photographers, the the camera operations compared with the cluster iconic image
ratings can almost reflect the photo aesthetics. Due to the lack of are utilized to represent the main object composition as shown in
GPS information, we did not use the data from these websites. Fig. 5. The camera operation is defined as horizontal translation,
However, we can expect the availability of the context infor- vertical translation, zoom in/out, and rotation as in [17]. Using
mation of them in the near future due to the user demand and the matched SIFT points, the coordinates in the given image
advanced techniques. Given the rich information of Flickr [15], can be represented by the affine model based on the
we generate image aesthetic scores based on several heuristic coordinates of its corresponding matched points in the cluster
criterions. iconic image as follows:
• Ranking of interestingness. Interestingness-enabled
search is provided by Flickr API. As in [16], the inter- (4)
estingness rankings are based on the quantity of user
entered metadata such as tags, comments and annotations,
The parameters of the affine model
the number of users who assigned metadata, user access
can be calculated by the least square method based on all
patterns, and a lapse of time of the media objects. Based
matched SIFT points in the given image. Based on the affine
on the interestingness rankings the top photos are usu-
parameters, the camera operations can be obtained by
ally with high qualities. Hence, we utilize this important
information to generate the aesthetic scores. Due to the (5)
consideration of lapse of time, although the interestingness where
based ranking can reflect the photo quality, the ranking
scheme tends to rank the newly uploaded photos higher (6)
than the old photos. To improve the rankings of those
appealing but old photos, we enhance the influence by the
number of views and the number of favorites considering The terms of the camera operations , , , and
the fact that these photos usually have high quantity in represent the camera horizontal translation, vertical
these two terms, even though the interestingness has translation, zoom in/out degree, and rotation, respectively [18].
subsumed the two terms. As shown in Fig. 5, the object composition in terms of scale and
• Number of favorites. The number of favorites explicitly location can be captured by the camera operations compared
shows how many people liked the image. Hence it is a with the view cluster iconic image.
straightforward reflection of the photo’s degree of appeal. In addition to the modeled main object, some other salient
• Number of views. The number of views of the image can objects in the photos can also affect the image compositions.
reflect the attention drawn from the social media commu- Therefore, the spatial distribution of saliency is also utilized to
nity. Usually, high quality images tend to be viewed by capture the composition. We divide the image into 5 5 grids,
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 191
(7)
geo-location. In this system, the dataset is built by GPS en- the separately predicted aperture, ISO and exposure time are ap-
abled search from Flickr, and hence all photos are tagged plicable, when putting them together it is not guaranteed that the
with their GPS. In addition, the date and time can be found exposure value fits the given content and contexts conditions.
from the EXIF of photos. To suggest exposure parameters with reasonable EV, there are
Exposure Feature Space Metric Learning. The exposure two possible ways. One is to learn the distance metric of the ex-
parameter setting is a complicated problem due to the various posure feature space by using the coupled aperture, ISO and ET
shooting content and lighting conditions affected by various triples as labels, but the combination of the large number of pos-
contextual factors. Although we may identify all possible in- sible values of the three parameters make the number of labels
fluencing factors, it is still quite difficult to formulate their dif- exponentially increased, and hence degrade performance. The
ferent effects. A simple instance is that when taking photos at other way is to predict EV and any two of the three parameters.
night, time of day may be a dominant factor rather than weather However, EV determination is complicated because it is sensi-
conditions, while at noon time, the lighting difference between tive to the layout of the intensity and colors in the shooting view
cloudy and clear weather conditions may have a greater influ- and lighting conditions. Although camera manufactories have
ence than time on the exposure. With these context dimensions made efforts to estimate correct exposure values by improving
intertwined, it is hard to determine their influences on exposure. lighting meters for decades, it is still challenging to provide cor-
The exposure setting problem becomes even more challenging rect exposure value automatically. Hence, predicting exposure
when taking into account the diversity of exposure for different value directly according to shooting content and context is not
shooting content. practical. Empirical analysis also validates that the direct expo-
Therefore, the proposed system models the exposure param- sure value prediction performance is not satisfactory.
eters by supervised distance metric learning, which generally Although camera lighting meters are indeed improved nowa-
aims to learn a transformation of the feature space to maxi- days, professional photographers sometimes need to adjust the
mize the classification performance. Here, the hope is to find the camera computed exposure value by increasing or decreasing
transformation of the exposure feature space, with the ability certain compensation levels to achieve perfect EV. By setting
to rescale the feature dimensions according to their effects on the exposure compensation (EC), the camera lighting meters
the exposure parameter selections. Ideally, the learned transfor- and the photographers’ knowledge are both sufficiently utilized
mation can project the photos with similar exposure parame- to obtain the ideal EV. Inspired by it, we adopt EC to achieve
ters into clusters as illustrated in the exposure metric learning optimal EV. We learn the distance metrics of the exposure fea-
module of Fig. 4. ture space for EC. Similarly, metric learning of exposure fea-
To perform distance metric learning for the exposure feature ture space for aperture and ISO are also performed, respectively.
space, Large Margin Nearest Neighbor (LMNN) [23], a distance Once the optimal EV, aperture and ISO are obtained, ET can be
metric learner designed for -nearest neighbor (kNN) classifier, calculated according to (8).
is utilized. It aims to maintain that the -nearest neighbors al-
ways belong to the same class while the samples from different
C. Online Mobile Photography Suggestion
classes are separated by a large margin. The approach maxi-
mizes the kNN classification performance without the unimodal Given the input wide-view image with size , view
distribution assumption for each class, which is a valuable prop- enclosure candidates at various positions in different scales need
erty for fitting our model. For example, the exposure parameters to be generated for optimal view selection. As mentioned in
for the same view cluster under similar weather conditions at Section IV-B-3, we assume the mobile captured photos are of
sunrise and sunset may be the same, but fall into different clus- aspect ratio 2:3 or 3:2. We slide a window of the given aspect
ters after exposure feature space transformation. In the metric ratio with moving step size from
learning step, the exposure parameters of the photos are repre- or for horizontal and vertical input image,
sented in the exposure feature space with their exposure param- respectively, until the largest possible window size with scaling
eter values as labels. ratio to generate all possible view enclosure candidates.
Exposure Compensation, Aperture and ISO Learning. Then, a number of view candidates will be discarded first based
As different exposure parameters have different sensitivity and on view cluster rankings. The view enclosure with the best com-
functionality to the content and context dimensions, we have to positions will be selected from the remaining ones, utilizing the
learn the metrics for the parameters separately. For example, at offline learned view specific composition rules.
night, photographers tend to increase the ISO values to over- 1) Relevant View Discovering: Once the view enclosure can-
come the issue of weak lighting, while besides the concerns didates are generated, their relevant view clusters containing the
about lighting, they tend to use large aperture to reduce the same content have to be discovered for their composition aes-
depth of field when shooting single objects. A straightforward thetics judgment. Using the image index built offline, the most
way is to model the exposure feature space distance metrics relevant image can be efficiently retrieved for each of the en-
for the three most commonly used exposure parameters, i.e., closure candidates. We can consider the view cluster, which the
aperture, ISO and exposure time, directly. Then, the exposure relevant image belongs to, as the relevant view cluster of the
value can be calculated by enclosure candidate. The visual words extraction only needs to
be performed once on the input image and the visual words of
(8) each candidate can be obtained based on the enclosure coordi-
nates accordingly.
where and are exposure value and exposure time, re- 2) Low Ranking View Enclosure Discarding: It is difficult to
spectively. However, it is not reasonable because even though decide which part of a given input panoramic image to capture
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 193
(11)
where and are the estimated EV of the sug-
gested view enclosure and the input panoramic image from Hence, ET can be reduced while maintaining the same EV.
the camera meter, respectively. and are the Then, if ET is below , the parameter adjustment stop and
194 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014
A. Dataset Building
Up to now, there is no publicly available dataset containing
sufficient photos with context information for photo quality
assessment. To guarantee sufficient photos with the required
context information for the composition learning and expo-
sure learning processes, we build our own dataset from eight
well-known hot spot landmark locations. The proposed pho-
tography learning process, including view cluster discovering,
composition learning and exposure parameter learning, is per-
formed utilizing the crowdsourced data from eight hot spot
places: “Golden Gate Bridge,” “Taj Mahal,” “Sydney Opera
House,” “Portland Head Light,” “Statue of Liberty,” “Jefferson
Memorial,” “Eiffel Tower,” and “United States Capitol.” In each
Fig. 7. The exposure parameter adjustment process. location, we use Flickr API flickr.photos.search to perform in-
terestingness based search in descending order for photos within
the radius of one kilometer by specifying the geo-location of the
the updated parameter set is suggested; otherwise, further ad- landmark. Hence, all photos collected are tagged with latitude
justment is needed. When updating the ISO value, we have to and longitude. The capture date and time is obtained from photo
check the current camera allowable range of ISO values. Once EXIF data. The weather information of all photos can be found
the maximum ISO value is reached, we are only able to de- from [22] by providing their capture date, hour and city. For each
crease ET by decreasing the aperture value using (12) under the location, about 2,000–4,000 photos are obtained due to limits of
same EV. Flickr API. The number of views ranges from 0 to 61,397, and the
number of favorites is from 0 to 455. They were captured at dif-
(12) ferent time of day ranging from 2001 to 2012, and under various
weather conditions: clear, cloudy, flurries, fog, overcast, rain,
Similarly, when updating aperture, the updated aperture has to be and snow. For each location, we randomly select 50 wide-view
supported by the current camera. Hence, we can obtain the op- images for system performance evaluation, i.e., view suggestion
timal set of exposure parameters fitting the suggested view and evaluation and camera parameter suggestion evaluation, and the
lighting contexts considering the mobile camera capabilities. remaining photos serve as training and validation data for compo-
5) Online Mobile Photography Suggestion Computation: sition learning and camera parameter metric learning. We split the
Given an input wide-view panoramic photo, with the view data into ten folds and in the system component evaluation, i.e.,
specific composition model and the exposure parameter metrics composition learning accuracy and exposure parameter learning
learned offline, the proposed socialized photography system accuracy, ten fold cross validation is carried out. In the system
can efficiently suggest optimal view and proper camera param- evaluation part, we utilize the composition model and camera
eters. Once the input photo and the associated GPS information parameter metrics learned in one randomly selected round.
is sent to the cloud server, the SIFT feature and saliency feature
are generated. Later on, the view enclosure candidates can be
B. View Cluster Discovering Results
generated by sliding windows and their features can be obtained
quickly by simply cutting out from the original image features. We carried out the proposed view cluster discovering ap-
Then, with the indices of the photos in the location, the most proach as described in Section IV-B-1. After the content and
relevant images of the candidates can be retrieved and thus geo-location based view clustering and local feature based
their view clusters can be found in a parallel fashion. Due to cluster denoising, 11–33 clusters are obtained for each location.
the limited number of view clusters, the highest ranked relevant The number of clusters for the eight locations are listed in
cluster can be obtained quickly. The aesthetic scores of the Table I. Example photos of the top three ranked clusters of four
candidates belonging to the highest ranked relevant cluster are hot spot locations using the view cluster ranking method illus-
parallelly predicted with the offline learned composition model. trated in Section IV-B-2 are demonstrated in Fig. 8. Despite the
Hence, the optimal view candidate can be suggested efficiently. existence of noisy images, the proposed view discovering ap-
Since the current weather information can be prefetched for the proach indeed found meaningful view clusters in the locations.
given location, the offline learned exposure parameter metrics
can be employed to obtain suitable parameters with a simple C. Evaluation of Composition Suggestion
operation. Usually the parameter adjustment can be finished
with 0–3 iterations. Hence, the optimal view and proper camera 1) Composition Learning Accuracy: To validate the pro-
parameters can be sent to the client quickly. posed view-specific composition learning component, we
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 195
TABLE I TABLE II
NUMBER OF CLUSTER FOR EIGHT LOCATIONS MSE OF THE PREDICTED AESTHETIC SCORES FOR EIGHT LOCATIONS
Fig. 8. The top three view clusters discovered of four locations: from top to
bottom are “Golden Gate Bridge,” “United States Capitol,” “Taj Mahal,” and
“Portland Head Light.”.
Fig. 10. Results of suggested views, in which input images and suggested views of each hot spot landmark location are shown in one row. (b)(d)(f)(h) are suggested
views of input wide-view images (a)(c)(e)(g), respectively.
results and the corresponding wide view input photos for com- racy of the subjective evaluation due to fatigue or boredom, no
parison. Note that the input wide view images downloaded from more samples were assessed. The average ratings of the pro-
Flickr were usually already carefully composed by the pho- fessional and the amateur photographers on the composition in
tographers. Three subjects are professional photographers, one terms of question (3) for the eight locations are demonstrated in
of whom is a photographer working in a photographic studio, Fig. 13(a). The blue and green solid lines show the average pro-
and the other two are students majoring in photography. All fessional ratings for the input and output composition, respec-
of them have more than five years of photography experiences tively. The red and black dashed lines show the average ama-
with single lens reflex cameras. Twelve subjects are amateurs, teur ratings for the input and output composition, respectively.
who are not majoring in photography but have at least two years From the figure, we find that, for most landmark locations, the
of experiences with single lens reflex cameras. For each of the suggested compositions are better than the input compositions
eight locations, five photos and corresponding results are ran- either from the professional or the amateurs’ perspective. In the
domly selected for the user study. Hence, 40 image pairs are cases of Statue of Liberty and Jefferson Memorial, the ratings
rated by the 15 subjects. Due to the page limit, for each location of the suggested compositions are slightly worse than or similar
one input wide view photo example and the corresponding view to the input compositions due to the failure in the detection of
suggestion result as well as the iconic image of the optimal rele- distinguished salient points because of noise on the background
vant view cluster are shown in Fig. 9. The remaining four input or overly small scale of foreground objects. Extensive adoption
photos and their suggested views of each location are demon- of more robust composition features will be performed in the fu-
strated in Fig. 10. The subjects are asked to rate each photo in ture to overcome this issue. In addition, the standard deviations
the following three perspectives: of the ratings from the three professional photographers and the
1) Is the photo visually appealing in terms of focal length? twelve amateur photographers for the input and output of the
2) Is the photo visually appealing in terms of object 40 photos from the eight landmark locations in terms of ques-
placement? tion (3) on overall composition are demonstrated in Fig. 14(a).
3) What’s the overall composition rating? The blue and green solid lines are the standard deviation of the
The subjects are asked to provide their ratings on the three professional ratings for input and output, respectively. The red
questions for each photo using 1(very bad), 2(bad), 3(neutral), and black dashed lines are the standard deviation of the am-
4(good) and 5(very good). The subjective composition evalu- ateur ratings for input and output, respectively. The standard
ation and the subject camera parameters evaluation in V-D-2 deviations on the ratings of the 40 photo pairs from the three
took 1.5–2 hours for each photographer. To avoid the inaccu- professional photographers and the twelve amateur photogra-
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 197
TABLE III
ERROR RATE OF THE PREDICTED EXPOSURE PARAMETERS
phers are smaller than 1.24 and 1.85, respectively. The rating
distributions of the three professional photographers on ques-
tion (1), (2) and (3) are demonstrated in Fig. 11(a), (b) and (c),
respectively, while the rating distributions of the twelve am-
ateur photographers are demonstrated in (d), (e) and (f). The
black bars show the rates on the input photos and the white
ones show the rates on the corresponding suggested views. From
the figure, we can observe that although the rating distributions
of the professional and amateur photographers are slightly dif-
ferent, there is a tendency that the composition of the suggested
views are significantly improved compared with the input wide Fig. 13. The average rating of the three professional photographers and the
view photos in terms of focal length, object placement and the twelve amateur photographers for the input and output of the eight landmark lo-
overall composition. cations in terms of (a) overall composition and (b) overall exposure parameters.
The blue and green solid lines are the average professional rating for input and
output, respectively. The red and black dashed lines are the average amateur
D. Evaluation of Camera Parameter Suggestion ratings for input and output, respectively.
1) Exposure Parameter Learning Accuracy: To validate the
proposed exposure parameter learning component, we carried Section IV-B-4 and calculate the average error rate of the pre-
out metric learning for aperture, ISO and EC as described in dicted parameters through ten fold cross validation for each hot
198 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014
TABLE IV
THE PREDICTED AND SUGGESTED EXPOSURE PARAMETERS OF THE SUGGESTED
VIEWS AND CORRESPONDING INPUT IMAGES OF FIG. 9
Fig. 14. The standard deviation of the ratings from the three professional pho-
tographers and the twelve amateur photographers for the input and output of the
40 photos from the eight landmark locations in terms of (a) overall composi-
tion and (b) overall exposure parameters. The blue and green solid lines are the
standard deviation of the professional ratings for input and output, respectively. We can observe that for most locations, the suggested param-
The red and black dashed lines are the standard deviation of the amateur ratings eters are better than or similar to the input parameters, either
for input and output, respectively. from the professional or the amateurs’ ratings. In addition, the
standard deviations of the ratings from the three professional
spot place. The average error rate of the predicted results be- photographers and the twelve amateur photographers for the
fore parameter adjustment for each location is demonstrated in input and output of the 40 photos from the eight landmark lo-
Table III. cations in terms of question (3) on overall exposure parame-
2) Subjective Evaluation of Exposure Parameter Sugges- ters are demonstrated in Fig. 14(b). The blue and green solid
tion: To evaluate the suggested exposure parameters, the 15 lines are the standard deviation of the professional ratings for
photographers are also invited to evaluate the suggested expo- input and output, respectively. The red and black dashed lines
sure parameters for the suggested view of the same set of 40 are the standard deviation of the amateur ratings for input and
photos. The subjects are required to answer the following three output, respectively. The standard deviations on the ratings of
questions: the 40 photo pairs from the three professional photographers
1) Do the camera parameters make the photo underexposed and the twelve amateur photographers are smaller than 0.94 and
or overexposed? 1.15, respectively. The rating distributions of the three profes-
2) Does the exposure time make the photo blurred when using sional photographers on question (1), (2) and (3) are shown in
mobile cameras? Fig. 12(a), (b), and (c), respectively, while the rating distribu-
3) Are the camera parameters reasonable from overall tions of the twelve amateur photographers are shown in (d), (e),
perspective? and (f). The black bars show the ratings on the camera parame-
For question (1), they are asked to answer with (under- ters of the original photos, while the white show the ratings on
exposed), (proper exposure), and (overexposed). For ques- the suggested camera parameters. From the evaluations of both
tion (2), they provide binary answers (yes) or (no). For professional and amateur photographers, we can find that the
question (3), they are asked to answer with 1 (not reasonable), 2 improper lightings, i.e., under-exposure and over-exposure, as
(reasonable, but can be improved) and 3 (perfect). When eval- well as blur are both reduced via parameter suggestion. Overall,
uate the camera parameters, the lighting related contexts, i.e., a large portion of unreasonable exposure parameters of the input
month, time of the day, and weather conditions are also pre- photos are corrected. Hence, the exposure parameter suggested
sented to the photographers. The average ratings of the profes- by the proposed system significantly improved the photo quality
sional and amateur photographers on the reasonability of the captured by mobile cameras.
input and output exposure parameters in terms of question (3) Assuming the mobile camera aperture range and ISO range
for the eight landmark locations are demonstrated in Fig. 13(b), are [2.8, 22.6] and [100, 1600], respectively, the predicted expo-
in which the blue and green solid lines show the ratings of sure parameters and the suggested ones after parameter adjust-
the professional photographers on the input and output, respec- ments of the suggested views as well as the original parameters
tively, and the red and black dashed lines show the ratings of the of the input photo of Fig. 9 are shown in Table IV. The EV and
amateur photographers on the input and output, respectively. EC are also demonstrated for comparison.
YIN et al.: SOCIALIZED MOBILE PHOTOGRAPHY: LEARNING TO PHOTOGRAPH WITH SOCIAL CONTEXT VIA MOBILE DEVICES 199
VI. CONCLUSION AND FUTURE WORK [7] M. Freeman, The Photographer’s Eye: Composition and Design for
Better Digital Photos. Lewes, U.K.: Ilex Press, 2007.
The contradiction between the popularity of photo capturing [8] L. Liu, R. Chen, L. Wolf, and D. Cohen-Or, “Optimizing photo com-
and sharing by mobile devices and the lack of photography position,” Comput. Graph. Forum, vol. 29, no. 2, pp. 469–478, 2010.
[9] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
knowledge and skills of most mobile users serves as the pri- object matching in videos,” in Proc. ICCV, 2003, pp. 1470–1477.
mary motivation of our work. We propose a socialized mobile [10] D. G. Lowe, “Distinctive image features from scale-invariant key-
photography system to assist mobile user in capturing high points,” Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.
[11] W. Zhou, Y. Lu, H. Li, Y. Song, and Q. Tian, “Spatial coding for large
quality photos by using both the rich context available from scale partial-duplicate web image search,” in Proc. ACM Multimedia,
mobile devices and crowdsourced social media on the Web. 2010, pp. 511–520.
Considering the flexible and adaptive adoption of photography [12] B. Frey and D. Dueck, “Newblock clustering by passing messages be-
tween data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
principles with different content and perspective, view clusters [13] Photo.net [Online]. Available: http://photo.net/
are discovered and the view specific composition rules are [14] DPChallenge [Online]. Available: http://www.dpchallenge.com/
learned from the community-contributed images. Leveraging [15] Flickr [Online]. Available: http://www.flickr.com/
[16] D. S. Butterfield, C. Fake, C. J. Henderson-Begg, and S. Mourachov,
user’s social context, the proposed socialized mobile pho- “Interestingness Ranking of Media Objects,” US Patent Application
tography system is able to suggest optimal view enclosure 20060242139, 2006.
to achieve appealing composition. Due to the complex scene [17] J.-G. Kim, H. S. Chang, J. Kim, and H.-M. Kim, “Efficient camera
content and a number of shooting-related contexts to exposure motion characterization for MPEG video indexing,” in Proc. IEEE Int.
Conf. Multimedia and Expo, 2000, pp. 1171–1174.
parameters, metric learning is applied to suggest appropriate [18] T. Mei, X.-S. Hua, C.-Z. Zhu, H.-Q. Zhou, and S. Li, “Home video
camera parameters. Currently, we aim to solve the mobile visual quality assessment with spatiotemporal factors,” IEEE Trans.
photography problem for some hot spot landmark locations. Circuits Syst. Video Technol., vol. 17, no. 6, pp. 699–706, Jun. 2007.
[19] X. Hou and L. Zhang, “Saliency detection: A spectral residual ap-
Objective and subjective evaluations for the hot spot landmark proach,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-
photos validated the effectiveness of the proposed socialized tion, 2007.
mobile photography system. [20] C. C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, 2011.
There are several interesting future works. Photography [21] J. Zhuang, T. Mei, S. C. H. Hoi, Y.-Q. Xu, and S. Li, “When recommen-
knowledge transfer from other similar geo-locations may solve dation meets mobile: Contextual and personalized recommendation on
the problem of insufficient data for photography learning the go,” in Proc. Ubicomp, 2011, pp. 153–162.
[22] wunderground.com [Online]. Available: http://www.wunder-
and thus improve the performance of the socialized mobile ground.com/
photography system [36]. Moreover, in the current system, [23] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large
we can only search the optimal view enclosure from the input margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10,
pp. 207–244, 2009.
wide-view photo. However, the optimal view may sometimes
[24] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for
go beyond the scope of the input views. It is very challenging photo quality assessment,” in Proc. IEEE Conf. Computer Vision and
to quantitatively evaluate the effect of the constraint on the Pattern Recognition, 2006, pp. 419–426.
initial input-wide view image. Such an evaluation is beyond the [25] Y. Luo and X. Tang, “Photo and video quality evaluation: Focusing on
the subject,” in Proc. ECCV, 2008, pp. 386–399.
scope of this paper and will be considered in the future work. [26] X. Sun, H. Yao, R. Ji, and S. Liu, “Photo assessment based on compu-
A possible way of optimal view enclosure finding beyond tational visual attention model,” in Proc. ACM Multimedia, 2009, pp.
the input scope may be realized via building 3D models of 541–544.
[27] S. Dhar, V. Ordonez, and T. L. Berg, “High level describable attributes
the given location from crowdsourced images. Furthermore, for predicting aesthetics and interestingness,” in Proc. IEEE Conf.
people tend to capture consumer photos with mobile cameras Computer Vision and Pattern Recognition, 2011, pp. 1657–1664.
in various scenes and events as in [37]. Hence, we may extend [28] L. Marchesotti, F. Perronnin, D. Larlus, and G. Csurka, “Assessing the
aesthetic quality of photographs using generic image descriptors,” in
our system into more scene types and events by analyzing and Proc. ICCV, 2011, pp. 1784–1791.
inferring human portraits and activities. Finally, city-scale mo- [29] R. Datta and J. Z. Wang, “ACQUINE: Aesthetic quality inference en-
bile photography suggestion system can be built by integrating gine—Real-time automatic rating of photo aesthetics,” in Proc. Multi-
media Information Retrieval, 2010, pp. 421–424.
automatic landmark recommendation [38] into the current [30] L. Yao, P. Suryanarayan, M. Qiao, J. Z. Wang, and J. Li, “OSCAR:
photography suggestion system. On-site composition and aesthetics feedback through exemplars for
photographers,” Int. J. Comput. Vision, vol. 96, no. 3, pp. 353–383,
2012.
REFERENCES [31] B. Cheng, B. Ni, S. Yan, and Q. Tian, “Learning to photograph,” in
Proc. ACM Multimedia, 2010, pp. 291–300.
[1] [Online]. Available: http://tinyurl.com/c8mp7jt [32] H.-H. Su, T.-W. Chen, C.-C. Kao, W. H. Hsu, and S.-Y. Chien, “Prefer-
[2] C. Zhu, K. li, Q. Lv, L. Shang, and R. P. Dick, “iScope: Personalized ence-aware view recommendation system for scenic photos based on
multi-modality image search for mobile devices,” in Proc. MOBISYS, bag-of-aesthetics-preserving features,” IEEE Trans. Multimedia, vol.
2009, pp. 277–290. 14, no. 3, pp. 833–843, Jun. 2012.
[3] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao, “Lo- [33] D. M. Chen, G. Baatz, K. Koser, S. S. Tsai, R. Vedantham, T. Pyl-
cation discriminative vocabulary coding for mobile landmark search,” vanainen, K. Roimela, X. Chen, and J. Bach, “City-scale landmark
Int. J. Comput. Vision, vol. 96, no. 3, pp. 290–314, 2012. identification on mobile devices,” in Proc. IEEE Conf. Computer Vi-
[4] R. datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aesthetics in pho- sion and Pattern Recognition, 2011, pp. 737–744.
tographic images using a computational approach,” in Proc. ECCV, [34] S. Bourke, K. McCarthy, and B. Smyth, “The social camera: A case-
2006, pp. 288–301. study in contextual image recommendation,” in Proc. IUI, 2011, pp.
[5] D. Joshi, R. Datta, Q.-T. Luong, E. Fedorovskaya, J. Z. Wang, J. Li, 13–22.
and J. Luo, “Aesthetics and emotions in images: A computational per- [35] W. Yin, T. Mei, and C. W. Chen, “Crowd-sourced learning to photo-
spective,” IEEE Signal Process. Mag., vol. 28, no. 5, pp. 94–115, 2011. graph via mobile devices,” in Proc. IEEE Int. Conf. Multimedia and
[6] S. Bhattacharya, R. Sukthankar, and M. Shah, “A framework for photo- Expo, 2012.
quality assessment and enhancement based on visual aesthetics,” in [36] W. Yin, T. Mei, and C. W. Chen, “Assessing photo quality with geo-
Proc. ACM Multimedia, 2010, pp. 271–280. context and crowdsourced photos,” in Proc. VCIP, 2012.
200 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 1, JANUARY 2014
[37] S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, and A. Vakali, Chang Wen Chen (F’04) is a Professor of Computer
“Cluster-based landmark and event detection for tagged photo collec- Science and Engineering at the State University
tions,” IEEE Multimedia, vol. 18, no. 1, pp. 52–63, Jan. 2011. of New York at Buffalo, USA. Previously, he was
[38] M. D. Choudhury, M. Feldman, S. Amer-Yahia, N. Golbandi, R. Allen S. Henry Endowed Chair Professor at Florida
Lempel, and C. Yu, “Automatic construction of travel itineraries Institute of Technology from 2003 to 2007, a faculty
using social breadcrumbs,” in Proc. 21st ACM Conf. Hypertext and member at the University of Missouri-Columbia
Hypermedia, 2010, pp. 35–44. from 1996 to 2003 and at the University of Rochester,
[39] W. Yin, J. Luo, and C. W. Chen, “Event-based semantic image adapta- Rochester, NY, from 1992 to 1996. He has served
tion for user-centric mobile display devices,” IEEE Trans. Multimedia, as the Editor-in-Chief for IEEE Transactions on
vol. 13, no. 3, pp. 432–442, Jun. 2011. Circuits and Systems for Video Technology from
January 2006 to December 2009 and an Editor for
Proceedings of IEEE, IEEE T-MM, IEEE JSAC, IEEE JETCAS and IEEE
Multimedia Magazine. He and his students have received six (6) Best Paper
Awards and have been placed among Best Paper Award finalists many times. He
is a recipient of Sigma Xi Excellence in Graduate Research Mentoring Award
in 2003, Alexander von Humboldt Research Award in 2009 and SUNY-Buffalo
Wenyuan Yin (S’10) received B.E. degree from Exceptional Scholar–Sustained Achievements Award in 2012. He is an IEEE
Nanjing University of Science and technology in Fellow and an SPIE Fellow.
2006. She is now pursuing the Ph.D. degree in the
Department of Computer Science and Engineering,
State University of New York at Buffalo. Her
current research interests include image and video Shipeng Li (F’10) joined Microsoft Research
semantic understanding, media quality assessment, Asia (MSRA) in May 1999. He is now a principal
mobile media adaptation, video transcoding, image researcher and research manager of the Media
processing, machine learning and computer vision. Computing group. He also serves as the research
She received the Best Student Paper Award at VCIP area manager coordinating the multimedia research
2012. activities at MSRA. From October 1996 to May
1999, he was with the Multimedia Technology
Laboratory at Sarnoff Corporation (formerly David
Sarnoff Research Center and RCA Laboratories)
Tao Mei (M’07–SM’11) is a Lead Researcher as a member of the technical staff. He has been
with Microsoft Research Asia, Beijing, China. He actively involved in research and development in
received the B.E. degree in automation and the Ph.D. broad multimedia areas. He has made several major contributions adopted
degree in pattern recognition and intelligent systems by MPEG-4 and H.264 standards. He invented and developed the world first
from the University of Science and Technology of cost-effective high-quality legacy HDTV decoder in 1998. He started P2P
China, Hefei, China, in 2001 and 2006, respectively. streaming research at MSRA as early as in August 2000. He led the building of
His current research interests include multimedia the first working scalable video streaming prototype across the Pacific Ocean
information retrieval and computer vision. He has in 2001. He has been an advocate of scalable coding format and is instrumental
authored or co-authored over 100 papers in journals in the SVC extension of H.264/AVC standard. He first proposed the 694;
and conferences, and holds eight U.S. granted Media 2.0 concepts that outlined the new directions of next generation internet
patents. He was the recipient of several best paper media research (2006). He has authored and coauthored more than 200 journal
awards, including the Best Paper Awards at ACM Multimedia in 2007 and and conference papers and holds 90+ US patents in image/video processing,
2009, and the IEEE Transactions on Multimedia Prize Paper Award 2013. He compression and communications, digital television, multimedia, and wireless
is an Associate Editor of Neurocomputing and the Journal of Multimedia. communication.