Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2018 IEEE 3rd International Conference on Big Data Analysis

An Effective Selecting Approach for Social Media Big Data Analysis -Taking
Commercial Hotspot Exploration with Weibo Check-in Data as An Example

Qingwu Hu Yuan Zhang


School of Remote Sensing and Information Engineering School of Remote Sensing and Information Engineering
Wuhan University Wuhan University
Wuhan, China Wuhan, China
e-mail: huqw@whu.edu.cn e-mail: zhang.yuan.1221@qq.com

Abstract—According to the problem that efficient datasets photos produced by the same user when at the same time and
cannot be quickly obtained from social media big data of social the same location which greatly reduces the amount of data.
networks in the process of focused mining and analysis. An The left data is more effective to explore the problem. They
effective selection method for clustering mining with space- both have done some effective processing to data when
time large data is proposed. The effective selection method of mining information and obtained great practical information.
clustering mining divides the spatiotemporal large data from The study of crowdsourcing data mining from social
the dimension of space, time or attribute. Then do exploratory network is of great value. There is lots of social, economic
spatial data analysis(ESDA) to the obtained subsets to get the and cultural information in crowdsourcing data which can
datasets with the potential of clustering mining quickly. the
reflect patterns and spatiotemporal distributions of urban and
proposed method is verified by using the Weibo check-in data
in Wuhan which is between 2011 and 2015 to mine commercial
human activities [7,8]. Therefore, how to analyze
hotspots. The experimental results show that the method can crowdsourcing data is very important for the focused mining
quickly and effectively excavate datasets from Weibo check-in of spatiotemporal data produced by social networks. This
data that can reflect the distribution of Wuhan business circle, paper takes Weibo check-in data from the social networks as
and the excavate d datasets have the characteristics of high an example, divides the large spatiotemporal data to get
clustering, small volume, high precision. The effective selection small sample datasets from two dimensions of space and
method of clustering mining for spatiotemporal data can time. With the tool of exploratory spatial data
provide fast and effective methods and ideas for the process of analysis(ESDA), test the effectiveness of the small sample
crowd sourcing geographic data today. datasets.

Keywords- social media big data; spatiotemporal data; cluster II. EFFICIENT SELECTION OF SPATIOTEMPORAL BIG
mining; efficient selection; ESDA; commercial hotspots DATA

I. INTRODUCTION A. Definition of Efficient Selection(ES)


With the popularity of network information technology, Spatiotemporal big data include multidimensional
some social platforms have created an endless stream of data, information and is of comprehensive features of multi-source,
such as Facebook, Weibo, WeChat, et al. The information massive, updated fast. The value of spatiotemporal data is
contained in these data is very complex. This phenomenon determined by the inner relationship between time, space,
requires us to explore effective ways to mining useful attribute and objects. The complex relationship and dynamic
information from the social media big data[1,2]. evolution between spatiotemporal large data make it difficult
Crowdsourcing geographic data is the open geographic data to express and compute the relation. The service value of
which is collected by the public and supplied to the public. spatiotemporal large data lies in the discovery and utilization
The processing and mining of the crowdsourcing data are of the hidden law behind it[9,10]. To excavate the hidden rules,
current research hotspot[3,4]. Chao Suo explored the patterns, it is necessary to screen effective data from massive
distribution characteristics and influencing factors of the data and avoid the uncertainty problem, low processing
commercial space around high-speed rail stations by using efficiency problem, low accuracy problem, and other issues
point of interest(poi) data of Baidu map[5]. It is pointed out brought by large and complex data.
that the commercial space around the station is affected by Therefore, this paper explores the controversial scientific
its own passenger flow scale structure and the relationship problem that “The sampling large data in time and space is
between the station and the urban space, and presents not the more extensive, the better”. The concept of efficient
different distribution characteristics. It does not completely selection (ES) of big data is proposed to extract small sample
follow the circle model of the land value theory. Juan Ding datasets with the perspective of time and space from large
who is from Anhui Normal University uses the pictures with dataset. Through analyzing the mining potentiality of the
geographic attributes on social networking sites to analyze small sample datasets on specific function model, select out
the public’s propensity to select tourist attractions[6]. Firstly, representative small datasets to replace the original data.
the data is flittered by the user residence. Then filter multiple This method solves the blindness of acquisition of the
traditional small sample data to some extent, and provides a

978-1-5386-4794-3/18/$31.00 ©2018 IEEE 28


powerful direction and method support for the mining of under some specific pattern. In the processing of ES, ESDA
human behavior pattern under the background of large data. is an important tool. Spatial autocorrelation is required when
The definition of ES can be descripted by Equation (1). using the original large dataset and the effective dataset to
∃ ⊂ { , , , , } make a ESDA. It means that the subset needs to inherit the
spatial autocorrelation property of primitive large dataset.
= , , , , ⎯⎯⎯⎯ δ1ε
B. ES of Weibo Check-in Data
, → For the problem of detecting the commercial hotspots
with Weibo check-in data, this paper takes into account its
In the above equation (1), D̗x,y,z,t,A̙is the input time dimension and spatial dimension, and take ESDA as a
large dataset, x,y,z,t,A respectively represent 3D coordinates tool for detecting the potentiality, which aims to select the
of space, time and attribute. i, j, k respectively correspond to right subset from the original Weibo check-in data efficiently
a particular scene. De is the more effective subset extracted and accurately. Then use hotspots distribution model to
from Ds. P is defined for specific function model. analyze the validity of the subset in mining commercial
The ES of spatiotemporal big data aims to extract more district. The technology roadmap of ES of Weibo check-in
targeted data fragment which is called the effective dataset data is shown in Fig. 1.
from the original dataset. Using this effective dataset, the
more accuracy and efficient result can be got in data mining
Get potential datasets

Weibo check-in data Remove noise

Segment data from time and space

Subsets

For each subset


Do ESDA

Estimate the clustering degree

Result Data validation Potential subsets

hotspot map

overlay

real commercial circles

Figure 1. Technology roadmap

As shown in Fig. 1, remove the check-in data not map respectively. According to the result of overlap to judge
associated Weibo poi in order to avoid noise and interference. the validity of the subsets.
Then process the data with dividing it into subsets from the 1) Data subset selecting.
perspective of time and space. An ESDA is presented to In space, on the basis of Weibo poi data with
analyze these subsets to estimate the degree of clustering and geographical location, find the interest points that do not
brush off the no qualified subsets. Create the hotspot maps belong to the commercial area, like the interest points in
with the left subsets and overlay them on the real trade circle colleges[11,12]. Then delete the check-in data at these points
of interest to get sample datasets.

29
In time, extract data subsets based on the time attributes
of the Weibo check-in data to get sample datasets. Sample datasets Girds
2) Associate the sample datasets with the grids.
Divide the target area into grids with size of 300m x The overlayed data
300m and give an ID number to each grid. Overlay the
sample datasets on the grids to associate the datasets with the Spatial autocorrelation
grids and separately calculate the number of check-in in each analysis
grid. Fig. 2 shows the process of associating datasets with
grids. The discrete check-in datasets are converted to grids
datasets which have the check-in number in each grid. It not Yes
z-zcore > +1.96
only simplifies the discrete check-in data but also keeps the
spatiotemporal characteristics. Satisfy the requirement of Map of commercial
Hotspot map
zones
ESDA and data mining.
Overlay
Result analysis

Grid ID Check-in Numbers Figure 3. Flow chat of testing validity


Griding

P1 N1
P2 N2 1) The ES datasets of spacial dimension.
P3 N3 The number of the check-in data made in universities is
… …
as high as 1034982 because of many colleges in Wuhan.
However, students’ high mobility makes check-in data
instability increase. So it is considered to be removed when
Figure 2. Association of datasets and grids
detect the commercial hotspots. The left check-in data is
taken as a sample dataset for hotspot detection, called data
3) ESDA analysis.
except colleges, and represented by the symbolic Dexcol. The
Spatial autocorrelation analysis is performed to the
total check-in data is represented by the symbolic Dall.
sample datasets associated with grids to confirm whether the
datasets are clustered in space. This paper uses Moran 'I 2) The ES datasets of temporal dimension.
spatial auto- correlation analysis method which is a tool of For the ES about time, we expect to randomly extract a
ArcGIS software. For the Moran' I index, we can use the dataset at a certain time. It still can be used as an effective
standardized statistic z-score value to estimate the spatial dataset in practical application.
autocorrelation between the regions. Divide Dall and Dexcol into a certain number of datasets by
year. We can further obtain Dall-2012, Dall-2013, Dall-2014, Dall-2015,
The distribution state of dataset can be known by the
Dexcol-2012, Dexcol-2013, Dexcol-2014, Dexcol-2015.
value of z-score according to Equation (2).
z − score > +1.96, clustering TABLE I. INFORMATION OF EXPERIMENTAL DATASETS
− < −1.96 , (2) Records
Datasets Time Proportion
−1.96 < − < +1.96, number
2012-
Dall 4772212 100%
4) Detect the commercial hotspot for verification of ES. Spacial 2015
Create the hotspot maps based on the number of check-in dimension
Dexcol 3737230
2012-
78.3%
records in each grid. The greater the check-in number is, the 2015
likelihood that the grid appears on the map as hotspots will Dall-2012 605157 2012 12.7%
be greater. Dall-2013 1523913 2013 31.9%
Then overlay the hotspot maps of the sample datasets on Dall-2014 1646151 2014 34.5%
the real commercial circles of the targeted region. If the Temporal Dall-2015 993583 2015 20.8%
hotspots are in the real commercial circles, they are called dimension
Dexcol-2012 516912 2012 10.8%
effective hotspots. The number of effective hotspots (Nv) can Dexcol-2013 1224623 2013 25.7%
be statisticed by vision. Therefore, the validity of subset is Dexcol-2014 1246719 2014 26.1%
mainly reflected by Nv. Dexcol-2015 745881 2015 15.6%

III. EXPERIMENT RESULT AND ANALYSIS


3) Evaluation parameters.
A. Datasets and Evaluation Parameters ESDA and hotspot extraction are carried out on the
Taking hotspot detection with Weibo check-in data as an sample datasets. Then do statistics on the value of spatial
example to validate the method of ES proposed in the paper. autocorrelation parameter (z-score). It means that the dataset
The experimental data is Weibo check-in data of Wuhan is of a clustering distribution model in space when the value
which is between 2011and 2015. The Fig. 3 shows the of z-score is much larger than the +1.96, so as to determine
process of datasets validation. whether the dataset has the potentiality to mine commercial

30
circles. Hotspots are detected according to those datasets ESDA results are shown in Table 4.
which have the potentiality to mine commercial circles.
Finally extract 70 hotspots to each sample dataset. Then TABLE IV. ESDA RESULTS
compare the distribution of the 70 hotspots with the real Datasets Records number z-score
trade circle map of Wuhan, and do statistics on the number
Dall-2012 605157 23.809920
of hotspots in the trade circle, namely the number of
effective hotspots (Nv), and the number of trade circles (Nb) Dall-2013 1523913 19.921971
in which at least ne hotspot locates. Dall-2014 1646151 18.345539
Wuhan real trade circles, which is used for verification, Dall-2015 993583 8.063170
is obtained from the literature named Urban Hotspot and
Dexcol-2012 516912 24.470909
Commercial Area Exploration with Check-in Data, including
Dexcol-2013 1224623 20.742209
15 trade circles [13].
Dexcol-2014 1246719 28.530130
B. Experiment Results of Datasets of Spatial Dimension Dexcol-2015 745881 4.446574
1) ESDA results of Dall and Dexcol. 2) Hotspot detection results of the temporal datasets.
The autocorrelation parameters obtained by spatial The statistics result of the hotspot detection with different
autocorrelation analysis are shown in Table 2. The results ES datasets are shown in Table 5.
show that the values of z-score of the two datasets are much
TABLE V. HOTSPOTS STATISTICS RESULT OF ELEVEN
larger than +1.96. That means that the two datasets are of a DATASETS
certain clustering distribution in space, and the occurrence of
the clustering distribution is not accidental. Therefore, it is Datasets N N N
feasible to use these two datasets for mining. Dall-2012 70 27 12
Dall-2013 70 18 10
TABLE II. ESDA RESULTS OF TWO DATASETS
Dall-2014 70 13 10
Datasets Records number z-score Dall-2015 70 13 8
Dexcol-2012 70 41 15
Dall 4772212 33.878647 Dexcol-2013 70 33 15
Dexcol-2014 70 30 14
Dexcol 3737230 43.084452
Dexcol-2015 70 21 12

2) Hotspot detection result of Dall and Dexcol An interesting discovery can be found from the results
Do hotspot detection to the datasets Dall and Dexcol. Then of the experiment. The hotspot distribution of all the
calculate the number of hotspots, the number of effective datasets divided from Dall and Dexcol is more and more
hotspots, the number of trade circles covered by hotspots scattered with the increase of year, and the hotspots which
respectively. The Table 3 shows the results. lie in Wuhan real commercial zones are fewer. It can be
supposed that the difference between the distribution of
TABLE III. HOT SPOTS STATISTICS Weibo hotspots and the distribution of Wuhan real trade
circles is more and more big, and the earlier the dataset come
Datasets N Nv Nb into being, the more reliable the dataset is to dig real circles.
3) ESDA analysis and hotspot detection to the datasets
Dall 70 16 9 divided from Dall-2012 ,Dexcol-2012 by half a year.
According to the above experiments, it can be concluded
70 36 15
that from the time dimension, Weibo check-in data
Dexcol
distribution in 2012 has higher degree of coincidence with
the distribution of Wuhan real commercial circless.
From the statistical results, it can be seen that although Therefore, subdivide the datasets Dall-2012, Dexcol-2012 by half a
Dall is huge in volume, its number of hotspot and the number year to get four sample datasets. The autocorrelation
of commercial zones covered by its hotspots are both very parameters of them are shown in Table 6.
small. The regularity of hotspot distribution is also not strong;
TABLE VI. ESDA RESULTS OF 4 DATASETS
The volume of Dexcol is relatively smaller than Dall’s. But the
effective hotspots of Dexcol are more than the effective Datasets Records number z-score
hotspots of Dall. The hotspots almost cover the all Dall-2012first 212607 39.044380
commercial circles. Regularity of hotspot distribution is Dall-2012second 392550 30.398639
consistent with the distribution of commercial circles;
Dexcol-2012first 189346 36.795859
C. Experiment Results of Datasets of Temporal Dimension
Dexcol-2012second 327566 32.069904
1) ESDA results of the temporal datasets

31
The hotspot detection results are shown in Table 7. consideration of time, space and other factors and doing
spatial exploratory analysis and hotspot detection to
TABLE VII. HOT SPOTS STATISTICS RESULT spatiotemporal large data. The ES experiments are carried
out by using Weibo location check-in data. The experimental
Datasets N Nv Nb
results show that the Weibo location check-in data is not the
Dall-2012first 70 46 15 bigger the better. By considering the spatial and temporal
Dall-2012second 70 28 12 factors, we can obtain datasets with smaller data size and
higher reliability to mine the commercial zones. The validity
Dexcol-2012first 70 52 15 of the method for selecting effective datasets from
Dexcol-2012second 70 42 15 spatiotemporal large data is proved. In practical applications,
According to the statistical results of the 4 datasets, we this method has good prospects for solving problems of large
can know that the distributions of hotspots of the 4 datasets uncertainty, high redundancy and low value density of
are basically consistent with the distribution of Wuhan real spatiotemporal large data.
trade circles.
D. Results Analysis REFERENCES
In general, through doing spatial exploratory analysis [1] Guo, Z.B.; Li, Z.T.; Tu, H.; Xie, D. Weibo: An Information-Driven
and hotspot detection to 14 datasets, we get 3 most reliable Online Social Network[7]. Springer Berlin Heidelberg. 2014, 8360:3-
16.
datasets which are shown in Table 8. At the same time, the 3
[2] Jiang, C.J.; Ding, Z.J.; Wang, J.L.; Yan, C.G. Big data resource
datasets have the least records in all the datasets. Their service platform for the internet financial industry[J]. Chinese Science
hotspot distribution can almost reflect the distribution of Bulletin. 2014, Vol.59(35), pp.5051-5058.
Wuhan real circles completely without any data processing. [3] Heipke, C. Crowdsourcing Geospatial Data[J]. Isprs Journal of
Greatly enhance the efficiency of mining commercial Photogrammetry & Remote Sensing, 2010, 65(6), 550-557.
districts with the Weibo check-in data. The accuracy of [4] Li, D.R; Ma, J.; Shao, Z.F. On Spatio - temporal Big Data and Its
mining trade circles is also much higher. Application[J]. Satellite Application, 2015(9), 7-11.
[5] Suo, C.; Zhang, H. Influencing factors and development proposals of
TABLE VIII. FOUR MOST RELIABLE DATASETS business space around HSR station – A case study of cities along
Shanghai-Nanjing HSR with POI data [J]. City Planning Review,
Records 2015, 39(7), 43-49.
Datasets z-score N Nv Nb
number [6] Ding, J.; Li, J. Spatial Patterns of Chinese Inbound Tourists POI: an
Dexcol-2012first 189346 36.795859 70 52 15 Analysis of Geographic Information from Web Pictures[J]. Economic
Dall-2012first 212607 39.044380 70 46 15 Geography, 2015, 35(6), 24-31.
Dexcol-2012second 327566 32.069904 70 42 15 [7] Yan, X. Real Applications of Big Data in The Real World[J]. CIO
INSIGHT, 2013(7), 46-49.
The experiment in temporal dimension shows that, the
[8] Dingguo, Y.U; Chen, N.; Ran, X.U. Computational modeling of
greater the time of the dataset is, the hotspot distribution is Weibo user influence based on information interactive network[J].
more scattered, and differences between the hotspot Online Information Review. 2016, 40 (7):867-881.
distribution and the distribution of Wuhan real trade circles [9] Gema, B.O.; Jason, J. J.; David, C. Social big data: Recent
is greater. This reason may be that, with the development of achievements and new challenges[J]. Information Fusion. 2016,
Wuhan economy, as well as the rapid development of traffic 28:45-59.
in Wuhan, more and more people come to Wuhan. The [10] Wang, Y.C.; LeeAnn, K.; Terry A.B. Big data analytics:
traditional business circle distribution pattern cannot meet Understanding its capabilities and potential benefits for healthcare
organizations[J]. Technological Forecasting & Social Change.
the needs of a large number of people. So a lot of business 10.1016/j.techfore.2015.12.019.
cities rise to the ground in various places, and share the [11] Cao, J.Z.; Wu, H.Y. POI Location Updating Method Based on
people flow of traditional business circle, resulting in Weibo- Attendance Data[J]. Geospatial Information. 2013, 11(2), 15-18.
check-in the data being more and more dispersed. [12] Wang, S.; Li J. Analysis and Visualization of POI Distribution
Density Based on Urban Network Space[J]. Urban Geotechnical
IV. CONCLUSION Investigation & Surveying. 2015(1), 21-25.
For the problem of mining information from the social [13] Hu, Q.W; Wang, M.; Li, Q.Q. Urban Hotspot and Commercial Area
media big data, this paper proposes an idea of obtaining Exploration with Check-in Data[J]. Acta Geodaetica et Cartographica
Sinica, 2014, 43(3), 314-321.
more accurate and more efficient datasets by comprehensive

32

You might also like