Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Analysis of entering flows in the congestion pricing

Area C of Milan

Lorenzo Mussone
Politecnico di Milano
Department ABC
Milano, Italy
mussone@polimi.it

Abstract—This paper deals with the analysis of traffic flow, This paper will try to analyse whether and how intra-
trying to exploit whether intra-period and inter-period kinds of period and inter-period kinds of dynamics may occur in a
dynamics may occur in a hierarchical structure: week-to-week, hierarchical structure: week-to-week, within-week, within-day
within-week, within-day dynamics, and whether analysis of dynamics, and whether analysis of within-week dynamics may
within-week dynamics may be restricted to distinguish days of be restricted to distinguish weekdays vs. weekend days only or
week (or working days) vs. weekend days only or whether each whether each single day should be analysed in details.
day should be analysed in details. An in-depth statistical analysis
of data available for the real case of Area C in Milan (Italy) This study can be inserted into extensive research on the
shows some particular structures in entry patterns giving some charging policy applied to entering vehicles in the Milan area
insights for supporting transport policy and assessment of called Area C (but it is not limited to it, since the presented
external costs. Particularly, lags between entries of each vehicle methodology can be applied to every monitored area); it aims
present a trend strongly linked to a weekly frequency. at providing evidence about this policy from the point of view
of demand choices and to develop tools and models useful for
Keywords— traffic flow evolution; congestion charge zone; its description and the assessment of external costs due to
controlled accesses; statistical analyses congestion and pollution. The opportunity to do this derives
from the detection of number plates of all vehicles entering
I. INTRODUCTION this area, allowing to know the actual universe of this process.
Analysis of traffic flow, at different time scales, generally The proposed approach is suited for large amount of
oriented to prediction, is a constant and crucial aim in available data, but collected only for traffic control purpose, so
transportation research and engineering. This task can be that available data concern only entering flows which traverse
carried out in many ways according to the scope of study. the border of the area; no data are available about exit and
Significant progress in the field is due to the diffusion of circulating flow, a part occasional surveys on circulating flow.
Computational Intelligence and Data Mining approaches in The large amount of data that has become available in
analysing data: besides classical statistical parametric methods processes like this is a first issue to be faced; in this sense, the
(e.g. ARIMA), non parametric models such as Neural and paper follows that by Lv et al. [4]. The dimensions of data
Bayesian Networks, Radial Basis Functions, Fuzzy and make computing challenging and stimulate the use of new
Evolutionary algorithms, or Kalman filter are also used [1]. paradigms and ways of analysing data.
Discussion about advantages of statistics vs. computational
intelligence is still ongoing [2]. Another question to be faced concerns the multivariate
nature of this process, as it is in any process related to traffic.
Perhaps short-term forecasting, that is from a few seconds Unfortunately, from this point of view data are very scarce and
to a few hours, is the best known application field because of neither information about social economic attributes of drivers
its potential and pervasive use in dynamic control and real and the reasons of their journey, nor about vehicle
time applications [1]. Also mid-term forecasting (from a few characteristics is available. It is worth to underline that this
hours to one or more days) may also be of interest since it can limitation hinders a wider generalization of results since many
explain user dynamic choices over longer periods. This type of links to overall potential demand cannot be stated.
forecasting has opened the door to another wide field of
research in day-to-day dynamics where the adaptive nature of The paper is organized into further four sections. Section 2
traveller choices and the transport system over successive describes Area C and the data collection scenario; Sections 3
similar time periods is described [3]. and 4 show the results of statistical analyses and the specific
lag analysis worked out on collected data. Finally, Section 5
Traffic flows greatly vary over time, and two levels of draws conclusions.
dynamics are often considered: intra-period dynamics,
sometimes called within-day dynamics, occurring within each
of a series of similar time periods; inter-period dynamics, II. THE TEST-BED: AREA C IN MILAN
occurring between consecutive similar time periods. The city of Milan and surrounding municipalities

978-1-5386-3917-7/17/$31.00 ©2017 IEEE


Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.
constitute a metropolitan area positioned in the centre of the
Po valley, Northern Italy. Whilst forming a relevant
destination in its own right, Milan also lies at a cross-roads for
the main routes towards the south of the country and for traffic
with destinations in the North. In the centre of Milan there is
the area of “Cerchia dei Bastioni” (Fig 1) that can be
represented by a network of about 2700 links, 1800 nodes and
160 centroids. Area C is contained in “Cerchia dei Bastioni”
which includes also the ring roads surrounding Area C.
The “Cerchia dei Bastioni” was the subject of a charging
policy from 2nd January 2008 to 31st December 2011, called
Ecopass. From 16 January 2012 the same area became the
subject of a different policy called Area C [5]. The differences Fig. 1. The “Cerchia dei Bastioni” area in Milan with locations of controlled
between the two policies concern: points (blue circles for all vehicles, red ones only for public transport) for
accessing Area C
• Ecopass aimed to reduce air pollution, while Area C
aims to reduce congestion first and then pollution; TABLE I. AVAILABLE DATA ABOUT VEHICLES ENTERING AREA C

• the amount of charging for Ecopass was 2 and for Area File Number
Description of records
C is 5 Euros; the tickets allow vehicles a free name
circulation and unlimited entry/exit from the area; Transiti Vehicle entries by day and time of entries. Data 64,770,30
were collected in January-February, and 0
• In Ecopass only high-polluting engines were charged September 2011 and for all 2012
while in Area C all private vehicles must pay but some Veicoli Some MCTC (the Italian public automobile 5,599,580
exemptions or reductions may apply. register) information about vehicles
Residenti Residents inside Area C (only those who 40,180
Area C has 43 access points each controlled by a video required registration)
system (Fig. 1). Seven of them are dedicated exclusively to Autorizzati Authorized vehicles 146,377
public transport. Video cameras detect the passage of vehicles VDS Service vehicles 31,955
entering the area by reading license plates. A central system Telepass Vehicles provided by Telepass RFID 306,551
then recognizes the vehicle type, owner and charge due. It also RID Vehicles paying by RID (Italian direct 80,034
withdrawal)
provides information for fines or sanctions as needed. It is
working all weekdays from 7:30 to 19:30 except on Thursday
when it ends at 18:00. Available data and the relative number
of records are described in Table 1. All files are linked
according to the license plate number field properly recoded
for privacy. A file with the calendar of what really happened
(when the pricing was applied and when it was not) was also
used.
The strength of this dataset is that it contains all records of
each vehicle entering since 2011 (apart from a few detection
errors), leading to databases of several millions of records.
Weaknesses depend on the fact that we do not have
Fig. 2. Number of vehicles entering Area C hourly in year 2011 and 2012
information about vehicles (for class, dimension, engine size, subdivided into Area C ON and OFF (all data)
seats, etc.), exits and circulation inside Area C, and drivers
and number of passengers per vehicle.
and in the following section IV:
III. PRELIMINARY STATISTICAL ANALYSES • Analysis of lag between entries,
Extensive analysis was carried out in order to characterize • Cross-covariance and cross-correlation between
data and to identify behaviour patterns of vehicles entering entries,
Area C (both when payment must be made [called Area C
ON], and not [Area C OFF]). It must be underlined that Area • PCA on lags,
C is OFF not only in the remaining time (when Area C is ON) • Analysis of vehicles entering the first time.
but also when for any reason charging is not applied.
These analyses are reported in two sections, precisely, in A. General descriptive Statistics
this section III: These analyses report simple statistical results on data
• General descriptive statistics, aggregated by the Area C operational state (ON and OFF);
number and average number of entries, entries by day, and
• Kolmogorov-Smirnov test on frequency of entries, vehicle use.
• Anova analysis on number of entries; In Fig. 2 the total number of entries subdivided into Area

Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.
C ON and OFF is drawn up. Since it represents the total over a whether the frequency of the number of entries (over time),
very large range (years 2011 and 2012) the effect of abnormal and the daily average number of entries (for each day),
days (when Area C is not active for bureaucratic or according to the day of week (for the two subsets, Area C ON
administrative reasons) can be assumed to be of little and OFF) could be drawn up from the same underlying
relevance. The figure shows a very typical shape for Area C continuous distribution (null hypothesis). From the point of
ON with a peak in the early morning (8:00-9:00) and then a view of the test, results are very similar whether or not the
decreasing trend till the end of charging. Conversely, a higher frequency or the average is considered. The highest p-values
peak on the curve of Area C OFF can be observed just after are found for Area C ON (in weekdays); the two weekend
19:30; presumably due to vehicles waiting for entry when the days are generally very different between themselves and from
payment period expires. Fig. 3 and Fig. 4 show the number of weekdays. The null hypothesis can be rejected only for the
entries for each recorded day subdivided into Area C ON and relationships between weekdays and weekend days.
OFF, according to whether vehicles entered for only one day
In some cases the alternate hypothesis (the data are not
and for more than 30 days (that is very frequent vehicles).
from the same continuous distribution) can be rejected at a 5%
There are significant differences: the total number of significance level, e.g. for Wednesday and Thursday, and
entries per day is about 3,000-4,000 veh/day for one entry Wednesday and Tuesday. There are some slight differences
vehicles against 90,000 veh/day for more than 30 entry according to the number of days a vehicle enters Area C; we
vehicles. The number of entries per day when Area C is ON is can conservatively state that up to three entries the
sensibly higher for one entry vehicles; trend is quite stationary commutative relationships between Tuesday, Wednesday and
for one entry vehicles while it is rather variable for more than Thursday have a higher level of significance than the
30 entry vehicles. Vehicles with multiple entries (in the same remaining ones. No relationship is found between data of Area
day) is limited to a small fraction of the whole, though it is C ON and Area C OFF.
more likely that a vehicle entering during the payment period,
with respect the OFF period, enters again within the same day. C. Anova analysis on the number of entries
Another type of analysis concerns the analysis of variance
of aggregated data of entries for weekdays (from Monday to
Friday) and weeks when Area C is ON. Considering weeks
containing no missing data or zeros (due to holidays, strikes,
or non operational days), a matrix with 30 rows (weeks) and 5
columns (weekdays) is analysed.
One-way and two-way Anova analyses are worked out
with the null hypothesis that all samples in the reference
matrix are drawn from populations with the same mean (a low
value of p means that the hypothesis is not correct and
differences between means are not due to random
Fig. 3. Number of vehicles entering Area C only one day, by day of year (x-
fluctuations). From the p-value (p=(Prob>F)=0.2367) of one-
axis, from 2011 to 2012) and subdivided into Area C ON and OFF
way Anova it can be argued that there are no (great)
differences between weekdays, while there are significant
differences between weeks (p=(Prob>F)=0). Two-way Anova
applied to weekdays and weeks (with and without interaction
term) shows that:
• there are differences between weekdays (columns), p
of columns=(Prob>F)=0.0001. This result overturns the
result obtained by the one-way Anova and states that
differences between weekdays are more relevant when
taking into account also differences between weeks;
• there are differences between weeks (rows), p of
Fig. 4. Number of vehicles entering Area C more than 30 days, by day of rows=(Prob>F)=0;
year (x-axis, days from 2011 to 2012) and subdivided into Area C ON and
OFF • there is no interaction between weekdays and weeks,
(Prob>F)=0.
Distribution of known characteristics about vehicles is
very abnormal since one category (owner) assimilates about
79% of the whole set, and, counting it together with the IV. LAG ANALYSES
category ‘hire without driver’, similar to ‘owner’ for travelling This section reports detailed analyses of lags between
choice, we sum up the 90% of the total. A similar distribution entries carried out to investigate particular relationships for
holds true also for subsets Area C ON and OFF. vehicles with the same number of entries (one or more).

B. Kolmogorov Smirnov tests A. Analysis of lag between entries


The Kolmogorov Smirnov test was used to determine These analyses consider the lag between entries, that is

Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.
distance (in days) between consecutive entries (on different respectively on the x- and y-axis) are drawn up depending on
days) of the same vehicle and subdivide data into classes, the meaning of the z-axis: frequency (for each lag), total
called lag. This aims at finding some behavioural patterns number of entries (the product of frequency multiplied by the
allowing us a better understanding of the process and some number of entries for each lag), average number of entries per
hints for modelling. In these analyses we considered the day (the average number of entries per day for each lag). In
average lag for each vehicle (calculated as the ratio of the Fig. 6 the frequency type is proposed both for total entries, a),
interval between the day of the first entry, d1, and the day of for Area C OFF, b), and Area C ON, c). The “coral reef“
the last one, d2, and the number of entries, n, minus one: effect (values are on the curve xy=cost§365) observed in these
). figures is due to the size of the data set which is limited to the
year 2012.
Fig. 5 shows the curves of frequency split for the first three
lags (2, 3, and 4) for Area C ON. The case lag=0 (meaning Also from these figures a periodicity (mainly of 7 days)
that a vehicle has entered only once) has a much higher value emerges especially for a lower number of entries. However, it
than other cases and, since out of scale, it is not reported. must be underlined that the most relevant contribution to total
entries is given by vehicles with a high number of entries.
Hence, further studies also on these subsets are needed and
proposed in the following sub-section.

B. Cross-covariance and cross-correlation between entries

Cross-covariance φ xy is calculated by:


φ xy ( m ) = E {( x n+m (
− μ x ) yn − μ y )* } (1)
where μx and μy are the mean values of the two stationary
random processes, xn and yn, and cross-correlation is
calculated by:
Fig. 5. Frequency of average lag for Area C for the first three classes of total
number of days of entries (2, 3 and 4)
{ } {
R xy ( m ) = E x n + m y*n = E x n y*n − m } (2)
where xn and yn are jointly stationary random processes with
∞<n<+∞ ; * denotes the complex conjugate, and E{·} is the
expected value operator. Both formulas have outcome in the
range (0,1). The difference between the two formulas is that
cross-covariance calculates the mean of the two random
processes before calculating cross-correlation, and this allows
us to know the interrelation between processes without the
a) effect of mean.

b) c)
Fig. 6. 3D frequency plot of lag and number of entries for the whole year
2012, a); only for Area C OFF, b); only for Area C ON, c)

An obvious result is that the curve for Area C OFF is


higher than that of Area C ON but both have a certain
periodicity of lags equal to 7 days or its multiple, as observed
in correlation analyses (subsection IV.B): this means that Fig. 7. Cross-covariance of frequency of number of entries between 2-entries
successive entries are more likely to occur on the same day of when Area C ON and all other number of entries (in legend) (on x-axis there
successive weeks. Periodicity is also observed in detailed is covariance lag and on y-axis cross-covariance (multiplied by factor 108)
curves drawn for a single lag with Area C ON (Fig. 5); in this
case the presence of other periodicities is easily recognizable. The calculated curves are symmetrical only if the
processes considered have the same shape. They are obviously
Three types of 3D diagrams (for number of entries and lag, cases sensitive to m and to n sign when their differences are

Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.
relevant. In these analyses m and n represent the average
distance between entry days and the processes xn and yn refer
to different number of entries (from 2 to 20). The two analyses
are applied to three datasets, representing the time series of the
frequency of number of entries, the total number of entries
(equal to the product of frequency and number of entries) , and
the average number of entries by day, all split up for Area C
ON and OFF, with the aim of mining their properties. In
proposed figures y-axis scale changes according to observed
maximum value (in order to guarantee readability).
Asymmetry is particularly relevant for diagrams involving the
total number of entries and less for the frequency of number of
entries (Fig. 7) and much less for the average number of
entries. This result highlights that the total number of entries
values are as much greater as the number of entries increases Fig. 8. PCA for vehicles with 10 entries
with respect to the values of the other two datasets.
The most significant lags related to the 2nd component of
No relevant differences are revealed between results of PCA are :
cross-covariance and cross-correlation stating that the mean of
these processes has not a particular effect on their 1 for NED =2,
interrelation. All curves have peaks at lags equal to seven and 1 and 3 for NED =4,
its multiples, and these peaks are more relevant for a lower
number of entries. 3 for NED = 6,
3 and 5 for NED = 10,
C. PCA on lags
2, 3, 4, and 5 for NED = 30 up to 50,
PCA (Principal Component Analysis) [6] is applied to find
the most significant lags (variables in PCA) in data with Area 1, 2, and 3 for NED = 100.
C ON for year 2012. In these analyses lag represents the exact Lag=1 is the most significant variable when NED > 3.
distance between two entries. We added also this analysis to Lag=7 has a relevant role when NED=2 and for the 4th and
effectively extract information and to show them by 5th component only when NED <30. In general, the first
highlighting variables containing the most part of variance in component explains at least 40% of the variance present in
data. Analyses are carried out using both all data and “modulo data. The first five PCA components explain about 80% of the
7” aggregated data. “Modulo 7” means that data are total variance (diagram (3,1) in fig.8). The higher the NED the
aggregated by summing up the occurrences of one lag (in the higher the explained variance when the number of principal
range 1-7) with its multiples of 7 (e.g. cases with lag = 1 are components is equal.
summed up together with those of lag equal to 8, 15, 22, 29,
...). “Modulo 7” data give almost exactly the same results of
entire data when NED is above 25. For NED lower than 25 the
Some results are reported in Fig. 8 only for NED (Number most significant lags are a little different.
of Entry Days; no matter how entries are made in a day) equal
to 10, for conciseness, but similar figures were prepared for These figures confirm the results obtained from the
NED equal to 2 up to 150. Fig. 8 contains a mosaic (3x4) of previous analyses: the characteristics of entries are strongly
figures. The first row reports PCA results for the original data correlated to how many entries will be made over time; there
set. The four diagrams concern vectors (of variables, in blue) are clear differences between vehicles that make few and
and data (in red) for the first two components, the first and many entries and between the weekdays when the first entry is
third component, the second and third component, and the made.
fourth and fifth component, respectively for the first up to the
fourth diagram. The second row reports the same diagrams D. Analysis of vehicles entering for the first time in the year
obtained from PCA but applied to “modulo 7” data. The third Vehicles entering for the first time represent a specific
row reports the cumulative curves of explained variance for process without no past story and it can be considered a
PCA applied to all data (first diagram) and to “modulo 7” data random process, like a Markovian one. Analysis is based on
(second diagram). The final two diagrams, (3,3) (3,4), refer to data for 2012 only.
the curves of normalized frequency for a certain lag One entry vehicles (over the year) for each weekday is
occurrence and the same subdivided by weekdays. shown in Fig. 9 (bottom curves) together with the total of
The most significant variable is different according to entered vehicles (top curves) by week and weekday (in
which NED is considered. The most significant lags related to legend). Fig. 9 shows that:
the 1st component of PCA are : • there are no great differences between weekdays;
3, 4, and 7 for NED =2,
• the process takes some weeks to converge towards a
1 and 3 for NED > 3. value of about 5% of total entered vehicles.

Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.
travel process (e.g. trip motivations). This makes it difficult to
investigate the relationships between driver behaviour and the
number of entries into Area C (both within one day and from
day to day). In order to come to terms with the many
unknowns presented by this situation, it will need to plan data
survey whether the overall process should be studied. The lack
of data about vehicles and drivers as well as data on exits and
circulation inside Area C, seems to be crucial in the sense that
it hinders or even prevents a thorough analysis of the problem,
especially as regards driver choices.
For the above mentioned reasons the many analyses
worked out are all based on the vehicle plate number, on the
date and time a vehicle entered Area C, with aggregated and
Fig. 9: Number of vehicles at first entry (subscript F, bottom curves without disaggregated data. All analyses (Kolmogorov-Smirnov test,
line indicators) and total of vehicles entered (subscript M, top curves with line Anova, cross-correlation, PCA) arrived at the same conclusion
indicators) for each weekday.
which pinpoints a strong connection between successive
In any case, the slope at the end of year (51st week) is not entries of the same vehicle. In particular, lag distribution per
completely flat and a further decrease could be verified with vehicle (that is the distance in days between two consecutive
data assembled in successive years. The tangent to the curve entries) is related to the number of entries (in one year) and
knee crosses the x-axis near the 11th week which means, follows patterns that have been well highlighted by PCA. By
considering that, during the first two weeks of the year, Area adopting a “modulo 7” data aggregation (section IV.C), PCA
C was not active, a typical time of about 30 effective days. gives the same results obtainable by the original dataset when
Some isolated positive spikes in the total curve are related to the number of entries (in one year) is greater than 25. This
single days and do not affect first entry vehicles. highlights that the most part of the process can be described by
Fig. 9 also shows that there is a lower variability (in taking into account only very frequent vehicles.
absolute values) of first entry vehicles along the year with By using lag between entries as key for analysing data, we
respect to the total number. It seems reasonable to hypothesize highlighted a relevant difference between weekdays. A similar
that first entry vehicles represent a random process with result is also shown in the paper by Crawford et al. [7] who
constant mean and very low variance; on the other hand, we applied a statistical analysis (Functional Data Analysis) on
have shown previously that multiple entry vehicles are flow curves.
strongly correlated according to the days. This result suggests
that the overall entry demand can be split into two ACKNOWLEDGMENT
components: the former (almost) fixed due to a first entry and
the latter strongly dependent on previous entries. Hence, we Thanks are due to the Mobility, Environment, Territory
can assume that, finished the transitory, an amount of about Agency of Milan (AMAT-MI) who provided data for this
5% of total entries is due to vehicles entering for the first time research.
(including those vehicles entering only once).
REFERENCES
V. CONCLUSIONS [1] E.I. Vlahogianni, M.G. Karlaftis, and J.C. Golias, “Short-term traffic
The presented preparatory analyses represent a possible forecasting: Where we are and where we’re going,” Transportation
Research Part C, vol. 43, pp. 3–19, 2014.
pattern of estimation methods that have really some policy and
[2] M.G. Karlaftis and E.I. Vlahogianni, “Statistical methods versus neural
then practical implications to help analysing the assessment of networks in transportation research: Differences, similarities and some
externalities in that transportation system. It can help to know insights,” Transportation Research Part C, vol. 19, pp. 387–399, 2011.
what a change of fares or what actions directed to some [3] D.P. Watling and G.E Cantarella., “Modelling sources of variation in
categories of vehicles imply both for congestion and pollution transportation systems: theoretical foundations of day-to-day dynamic
control. For example, if vehicles entering only once is less models,” Transportmetrica B: Transport Dynamics, Vol. 1, no. 1, pp. 3–
than 5% each day (like the case of Area C in Milan suggests 32, http://dx.doi.org/10.1080/21680566.2013.785372, 2013.
for 2012 data), it does not make sense to address special care [4] Y. Lv, Y. Duan, W. Kang, Z.Li, and F.-Y. Wang, “Traffic Flow
to this category in order to reduce the total daily amount of Prediction With Big Data: A Deep Learning Approach.” IEEE
Transactions On Intelligent Transportation Systems,
entering vehicles. doi:10.1109/TITS.2014.2345663, vol. 16, no. 2, 2015.
The huge amount of data to analyse represents the [5] L. Mussone, S. Grant-Muller, and J. Laird , “Sensitivity analysis of
nowadays challenge and means that it is more than ever traffic congestion costs in a network under a charging policy”, Case
Studies on Transport Policy, ISSN: 2213-624X , Vol. 3, 44-54,
necessary both to compute our efforts and to devise efficient 10.1016/j.cstp.2014.03.00, 2015.
keywords to manage and make good use of those data. A [6] L: Lebart, A. Morineau, and N. Tabard, “Techniques de la description
crucial point that arose while studying the case of Area C of statistique: méthodes et logiciels pour l’analyse des grands tableaux, “
Milan, is that while datasets are collected only for Dunod, Paris, 1977.
administrative purposes (specifically for payment of charges) [7] F. Crawford, D.P. Watling, R.D. Connors, “A statistical method for
some useful information emerged regarding normal travel, a estimating predictable differences between daily traffic flow profiles”,
very few information can be on converse available about Transportation Research Part B 95 (2017) 196–213.

Authorized licensed use limited to: ANII. Downloaded on July 03,2023 at 20:03:21 UTC from IEEE Xplore. Restrictions apply.

You might also like