1. Descriptions: This dataset summarizes a heterogeneous set of features about articles

published by Mashable in a period of two years. The goal is to predict the number of shares in
social networks (popularity).

2. Number of Instances: 39797

3. Number of Variables: 61

4. Note: several of the variables may be correlated, thus you should consider multicollinearity.

5. Variable Information:
0. url: URL of the article
1. timedelta: Days between the article publication and
the dataset acquisition
2. n_tokens_title: Number of words in the title
3. n_tokens_content: Number of words in the content
4. n_unique_tokens: Rate of unique words in the content
5. n_non_stop_words: Rate of non-stop words in the content
6. n_non_stop_unique_tokens: Rate of unique non-stop words in the
7. num_hrefs: Number of links
8. num_self_hrefs: Number of links to other articles
published by Mashable
9. num_imgs: Number of images
10. num_videos: Number of videos
11. average_token_length: Average length of the words in the
12. num_keywords: Number of keywords in the metadata
13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus: Is data channel 'Business'?
16. data_channel_is_socmed: Is data channel 'Social Media'?
17. data_channel_is_tech: Is data channel 'Tech'?
18. data_channel_is_world: Is data channel 'World'?
19. kw_min_min: Worst keyword (min. shares)
20. kw_max_min: Worst keyword (max. shares)
21. kw_avg_min: Worst keyword (avg. shares)
22. kw_min_max: Best keyword (min. shares)
23. kw_max_max: Best keyword (max. shares)
24. kw_avg_max: Best keyword (avg. shares)
25. kw_min_avg: Avg. keyword (min. shares)
26. kw_max_avg: Avg. keyword (max. shares)
27. kw_avg_avg: Avg. keyword (avg. shares)
28. self_reference_min_shares: Min. shares of referenced articles in
29. self_reference_max_shares: Max. shares of referenced articles in
30. self_reference_avg_sharess: Avg. shares of referenced articles in
31. published_day: Which day was the article published?
32. is_weekend: Was the article published on the weekend?
33. LDA_00: Closeness to LDA topic 0
34. LDA_01: Closeness to LDA topic 1
35. LDA_02: Closeness to LDA topic 2
36. LDA_03: Closeness to LDA topic 3
37. LDA_04: Closeness to LDA topic 4
38. global_subjectivity: Text subjectivity
39. global_sentiment_polarity: Text sentiment polarity
40. global_rate_positive_words: Rate of positive words in the content
41. global_rate_negative_words: Rate of negative words in the content
42. rate_positive_words: Rate of positive words among non-neutral
43. rate_negative_words: Rate of negative words among non-neutral
44. avg_positive_polarity: Avg. polarity of positive words
45. min_positive_polarity: Min. polarity of positive words
46. max_positive_polarity: Max. polarity of positive words
47 avg_negative_polarity: Avg. polarity of negative words
48. min_negative_polarity: Min. polarity of negative words
49. max_negative_polarity: Max. polarity of negative words
50. title_subjectivity: Title subjectivity
51. title_sentiment_polarity: Title polarity
52. abs_title_subjectivity: Absolute subjectivity level
53. abs_title_sentiment_polarity: Absolute polarity level
54. shares: Number of shares (target)

6. Missing Attribute Values: None

7. Citation Request: Please include this citation if you plan to use this database:

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision

Support System for Predicting the Popularity of Online News. Proceedings
of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
September, Coimbra, Portugal.

1. Descriptions: The data are related to red Vinho Verde wine samples, from the north of
Portugal. The goal is to predict wine quality.

2. Number of Instances: 1599

3. Number of Variables: 11 + dependent variable

4. Note: several of the variables may be correlated, thus you should consider multicollinearity.

5. Variable information:

1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
12 - quality (score between 0 and 10)

6. Missing Attribute Values: None


1. Descriptions: The data are related to white Vinho Verde wine samples, from the north of
Portugal. The goal is to predict wine quality.

2. Number of Instances: 4898

3. Number of Variables: 11 + dependent variable.

4. Note: several of the variables may be correlated, thus you should consider multicollinearity.

5. Variable information:
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
12 - quality (score between 0 and 10)

6. Missing Attribute Values: None


1. Descriptions: This dataset is related to predicting the performance metrics of posts published
in brand’s Facebook pages. Multiple performance metrics are in the dataset.

2. Variable information:

1 -- Type: Post type (Link, Photo, Status, Video).

2 – Category: action, product, inspiration.
3 -- Paid: 1=yes, 0=no.
4 – LifetimePostTotalReach: The number of people who saw a page post (unique users).
5 – LifetimePostTotalImpressions: The number of times a post from a page is displayed,
whether the post is clicked or not. People may see multiple impressions of the same post. For
example, someone might see a Page update in News Feed once, and then a second time if a
friend shares it.
6 - LifetimeEngagedUsers: The number of people who clicked anywhere in a post (unique
7 - LifetimePostConsumers: The number of people who clicked anywhere in a post.
8 – LifetimePostConsumptions: The number of clicks anywhere in a post.
9 – LifetimePostImpressionsByLiked: The number of impressions just from people who have
liked a page.
10 – LifetimePostReachByLiked: The number of people who saw a page post because they
have liked that page (unique users).
11 – LifetimePeopleEngaged: The number of people who have liked a Page and clicked
anywhere in a post (unique users).
12 – Comments: Number of comments on the publication.
13 – Likes: Number of “Likes” on the publication.
14 – Shares: Number of times the publication was shared.
15 – TotalInteractions: The sum of “likes,”, “comments,” and “shares” of the post.


1. Descriptions: Bike sharing systems are new generation of traditional bike rentals where
whole process from membership, rental and return back has become automatic. Through these
systems, user is able to easily rent a bike from a particular position and return back at another
position. Currently, there are about over 500 bike-sharing programs around the world which is
composed of over 500 thousands bicycles. Today, there is great interest in understanding the
use of these systems due to their growth, as well as their role in traffic, environmental, and
health issues.

Due to the individualized and electronic nature of bike sharing systems, detailed information is
recorded, including the duration of travel, departure and arrival position is explicitly recorded in
these types of systems. The dataset is related to the two-year historical log corresponding to
years 2011 and 2012 from Capital Bikeshare system, Washington D.C.

2. Variable information:

1 instant: record index

2 dteday : date
3 season : season (spring, summer, fall, winter)
4 yr : year 2011; 2012
5 mnth : month ( 1 to 12)
6 holiday : holiday; non_holiday
7 weekday : day of the week
8 workingday: workingday; non_workingday
9 weathersit :
- Clear, Few clouds, Partly cloudy, Partly cloudy
- Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain +
Scattered clouds
10 temp : Normalized temperature in Celsius. The values are divided to 41 (max)
11 atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
12 hum: Normalized humidity. The values are divided to 100 (max)
13 windspeed: Normalized wind speed. The values are divided to 67 (max)
14 casual: count of casual users
15 registered: count of registered users
16 cnt: count of total rental bikes including both casual and registered

1. Descriptions: Bike sharing systems are new generation of traditional bike rentals where
whole process from membership, rental and return back has become automatic. Through these
systems, user is able to easily rent a bike from a particular position and return back at another
position. Currently, there are about over 500 bike-sharing programs around the world which is
composed of over 500 thousands bicycles. Today, there is great interest in understanding the
use of these systems due to their growth, as well as their role in traffic, environmental, and
health issues.

Due to the individualized and electronic nature of bike sharing systems, detailed information is
recorded, including the duration of travel, departure and arrival position is explicitly recorded in
these types of systems. The dataset is related to the two-year historical log corresponding to
years 2011 and 2012 from Capital Bikeshare system, Washington D.C.

2. Variable information:

1 instant: record index

2 dteday : date
3 season : season (spring, summer, fall, winter)
4 yr : year (0: 2011, 1:2012)
5 mnth : month ( 1 to 12)
6 hr : hour (0 to 23)
7 holiday : holiday; non_holiday
8 weekday : day of the week
9 workingday: workingday; non_workingday
10 weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain +
Scattered clouds

11 temp : Normalized temperature in Celsius. The values are divided to 41 (max)

12 atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
13 hum: Normalized humidity. The values are divided to 100 (max)
14 windspeed: Normalized wind speed. The values are divided to 67 (max)
15 casual: count of casual users
16 registered: count of registered users
17 cnt: count of total rental bikes including both casual and registered

1. Descriptions: GoodBelly is trying to boost its sales at grocery stores like Whole Foods Market.
As a small start-up, GoodBelly must optimize the allocation of its limited marketing budget. It
currently promotes through in-person demonstrations in stores, but management is concerned
that these demonstrations are not effective enough to justify the cost. The main task is to
determine whether or not the company should continue its promotional programs.

2. Variable information:

1 – Date: Date.
2 – Region: Region.
3 - UnitsSold: The number of units sold per store per week.
4 - AverageRetailPrice: The average retail price for GoodBelly products per store per week
5 – SalesRep: 1 if the store had a regional sales rep (face-to-face contact) and 0 if the store
had only the national sales rep (no face-to-face contact).
6 – Endcap: 1 if a store participated in an endcap promotion.
7 – Demo: 1 if the store had a demo on the corresponding.
8 – Demo1_3: 1 if the store had a demo 1-3 weeks ago.
9 – Demo4_5: 1 if the store had a demo 4-5 weeks ago.
10 – Natural: The number of other natural retailers within 5 miles of each store.
11 – Fitness: The number of fitness centers within 5 miles of each store.


1. Descriptions: In 2014, the owner of a food truck based in Hamilton, Ontario, was looking over
the first year of her operations. In addition to working in Hamilton, she had tried to maximize
her revenues by driving to several other cities and charging various prices for each burger,
depending partly on the fresh ingredients available in each city. Besides location, the owner had
collected data on a few other factors-the weather, the day of the week, the city's population,
and whether a festival was going on-that had had an impact on the demand for her product.
She wondered whether analytics could help her decide where to sell and how much to charge
on a daily basis.

2. Variable information:

1 Date: Date
2 QuantitySold
3 City: Hamilton, Toronto, London, Waterloo
4 Precipitation: The precipitation probability
5 Temperature: in Celsius
6 Festival: 1 if there is a festival on that day and 0 otherwise.
7 Price: in dollars
8 Weekday: 1 if the day is a weekday and 0 otherwise.

