Machine Learning

Clustering Veneto cities in the proximity of DOCG areas

Posted on June 10, 2020

1. Introduction

In this project we will cluster cities located in the proximity of DOCG areas in the Veneto region, in Italy. In order to do so, we will first need to get familiar with some of the key concepts.

We will start with a brief introduction to italian wine classification and Veneto, and then we will dive in how we will be tackling this project.

A. Italian wine classification

Wine is produced in every region of Italy, which is home to some of the oldest wine-producing areas in the world. Italy is the world's largest producer of wine, with an area of 702,000 hectares under vineyard cultivation.

In 1963, the first official Italian system of classification of wines was launched. Since then, several modifications and additions to the legislation have been made. The last modification established four basic wine categories. The categories, from the bottom to the top level, are:

  • Vino da Tavola: generic wines that are made either mostly from one kind of authorized 'international' grape variety or entirely from two or more of them.
  • Vini IGT: wines produced in a specific territory within Italy. These need following a series of specific and precise regulations on authorized varieties, viticultural and vinification practices, labeling instructions, etc.
  • Vini DOP: This category includes two sub-categories:
    • Vini DOC: DOC wines generally come from smaller regions that are particularly vocated for their climatic and geological characteristics, quality, and originality of local winemaking traditions.
    • Vini DOCG: In addition to fulfilling the requisites for DOC wines, DOCG wines must pass stricter analyses prior to commercialization. They must also demonstrate a superior commercial success.

DOCG wines are the top level wines produced in Italy, therefore the areas where they are produced are generally popular touristic destinations.

B. Veneto

Veneto is a gem of a region in the northeast corner of Italy. Bound on the west by Lake Garda, on the north by the Dolomite Mountains and on the east by the Adriatic Sea, the landscape of the Veneto is rich and varied. From the grandeur of crumbly old Venice to the medieval flavor of Bassano del Grappa, and on to Belluno, a striking town that's a gateway for visiting the Dolomites, the Veneto makes a fascinating region to explore.

Veneto is one of the leading Italian regions in terms of quantity and quality production of grapes. The wines produced in this region are famous throughout the world: Prosecco, Amarone, Recioto, Soave, Valpolicella and Bardolino, are only a few of the names of wines known at international level.

Veneto is also the region that can count the highest number of DOCG areas in Italy.

C. Our objective

Now that we are familiar with DOCG areas and the Veneto region we can define our objective.

There are many beautiful cities located within or in proximity of DOCG areas. These cities offer the most different kind of activities, venues and places of interest. If a person is interested in visiting the DOCG vineyards and stay in a city nearby, it would be great if they were able to choose a destination based on their preferred type of activities. That is the purpose of our project.

2. Data

The data used in this project has been made freely available by the Veneto region through it geoportal. There we can download the datasets and prepare them for our analysis.

A. DOCG geographical areas

The geographical information regading DOCG appellations area is contained in an SHP file.

A SHP (shapefile) is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas). In Python, in order to access the shapefile we can use the pyshp library. We then need to convert the file to geojson format in order to utilise it with folium for our maps plotting.

After having created the dataset we can see the first 5 rows:

appellation code zone coords
0 RECIOTO SOAVE CLASSICO A021 A [(11.252029507865142, 45.41758433331832), (11....
1 RECIOTO SOAVE A021 X [(11.207064614961942, 45.4507371295929), (11.2...
2 BARDOLINO SUPERIORE CLASSICO A025 A [(10.794778650134427, 45.518760038125784), (10...
3 BARDOLINO SUPERIORE A025 X [(10.843049258063267, 45.43160165449561), (10....
4 SOAVE SUPERIORE CLASSICO A026 A [(11.252029507865142, 45.41758433331832), (11....

There are four features:

  • appellation: the name of the DOCG
  • code: the area code
  • zone: the zone type
  • coords: a list of latitude and longitude coordinates

And here is the DOCGs data plotted on a map.

In [3]:
IFrame('maps/docgs.html', width=1000, height=450)
Out[3]:

B. Veneto municipalities geographical areas

We managed to get ahold of another SHP file containing the georaphical coordinates of every municipality in the Veneto region. As with the previous shapefile we can store the information in both a DataFrame and a geojson file.

This is how the data looks:

Comune Prov CODISTAT NOMCOM PROVINCIA AREA PERIMETER ID1 coords
0 29033 29 29033 Occhiobello RO 3.251909e+07 28900.15864 527 [(11.574961743249316, 44.95070722627497), (11....
1 29025 29 29025 Gaiba RO 1.206460e+07 18468.00608 526 [(11.479809311494138, 44.97789519756071), (11....
2 29009 29 29009 Canaro RO 3.266567e+07 33974.60289 525 [(11.661722276937734, 44.97455175786645), (11....
3 29021 29 29021 Ficarolo RO 1.796072e+07 21152.56640 524 [(11.440782497382445, 44.98232147591316), (11....
4 29045 29 29045 Stienta RO 2.408899e+07 24452.03201 523 [(11.559372185120054, 44.98162314511416), (11....

We do not need to focus on any of the fields here besides NOMCOM, which is the name of the municipality, and the coords that we will need for plotting.

This is the comunes data plotted on a folium map.

In [6]:
IFrame('maps/comunes.html', width=1000, height=500)
Out[6]:

This dataset will then need to be filtered to include only the cities relevant to our study.

C. Touristic cities

The data stored in a csv file contains information on the amount of tourists visiting each of Veneto's comunes in a given year. The data was collected from 2003 to 2013. Here we can see the last 5 rows:

year comune province n_tourists
5572 2013 Taglio di Po ROVIGO 4819.0
5573 2013 Trecenta ROVIGO 482.0
5574 2013 Villadose ROVIGO 584.0
5575 2013 Villamarzana ROVIGO NaN
5576 2013 Porto Viro ROVIGO 2566.0

It contains information about:

  • year: the relevant year
  • comune: the name of the municipality
  • province: the name of the province
  • n_tourists: the number of tourists that have visited the comune that given year

We selected the first 32 cities in proximity of DOCG areas by total number of tourists for our analysis. With the help of the Nominatim geolocator we managed to find the latitude and longitude coordinates of all of the cities. The first five rows of our final dataset look like this:

latitude longitude
comune
Abano Terme 45.360314 11.789783
Asiago 45.875377 11.510700
Bardolino 45.547559 10.724215
Bassano del Grappa 45.766911 11.734347
Brenzone 45.707599 10.765873

Here the selected cities are pinned to the map of the DOCG areas. You can hover on the tooltip to see the name of the comunes and those of the DOCG appellations.

In [9]:
IFrame('maps/selected_cities.html', width=1000, height=450)
Out[9]:

D. Foursquare API

We used the Foursquare API in order to find venues and places of interest in each of the locations.

After leveraging the platform our final result is a dataset containg info about venues within these municipalities. A radius of 5km from each location was used to perform the search. We can have a look again at the first five rows.

city city_lat city_lon venue venue_lat venue_lon category
0 Abano Terme 45.360314 11.789783 L'ombra Che Conta 45.361623 11.790219 Trattoria/Osteria
1 Abano Terme 45.360314 11.789783 Abano Grand Hotel 45.354321 11.785206 Hotel
2 Abano Terme 45.360314 11.789783 Panoramic Hotel Plaza 45.354413 11.783820 Hotel
3 Abano Terme 45.360314 11.789783 Grand Hotel Trieste & Victoria 45.352713 11.781310 Hotel
4 Abano Terme 45.360314 11.789783 Parco Urbano Termale 45.351798 11.783535 Park

The information conatined stores information about:

  • city: the municipality
  • city_lat: the latitude of the municipality
  • city_lon: the longitude of the municipality
  • venue: the venue name
  • venue_lat: the latitude of the venue
  • venue_lon: the longitude of the venue
  • category: the category of the venue

Our query returned a total of 1925 venues in the areas of interest. We can the first five rows of the table containing the total number of venues found for each category.

venues
category
Accessories Store 2
Agriturismo 3
American Restaurant 7
Argentinian Restaurant 1
Art Gallery 3

After all the datasets are acquired we are now ready to start the analysis.

3. Methodology

This section represents the main component of our analysis. As a reminder our purpose is to cluster similar cities based on some similar activities. As always we start with some exploratory data analysis. In particular, our first question is to see whether these municipalities have venue categories in common.

We can plot the top 5 categories for each of the comunes.

We can see some categories in common amongst some of the municipalities. For example the are places like Caorle or Cavallino-Treporti that have a more maritime related kind of venue, or activity. In another instance, Castelnuovo del garda and Peschiera del Garda both have theme park attraction as their main category. We can also see that venues like restaurant are very popular in all of this locations.

We need to perform some transformations to our data so that we are able to cluster the municipalities. We will approach the clustering problem by implementing the k-means algorithm. k-means is a distance-based method that iteratively updates the location of k cluster centroids until convergence. The main user-defined "ingredients" of the k-means algorithm are the distance function (often Euclidean distance) and the number of clusters k. This parameter needs to be set according to the application or problem domain.

In a nutshell, k-means groups the data by minimizing the sum of squared distances between the data points and their respective closest centroid. It is particulary used in problems involving spatial data.

In Python we can use the KMeans class from scikit-learn. We then analysed the inertias for different values of k and picked 5 as our hyperparameter. Here we can see a plot of inertia values for different values of k.

After fitting our model we are able to apply the clusters to the municipalities. Following are the first five rows of the municipalities with their assigned cluster.

city cluster
8 Abano Terme 2
29 Asiago 0
9 Bardolino 3
26 Bassano del Grappa 0
24 Brenzone 2

The code for the data acquisition and creating the model can be found in the exploration notebook, which is part of this repo. We are now ready to check the results of our clustering.

4. Results

All of our cities are now clustered into 5 different groups. As a first step we can visualize the different clusters.

In [14]:
IFrame('maps/cities_clustered.html', width=1000, height=550)
Out[14]:

Our next step is to see what the discriminants are to distinguishing these groups. This could be useful, for example, in recommending tourists wanting to visit the DOCG areas which cities to visit or stay at, based on particular activities they would like to do while staying at these locations. We will check how the clusters were chosen.

Grouping by cluster we can see a normalized table with the percentage of venue category for each of the clusters.

Accessories Store Agriturismo American Restaurant Argentinian Restaurant Art Gallery Art Museum Arts & Crafts Store Asian Restaurant Athletics & Sports BBQ Joint Bagel Shop Bakery Bar Basketball Court Basketball Stadium Bay Beach Beach Bar Bed & Breakfast Beer Bar Beer Garden Bistro Board Shop Boarding House Bookstore Boutique Bowling Alley Brazilian Restaurant Breakfast Spot Brewery Bridge Buffet Burger Joint Cafeteria Café Campground Canal Castle Cheese Shop Chinese Restaurant Chocolate Shop Church City Clothing Store Cocktail Bar Coffee Shop Comfort Food Restaurant Concert Hall Coworking Space Creperie Cupcake Shop Deli / Bodega Department Store Dessert Shop Diner Discount Store Dive Bar Dive Spot Donut Shop Eastern European Restaurant Electronics Store Event Space Fast Food Restaurant Fish Market Flea Market Flower Shop Food Food & Drink Shop Football Stadium Fried Chicken Joint Furniture / Home Store Gaming Cafe Garden Garden Center Gas Station Gastropub General Entertainment German Restaurant Golf Course Gourmet Shop Greek Restaurant Grocery Store Gym Gym / Fitness Center Gym Pool Harbor / Marina Hill Historic Site History Museum Hobby Shop Hockey Arena Hot Spring Hotel Hotel Bar Hotel Pool Ice Cream Shop Indian Restaurant Italian Restaurant Japanese Restaurant Kids Store Lake Lighthouse Liquor Store Lounge Market Mediterranean Restaurant Men's Store Mexican Restaurant Middle Eastern Restaurant Monument / Landmark Mountain Movie Theater Multiplex Museum Music Venue Neighborhood Nightclub Noodle House Nudist Beach Opera House Outdoors & Recreation Outlet Store Park Pastry Shop Pedestrian Plaza Performing Arts Venue Pharmacy Piadineria Pizza Place Plaza Pool Pub Public Art Racetrack Record Shop Resort Rest Area Restaurant River Road Rock Club Sandwich Place Scenic Lookout Science Museum Sculpture Garden Seafood Restaurant Shoe Store Shop & Service Shopping Mall Shopping Plaza Skating Rink Ski Area Snack Place Soccer Field Spa Sporting Goods Shop Stadium Steakhouse Supermarket Sushi Restaurant Tea Room Tennis Court Thai Restaurant Theater Theme Park Theme Park Ride / Attraction Toy / Game Store Trail Train Station Trattoria/Osteria University Used Bookstore Vacation Rental Vegetarian / Vegan Restaurant Veneto Restaurant Video Game Store Water Park Waterfront Wine Bar Wine Shop Winery Women's Store Zoo
cluster
1 0.012241 0.011494 0.021419 0.01 0.01 0.01 NaN 0.012463 0.010000 0.057844 0.076923 0.016616 0.032521 NaN 0.025641 NaN 0.090000 NaN 0.014519 0.012463 0.012463 0.010000 0.01 0.01 0.013247 0.051948 0.011494 0.033248 0.014311 0.030553 0.015 0.025641 0.015633 NaN 0.079821 0.013772 NaN 0.017195 0.047619 0.014563 0.012987 0.020 0.025641 0.034926 0.035887 0.014975 0.01 NaN NaN NaN 0.027035 NaN 0.034412 0.019370 0.023018 0.019658 NaN NaN NaN 0.011494 0.025466 0.076923 0.023901 NaN 0.02439 0.01321 0.034542 0.016043 NaN 0.011494 0.032484 0.011494 0.015 NaN NaN 0.030445 0.01 NaN 0.017755 0.014925 0.011494 0.016085 0.027854 0.018079 0.01 NaN 0.01 0.018772 0.023810 0.01 0.071429 NaN 0.059223 0.025974 NaN 0.026077 0.01 0.116428 0.026454 0.012821 NaN NaN 0.014925 0.020967 0.01 0.014925 0.012987 0.01641 NaN 0.02 NaN 0.017463 0.01214 0.01813 0.01 0.010000 0.018430 0.012637 NaN NaN 0.090909 0.018734 0.018445 0.01 0.01 0.012987 0.01 0.019658 0.105097 0.033375 0.013136 0.040340 0.025641 0.01 0.01 0.010000 0.013455 0.042192 0.050455 0.01094 0.017544 0.015705 0.013772 0.010000 NaN 0.041800 0.015949 0.012821 0.017024 0.012241 0.01 0.02381 0.015705 0.010498 0.028063 0.01593 0.012821 0.01527 0.031048 0.01112 0.01 0.017821 NaN 0.010000 0.011494 0.025641 0.014925 0.028810 0.030773 0.045753 0.01 NaN NaN 0.01 0.017508 0.014925 NaN NaN 0.024114 0.01 0.014832 0.012987 0.011494
2 NaN 0.018353 0.018353 NaN NaN NaN NaN NaN NaN 0.020833 NaN NaN NaN NaN NaN NaN 0.031746 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.060516 0.044643 NaN NaN NaN 0.020833 NaN NaN NaN NaN 0.015873 NaN NaN NaN NaN NaN NaN NaN NaN 0.041667 0.015873 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.020833 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.041667 0.052579 NaN NaN 0.015873 NaN 0.149802 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.020833 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.047123 NaN 0.015873 NaN NaN NaN NaN NaN 0.020833 0.026290 NaN NaN NaN NaN NaN NaN NaN 0.055060 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.018353 NaN NaN NaN NaN 0.018353 0.034226 0.283234 NaN NaN NaN 0.047619 NaN NaN NaN NaN NaN NaN NaN 0.015873 NaN NaN NaN NaN NaN
3 NaN NaN 0.012937 NaN 0.02 0.01 0.01 0.012937 0.033333 NaN NaN 0.030000 0.015000 0.02 NaN 0.01 0.043478 NaN NaN NaN NaN NaN NaN 0.02 NaN NaN NaN 0.010000 0.020000 NaN 0.010 NaN NaN NaN 0.047572 0.033333 0.01 NaN NaN 0.017937 0.010000 0.020 0.033333 NaN 0.023370 0.020000 NaN 0.01 NaN NaN 0.025000 0.01 0.010000 0.021958 0.045608 NaN NaN 0.012937 NaN NaN 0.020000 NaN NaN NaN NaN NaN 0.015291 0.015000 0.02 0.010000 NaN NaN NaN 0.010000 NaN 0.015873 NaN NaN 0.019735 0.010000 NaN NaN 0.020000 0.020000 NaN 0.046763 NaN 0.027937 0.012937 0.02 NaN 0.012937 0.227050 NaN 0.038333 0.029841 NaN 0.162131 NaN NaN NaN 0.01000 NaN 0.015873 NaN 0.010000 NaN NaN 0.015873 NaN 0.01 NaN NaN 0.02500 NaN NaN 0.020873 NaN NaN 0.01 0.012937 NaN 0.016468 0.01 NaN NaN NaN NaN 0.103801 0.080000 0.010000 0.025291 0.020000 NaN NaN 0.030317 NaN 0.061739 NaN NaN 0.012937 0.010000 0.030000 0.012937 0.01 0.031739 NaN NaN 0.010000 NaN NaN NaN 0.020000 NaN 0.010000 NaN NaN NaN NaN NaN NaN 0.015873 NaN 0.010000 NaN NaN NaN 0.033333 NaN 0.028360 NaN 0.01 0.033333 NaN 0.010000 NaN NaN NaN 0.018624 0.02 0.015000 NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.035208 NaN NaN NaN 0.035151 0.013611 0.010000 NaN 0.016022 0.037037 NaN NaN NaN NaN 0.037037 NaN NaN NaN NaN NaN 0.010000 0.037037 0.037636 0.030814 NaN 0.010000 0.037037 NaN NaN NaN 0.022045 NaN 0.025208 NaN NaN NaN 0.037037 0.01 NaN NaN 0.037037 NaN 0.014419 NaN 0.01 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.020000 0.020000 NaN NaN NaN NaN NaN 0.037037 NaN 0.010000 NaN 0.01 0.016628 NaN NaN NaN NaN 0.013611 NaN NaN NaN 0.010000 NaN NaN NaN 0.020000 0.125005 NaN 0.023256 0.037469 NaN 0.338304 NaN NaN 0.023256 NaN NaN NaN NaN 0.010000 NaN NaN NaN NaN NaN NaN NaN 0.02000 NaN 0.037037 0.010000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.054743 0.010000 0.010000 0.020188 NaN NaN NaN 0.010000 NaN 0.043709 NaN NaN NaN NaN 0.016818 NaN NaN 0.012708 NaN 0.010000 0.010000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.013611 NaN 0.016944 0.010000 NaN 0.020818 NaN 0.038453 NaN NaN NaN NaN NaN NaN 0.01 NaN 0.013333 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.031426 NaN NaN NaN 0.187273 0.024695 NaN NaN 0.025000 0.024390 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.025000 NaN 0.036890 0.153846 NaN NaN NaN 0.025000 NaN 0.025 0.024390 NaN 0.031731 NaN NaN NaN NaN NaN NaN NaN NaN 0.025000 NaN NaN NaN NaN 0.02439 NaN NaN NaN NaN 0.02439 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.024390 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.024695 NaN NaN 0.049390 NaN 0.176313 NaN NaN NaN 0.02439 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.025000 NaN 0.025 NaN NaN NaN 0.024390 NaN NaN NaN NaN NaN 0.070341 NaN NaN 0.024695 NaN NaN NaN 0.069887 NaN 0.088462 NaN NaN NaN NaN 0.024390 NaN NaN 0.082958 NaN NaN 0.024390 NaN NaN NaN 0.031426 NaN NaN NaN NaN NaN 0.025000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.024695 NaN NaN NaN NaN

Based on the venue categories we can differentiate and define the different clusters as:

  • Cluster 1:

    This cluster contains the following cities: Asiago, Bassano del Grappa, Bussolengo, Eraclea, Iesolo, Mira, Mogliano Veneto, Noventa di Piave, Padova, San Giovanni Lupatoto, San Michele al Tagliamento, Treviso, Verona, Vicenza and Villafranca di Verona. It mainly includes venues or activities like spas, river walks and cocktail bars.
  • Cluster 2

    Is the smallest cluster and includes the cities of Castelnuovo del Garda and Peschiera del Garda. These locations are situated in the proximity of an amusement park and share theme park attractions in the proximity.
  • Cluster 3:

    Venice, Abano Terme, Brenzone, Garda, Montegrotto Terme and Preganziol form our third cluster. This cluster offers plenty of hotels and all sorts of gastronomic venues.
  • Cluster 4:

    Is almost entirely made up of the towns in proximity of lake Garda. Bardolino, Costermano, Lazise, Quarto d'Altino, San Zeno di Montagna and Torri del Benaco form this cluster. The majority of the venues are wine and food related, with many restaurants and wine bars.
  • Cluster 5:

    This includes the maritime areas situated on the gulf of Venice and offers beaches, seafood restaurants and resorts. These municipalities included are Caorle, Cavallino-Treporti and Chioggia.

5. Discussion

We are able to propose different destinations, based on the type of activities, to a person visiting the DOCG areas.

We can ultimately summarize the different clusters into what they would be best suited for in the following table.

Cluster Cities forming the cluster Best suited for
1 Asiago, Bassano del Grappa, Bussolengo, Eraclea, Iesolo, Mira, Mogliano Veneto, Noventa di Piave, Padova, San Giovanni Lupatoto, San Michele al Tagliamento, Treviso, Verona, Vicenza, Villafranca di Verona Relaxational destinations
2 Castelnuovo del Garda, Peschiera del Garda Family destinations
3 Venice, Abano Terme, Brenzone, Garda, Montegrotto Terme, Preganziol City experience
4 Bardolino, Costermano, Lazise, Quarto d'Altino, San Zeno di Montagna, Torri del Benaco Gastronomical tour
5 Caorle, Cavallino-Treporti, Chioggia Beach destination

Finally, it is important to notice that this classification is limited to the information retreived through the Foursquare API. The amount of venues taken into consideration is only a fraction of the actual amount.

6. Conclusion

We analysed some beautiful cities located in the proximity of DOCG areas and are able to pick a destination, based on our preferred activities, for our holiday. It is now up to you to decide which location suits you best in order to visit those wonderful areas made of outstanding wines and food.

I hope you enjoyed this journey in the land of wines. You can fin the code and all the assets by following this link. The repo includes all of the files used in this project, including the datasets with the geographical data.

If you are not able to view the maps on github you ca read the notebook following this link.