#


This notebook was written for a university project. </font>

#

Homework assignment notebook

General information:

  • The homework assignment can be conducted in groups of 1 to 2 people (1 homework submission per group).
  • The homework should be submitted via email to **rakers[at]pharm.kyoto-u.ac.jp** using the subject "homework - data analysis"
  • **Deadline: 1st of August (Thu)**

General evaluation criteria:

  • Quality of workflow execution
  • Documentation/ reporting of executed steps (data analysis procedures)
  • Interpretation of findings/ results
  • Bonus points for:
    • Creativity (e.g. in applying new statistical or technical methods, finding new ways to visualise results, ...)
    • Depth of analysis and/or interpretation

0. Import the modules necessary for your workflow

We start by importing the display utilities in order to print some markdown, format the dates, and then we import pandas, matplot, numpy and seaborn in order to gather the dataset, manipulate and print some information from it, and we import the sklearn modules to apply the K-means, linear and random forest regressions.

In [1]:
from IPython.display import display, Markdown, Image
from datetime import datetime

import pandas as pd
import matplotlib.pyplot
import numpy as np
import seaborn
import pydot

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
In [2]:
%matplotlib inline

1. Pick a dataset and load it into jupyter

For this homework, I decided to use the regularity dataset provided by the SNCF for French regional trains.
The dataset is available here (link in french)

We first define the API endpoint in order to download the dataset. We also define useful variables such as the number of datapoints, the date of the first recorded,and the date of the last recorded.

We then convert the dates to readable formats, we fill NaN values with 0 in order to avoid any type of problem, and we convert some fields to their real types as it is not automatically done.

For debugging purposes, we also display the table to check what informations are available, and how they are organized.

In [3]:
url = "https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/download/?format=csv&timezone=Asia/Tokyo"

# Utility variables : number of lines in the dataset, most recent and oldest data
countDatapoints = 0
first = ""
last = ""
In [4]:
test = pd.read_csv(url, sep=';',header=0)

# Setting the datetime type on date with the correct format
test.date = pd.to_datetime(test.date)
test.date = test.date.dt.strftime('%Y-%m')

# Filling NaN values with fake data
test.fillna(0, inplace=True)

# Converting fields to their real types
test.nombre_de_trains_programmes = test.nombre_de_trains_programmes.astype(int)
test.nombre_de_trains_ayant_circule = test.nombre_de_trains_ayant_circule.astype(int)
test.nombre_de_trains_annules = test.nombre_de_trains_annules.astype(int)
test.nombre_de_trains_en_retard_a_l_arrivee = test.nombre_de_trains_en_retard_a_l_arrivee.astype(int)

# Set utility variables
first = test.date.min()
last = test.date.max()
countDatapoints = test.id.count()

# Display the table with all the data
pd.DataFrame(test).sort_values(by='date')
Out[4]:
id date region nombre_de_trains_programmes nombre_de_trains_ayant_circule nombre_de_trains_annules nombre_de_trains_en_retard_a_l_arrivee taux_de_ponctualite nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee commentaires
0 TER_3 2013-01 Auvergne 5785 5732 53 431 92.5 12.3 Conditions météos défavorables.
861 TER_5 2013-01 Bourgogne 8400 8332 68 625 92.5 12.3 Un mois de janvier qui surpasse les six exerci...
862 TER_19 2013-01 Provence Alpes Côte d'Azur 13956 13219 737 1816 86.3 6.3 0
863 TER_20 2013-01 Rhône Alpes 31438 30779 659 3552 88.5 7.7 0
864 TER_13 2013-01 Lorraine 0 0 0 0 0.0 0.0 Le Président de la Région Lorraine s'est oppos...
576 TER_15 2013-01 Nord Pas de Calais 19227 18890 337 2332 87.7 7.1 Vol de câble à Lille Sud.
575 TER_12 2013-01 Limousin 3817 3770 47 210 94.4 17.0 0
574 TER_7 2013-01 Centre 9882 9687 195 812 91.6 10.9 Trois incidents caténaires lourds, dont deux p...
573 TER_6 2013-01 Bretagne 8776 8631 145 554 93.6 14.6 Fortes chutes de neige ayant entrainé des pert...
572 TER_1 2013-01 Alsace 20095 19874 221 897 95.5 21.2 Intempéries.
1150 TER_18 2013-01 Poitou Charentes 3269 3134 135 205 93.5 14.3 Mouvements sociaux des agents du service comme...
860 TER_4 2013-01 Basse Normandie 3331 3297 34 311 90.6 9.6 Grand froid et épisode neigeux les semaines 3 ...
279 TER_16 2013-01 Pays de la Loire 10407 10195 212 713 93.0 13.3 0
278 TER_11 2013-01 Languedoc Roussillon 5024 4897 127 377 92.3 12.0 Chute de neige sur le littoral.
277 TER_10 2013-01 Haute Normandie 5957 5878 79 488 91.7 11.0 Episodes neigeux et difficultés matériels.
276 TER_8 2013-01 Champagne Ardenne 6648 6595 53 334 94.9 18.7 Conditions météorologiques dégradées du 15 au ...
859 TER_2 2013-01 Aquitaine 8099 8014 85 731 90.9 10.0 Intempéries à partir du 20 janvier.
1 TER_9 2013-01 Franche Comté 5826 5744 82 553 90.4 9.4 0
2 TER_14 2013-01 Midi Pyrénées 8208 7941 267 903 88.6 7.8 0
3 TER_17 2013-01 Picardie 11754 11673 81 1245 89.3 8.4 Fortes chutes de neige ayant entrainé des pert...
4 TER_7 2013-02 Centre 8979 8830 149 629 92.9 13.0 0
284 TER_13 2013-02 Lorraine 0 0 0 0 0.0 0.0 Le Président de la Région Lorraine s'est oppos...
283 TER_20 2013-02 Rhône Alpes 28215 27910 305 3771 86.5 6.4 Météo difficile et intrusions.
282 TER_10 2013-02 Haute Normandie 5349 5301 48 211 96.0 24.1 0
281 TER_6 2013-02 Bretagne 7729 7644 85 448 94.1 16.1 Quelques journées perturbées par des épisodes ...
280 TER_4 2013-02 Basse Normandie 3013 2991 22 185 93.8 15.2 Nombreuses difficultés de circulation et plusi...
869 TER_12 2013-02 Limousin 3449 3406 43 219 93.6 14.6 0
868 TER_5 2013-02 Bourgogne 7418 7337 81 630 91.4 10.6 Un mois de février en retrait d'un point sur l...
867 TER_3 2013-02 Auvergne 5229 5166 63 427 91.7 11.1 Conditions météos et dérangement d'installatio...
871 TER_18 2013-02 Poitou Charentes 3014 2991 23 118 96.1 24.3 0
... ... ... ... ... ... ... ... ... ... ...
1455 TER_1 2019-12 Grand Est 15858 15341 517 878 94.3 16.5 0
1143 TER_14 2019-12 Occitanie 2777 2599 178 431 83.4 5.0 0
1142 TER_6 2019-12 Bretagne 3584 3545 39 387 89.1 8.2 0
854 TER_7 2019-12 Centre 4025 3788 237 505 86.7 6.5 0
855 TER_15 2019-12 Hauts de France 10426 9999 427 1419 85.8 6.0 0
1144 TER_19 2019-12 Provence Alpes Côte d'Azur 4336 3459 877 585 83.1 4.9 0
565 TER_5 2019-12 Bourgogne-Franche Comté 5673 5252 421 518 90.1 9.1 0
566 TER_16 2019-12 Pays de la Loire 3835 3726 109 664 82.2 4.6 0
1458 TER_2 2020-01 Aquitaine 10387 10011 376 1153 87.1 6.7 0
569 TER_19 2020-01 Provence Alpes Côte d'Azur 9905 9199 706 1065 85.5 5.9 0
1145 TER_6 2020-01 Bretagne 7664 7584 80 287 96.1 24.9 0
1146 TER_7 2020-01 Centre 7991 7820 171 690 89.9 8.9 0
1147 TER_16 2020-01 Pays de la Loire 9655 9441 214 795 91.4 10.6 0
567 TER_5 2020-01 Bourgogne-Franche Comté 12135 11848 287 879 92.2 11.9 0
568 TER_15 2020-01 Hauts de France 22125 21551 574 2476 87.8 7.2 0
856 TER_1 2020-01 Grand Est 33309 32737 572 1583 94.7 18.0 0
271 TER_3 2020-01 Auvergne-Rhône Alpes 24279 23486 793 2105 89.8 8.8 0
273 TER_14 2020-01 Occitanie 6307 5802 505 717 85.3 5.8 0
272 TER_4 2020-01 Normandie 8356 8102 254 1297 83.3 5.0 0
857 TER_4 2020-02 Normandie 10373 10000 373 1279 87.2 6.8 0
1459 TER_2 2020-02 Aquitaine 16304 16018 286 1162 92.8 12.8 0
275 TER_19 2020-02 Provence Alpes Côte d'Azur 13719 13297 422 1271 90.4 9.5 0
1149 TER_6 2020-02 Bretagne 8733 8580 153 405 95.3 20.2 0
1148 TER_5 2020-02 Bourgogne-Franche Comté 14686 14352 334 1241 91.3 10.6 0
570 TER_1 2020-02 Grand Est 40918 39278 1640 2518 93.6 14.6 0
571 TER_16 2020-02 Pays de la Loire 12174 11941 233 689 94.2 16.3 0
1460 TER_7 2020-02 Centre 9903 9617 286 792 91.8 11.1 0
858 TER_15 2020-02 Hauts de France 29503 27962 1541 3076 89.0 8.1 0
274 TER_3 2020-02 Auvergne-Rhône Alpes 36026 35182 844 2970 91.6 10.8 0
1461 TER_14 2020-02 Occitanie 11512 11172 340 1112 90.0 9.1 0

1462 rows × 10 columns

2. Data inspection

Describe the raw data (For example: What is the source and background of the data? What kind of descriptors does it include? How many data points? Which attribute(s) could be used for prediction tasks (e.g. classification)?).
The following aspects might help to assess the data:

  • Number and types of attributes
  • Number of instances per attribute
  • Numeric data ranges of attributes
  • Distribution of data values per attribute (e.g. through plotting histograms and/or applying kernel density estimator plotting function (KDE))
  • Use scatter plots to visualise the relationship between some of the descriptors
  • ...
In [5]:
display(Markdown("The data represents the monthly regularity rates for the french local trains between "+first+" and "+last+"."))
display(Markdown("There are "+str(countDatapoints)+" datapoints inside of this dataset. It contains the following "+str(len(test.dtypes))+" fields :"))
print(test.dtypes)
display(Markdown("The data distribution for each attribute is represented via the following boxplot :"))
test.plot.box(figsize=(24, 16))

The data represents the monthly regularity rates for the french local trains between 2013-01 and 2020-02.

There are 1462 datapoints inside of this dataset. It contains the following 10 fields :

id                                                               object
date                                                             object
region                                                           object
nombre_de_trains_programmes                                       int32
nombre_de_trains_ayant_circule                                    int32
nombre_de_trains_annules                                          int32
nombre_de_trains_en_retard_a_l_arrivee                            int32
taux_de_ponctualite                                             float64
nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee    float64
commentaires                                                     object
dtype: object

The data distribution for each attribute is represented via the following boxplot :

Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbeeb57358>

Source: https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/table/?disjunctive.region&sort=date

The header for the CSV file downloaded previously is the following :

ID;Date;Région;Nombre de trains programmés;Nombre de trains ayant circulé;Nombre de trains annulés;Nombre de trains en retard à l'arrivée;Taux de régularité;Nombre de trains à l'heure pour un train en retard à l'arrivée;Commentaires

A datapoint looks like this :

TER_3;2013-01;Auvergne;5785;5732;53;431;92.5;12.3;Conditions météos défavorables.

Each datapoint has the following 10 fields (the names are french in the dataset, the descriptions are translated for more convenience):

- id : id of the region concerned
- date : YYYY-MM Year and month for those results
- région : name of the concerned region
- nombre de trains programmés : Number of scheduled trains for the month - no range
- nombre de trains ayant circulé : number of scheduled trains that have effectively circulated - between 0 and the value of
nombre_de_trains_programmes
- nombre de trains annulés : number of scheduled trains that were cancelled - between 0 and the value of 
nombre_de_trains_programmes
- nombre de trains en retard à l'arrivée : Number of trains delayed among the scheduled trains - between 0 and the value of
nombre_de_trains_programmes
- taux de ponctualité : punctuality rate (percentage of trains on time) - Between -100 and 100. Some datapoints don't have this
value, so the default value will be 0
- nombre de trains à l'heure pour un train en retard à l'arrivée : number of trains that effectively arrived on time for 1 train 
cancelled (in the example, 10 trains were on time for one delayed.
- commentaires : various comments written by the managers about the reasons why trains were delayed or cancelled, and 
observations about the different indicators and results

Some of those fields can be absent, as one region did not disclosed it's punctuality results until 2015, and some of the datapoints may have been filled incorrectly by the transportation authority. In this case, the fields will contain a 0 when concerned by those restrictions.

For the prediction tasks, we can basically use all the attributes, expect for the Commentaires section which contains only informations meant to have a deeper understanding of the data and that are not formatted in a computer-readable format nor in english language. It would have been better to have a standardized system which classifies the situations depending of the context, but this is strictly a limitation of the dataset.

We now display several scatter plots to display the relationships between several attributes. Those relationships are the most present in the dataset, especially for the linear between the number of scheduled trains and any other attribute related to a number of trains.

In [6]:
display(Markdown("Relationship between the number of scheduled trains and the number of trains that have effectively run"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_ayant_circule")

Relationship between the number of scheduled trains and the number of trains that have effectively run

Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbeef0a2e8>
In [8]:
display(Markdown("Relationship between the number of scheduled trains and the number of cancelled trains"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_annules")

Relationship between the number of scheduled trains and the number of cancelled trains

Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbeefb0e48>
In [9]:
display(Markdown("Relationship between the number of scheduled trains and the punctuality rate"))
test.plot.scatter(x="nombre_de_trains_programmes", y="taux_de_ponctualite")

Relationship between the number of scheduled trains and the punctuality rate

Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbeeffea58>
In [10]:
display(Markdown("Relationship between the number of scheduled trains and the number of trains on time for one delayed train"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")

Relationship between the number of scheduled trains and the number of trains on time for one delayed train

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbef05ea90>
In [11]:
display(Markdown("Relationship between the punctuality rate and the number of trains on time for one delayed train"))
test.plot.scatter(x="taux_de_ponctualite", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")

Relationship between the punctuality rate and the number of trains on time for one delayed train

Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cbef0b9550>

3. Basic data exploration

Analyse your data in terms of measures of central tendency and dispersion (Calculation of statistics AND visualization).
(Examples: mean, median, standard deviation, variance, 25th/75th percentile, ...)

In [12]:
display(Markdown("The global data analysis gives us the following :"))
pd.DataFrame(test).describe()

The global data analysis gives us the following :

Out[12]:
nombre_de_trains_programmes nombre_de_trains_ayant_circule nombre_de_trains_annules nombre_de_trains_en_retard_a_l_arrivee taux_de_ponctualite nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee
count 1462.000000 1462.000000 1462.000000 1462.000000 1462.000000 1462.000000
mean 10824.285910 10607.298906 222.194938 947.478796 88.832353 12.330711
std 8333.604446 8169.137232 263.820233 899.507293 15.326816 6.387889
min 0.000000 0.000000 0.000000 0.000000 -38.400000 -0.300000
25% 5498.500000 5426.750000 70.000000 369.000000 88.800000 8.000000
50% 8412.500000 8281.500000 134.000000 654.000000 91.700000 11.000000
75% 13331.500000 12949.750000 280.000000 1106.500000 93.900000 15.500000
max 46329.000000 45569.000000 4024.000000 10996.000000 98.000000 49.700000

We then display plots of the punctuality rates for each region per month, and the means and medians for each set

In [13]:
print("Displaying punctuality rates per month for each region, along with means and medians\n")
for i in range(1,20):
    # Create the id to sort and filter the data
    id = "TER_"+str(i)
    
    # Filter to keep only the data for the selected region, sorted by date
    m = test.loc[test.id == id].sort_values(by='date')
    
    # Draw the plot with all the parameters
    matplotlib.pyplot.rcParams["figure.figsize"] = [16.0, 16.0]
    matplotlib.pyplot.title('Punctuality rate per month for '+m.region.all())
    print("Region : "+m.region.all())
    matplotlib.pyplot.xlabel('Date')
    matplotlib.pyplot.ylabel('Punctuality rate')
    matplotlib.pyplot.xticks(rotation=90)
    matplotlib.pyplot.plot(m.date,m.taux_de_ponctualite)
    
    # Compute the mean, display and draw it
    mean = m.taux_de_ponctualite.mean()
    print("Mean : "+str(mean))
    horiz_line_data = np.array([mean for i in range(0,len(m.date))])
    matplotlib.pyplot.plot(m.date, horiz_line_data)
    
    #compute the median, display and draw it
    median = m.taux_de_ponctualite.median()
    print("Median : "+str(median))
    horiz_line_data2 = np.array([median for i in range(0,len(m.date))])
    matplotlib.pyplot.plot(m.date, horiz_line_data2)
    
    matplotlib.pyplot.show()
    
    
#seaborn.heatmap(punct, cmap='hot')
#matplotlib.pyplot.boxplot(punct, date)
Displaying punctuality rates per month for each region, along with means and medians

Region : Grand Est
Mean : 95.47093023255812
Median : 95.6
Region : Aquitaine
Mean : 89.15813953488373
Median : 89.2
Region : Auvergne-Rhône Alpes
Mean : 91.72674418604652
Median : 92.35
Region : Normandie
Mean : 93.2906976744186
Median : 93.9
Region : Bourgogne-Franche Comté
Mean : 91.25232558139535
Median : 91.5
Region : Bretagne
Mean : 94.66860465116284
Median : 95.15