We start by importing the display utilities in order to print some markdown, format the dates, and then we import pandas, matplot, numpy and seaborn in order to gather the dataset, manipulate and print some information from it, and we import the sklearn modules to apply the K-means, linear and random forest regressions.
from IPython.display import display, Markdown, Image
from datetime import datetime
import pandas as pd
import matplotlib.pyplot
import numpy as np
import seaborn
import pydot
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
%matplotlib inline
For this homework, I decided to use the regularity dataset provided by the SNCF for French regional trains.
The dataset is available here (link in french)
We first define the API endpoint in order to download the dataset. We also define useful variables such as the number of datapoints, the date of the first recorded,and the date of the last recorded.
We then convert the dates to readable formats, we fill NaN values with 0 in order to avoid any type of problem, and we convert some fields to their real types as it is not automatically done.
For debugging purposes, we also display the table to check what informations are available, and how they are organized.
url = "https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/download/?format=csv&timezone=Asia/Tokyo"
# Utility variables : number of lines in the dataset, most recent and oldest data
countDatapoints = 0
first = ""
last = ""
test = pd.read_csv(url, sep=';',header=0)
# Setting the datetime type on date with the correct format
test.date = pd.to_datetime(test.date)
test.date = test.date.dt.strftime('%Y-%m')
# Filling NaN values with fake data
test.fillna(0, inplace=True)
# Converting fields to their real types
test.nombre_de_trains_programmes = test.nombre_de_trains_programmes.astype(int)
test.nombre_de_trains_ayant_circule = test.nombre_de_trains_ayant_circule.astype(int)
test.nombre_de_trains_annules = test.nombre_de_trains_annules.astype(int)
test.nombre_de_trains_en_retard_a_l_arrivee = test.nombre_de_trains_en_retard_a_l_arrivee.astype(int)
# Set utility variables
first = test.date.min()
last = test.date.max()
countDatapoints = test.id.count()
# Display the table with all the data
pd.DataFrame(test).sort_values(by='date')
Describe the raw data (For example: What is the source and background of the data? What kind of descriptors does it include? How many data points? Which attribute(s) could be used for prediction tasks (e.g. classification)?).
The following aspects might help to assess the data:
display(Markdown("The data represents the monthly regularity rates for the french local trains between "+first+" and "+last+"."))
display(Markdown("There are "+str(countDatapoints)+" datapoints inside of this dataset. It contains the following "+str(len(test.dtypes))+" fields :"))
print(test.dtypes)
display(Markdown("The data distribution for each attribute is represented via the following boxplot :"))
test.plot.box(figsize=(24, 16))
Source: https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/table/?disjunctive.region&sort=date
The header for the CSV file downloaded previously is the following :
ID;Date;Région;Nombre de trains programmés;Nombre de trains ayant circulé;Nombre de trains annulés;Nombre de trains en retard à l'arrivée;Taux de régularité;Nombre de trains à l'heure pour un train en retard à l'arrivée;Commentaires
A datapoint looks like this :
TER_3;2013-01;Auvergne;5785;5732;53;431;92.5;12.3;Conditions météos défavorables.
Each datapoint has the following 10 fields (the names are french in the dataset, the descriptions are translated for more convenience):
- id : id of the region concerned
- date : YYYY-MM Year and month for those results
- région : name of the concerned region
- nombre de trains programmés : Number of scheduled trains for the month - no range
- nombre de trains ayant circulé : number of scheduled trains that have effectively circulated - between 0 and the value of
nombre_de_trains_programmes
- nombre de trains annulés : number of scheduled trains that were cancelled - between 0 and the value of
nombre_de_trains_programmes
- nombre de trains en retard à l'arrivée : Number of trains delayed among the scheduled trains - between 0 and the value of
nombre_de_trains_programmes
- taux de ponctualité : punctuality rate (percentage of trains on time) - Between -100 and 100. Some datapoints don't have this
value, so the default value will be 0
- nombre de trains à l'heure pour un train en retard à l'arrivée : number of trains that effectively arrived on time for 1 train
cancelled (in the example, 10 trains were on time for one delayed.
- commentaires : various comments written by the managers about the reasons why trains were delayed or cancelled, and
observations about the different indicators and results
Some of those fields can be absent, as one region did not disclosed it's punctuality results until 2015, and some of the datapoints may have been filled incorrectly by the transportation authority. In this case, the fields will contain a 0 when concerned by those restrictions.
For the prediction tasks, we can basically use all the attributes, expect for the Commentaires section which contains only informations meant to have a deeper understanding of the data and that are not formatted in a computer-readable format nor in english language. It would have been better to have a standardized system which classifies the situations depending of the context, but this is strictly a limitation of the dataset.
We now display several scatter plots to display the relationships between several attributes. Those relationships are the most present in the dataset, especially for the linear between the number of scheduled trains and any other attribute related to a number of trains.
display(Markdown("Relationship between the number of scheduled trains and the number of trains that have effectively run"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_ayant_circule")
display(Markdown("Relationship between the number of scheduled trains and the number of cancelled trains"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_annules")
display(Markdown("Relationship between the number of scheduled trains and the punctuality rate"))
test.plot.scatter(x="nombre_de_trains_programmes", y="taux_de_ponctualite")
display(Markdown("Relationship between the number of scheduled trains and the number of trains on time for one delayed train"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")
display(Markdown("Relationship between the punctuality rate and the number of trains on time for one delayed train"))
test.plot.scatter(x="taux_de_ponctualite", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")
Analyse your data in terms of measures of central tendency and dispersion (Calculation of statistics AND visualization).
(Examples: mean, median, standard deviation, variance, 25th/75th percentile, ...)
display(Markdown("The global data analysis gives us the following :"))
pd.DataFrame(test).describe()
We then display plots of the punctuality rates for each region per month, and the means and medians for each set
print("Displaying punctuality rates per month for each region, along with means and medians\n")
for i in range(1,20):
# Create the id to sort and filter the data
id = "TER_"+str(i)
# Filter to keep only the data for the selected region, sorted by date
m = test.loc[test.id == id].sort_values(by='date')
# Draw the plot with all the parameters
matplotlib.pyplot.rcParams["figure.figsize"] = [16.0, 16.0]
matplotlib.pyplot.title('Punctuality rate per month for '+m.region.all())
print("Region : "+m.region.all())
matplotlib.pyplot.xlabel('Date')
matplotlib.pyplot.ylabel('Punctuality rate')
matplotlib.pyplot.xticks(rotation=90)
matplotlib.pyplot.plot(m.date,m.taux_de_ponctualite)
# Compute the mean, display and draw it
mean = m.taux_de_ponctualite.mean()
print("Mean : "+str(mean))
horiz_line_data = np.array([mean for i in range(0,len(m.date))])
matplotlib.pyplot.plot(m.date, horiz_line_data)
#compute the median, display and draw it
median = m.taux_de_ponctualite.median()
print("Median : "+str(median))
horiz_line_data2 = np.array([median for i in range(0,len(m.date))])
matplotlib.pyplot.plot(m.date, horiz_line_data2)
matplotlib.pyplot.show()
#seaborn.heatmap(punct, cmap='hot')
#matplotlib.pyplot.boxplot(punct, date)