We start by importing the display utilities in order to print some markdown, format the dates, and then we import pandas, matplot, numpy and seaborn in order to gather the dataset, manipulate and print some information from it, and we import the sklearn modules to apply the K-means, linear and random forest regressions.
from IPython.display import display, Markdown, Image
from datetime import datetime
import pandas as pd
import matplotlib.pyplot
import numpy as np
import seaborn
import pydot
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
%matplotlib inline
For this homework, I decided to use the regularity dataset provided by the SNCF for French regional trains.
The dataset is available here (link in french)
We first define the API endpoint in order to download the dataset. We also define useful variables such as the number of datapoints, the date of the first recorded,and the date of the last recorded.
We then convert the dates to readable formats, we fill NaN values with 0 in order to avoid any type of problem, and we convert some fields to their real types as it is not automatically done.
For debugging purposes, we also display the table to check what informations are available, and how they are organized.
url = "https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/download/?format=csv&timezone=Asia/Tokyo"
# Utility variables : number of lines in the dataset, most recent and oldest data
countDatapoints = 0
first = ""
last = ""
test = pd.read_csv(url, sep=';',header=0)
# Setting the datetime type on date with the correct format
test.date = pd.to_datetime(test.date)
test.date = test.date.dt.strftime('%Y-%m')
# Filling NaN values with fake data
test.fillna(0, inplace=True)
# Converting fields to their real types
test.nombre_de_trains_programmes = test.nombre_de_trains_programmes.astype(int)
test.nombre_de_trains_ayant_circule = test.nombre_de_trains_ayant_circule.astype(int)
test.nombre_de_trains_annules = test.nombre_de_trains_annules.astype(int)
test.nombre_de_trains_en_retard_a_l_arrivee = test.nombre_de_trains_en_retard_a_l_arrivee.astype(int)
# Set utility variables
first = test.date.min()
last = test.date.max()
countDatapoints = test.id.count()
# Display the table with all the data
pd.DataFrame(test).sort_values(by='date')
Describe the raw data (For example: What is the source and background of the data? What kind of descriptors does it include? How many data points? Which attribute(s) could be used for prediction tasks (e.g. classification)?).
The following aspects might help to assess the data:
display(Markdown("The data represents the monthly regularity rates for the french local trains between "+first+" and "+last+"."))
display(Markdown("There are "+str(countDatapoints)+" datapoints inside of this dataset. It contains the following "+str(len(test.dtypes))+" fields :"))
print(test.dtypes)
display(Markdown("The data distribution for each attribute is represented via the following boxplot :"))
test.plot.box(figsize=(24, 16))
Source: https://data.sncf.com/explore/dataset/regularite-mensuelle-ter/table/?disjunctive.region&sort=date
The header for the CSV file downloaded previously is the following :
ID;Date;Région;Nombre de trains programmés;Nombre de trains ayant circulé;Nombre de trains annulés;Nombre de trains en retard à l'arrivée;Taux de régularité;Nombre de trains à l'heure pour un train en retard à l'arrivée;Commentaires
A datapoint looks like this :
TER_3;2013-01;Auvergne;5785;5732;53;431;92.5;12.3;Conditions météos défavorables.
Each datapoint has the following 10 fields (the names are french in the dataset, the descriptions are translated for more convenience):
- id : id of the region concerned
- date : YYYY-MM Year and month for those results
- région : name of the concerned region
- nombre de trains programmés : Number of scheduled trains for the month - no range
- nombre de trains ayant circulé : number of scheduled trains that have effectively circulated - between 0 and the value of
nombre_de_trains_programmes
- nombre de trains annulés : number of scheduled trains that were cancelled - between 0 and the value of
nombre_de_trains_programmes
- nombre de trains en retard à l'arrivée : Number of trains delayed among the scheduled trains - between 0 and the value of
nombre_de_trains_programmes
- taux de ponctualité : punctuality rate (percentage of trains on time) - Between -100 and 100. Some datapoints don't have this
value, so the default value will be 0
- nombre de trains à l'heure pour un train en retard à l'arrivée : number of trains that effectively arrived on time for 1 train
cancelled (in the example, 10 trains were on time for one delayed.
- commentaires : various comments written by the managers about the reasons why trains were delayed or cancelled, and
observations about the different indicators and results
Some of those fields can be absent, as one region did not disclosed it's punctuality results until 2015, and some of the datapoints may have been filled incorrectly by the transportation authority. In this case, the fields will contain a 0 when concerned by those restrictions.
For the prediction tasks, we can basically use all the attributes, expect for the Commentaires section which contains only informations meant to have a deeper understanding of the data and that are not formatted in a computer-readable format nor in english language. It would have been better to have a standardized system which classifies the situations depending of the context, but this is strictly a limitation of the dataset.
We now display several scatter plots to display the relationships between several attributes. Those relationships are the most present in the dataset, especially for the linear between the number of scheduled trains and any other attribute related to a number of trains.
display(Markdown("Relationship between the number of scheduled trains and the number of trains that have effectively run"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_ayant_circule")
display(Markdown("Relationship between the number of scheduled trains and the number of cancelled trains"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_annules")
display(Markdown("Relationship between the number of scheduled trains and the punctuality rate"))
test.plot.scatter(x="nombre_de_trains_programmes", y="taux_de_ponctualite")
display(Markdown("Relationship between the number of scheduled trains and the number of trains on time for one delayed train"))
test.plot.scatter(x="nombre_de_trains_programmes", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")
display(Markdown("Relationship between the punctuality rate and the number of trains on time for one delayed train"))
test.plot.scatter(x="taux_de_ponctualite", y="nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee")
Analyse your data in terms of measures of central tendency and dispersion (Calculation of statistics AND visualization).
(Examples: mean, median, standard deviation, variance, 25th/75th percentile, ...)
display(Markdown("The global data analysis gives us the following :"))
pd.DataFrame(test).describe()
We then display plots of the punctuality rates for each region per month, and the means and medians for each set
print("Displaying punctuality rates per month for each region, along with means and medians\n")
for i in range(1,20):
# Create the id to sort and filter the data
id = "TER_"+str(i)
# Filter to keep only the data for the selected region, sorted by date
m = test.loc[test.id == id].sort_values(by='date')
# Draw the plot with all the parameters
matplotlib.pyplot.rcParams["figure.figsize"] = [16.0, 16.0]
matplotlib.pyplot.title('Punctuality rate per month for '+m.region.all())
print("Region : "+m.region.all())
matplotlib.pyplot.xlabel('Date')
matplotlib.pyplot.ylabel('Punctuality rate')
matplotlib.pyplot.xticks(rotation=90)
matplotlib.pyplot.plot(m.date,m.taux_de_ponctualite)
# Compute the mean, display and draw it
mean = m.taux_de_ponctualite.mean()
print("Mean : "+str(mean))
horiz_line_data = np.array([mean for i in range(0,len(m.date))])
matplotlib.pyplot.plot(m.date, horiz_line_data)
#compute the median, display and draw it
median = m.taux_de_ponctualite.median()
print("Median : "+str(median))
horiz_line_data2 = np.array([median for i in range(0,len(m.date))])
matplotlib.pyplot.plot(m.date, horiz_line_data2)
matplotlib.pyplot.show()
#seaborn.heatmap(punct, cmap='hot')
#matplotlib.pyplot.boxplot(punct, date)
display(Markdown("We then display the Boxplots for each region. They report, for each numerical attribute related to the number of trains (number of trains scheduled, that have circulated, delayed at final stop and cancelled), the quartiles, means, and the upper and lower bounds."))
for i in range(1,20):
# Create the id to sort and filter the data
id = "TER_"+str(i)
# Filter to keep only the data for the selected region, sorted by date, and with only the attributes that we want
m = test.loc[test.id == id, ["date","region","nombre_de_trains_programmes","nombre_de_trains_ayant_circule","nombre_de_trains_en_retard_a_l_arrivee","nombre_de_trains_annules"]].sort_values(by='date')
matplotlib.pyplot.rcParams["figure.figsize"] = [24.0, 16.0]
m.plot.box()
matplotlib.pyplot.title('Boxplot for '+m.region.all())
matplotlib.pyplot.show()
display(Markdown("We do the same for the punctuality rates and the number of trains on time for one delayed train, since their values are much smaller (generally between 0 and 100, with some exceptions)."))
for i in range(1,20):
# Create the id to sort and filter the data
id = "TER_"+str(i)
# Filter to keep only the data for the selected region, sorted by date, and with only the attributes that we want
m = test.loc[test.id == id, ["date","region","taux_de_ponctualite","nombre_de_trains_a_lheure_pour_un_train_en_retard_a_larrivee"]].sort_values(by='date')
matplotlib.pyplot.rcParams["figure.figsize"] = [24.0, 16.0]
m.plot.box()
matplotlib.pyplot.title('Boxplot for '+m.region.all())
matplotlib.pyplot.show()
Calculate covariance and correlation matrices.
Which attributes are highly (positively or negatively) correlated?
Create a heatmap of the correlation matrix.
We output the matrices and heatmaps for the whole dataset, and for each region individually
print("Covariance and correlation matrices for all regions")
m = test.sort_values(by='date')
m.cov()
cl = m.corr()
print(cl)
seaborn.heatmap(cl,annot=True, cmap='hot')
matplotlib.pyplot.show()
for i in range(1,20):
# Create the id to sort and filter the data
id = "TER_"+str(i)
# Filter to keep only the data for the selected region, sorted by date
m = test.loc[test.id == id].sort_values(by='date')
print("Covariance and correlation matrices for "+m.region.all())
m.cov()
cl = m.corr()
print(cl)
seaborn.heatmap(cl,annot=True, cmap='hot')
matplotlib.pyplot.show()
For all regions, we see that there are two groups representated : the number of trains scheduled, the number of trains that ran, the number of cancelled trains and the number of delayed trains have some mid-ranged correlation with each other, oscillating between 0.6 and 0.75; while the punctuality rate and the number of trains on time for one delayed train at arrival have a correlation of 0.5 only. Both groups don't have any strong positive correlation with each other, and they only tend to weak tendances (close to 0.2 at most), or negative values (some reach -0.36 at most). We observe the same tendency with each region analysis, but the values are really different depending on the region studied.
We need to select datapoints where the fields that interests us are not equal to 0 as they usually indicates the absence of revelant data. This means that we are basically going to exclude all reported months where there was no data from our model to avoid any problem.
We also reshape our data to have less difficulties during the supervised learning.
ah = test.loc[test.taux_de_ponctualite > 0].sort_values(by='date')
test.describe()
ah.plot(x='nombre_de_trains_programmes', y='nombre_de_trains_en_retard_a_l_arrivee', style='o')
matplotlib.pyplot.title('Number of scheduled trains vs Number of delayed trains (before reshaping)')
matplotlib.pyplot.xlabel('Number of scheduled trains')
matplotlib.pyplot.ylabel('Number of delayed trains')
matplotlib.pyplot.show()
X = ah['nombre_de_trains_programmes'].values.reshape(-1,1)
y = ah['nombre_de_trains_en_retard_a_l_arrivee'].values.reshape(-1,1)
Apply a clustering (e.g. k-means, hierarchical, ...) or PCA algorithm to your dataset.
NOTE: This step can be skipped in case you have a dataset that is suitable for supervised learning!
(Note: If you have a dataset that is suitable for supervised learning but you still want to perform unsupervised learning:
In that case, the dependent variable (class attribute to predict) should be excluded from the data for clustering.
For example, for the iris dataset we would exclude the attribute "species" to see if clustering only based
on the independent variables (i.e. petal and sepal lengths/widths) allows partitioning of the data into
groups that relate to the plant species)
For the unsupervised learning part, we try to apply the k-means clustering to our dataset in order to determine main groups among some of our fields.
Here, we try to predict how many trains are going to be cancelled in average by selecting the group that corresponds the most to our number of scheduled trains.
In order to have an optimal clustering, we first use the elbow method to select a value of k with a low error rate.
df = pd.DataFrame( test.sort_values(by='date'),columns=['nombre_de_trains_programmes','nombre_de_trains_annules'])
ssedata = {}
for k in range(2, 30):
kmeans = KMeans(n_clusters=k).fit(df)
centroids = kmeans.cluster_centers_
ssedata[k] = kmeans.inertia_
matplotlib.pyplot.figure()
matplotlib.pyplot.plot(list(ssedata.keys()), list(ssedata.values()))
matplotlib.pyplot.xlabel("Number of cluster")
matplotlib.pyplot.ylabel("SSE")
With this method, we then know that we should select a number of clusters that is superior than 15, because they generate less errors than any number of cluster inferior to this minimal limit. Once we got the good number of cluster, we can compute the k-means graph. Here, we will try with 15.
kmeans = KMeans(n_clusters=15).fit(df)
centroids = kmeans.cluster_centers_
print(centroids)
matplotlib.pyplot.scatter(df['nombre_de_trains_programmes'], df['nombre_de_trains_annules'], c= kmeans.labels_.astype(float), s=50, alpha=0.5)
matplotlib.pyplot.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
According to our final results, if we decide in the future to schedule 9259 trains for a month, there is a probability that there will be 177 trains that will be cancelled on average. Since the data is sprayed, this is not the most reliable estimation available, but it helps us predict global, average outcomes that can help us to verify if the situation for the month currently tested was considerated as normal (within the predictions) or not.
Depending on your dataset, there is an attribute (which can be categorical (classification task) or numeric (regression task)) that we would like to be able to predict.
(For example, for the iris dataset we had the attribute "species". Using the independent attributes of
sepal and petal lengths and widths, the goal would be to build a prediction model that allows us to predict
the plant species of new incoming plant data.)
Tasks:
We then create our train and test shapes, apply the linear regression wth the train set, and test it to see if it fits.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)
regressor = LinearRegression()
regressor.fit(X_train, y_train) #training the algorithm
#To retrieve the intercept:
print(regressor.intercept_)
#For retrieving the slope:
print(regressor.coef_)
y_pred = regressor.predict(X_test)
wow = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
wow
matplotlib.pyplot.scatter(X_test, y_test, color='gray')
matplotlib.pyplot.plot(X_test, y_pred, color='red', linewidth=2)
matplotlib.pyplot.show()
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
We see with the prediction tables, the graph and the error values that the model gives us outputs that are not perfect for the test set. But, we are actually trying to estimate an attribute that has no strong correlation with the input attribute, so this was to be expected. This model is more adaptated to give us a way to obtain an average number of delayed trains that we can expect depending on the punctuality rate provided on the input, that would act as a way to control if a unnatural situation has occured on that month (by applying an interval around the model prediction where we can still confirm that the value is normal or not).
Thus, we can say that the linear regression does not fit with the real data, but instead focus to trying to provide an average for constraints-free situations.
We then want to apply the random forest regression in order to have another supervised learning technique. The problem is that we obtain a negative accuracy which is synonym of a bad quality regression, so we can't rely on it to obtain accurate results. However, we can still conceive a model that can output realistic values for a somewhat "normal" situation.
# The baseline predictions are the historical averages
baseline_preds = X_test[:, 0]
# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - y_test)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))
# Import the model we are using
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)# Train the model on training data
rf.fit(X_train, y_train.ravel());
# Use the forest's predict method on the test data
predictions = rf.predict(X_test)# Calculate the absolute errors
errors = abs(predictions - y_test)# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
We output two different trees for our random forest regressor. The biggest one is on tree.dot and tree.png, while a (not so) "smaller" and somewhat more readable tree designed on a 6-level depth is available on small_tree.dot and small_tree.png
# Pull out one tree from the forest
tree = rf.estimators_[5]# Export the image to a dot file
export_graphviz(tree, out_file = 'tree.dot', feature_names = ['nombre_de_trains_programmes'], rounded = True)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')
display(Image(filename='tree.png'))
# Limit depth of tree to 6 levels
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 6)
rf_small.fit(X_train, y_train.ravel())
# Extract the small tree
tree_small = rf_small.estimators_[5]
# Save the tree as a png image
export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = ['nombre_de_trains_programmes'], rounded = True, precision = 1)
(graph, ) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')
display(Image(filename='small_tree.png'))
This is where we are supposed to run our model, and to see if our predictions fits the actual data.
predictions_data = pd.DataFrame(data = {'nombre_de_trains_en_retard_a_l_arrivee': predictions})
matplotlib.pyplot.plot(ah['nombre_de_trains_programmes'], ah['nombre_de_trains_en_retard_a_l_arrivee'], 'go', label = 'actual')
matplotlib.pyplot.plot(X_test, predictions_data['nombre_de_trains_en_retard_a_l_arrivee'], 'ro', label = 'prediction')
matplotlib.pyplot.xticks(rotation = '60');
matplotlib.pyplot.legend()
We see that our predictions are actually pretty good compared to the disparity of the data, and that we can actually make some average predictions to simulate a normal behaviour. I was expecting this model to output values that would be completly out of place and totally unusable, but it turned out better than expected.
In conclusion, we now have a model that, for any number of scheduled trains inputted, is able to predict the number of trains that will be delayed at the final stop while following the same pattern than for the actual data. This gives more realistic values, but which are less likely to be accurate.
But since the accuracy is irrevelant on data based from results mesured on the real world that depends on countless factors, we can still take this model into consideration.
To sum up this part on supervised learning, we obtained two models that each have their own way of running (linear and random forest), and outputs different values for each data inputted. In order to have more accurate predictions, an idea of improvement would be to merge the results given by both models, which would give probably better results. For the time being, the random forest seems closer to the real life situation anticipation, while the linear model is only considerating a situation where there is no external constraint.