draw a decision tree to find eric s
How to program a decision tree in Python from 0
The decision tree is one of the most widely used machine learning algorithms due to its ease of interpretation. However, do you know how it works? In this post I am going to explain everything that you need about decision trees. To do this, we are going to create our own decision tree in Python from scratch. Does it sound interesting? Let's get to it!
Understanding how a decision tree works
A decision tree consists of creating different rules by which we make the prediction. For example, let's say we train an algorithm that predicts whether or not a person is obese based on their height and weight. To do this, we will use the following dataset.
import pandas as pd import numpy as np data = pd.read_csv("data.csv") data.head()
Gender Height Weight Index 0 Male 174 96 4 1 Male 189 87 2 2 Female 185 110 4 3 Female 195 104 3 4 Male 149 61 3
Imagine that we want to predict whether or not the person is obese. Based on the description of the dataset (available on Kaggle), people with an index of 4 or 5 are obese, so we could create a variable that reflects this:
data['obese'] = (data.Index >= 4).astype('int') data.drop('Index', axis = 1, inplace = True)
In that case, a decision tree would tell us different rules, such as that if the person's weight is greater than 100kg, it is most likely that the person is obese. However, that cut will not be precise: there will be people who weigh 100kg or more who are not obese. Thus, the decision tree continues to create more branches that generate new conditions to "refine" our predictions.el sitio web de Betcris Perú
As you can see, decision trees usually have sub-trees that serve to fine-tune the prediction of the previous node. This is so until we get to a node that does not split. This last node is known as a leaf node or leaf node. Let's see a graphic example:
Besides,a decision trees can work for both regression problems and for classification problems. In fact, we will code a decision tree from scratch that can do both.
Now you know the bases of this algorithm, but surely you have doubts. How does the algorithm decide which variable to use as the first cutoff? How do you choose the values? Let's see it little by little programming our own decision tree from scratch in Python.
Impurity and cost functions of a decision tree
As in all algorithms, the cost function is the basis of the algorithm. In the case of decision trees, there are two main cost functions: the Gini index and entropy.
Any of the cost functions we can use are based on measuring impurity. Impurity refers to the fact that, when we make a cut, how likely is it that the target variable will be classified incorrectly.
In the example above, impurity will include the percentage of people that weight >=100 kg that are not obese and the percentage of people with weight<100 kg that are obese. Every time we make a split and the classification is not perfect, the split is impure.
However, this does not mean that all cuts are the same: sure that the cut in 100kg classifies better than if we make the split at 80kg. In fact, we can check it:
print( " Misclassified when cutting at 100kg:", data.loc[(data['Weight']>=100) & (data['obese']==0),:].shape[0], "\n", "Misclassified when cutting at 80kg:", data.loc[(data['Weight']>=80) & (data['obese']==0),:].shape[0] )
Misclassified when cutting at 100kg: 18 Misclassified when cutting at 80kg:: 63
In short, the cost function of a decision tree seeks to find those cuts that minimize impurity. Now, let's see what ways exist to calculate impurity:
Calculate impurity using the Gini index
The Gini index is the most widely used cost function in decision trees. This index calculates the amount of probability that a specific characteristic will be classified incorrectly when it is randomly selected.
This is an index that ranges from 0 (a pure cut) to 0.5 (a completely pure cut that divides the data equally). The Gini index is calculated as follows:
\[ Gini = 1 – \sum^n_{i=1}(P_i)^2 \]
Where Pi is the probability of having that class or value.
Let's program the function, considering the input will be a Pandas series:
def gini_impurity(y): ''' Given a Pandas Series, it calculates the Gini Impurity. y: variable with which calculate Gini Impurity. ''' if isinstance(y, pd.Series): p = y.value_counts()/y.shape[0] gini = 1-np.sum(p**2) return(gini) else: raise('Object must be a Pandas Series.') gini_impurity(data.Gender)
0.4998
As we can see, the Gini index for the Gender variable is very close to 0.5. This indicates that the Gender variable is very impure, that is, the cutting results are not will both have equally the same proportion of incorrectly classified data.
Now that you know how the index works, let's see how entropy works.
Calculate impurity with entropy
Entropy it is a way of measuring impurity or randomness in data points. Entropy is defined by the following formula:
\[ E(S) = \sum^c_{i=1}-p_ilog_2p_i \]
Unlike the Gini index, whose range goes from 0 to 0.5, the entropy range is different, since it goes from 0 to 1. In this way, values close to zero are less impure than those that approach 1.
Let's see how entropy works by calculating it for the same example that we have done with the Gini index:
def entropy(y): ''' Given a Pandas Series, it calculates the entropy. y: variable with which calculate entropy. ''' if isinstance(y, pd.Series): a = y.value_counts()/y.shape[0] entropy = np.sum(-a*np.log2(a+1e-9)) return(entropy) else: raise('Object must be a Pandas Series.') entropy(data.Gender)
0.9997114388674198
As we see, it gives us a value very close to 1, which denotes an impurity similar to that indicated by the Gini impurity, whose value is close to 0.5.
With this, you already know the two main methods that can be used in a decision tree to calculate impurity. Perfect, we already know how to decide if a cut is good or not, but… between which splits do we choose? Let's see it!
How to choose the cuts for our decision tree
As we have seen, cuts are compared by impurity. Therefore, we are interested in comparing those cuts that generate less impurity. For this, Information Gain is used. This metric indicates the improvement when making different partitions and is usually used with entropy (it could also be used with the Gini index, although in that case it would not be called Informaiton Gain).
The calculation of the Information Gain will depend on whether it is a classification or regression decision tree. There would be two options:
\[ Information Gain_{Classification}= E(d) – \sum \frac{|s|}{|d|}E(s) \]
\[ Information Gain_{Regresion}= Variance(d) – \sum \frac{|s|}{|d|}Variance(s) \]
So the Information Gain will look like this:
def variance(y): ''' Function to help calculate the variance avoiding nan. y: variable to calculate variance to. It should be a Pandas Series. ''' if(len(y) == 1): return 0 else: return y.var() def information_gain(y, mask, func=entropy): ''' It returns the Information Gain of a variable given a loss function. y: target variable. mask: split choice. func: function to be used to calculate Information Gain in case os classification. ''' a = sum(mask) b = mask.shape[0] - a if(a == 0 or b ==0): ig = 0 else: if y.dtypes != 'O': ig = variance(y) - (a/(a+b)* variance(y[mask])) - (b/(a+b)*variance(y[-mask])) else: ig = func(y)-a/(a+b)*func(y[mask])-b/(a+b)*func(y[-mask]) return ig
Now, we can calculate the information gain of a specific cut:
information_gain(data['obese'], data['Gender'] == 'Male')
0.0005506911187600494
Knowing this, the steps that we need to follow in order to code a decision tree from scratch in Python are simple:
- Calculate the Information Gain for all variables.
- Choose the split that generates the highest Information Gain as a split.
- Repeat the process until at least one of the conditions set by hyperparameters of the algorithm is not fulfilled.
However, we have a newly added difficulty, and it is, how do we choose which is the best split in the numerical variables? And if there is more than one categorical variable?
How to calculate the best split for a variable
To calculate the best split of a numeric variable, first, all possible values that the variable is taking must be obtained. Once we have the options, for each option we will calculate the Information Gain using as a filter if the value is less than that value. Obviously, the first possible data will be drop, because the split will include all values.
In case we have categorical variables, the idea is the same, only that in this case we will have to calculate the Information Gain for all possible combinations of that variable, excluding the option that includes all the options (since it would not be doing any split). This is quite computationally costly if we have a high number of categories, that decision tree algorithms usually only accept categorical variables with less than 20 categories.
So, once we have all the splits, we will stick with the split that generates the highest Information Gain.
import itertools def categorical_options(a): ''' Creates all possible combinations from a Pandas Series. a: Pandas Series from where to get all possible combinations. ''' a = a.unique() opciones = [] for L in range(0, len(a)+1): for subset in itertools.combinations(a, L): subset = list(subset) opciones.append(subset) return opciones[1:-1] def max_information_gain_split(x, y, func=entropy): ''' Given a predictor & target variable, returns the best split, the error and the type of variable based on a selected cost function. x: predictor variable as Pandas Series. y: target variable as Pandas Series. func: function to be used to calculate the best split. ''' split_value = [] ig = [] numeric_variable = True if x.dtypes != 'O' else False # Create options according to variable type if numeric_variable: options = x.sort_values().unique()[1:] else: options = categorical_options(x) # Calculate ig for all values for val in options: mask = x < val if numeric_variable else x.isin(val) val_ig = information_gain(y, mask, func) # Append results ig.append(val_ig) split_value.append(val) # Check if there are more than 1 results if not, return False if len(ig) == 0: return(None,None,None, False) else: # Get results with highest IG best_ig = max(ig) best_ig_index = ig.index(best_ig) best_split = split_value[best_ig_index] return(best_ig,best_split,numeric_variable, True) weight_ig, weight_slpit, _, _ = max_information_gain_split(data['Weight'], data['obese'],) print( "The best split for Weight is when the variable is less than ", weight_slpit,"\nInformation Gain for that split is:", weight_ig )
The best split for Weight is when the variable is less than 103 Information Gain for that split is: 0.3824541370911895
Now that we know how to calculate the split of a variable, let's see how to decide the best split.
How to choose the best split
As I have previously said, the best split will be the one that generates the highest Information Gain. To know which one is it, we simply have to calculate the Information Gain for each of the predictor variables of the model.
data.drop('obese', axis= 1).apply(max_information_gain_split, y = data['obese'])
Gender Height Weight 0 0.000550691 0.0647483 0.382454 1 [Male] 174 103 2 False True True 3 True True True
As we can see, the variable with the highest Information Gain is Weight. Therefore, it will be the variable that we use first to do the split. In addition, we also have the value on which the split must be performed: 103.
With this, we already have the first split, which would generate two dataframes. If we apply this recursively, we will end up creating the entire decision tree (coded in Python from scratch). Let's do it!
How to train a decision tree in Python from scratch
Determining the depth of the tree
We already have all the ingredients to calculate our decision tree. Now, we must create a function that, given a mask, makes us a split.
In addition, we will include the different hyperparameters that a decision tree generally offers. Although we could include more, the most relevant are those that prevent the tree from growing too much, thus avoiding overfitting. These hyperparameters are as follows:
-
max_depth
: maximum depth of the tree. If we set it to None, the tree will grow until all the leaves are pure or the hyperparametermin_samples_split
has been reached. -
min_samples_split
: indicates the minimum number of observations a sheet must have to continue creating new nodes. -
min_information_gain
: the minimum amount the Information Gain must increase for the tree to continue growing.
With this in mind, let's finish creating our decision tree from 0 in Python. To do this, we will:
- Make sure that the conditions established by
min_samples_split
andmax_depth
are being fulfilled. - Make the split.
- Ensure that
min_information_gain
if fulfilled. - Save the data of the split and repeat the process.
To do this, first of all, I will create three functions: one that, given some data, returns the best split with its corresponding information, another that, given some data and a split, makes the split and returns the prediction and finally, a function that given some data, makes a prediction.
Note: the prediction will only be given in the branches and basically consists of returning the mean of the data in the case of the regression or the mode in the case of the classification.
def get_best_split(y, data): ''' Given a data, select the best split and return the variable, the value, the variable type and the information gain. y: name of the target variable data: dataframe where to find the best split. ''' masks = data.drop(y, axis= 1).apply(max_information_gain_split, y = data[y]) if sum(masks.loc[3,:]) == 0: return(None, None, None, None) else: # Get only masks that can be splitted masks = masks.loc[:,masks.loc[3,:]] # Get the results for split with highest IG split_variable = max(masks) #split_valid = masks[split_variable][] split_value = masks[split_variable][1] split_ig = masks[split_variable][0] split_numeric = masks[split_variable][2] return(split_variable, split_value, split_ig, split_numeric) def make_split(variable, value, data, is_numeric): ''' Given a data and a split conditions, do the split. variable: variable with which make the split. value: value of the variable to make the split. data: data to be splitted. is_numeric: boolean considering if the variable to be splitted is numeric or not. ''' if is_numeric: data_1 = data[data[variable] < value] data_2 = data[(data[variable] < value) == False] else: data_1 = data[data[variable].isin(value)] data_2 = data[(data[variable].isin(value)) == False] return(data_1,data_2) def make_prediction(data, target_factor): ''' Given the target variable, make a prediction. data: pandas series for target variable target_factor: boolean considering if the variable is a factor or not ''' # Make predictions if target_factor: pred = data.value_counts().idxmax() else: pred = data.mean() return pred
Training our decision tree in Python
Now that we have these three functions, we can, let's train the decision tree that we just programmed in Python.
- We ensure that both
min_samples_split
andmax_depth
are fulfilled. - If they are fulfilled, we get the best split and obtain the Information Gain. If any of the conditions are not fulfilled, we make the prediction.
- We check that the Information Gain Comprobamos passes the minimum amount set by
min_information_gain
. - If the condition above is fulfilled, we make the split and save the decision. If it is not fulfilled, then we make the prediction.
We will do this process recursively, that is, the function will call itself. The result of the function will be the rules you follow to make the decision:
def train_tree(data,y, target_factor, max_depth = None,min_samples_split = None, min_information_gain = 1e-20, counter=0, max_categories = 20): ''' Trains a Decission Tree data: Data to be used to train the Decission Tree y: target variable column name target_factor: boolean to consider if target variable is factor or numeric. max_depth: maximum depth to stop splitting. min_samples_split: minimum number of observations to make a split. min_information_gain: minimum ig gain to consider a split to be valid. max_categories: maximum number of different values accepted for categorical values. High number of values will slow down learning process. R ''' # Check that max_categories is fulfilled if counter==0: types = data.dtypes check_columns = types[types == "object"].index for column in check_columns: var_length = len(data[column].value_counts()) if var_length > max_categories: raise ValueError('The variable ' + column + ' has '+ str(var_length) + ' unique values, which is more than the accepted ones: ' + str(max_categories)) # Check for depth conditions if max_depth == None: depth_cond = True else: if counter < max_depth: depth_cond = True else: depth_cond = False # Check for sample conditions if min_samples_split == None: sample_cond = True else: if data.shape[0] > min_samples_split: sample_cond = True else: sample_cond = False # Check for ig condition if depth_cond & sample_cond: var,val,ig,var_type = get_best_split(y, data) # If ig condition is fulfilled, make split if ig is not None and ig >= min_information_gain: counter += 1 left,right = make_split(var, val, data,var_type) # Instantiate sub-tree split_type = "<=" if var_type else "in" question = "{} {} {}".format(var,split_type,val) # question = "\n" + counter*" " + "|->" + var + " " + split_type + " " + str(val) subtree = {question: []} # Find answers (recursion) yes_answer = train_tree(left,y, target_factor, max_depth,min_samples_split,min_information_gain, counter) no_answer = train_tree(right,y, target_factor, max_depth,min_samples_split,min_information_gain, counter) if yes_answer == no_answer: subtree = yes_answer else: subtree[question].append(yes_answer) subtree[question].append(no_answer) # If it doesn't match IG condition, make prediction else: pred = make_prediction(data[y],target_factor) return pred # Drop dataset if doesn't match depth or sample conditions else: pred = make_prediction(data[y],target_factor) return pred return subtree max_depth = 5 min_samples_split = 20 min_information_gain = 1e-5 decisiones = train_tree(data,'obese',True, max_depth,min_samples_split,min_information_gain) decisiones
{'Weight <= 103': [{'Weight <= 66': [0, {'Weight <= 84': [{'Weight <= 74': [0, {'Weight <= 75': [1, 0]}]}, {'Weight <= 98': [1, 0]}]}]}, 1]}
It is done! The decision tree we just coded in Python has created all the rules that it will use to make predictions.
Now, there would only be one thing left: convert those rules into concrete actions that the algorithm can use to classify new data. Let's go for it!
Predict using our decision tree in Python
To make the prediction, we are going to take an observation and the decision tree. These decisions can be converted into real conditions by splitting them.
So, to make the prediction we are going to:
- Break the decision into several chunks.
- Check the type of decision that it is (numerical or categorical).
- Considering the type of variable that it is, check the decision boundary. If the decision is fulfilled, return the result, if it is not, then continue with the decision..
def clasificar_datos(observacion, arbol): question = list(arbol.keys())[0] if question.split()[1] == '<=': if observacion[question.split()[0]] <= float(question.split()[2]): answer = arbol[question][0] else: answer = arbol[question][1] else: if observacion[question.split()[0]] in (question.split()[2]): answer = arbol[question][0] else: answer = arbol[question][1] # If the answer is not a dictionary if not isinstance(answer, dict): return answer else: residual_tree = answer return clasificar_datos(observacion, answer)
So, we can try to classify all the data in our algorithm to see how well our decision tree has worked that we just programmed in Python:
print(';Algorithm Accuracy:';, accuracy)
Algorithm Accuracy: 0.848
The decision tree we just coded in Python is almost 85% accurate! As we can see, it seems that it has trained well, although perhaps the hyperparameters that we have chosen are not the best (that's a topic for another post).
Finally, as we have coded our decision tree in Python to support different types of data and to be used for both regression and classification, we are going to test it with different use cases:
Decision tree prediction for regression
To do the regression, we are going to use the gapminder dataset, which has the information on the number of inhabitants, GDP per Capita of different countries for different years.
So, we are going to use the algorithm to predict the life expectancy of a country taking into account its GDP per Capita and its population:
from gapminder import gapminder gapminder_lite = gapminder.loc[:,['lifeExp','pop','gdpPercap']] gapminder_decission = train_tree(gapminder_lite,'lifeExp',True, max_depth=5) gapminder_decission
{'pop <= 6927772': [{'pop <= 2728150': [{'pop <= 1138101': [{'pop <= 551425': [{'pop <= 305991': [50.93899999999999, 61.448]}, {'pop <= 841934': [61.556999999999995, 60.396]}]}, {'pop <= 1865490': [{'pop <= 1489518': [53.754, 59.448]}, {'pop <= 2312802': [69.39, 61.271]}]}]}, {'pop <= 4318137': [{'pop <= 3495918': [{'pop <= 3080828': [74.712, 73.44]}, {'pop <= 3882229': [67.45, 39.486999999999995]}]}, {'pop <= 5283663': [{'pop <= 4730997': [56.604, 79.313]}, {'pop <= 6053193': [45.552, 48.437]}]}]}]}, {'pop <= 19314747': [{'pop <= 10191512': [{'pop <= 8519282': [{'pop <= 7661799': [23.599, 69.51]}, {'pop <= 9354120': [48.968999999999994, 73.042]}]}, {'pop <= 13329874': [{'pop <= 11000948': [69.58, 48.303000000000004]}, {'pop <= 16252726': [55.448, 66.8]}]}]}, {'pop <= 43997828': [{'pop <= 28227588': [{'pop <= 22662365': [72.396, 62.677]}, {'pop <= 34621254': [51.542, 37.802]}]}, {'pop <= 76511887': [{'pop <= 56667095': [72.76, 68.564]}, {'pop <= 135031164': [78.67, 64.062]}]}]}]}]}
Likewise, we can take advantage of this same Gapminder dataset to check how, if we pass a categorical variable with more levels than what is set by the max_categories
parameter, it will return an error:
train_tree(gapminder,'lifeExp',True, max_depth=5)
ValueError: The variable country has 142 unique values, which is more than the accepted ones: 20
Besides, we can calculate the estimate of our algorithm:
for i in range(num_obs): obs_pred = clasificar_datos(gapminder_lite.iloc[i,:], gapminder_decission) gapm_prediction.append(obs_pred) print('Predictions: ';,gapm_prediction, ';\n\nReal Values:';, gapminder_lite.lifeExp[:num_obs].to_numpy())
Predictions: [69.51, 48.968999999999994, 69.58, 48.303000000000004, 48.303000000000004, 55.448, 48.303000000000004, 55.448, 66.8, 72.396, 62.677, 51.542, 53.754, 53.754, 59.448, 69.39, 69.39, 61.271, 74.712, 74.712, 69.51, 48.968999999999994, 69.58, 48.303000000000004, 48.303000000000004, 55.448, 48.303000000000004, 55.448, 66.8, 72.396, 62.677, 51.542, 53.754, 53.754, 59.448, 69.39, 69.39, 61.271, 74.712, 74.712] Real Values: [28.801 30.332 31.997 34.02 36.088 38.438 39.854 40.822 41.674 41.763 42.129 43.828 55.23 59.28 64.82 66.22 67.69 68.93 70.42 72. ]
Likewise, we could also use any type of categorical variable to make our predictions:
gapminder_cat = gapminder.loc[:,['lifeExp','continent']] gapminder_cat_decission = train_tree(gapminder_cat,'lifeExp',False, max_depth=5) gapminder_cat_decission
{"continent in ['Africa', 'Americas']": [{"continent in ['Americas']": [64.65873666666667, 48.86533012820513]}, {"continent in ['Asia']": [60.064903232323225, {"continent in ['Oceania']": [74.32620833333333, 71.9036861111111]}]}]}
Of course, we could also use the same prediction function to make predictions with trees that use categorical variables. Although, as you might expect, in this case the predictions will be very poor due to oversimplification of the predictor variables used:
gapm_cat_prediction = [] num_obs = 20 for i in range(num_obs): obs_pred = clasificar_datos(gapminder_cat.iloc[i,:], gapminder_cat_decission) gapm_cat_prediction.append(obs_pred) print("Predictions: ",gapm_cat_prediction, "\n\nReal Values:", gapminder_cat.lifeExp[:num_obs].to_numpy())
Predictions: [60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 60.064903232323225, 71.9036861111111, 71.9036861111111, 71.9036861111111, 71.9036861111111, 71.9036861111111, 71.9036861111111, 71.9036861111111, 71.9036861111111] Real Values: [28.801 30.332 31.997 34.02 36.088 38.438 39.854 40.822 41.674 41.763 42.129 43.828 55.23 59.28 64.82 66.22 67.69 68.93 70.42 72. ]
Likewise, we can also use the decision tree that we have programmed in Python from scratch for classification problems.
Decision tree prediction for classification
To test our decision tree with a classification problem, we are going to use the typical Titanic dataset, which can be downloaded from here.
In our case, we do not seek to achieve the best results, but to demonstrate how the decision tree that we have programmed in Python from scratch works. Therefore, I will simply keep some columns and all those observations that have the complete data:
titanic = pd.read_csv("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/fa71405126017e6a37bea592440b4bee94bf7b9e/titanic.csv") titanic_lite = titanic.loc[:,["Embarked", "Age", "Fare","Survived"]] titanic_lite = titanic_lite.loc[titanic_lite.isna().sum(axis = 1)==0,:] titanic_tree = train_tree(titanic_lite,'Survived',True, max_depth=10, min_samples_split=10) titanic_tree
{'Fare <= 52.5542': [{'Fare <= 10.5': [{'Fare <= 7.1417': [0, {'Fare <= 7.225': [1, {'Fare <= 9.8458': [{'Fare <= 9.8417': [{'Fare <= 9.825': [{'Fare <= 8.05': [{'Fare <= 7.925': [0, {'Fare <= 8.0292': [0, 1]}]}, 0]}, 0]}, 1]}, 0]}]}]}, {'Fare <= 39.6875': [{'Fare <= 39.0': [{'Fare <= 27.9': [{'Fare <= 18.75': [{'Fare <= 17.8': [{'Fare <= 16.7': [{'Fare <= 16.1': [{'Fare <= 15.7417': [0, 1]}, 0]}, 1]}, 0]}, {'Fare <= 20.2125': [1, {'Fare <= 25.9292': [{'Fare <= 24.15': [{'Fare <= 23.0': [0, 1]}, 0]}, 1]}]}]}, {'Fare <= 30.0': [0, {'Fare <= 31.275': [{'Fare <= 31.0': [{'Fare <= 30.6958': [1, 0]}, 1]}, {'Fare <= 31.3875': [0, {'Fare <= 33.5': [1, {'Fare <= 35.5': [0, 1]}]}]}]}]}]}, 1]}, {'Fare <= 49.5': [0, {'Fare <= 49.5042': [1, 0]}]}]}]}, {'Fare <= 83.1583': [{'Fare <= 61.175': [1, {'Fare <= 63.3583': [0, {'Fare <= 66.6': [1, {'Fare <= 82.1708': [{'Fare <= 75.25': [{'Fare <= 73.5': [1, 0]}, {'Fare <= 77.2875': [1, {'Fare <= 77.9583': [0, 1]}]}]}, 0]}]}]}]}, 1]}]}
Once again, we can use this same data to make the prediction:
titanic_prediction = [] num_obs = 20 for i in range(num_obs): obs_pred = clasificar_datos(titcanic_lite.iloc[i,:], titanic_tree) titanic_prediction.append(obs_pred) print("Predictions: ",titanic_prediction, "\n\nReal values:", titcanic_lite.Survived[:num_obs].to_numpy())
Predictions: [0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1] Real values: [0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1]
Conclusiones
Personally, I think the best way to know an algorithm is to program it from scratch. In this case, we have coded a decision tree from scratch in Python and, without a doubt, it is useful to know how the algorithm works, the types of cost functions it can uses, how they work and how the splits and the predictions are made.
If you have found it interesting, I recommend you see the rest of the posts in which I code other algorithms from scratch, including, a neural network in Python and R, the K-means algorithm in Python and R, or linear regression in R.
If you would like to keep up to date with the posts I publish, I encourage you to subscribe to my newsletter. In any case, see you next time!
\(\)
Source: https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/
0 Response to "draw a decision tree to find eric s"
Post a Comment