When a bank considers whether it would offer a loan to someone or not, it considers a chronological list of questions to decide if it’s safe to approve such a loan. The questions under consideration could begin with simple ones such as what’s the individual’s annual income. Depending on the answer (say, <$20,000, between $20,000 and $50,000, and >$50,000), the next questions could be:
- Is this income from salary? Or from a business?
- How long has the individual been in service or running the business?
- Does the person have any criminal record?
Based on the answers, the next set of questions could involve finding out if the person has any existing loans, has defaulted on credit card payments, etc. Assuming the person draws a salary of $30,000, has no existing loans or criminal record, and makes his credit card payments on time, the bank may offer him the loan. You can call this a basic form of a decision tree.
What’s a decision tree?
It is an effective machine learning modeling technique for classification and regression problems. To find solutions or possible results of a series of related choices, a decision tree makes hierarchical, sequential, decisions about the variable outcomes based on the predictor data. Typically, a decision tree begins with a single node, or the root node, which branches into probable outcomes which are called the decision nodes or leaf/terminal nodes based on whether they have further sub-nodes or not. Each of those outcomes could give rise to additional nodes that branch off further to take into account other possibilities, as in the case of the bank loan example we had discussed at the start. This gives the entire structure a tree-like shape, and hence the name of the decision tree as it helps to identify a strategy that’s most likely to facilitate achieving the desired goal.
Decision Trees and Machine Learning
In machine learning, you have two categories of models, namely classification and regression. You can apply decision trees to both. Where you need to deal with distinct categorical target variables, you’ll need to use classification trees. Typical use cases could be predicting if a team would win the match or not, if an email is spam or not, if the temperature would be low or high, etc.
Regression trees are used when you have to handle continuous quantitative target variables. Typical use cases could be predicting marks, revenue, the salary of an employee, rainfall, etc.
Handling variance in decision tree models
Even a slight variation in the data could make the decision tree unstable by resulting in the creation of a tree that’s completely different from what was intended. This is called variance. You can reduce it with methods like bagging and boosting.
Bagging decision trees: You can use them to create multiple decision trees by continually resampling training data with replacement and getting the average of all the predictions from different trees, which would help deliver more robust results than a single decision tree.
Boosting decision trees: An iterative method is used to fit a weak tree to the data and continue fitting weak learners (trees) iteratively to analyze and correct the errors of the preceding model.
Parting thoughts
If you work with Python, there is a popular library called Scikit-Learn that you can use to implement decision tree algorithms. It has a great api that can get your model up and running with just a few lines of code.