A Beginner’s Guide to the Hoeffding Tree with the Python Implementation



The traditional machine learning model begins with data processing and ends with data modeling. The model gives us the results based on satisfaction. But the crucial problem with the traditional procedure is the lack of storage and time. Therefore, new technologies such as online machine learning are introduced where we can use conventional models in the data stream to achieve satisfactory results. In this article, we are going to discuss a model called Hoeffding Tree which is based on the conventional decision tree designed for use in online machine learning. It outperforms other machine learning models while working with large data streams assuming that the distribution of the data does not change over time. The main points to cover in this article are listed below.


  1. Online machine learning
  2. Incremental machine learning
  3. Incremental decision tree
  4. Data flow
  5. Hoeffding Tree
  6. Implementation of the Hoeffding tree
    1. Scikit-MultiFlow
    2. StreamDM

Let’s start our discussion by understanding online machine learning which is the basis of our main discussion.

Online machine learning

Online machine learning is required when data arrives at an algorithm sequentially and machine learning algorithms need to use the data to update the best predictor so that the upcoming sequence of data can be processed sensibly at every step to come. Unlike other batch learning programs, online machine learning algorithms primarily focus on areas where models are computationally impossible to train on the dataset and there is still a need for algorithms. external memory. Online machine learning algorithms are likely to dynamically adapt to the new data model.

Most of the time, online machine learning algorithms are designed in such a way that they can forget the previously learned information so that they can be trained on the new information. This online machine learning process can be approached by incremental machine learning. These algorithms are mainly used in areas like stock forecast, weather forecast, etc. In the next section of this article, we’ll discuss incremental machine learning. There are a lot of points about machine learning online, most of them are mentioned and explained in this article.

Incremental machine learning

This is another type of machine learning that basically focuses on training the model based on the continuous incoming data. By also using this learning method, we can increase the knowledge of the existing model. More formally, it can be considered as a dynamic technique of supervised or unsupervised learning. These techniques are used with data gradually available over time. There are many conventional machine learning algorithms that can be used as incremental machine learning. We can also call them incremental algorithms.

The main objective of this type of algorithm is to adapt to new changes in the data and also to keep the old learning in the model memory so that there can be less chance of the model failing or that we can achieve greater precision in both. situations. There are areas such as big data and data feeds where incremental machine learning techniques are used frequently. The stock market trend is an example of a data flow.

Some of the examples of incremental algorithms are Incremental Decision Tree (IDE4, ID5R), Incremental SVM, RBF Neural Network, etc. since the article concerns the Hoeffding tree, which is also a decision tree. will cover some basic knowledge of the incremental decision tree.

Incremental decision tree

In the field of online machine learning, any incremental machine learning algorithm that can generate a decision tree can be considered as the incremental decision tree. Traditional decision tree methods involve learning the algorithm using the data set. Where incremental decision tree methods are primarily focused on updating the existing tree so that the tree can be used to make predictions from the few instances of the newly arrived data. They are designed so that they can be used in situations where a sufficient data set is not available to train the model, but the tree is being updated or the data size is very large and varies over time.

In various fields such as online learning, data flow, concept drift, these algorithms can be used and are also frequently used. It is not enough here; these algorithms can also be used in situations where the data cannot be modeled using models and hierarchical systems where the required output is interpretable. The most frequent uses of the incremental decision tree can be seen in the data flow

Data flow

Data that is continuously incrementally generated from different sources can be thought of as the data stream. Basically, this kind of data is generated gradually and circulates on different platforms designed to provide information to the user over time. Usually, data streams are used in the context of big data where they are generated using different sources with stable or unstable high throughput. To learn more about the data flow, a reader can go to this link. Where various details of the data flow are mentioned. As this is not the subject of the article, we will not go into detail, a basic introduction is sufficient to understand the next section of the article. The next section concerns the Hoeffding tree which is an incremental decision tree.

Hoeffding Tree

A Hoeffding tree is an incremental decision tree capable of learning from data flows. The basic assumption about the data is that the data does not change over time helps to build a Hoeffding tree. This is an algorithm that helps to make and develop a decision tree on the basis of the guarantees given by the additive Chernoff terminal. Chernoff’s additive limit is a probability theory that gives exponentially decreasing limits on the tail distribution of the total of the random variable that is an independent variable. These bounds are sharper than other terminal bounds such as Markov inequality.

Using this probabilistic theory in Hoeffding decision trees, any node is extended as long as sufficient evidence for an optimal division characteristic is available. Saying Chernoff’s additive limit more formally helps to exploit the fact that the fact indicates that a smaller amount of sample can give enough evidence to choose an optimal division attribute. The additive Chernoff limit can also be thought of as the Hoeffding limit which essentially gives values ​​in observation quantities which can help in estimating statistical results within a prescribed estimated range of results.

Hoeffding Limit can be viewed as algorithm proof in incremental machine learning which shows that the output or precision of the algorithm is almost the same as the output or result of non-incremental machine learning algorithms. or static. There can be different ways to implement the Hoeffding tree. Some of the main methods are shown below.

Implementation of the Hoeffding tree

Let’s implement the Hoeffding tree in python. We will follow the different steps described below.

In the field of machine learning, sklearn is a library that gives us most of the algorithms for offline machine learning. When it comes to online machine learning or incremental machine learning, we can use the scikit-Multiflow library. Let’s start with the installation of the library specially designed for online machine learning.

We can install the scikit-MultiFlow using the following command line and pip.

! pip install -U git + https: //github.com/scikit-multiflow/scikit-multiflow

Go out:

Some of the basic requirements for using the library are Python 3.5+, NumPy, and Cython =

Importing libraries:

from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier

Creating a data flow:

state = 1
data_stream = SEAGenerator(random_state=state)

The SEAGenerator module helps to make a synthetic data flow.

See also
Amazon HealthLake

Configuration of some of the variables to control the procedure:

n_samples = 0
correct_cnt = 0
max_samples = 600

Make the HT classifier instance:

ht = HoeffdingTreeClassifier()

Training the model on the data flow:

while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = ht.predict(X)
    if y[0] == y_pred[0]:
        correct_cnt += 1
    ht = ht.partial_fit(X, y)
    n_samples += 1

Display of results:

print('{} samples for analysis.'.format(n_samples))
print('accuracy: {}'.format(correct_cnt / n_samples))

Go out:

As we can see we have created a Hoeffding tree model and we can also see in the result which is quite similar to the results of a non-incremental decision tree algorithm.

We can also implement the Hoeffding tree in Apache Spark using the StreamDM library, which is basically a library for leveraging a large data stream using Spark stream /. Using StreamDM, we can use the following patterns on the data stream.

  • SGD and Perceptron learner
  • Naive Bayes
  • CluStream
  • Hoeffding decision trees
  • Bagging
  • KM ++ stream

We also have some of the data generators there that can be used to train on the model.

  • HyperplanGenerator
  • Random tree generator
  • RandomRBFGenerator
  • Random event generatorRBFE

Let’s see how we can use the streamDM to create the Hoeffding trees.

We can use the HoeffdingTree class which is implemented in the HoeffdingTreeModel. The class is controlled by the following major parameter.

  • -n: for the moment when only the parameter is supported, only the Gaussian parameter.
  • -g: number of examples observed by the sheet before the split attempt.
  • -s: division criterion.
  • -c: allow error in a split decision.

Final words

In the article, we discussed online machine learning and incremental machine learning as well as incremental decision trees which are the foundations of a Hoeffding tree. We also saw an incremental tree that can be applied to the data flow and we also looked at some of the main ways to implement a Hoeffding tree in the data flow.


Hoeffding decision trees

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Join our Telegram Group. Be part of an engaging community

Source link


About Author

Leave A Reply