Skip to main content

RuleFit: When disassembled trees meet Lasso

The RuleFit algorithm from Friedman and Propescu is an interesting regression and classification approach that uses decision rules in a linear model.

RuleFit is not a completely new idea, but it combines a bunch of algorithms in a clever way. RuleFit consists of two components: The first component produces "rules" and the second component fits a linear model with these rules as input (hence the name "RuleFit"). The cool thing about the algorithm is that the produced model is highly interpretable, because the decision rules have an easy understandable format, but you still have a flexible enough approach to capture complex interactions and get a good fit.

Part I: Generate rules 

The rules that the algorithm generates have a simple form:

if  x2 < 3  and  x5 < 7  then  1  else  0

The rules are generated from the covariates matrix X. You can also see the rules simply as new features based on your original features.

The RuleFit paper uses the Boston housing data as example: The goal is to predict the median house value in the Boston neighborhood. One of the rules that is generated by RuleFit is:

if  (number of rooms > 6.64)  and  (concentration of nitric oxide < 0.67) then 1 else 0

The interesting part is how those rules are generated: They are derived from Decision Trees, by basically ripping them apart. Every path in a tree can be turned into a decision rule. You simply chain the binary decisions that lead to a certain node et voilà, you have a rule. It is desirable to generate a lot of diverse and meaningful rules. Gradient boosting is used to fit an ensemble of decision trees (by regressing/classifying y with your original features X). Each resulting trees is turned into multiple rules.


4 rules can be generated from a tree with 3 terminal nodes

Another way to see this step is a black box, that generates a new set of features X' out of your original features X. Those features are binary and can represent quite complex interactions of your original X. The rules are chosen to maximise the prediction/classification task at hand.

Part II: Fit the rules

You will get A LOT of rules from the first step (and that is what you want). Since the first step is only a feature transformation function on your original data set you are still not done with fitting a model and also you want to reduce the number of rules. Lasso or L1 regularised regression is good in this scenario. Next to the rules also all numeric variables from your original data set will be used in the Lasso linear model. Every rule and numeric variable gets a coefficient (beta). And thanks to the regularisation, a lot of those betas will be estimated to zero. The numeric variables are added because trees suck at representing simple linear relationships between y and x. The outcome is a linear model that has linear effects for all of the numeric variables and also linear terms for the rules.

The paper not only introduces RuleFit and evaluates it, but it also comes with a bunch of useful tools, (comparable to Random Forest): Measurement tools for variable importance, degree of relevance of original input variables and interaction effects between variables.

There are currently two implementations out there one from the authors in R and one in the TMVA toolkit.

tl;dr  The RuleFit regression and classification algorithm works by boosting an ensemble of decision trees, creating decision rules from those trees and throwing the rules (+ the numeric covariables) at L1 regularized regression (Lasso). Try it out.


Comments

  1. This looks interesting. Any thoughts on semi automatically turning a random forest into an ensemble of rule fit models and what the performance would be like? Seems like a natural fit to predicting continues variables.

    ReplyDelete
    Replies
    1. It is possible, but I don't know what is better performance wise. Instead of generating the trees with Gradient boosting, you can also generate the trees from Random Forests. In fact, the TMVA toolkit implements both using Gradient boosting or Random Forests for rule generation. From one Random Forest you generate all the rules and put them into Lasso.

      Delete
  2. Hello,
    In our project we've combined your python implementation for RuleFit. While this method works very well (R^2 ranges from 0.75 to 0.85) on our test group, the code outputs 10k+ rules in excel spreadsheet, all of which have a coeff of '0'.

    More over, out of these 10k+ rules, we did not see any indication to linear rules (i.e. always apply). In order to test the validity of the model in our project, we wish to attain 'X' most important rules and their respective coeff and predict the "Y expected" manually.

    Thanks for your help, we would very much like to also send our data to you so perhaps you'll see the issues for yourself.

    Will it be possible to do so?

    With much regards,

    Solar-Cell Research group

    ReplyDelete
    Replies
    1. Hello,

      thank you for your comment. I am glad someone is finding my implementation useful. Please be aware that the code is not used and tested thoroughly, so it might have bugs.

      Thanks to your input I made some necessary changes to the get_rules() method: Now it also outputs the coefficients for the linear terms and there is a 'type' column to distinguish linear and rule terms. Before the change, the linear terms were excluded. Also rules with an estimated coefficient of 0 are now filtered out by default.

      Delete

Post a Comment

Popular posts from this blog

My first deep learning steps with Google and Udacity

I did my first steps in deep learning by taking the deep learning course at Udacity.

Deep learning is a hot topic. Deep neural networks can classify images, describe scenes, translate text and do so much more. It's great that Google and Udacity offer this course which helped me getting started with learning about deep learning.



How does the course work? The course consists of dozens 1-2 minute videos and assignments accompanying the videos.

Well, actually it's the other way round: The assignments are the heart of the course and the videos just give you the basic understanding you need to get started building networks. There are no exams.

The course covers basic neural networks, softmax, stochastic gradient descent, backpropagation, ReLU units, hidden layers, regularization, dropout, convolutional networks, recurrent networks, LSTM cells and more. Building deep neural networks is a bit like playing Legos and the course shows you the building bricks and teaches you how to use th…

Statistical modeling: two ways to see the world.

This a machine-learning-vs-traditional-statistics kind of blog post inspired by Leo Breiman's "Statistical Modeling: The Two Cultures". If you're like: "I had enough of this machine learning vs. statistics discussion,  BUT I would love to see beautiful beamer-slides with an awesome font.", then jump to the bottom of the post and for my slides on this subject plus source code.

I prepared presentation slides about the paper for a university course. Leo Breiman basically argued, that there are two cultures of statistical modeling:
Data modeling culture: You assume to know the underlying data-generating process and model your data accordingly. For example if you choose to model your data with a linear regression model you assume that the outcome y is normally distributed given the covariates x. This is a typical procedure in traditional statistics. Algorithmic modeling culture:  You treat the true data-generating process as unkown and try to find a model that is…