Binarised Regression

Publication

Software and materials within this website are associated to the following paper:

Jose Hernandez-Orallo, Cèsar Ferri, Nicolas Lachiche, Adolfo Martínez-Usó, M.José Ramírez-Quintana ,"Binarised Regression Tasks: Methods and Evaluation Metrics" (Submitted to Data Mining and Knowledge Discovery))

Binarised Regression problems

This work focuses on those supervised tasks that are presented with a numerical output but decisions have to be made in a discrete, binarised, way, according to a particular cutoff. In other words, we have a training dataset as if we were facing a regression task but we have a deployment situation that is a binary classification task. Our proposed approach is easily understood by means of a real life example:

An estate agent has a database of possible customers who are interested in buying a house. The estate agent collects information about each customer and learns a model about the maximum mortgage that the customer can get from a bank. This is our regression model. On an everyday basis, several new properties enter the estate agent's portfolio. Each of them has a different price. Obviously, the estate agent only offers a property to those customers that can afford it, i.e., those that can get a mortgage for at least the property price. That means that each property represents a genuine cutoff in our binarisation setting.

This binarised regression task is a very common situation that requires its own analysis, being this analysis different from regression and classification -and ordinal regression. How to address this task and how should it be evaluated are two of the main questions that this work deals with.

Context plots and Cutoff Distributions for the Binarised Regression problems can be found here.

Solutions

The basic idea arising the Binarised Regression problems is that, for many applications, we are interested in telling whether the predictions are above or below a given cutoff. This cutoff c can vary depending on the context and will critically determine the overall performance.

We study two basic approaches to address this task:

The retraining approach, which discretises the training set whenever the cutoff is available and learns a new classifier from it.
The reframing approach, which learns a regression model and sets the cutoff when this is available during deployment.

Software and Datasets

A comprehensive evaluation of the retraining and reframing approaches is performed in this work. Using the newly introduced plots for spotting the regions in which one technique dominates over the other allows us to discard approaches that are suboptimal for every possible operating context.

Up to 20 datasets from different repositories are presented as binarised regression problems, showing that these are common situations in many research fields.

Any use of this software (even non-profit or academic uses) should be done only after contacting the authors first (signature at the bottom part). We will most probably grant permission to use it freely and even point to newer versions if there are.

Case Study Dataset

A running example using real mortgages data from Zillow (Zillow API 2013^(*)) and real cutoffs from the US Federal Financial Institutions Examination Council (Federal Financial Institutions Examination Council: Home Mortgage Disclosure Act, HMDA 2013) is used throughout the paper. This is a prototypical case of a binarised regression problem where the distributions for the output values and for the cutoffs are similar, as the next figure shows.

Zillow-HMDA running example — Comparing the true cutoff distribution (top) with the output distribution (bottom) for the running example presented in this work. Figure shows the distribution of the mortgage amounts (HMDA) and house prices (Zillow) in USA for year 2013. As it can be seen, both mortgage amounts and house prices exhibit approximately the same distribution.

As stated in Context plots and Cutoff Distributions for the Binarised Regression problems, this example belongs to case B, is of typology 3 and corresponds to a 'supply-demand regulation' case.

Note (*): Zillow data have partially been anonymised.

Contact

Jose Hernandez-Orallo, Universitat Politècnica de Valencia, Spain (jorallo AT upv DOT es)
Cèsar Ferri, Universitat Politècnica de Valencia, Spain (cferri AT upv DOT es)
Nicolas Lachiche, University of Strasbourg, France (nicolas DOT lachiche AT unistra DOT fr)
Adolfo Martínez-Usó, Universitat Politècnica de Valencia, Spain (admarus AT upv DOT es)
María José Ramírez-Quintana, Universitat Politècnica de Valencia, Spain (mramirez AT upv DOT es)

Binarised Regression

Publication

Contents: