Saturday, 9 May 2015

What even is supervised/unsupervised disaggregation?

I've noticed a fair amount of disagreement regarding exactly what type of learning is being used by a specific energy disaggregation method. I think the confusion arises from a discrepancy between the definition of supervised learning in the general machine learning literature and the practical assumptions of energy disaggregation methods:

Machine learning definition

General purpose machine learning defines supervised learning methods as those which require labelled training data to train a model. Labelled data refers to both the input and answers to the problem, which in the case of energy disaggregation corresponds to both household aggregate and individual appliance energy consumption. Conversely, unsupervised learning refers the use of only unlabelled data (household-level) data to construct models.

Practical energy disaggregation

In the energy disaggregation field, a fundamental problem exists due to the variation in appliances between different houses. As a result, scalable methods must not require appliance-level data from the houses in which disaggregation is to be performed (test houses). As such, practical approaches can apply supervised learning to appliance-level data from houses other than the test house, but can only apply unsupervised learning to aggregate-level data from the test house.

Semi-supervised learning

General purpose machine learning defines semi-supervised learning as the combination of a small amount of labelled training data with a large amount of unlabelled training data. Although this sounds similar to the scenario described above, the crucial difference is that energy disaggregation requires that the supervised and unsupervised learning takes place on data from difference domains (buildings), while general purpose machine learning assumes both the labelled and unlabelled training data are drawn from the same domain. Furthermore, energy disaggregation training methods could even make use of a large amount of labelled training data from non-test houses, and only a small amount of unlabelled training data from the test house.


I've been apprehensive to use the term semi-supervised learning to describe practical energy disaggregation methods due to the domain-specific requirements of the field. Instead, I generally refer to methods as unsupervised if they use appliance-level data from only non-test houses, which often leads to confusion. I'd be interested to hear other people's opinions on the matter, and hopefully we can reach some consensus!