Thursday, 13 August 2015

Reusable hold out test sets for NILM

Overfitting is a well-cited problem in the field of Non-intrusive Load Monitoring. Overfitting refers to the high accuracy of an algorithm on one small data set, while the same algorithm generalises poorly to other data sets. In NILM, this often corresponds to algorithms which work well on data covering a short period of time or a small number of houses, but performs poorly on data covering longer time periods or other houses.

One solution to this problem is through the use of a competition, in which the organiser releases part of the data set for training, while holding out the remaining data for testing. This approach works fine for one-off competitions where each participant can only submit one solution, but it weakens when participants are allowed to submit multiple entries. The reason for this is that participants can use information learned from their accuracy score on the test set to inform their algorithm choice.

Belkin organised exactly this type of competition for NILM via the Kaggle platform a few years ago. The competition allowed two entries per day per participant, each of were evaluated using half of a private hold out data set and displayed on a public leaderboard. However, the competition ended after 4 months, at which point the final standings were determined by each solution's performance on the other half of the private hold out data set which had never previously been released. Through such a format, the final standings can only be calculated once and the competition cannot be re-run, as the final standings convey information about the private hold out data set which could inform the design of future algorithms.

A recent Google Research blog post describes this exact problem in a much more general sense than NILM. Most interestingly, their recent paper even proposes a solution to this problem through the use of a reusable holdout set. The approach is that the reusable hold out set is only accessed through a differentially private algorithm, which effectively samples the holdout set in order to produce a different sample each time it is accessed.

At the 2015 European NILM workshop, an MSc group from Imperial supervised by Jack Kelly presented a platform which could potentially be used to host NILM competitions in the future. I’d be really interested to see whether such a platform could use such a reusable hold out set in order to allow the competition to run for much longer without compromising the results relative to the classical method of evaluation.