Chunk #25 — Methods — The OpenML database

Source: Random forest versus logistic regression: a large-scale benchmark experiment.
Embedded: yes

Text

So far we have stated that the benchmarking experiment uses a collection of M real datasets without further specifications. In practice, one often uses already formatted datasets from public databases. Some of these databases offer a user-friendly interface and good documentation which facilitate to some extent the preliminary steps of the benchmarking experiment (search for datasets, data download, preprocessing). One of the most well-known database is the UCI repository [24]. Specific scientific areas may have their own databases, such as ArrayExpress for molecular data from high-throughput experiments [25]. More recently, the OpenML database [26] has been initiated as an exchange platform allowing machine learning scientists to share their data and results. This database included as many as 19660 datasets in October 2016 when we selected datasets to initiate our study, a non-negligible proportion of which are relevant as example datasets for benchmarking classification methods.