Google Summer of Code 2026 | mlr3 + Hugging Face
Name: Aksh Kaushik
GitHub: @aksh08022006
Repository: github.com/aksh08022006/mlr3hf
Submission Date: February 5, 2026
The easy task focuses on validating OpenML integration with mlr3 through the mlr3oml::otsk() function. The goal is to demonstrate that we can reliably download, parse, and convert OpenML datasets into structured mlr3::Task objects.
otsk() β Data download β as_task() β mlr3::Task β
Why this approach?
Type: Classification | Samples: 150 | Features: 4 numeric + 1 factor target
A canonical ML benchmark. Tests that otsk() correctly fetches, parses, and converts numerical features.
Type: Classification | Samples: 6 | Features: 4 (mixed types) + 1 factor target
Tests mixed-type handling (numeric + factors), ensures role inference and type coercion work correctly in edge cases.
The implementation demonstrates two complementary approaches to task creation:
otsk()Fetch and convert an OpenML dataset directly:
library(mlr3)
library(mlr3oml)
# Download iris dataset (OpenML ID 59) and convert to mlr3 Task
task_iris <- as_task(otsk(id = 59))
print(task_iris)
Demonstrate that as_task_classif() handles mixed data types and role assignment:
library(mlr3)
# Build small handcrafted dataset with mixed types
stationery_data <- data.frame(
length_cm = c(14.5, 15.0, 13.8, 17.5, 18.2, 19.1),
has_ink = factor(c("Yes", "Yes", "Yes", "No", "No", "No")),
has_eraser = factor(c("No", "No", "No", "Yes", "Yes", "Yes")),
body_material = factor(c("Plastic", "Plastic", "Metal", "Wood", "Wood", "Wood")),
label = factor(c("Pen", "Pen", "Pencil", "Pencil", "Pen", "Pencil"))
)
# Convert to mlr3::Task with explicit target role
task_stationery <- as_task_classif(
stationery_data,
target = "label",
id = "pen_vs_pencil"
)
print(task_stationery)
When executed, both approaches produce structured mlr3::Task objects:
<TaskClassif: iris (150 x 5)>
Target: Species
Properties: multiclass
Features (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
<TaskClassif: pen_vs_pencil (6x5)>
Target: label
Properties: multiclass
Features (4): length_cm, has_ink, has_eraser, body_material
This submission demonstrates end-to-end integration robustness across multiple scenarios:
By fetching datasets from OpenML, we prove that external data source integration is working. This validates the entire pipeline: HTTP requests, data parsing, and format conversion.
The handcrafted dataset exercises the type coercion engine. as_task_classif() must correctly:
Unlike heavy integration frameworks, this approach isolates and validates the core functionality: data ingestion + Task construction. No unnecessary dependencies or complex workflows.
Reproduce the entire workflow on your machine:
# 1. Install dependencies
Rscript -e "install.packages(c('mlr3','mlr3oml','mlr3data','testthat'), repos='https://cloud.r-project.org')"
# 2. Run the R code from the Implementation section above
# (Paste into an R console or script file)
# 3. Verify output matches the expected results above
Expected runtime: ~10β30 seconds (first run may download OpenML data)