One-Line Sanity Checks After Every Transformation

Small habits that catch data bugs early

data-wrangling

debugging

workflows

R-Hacks N.5

Author

Federica Gazzelloni

Published

February 2, 2026

This hack is based on RLadiesRome Tutorial analysis on From Basics to Advanced Health Analytics: Exploring Diabetes Data:
🔗 https://rladiesrome.github.io/Principles-of-data-Analysis-in-R/

When working with data, most bugs don’t come from complex models.

They come from small transformations that quietly change your data in ways you didn’t expect:

filter() that drops too many rows
join() that duplicates observations
mutate() that introduces impossible values

This R-Hack shows how to build a habit of running one-line sanity checks after every major transformation, using a concrete example from our Exploring Diabetes Data tutorial.

One-line sanity checks you can run after every major transformation.

They are quick, cheap, and often enough to catch problems before plots or models hide them.

Why One-Line Checks Matter

Transformations are the most fragile part of a data workflow. They change structure, size, and meaning — often without throwing an error.

The goal here is not defensive programming or heavy validation. It’s about building small, automatic pauses that let you ask:

“Does the data still look the way I expect?”

Data Example: Diabetes Dataset (Before Clustering)

In the tutorial, the diabetes dataset is cleaned and prepared before moving to k-prototypes clustering.

At that point, the data typically goes through steps like:

selecting relevant variables
recoding categorical fields
filtering incomplete observations

That makes it an ideal place to pause and check assumptions. Let’s simulate a sample dataset similar to the one used in the tutorial:

library(dplyr)

set.seed(123)

df_before <- data.frame(
  id       = 1:300,
  age      = round(rnorm(300, mean = 55, sd = 10)),
  bmi      = round(rnorm(300, mean = 28, sd = 5), 1),
  glucose  = round(rnorm(300, mean = 120, sd = 30)),
  sex      = sample(c("Female", "Male"), 300, replace = TRUE),
  diabetes = sample(c("No", "Yes"), 300, replace = TRUE, prob = c(0.7, 0.3))
)

# Introduce realistic missingness (so filtering has an effect)
set.seed(456)
df_before$bmi[sample(seq_len(nrow(df_before)), size = 20)] <- NA
df_before$glucose[sample(seq_len(nrow(df_before)), size = 15)] <- NA


dim(df_before)

[1] 300   6

head(df_before)

  id age  bmi glucose    sex diabetes
1  1  49 24.4     152   Male       No
2  2  53 24.2     119   Male       No
3  3  71 23.3     119   Male       No
4  4  56 22.7      75 Female      Yes
5  5  56 25.8     144   Male       No
6  6  72 29.7     114 Female      Yes

After the transformations, we get a new dataframe df_after:

df_after <- df_before |>
  dplyr::filter(!is.na(glucose), !is.na(bmi)) |>
  dplyr::mutate(
    diabetes = factor(diabetes, levels = c("No", "Yes"))
  )

dim(df_after)

[1] 266   6

head(df_after)

  id age  bmi glucose    sex diabetes
1  1  49 24.4     152   Male       No
2  2  53 24.2     119   Male       No
3  3  71 23.3     119   Male       No
4  4  56 22.7      75 Female      Yes
5  5  56 25.8     144   Male       No
6  6  72 29.7     114 Female      Yes

Check 1 – Did the Row Count Change as Expected?

After any operation that could affect the number of rows, check it explicitly.

# One-line sanity check: did we drop rows as expected?
nrow(df_before)

[1] 300

nrow(df_after)

[1] 266

Notes (why this works)

We create missing values in bmi and glucose, which is common in real health datasets.
The filter step now drops incomplete rows, so df_after has fewer rows than df_before.
This makes your “row count changed” check meaningful and aligned with Issue N.5’s message.

In this context, a reduction is expected, but it should be understood and intentional.

Use this after:

filter()
join()
subsetting
removing duplicates

What this catches:

accidental row loss
many-to-many joins
accidental over-filtering
unintended row drops
silent data duplication

If the number changes unexpectedly, stop and investigate. This single check prevents a large class of downstream errors.

Check 2 – Do Key Variables Still Look Reasonable?

Whenever you create or transform a variable, inspect its range.

summary(df_after$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   32.0    49.0    55.0    55.4    61.0    87.0

range(df_after$bmi, na.rm = TRUE)

[1] 14.0 40.9

Look for:

negative values where they shouldn’t exist
impossible values (e.g. BMI of 5)
suspicious defaults (e.g. all zeros)
transformations that didn’t behave as expected

These problems are much easier to fix immediately than after modelling.

Check 3 – Did Missingness Increase?

Joins and reshaping often introduce NAs. Count them explicitly.

colSums(is.na(df_after))

      id      age      bmi  glucose      sex diabetes 
       0        0        0        0        0        0

This gives a fast overview of:

columns that need cleaning
variables that may bias analysis
fields that should be dropped or imputed

Never assume “there aren’t many NAs”. Check!

Check 4 – Do Group Sizes Still Make Sense?

Before summarising or modelling grouped data, look at the group counts. In particular, before using categorical variables in clustering or summaries, check their distribution.

df_after |> dplyr::count(diabetes)

  diabetes   n
1       No 188
2      Yes  78

This is especially important before:

rates or percentages
averages by group
comparisons across categories

This prevents:

unstable clusters
categories with too few observations
misleading interpretations

What this catches:

tiny or empty groups
unstable estimates
summaries that look precise but aren’t meaningful

Making This a Habit

These checks are not meant to clutter your script. They are meant to become automatic.

A good rule of thumb:

After anything that changes rows → check nrow()
After anything that changes values → check summary() or range()
After anything that combines data → check NAs
Before summarising → check group sizes

One line is often enough.

In Short

Most data bugs enter during transformations
One-line checks catch problems early
They cost seconds and save hours
Make them a habit, not an afterthought

This is not about perfection. It’s about seeing problems while they’re still small.

Tip

If you want to stay up to date with the latest events and posts from the Rome R Users Group, follow us here:

👉 https://www.meetup.com/rome-r-users-group/