Dealing with outliers the simple way using R

A decent portion of my new role entails getting my hands on data from various sources, “tidying” it up and putting it into a place where my team can make use of it. As such, I’ve found that even the most curated data sources have issues such as data-typing mistakes, duplicated imports of the same rows and just some general cruft. One of the first things I’ll do to try to gauge the quality of the data set is graph the number of rows per day. The kinds of things I’m interested in generally have a pretty narrow distribution of daily data points or observation counts which shouldn’t vary wildly from day to day.

Something simple like this will get us started:

If we pull those results into R we can visually assess if there’s anything amiss:

Hmm. A couple of days don’t look so good.

We can easily (i.e., lazily) identify the outlying data points using boxplot().

Yep. There are certainly some outliers there.

Boxplot stores the value of outliers which you can use to generate a subset of the original data containing just the rows containing those outliers. You can then go back to your import or ETL process and spot check those dates. In my case the import script had inadvertently been run twice generating duplicate rows on those days.

Hopefully that was helpful!

Until next time. Stay frosty.