Data Preparation for Data Science Using SQL, with Quick Reference Sheet

Stan Pugsley
2 min readMay 4, 2021

--

Where we like it or not, data prep is a major part of every data science project. Data preparation consists of tasks to prepare data in a repeatable process for use in business analytics, including data acquisition, data storage and handling, data cleaning, and early-stages of feature engineering.

A fundamental question is how and where to do the data prep. There are valid reasons to pick different tools and methods:

- Which tool or languages are your staff/peers most familiar with?

- Do you need a repeatable process, scheduled to run on a regular basis?

- Where is the bulk of your data currently located?

- What data volume and velocity are you dealing with?

- What current cloud services or on-prem resources exist to run the process?

- Which tools are most compatible with the current architecture in your organization?

- Does your dataset contain private elements (HIPA, FERPA, GDPR) that require special protection?

For many organizations, the answers to those questions are going to lead to SQL. Not only is SQL widely known and used in most organizations, but it also leverages existing database resources, security, and pipelines. If your raw data is in a SQL-based data lake, why spend the time and money to export the data into a new platform for data prep?

The following “Quick Reference” guide will give a sampling of SQL approaches to each of the steps in data preparation. This is not meant to be an exhaustive list of SQL functions or options, but rather a starting point.

A final word on creating an interface to your model. SQL views allow you to wrap up the complexity of many of the data prep steps in a clean, secure, modular format. Rather than embedding long, complex queries in your Python or R code, you can create a view that allows access to that code in a simple, reusable format. Views are also a great way to apply security on private data elements by masking or hiding those from model access.

If you are already investing in high-performant database licenses, why not leverage them for data science by doing data prep in SQL?

--

--

Stan Pugsley
Stan Pugsley

Written by Stan Pugsley

Lecturer at the University of Utah, freelance data engineer, consultant, data architect based in Salt Lake City, UT, USA