The Data Cleaning Toolkit: Plus, How to Prevent the Mess
PLAYBOOKSOPERATIONSDATA STRATEGYDATA MANAGEMENTAUTOMATION
6/24/20259 min read


Yes the Mess is real… and YES you have to clean it
Every growing company has a data problem. It’s a universal side effect of rapid expansion. Maybe different teams are uploading spreadsheets with slightly different formats. Maybe your CRM and accounting software are having a silent, bitter war with each other. (Or maybe it’s not so silent, and everyone knows they have beef over how to structure customer records… No I am totally not talking from experience ).
How many versions of "United States" do you have in your database? U.S., US, USA, United States?
As much as I’d love to say you can just build a new, perfect system and forget the past, you can’t. You have to clean the existing mess. I’m one of those weird people who actually enjoy getting into the weeds and cleaning data, but I know most people don't. So, if you need to, gamify it.
At one job, we had a mountain of "To Be Determined" contacts in our CRM. We made a game of it: for one month, the whole team updated as many as they could, putting a sticky note on a board for each one. At the end of the month, we got a team lunch, and the winner got bragging rights—the most coveted prize of all.
I also believe cleaning the data gives you critical insight into where the biggest problems are coming from, which you’ll need for the systems phase.
Your Quick and Dirty Cleaning Tool Kit
The goal here is to clean the data reliably without sinking all your time into it. Your tools will depend on your scale.
Excel: Your Go-to Quick Win
Excel has been my longest and most reliable work friend. Excel is a powerhouse to quickly scrub a list or fix a one off report.
Here are the common data problems you'll face and the Excel features I have on speed dial.
1. Finding & Matching Data (Lookups) This is your toolkit for cross-referencing lists to find what’s new, what’s missing, or what already exists.
The Classics (VLOOKUP, HLOOKUP): VLOOKUP is famous, but remember its critical flaw: it can only search for a value in the first column of a selection and look to the right.
The Modern Solution (XLOOKUP): This is the ideal function for modern spreadsheets. It’s simple to write and can look in any direction—left, right, up, or down.
The Pro Move (INDEX/MATCH): For years, this combination was the go-to solution to overcome VLOOKUP's limitations. It’s complex but incredibly flexible. A word of caution: if you use it on a table where columns might be added or removed, lock your ranges with $ or use named ranges to prevent it from breaking.
2. Applying Logic & Categories (IF Statements) The IF statement is your tool for categorizing records based on rules (e.g., IF(B2>250, "Enterprise", "SMB")). Be careful: Long, nested IF() chains can quickly grow out of control and become hard to debug.
3. Cleaning & Combining Text (CONCATENATE or &) Use the & symbol to combine text from cells (e.g., =A2 & " " & B2). Pro-tip: Nest TRIM() and PROPER() inside to remove extra spaces and enforce consistent capitalization all in one step.
4. Essential Quick-Cleaning Features
Data Validation: Use this to create dropdown lists that limit entries to a controlled list—your #1 defense against inconsistent categories. The pitfall: users can bypass validation by copy-pasting. For critical data, you need other checks.
Text-to-Columns: This splits one column into many (e.g., "Last, First" into two). The pitfall: it overwrites adjacent columns, so always insert blank columns to the right first.
Remove Duplicates: A powerful tool, but be specific. To dedupe by email alone, make sure you uncheck every other column in the dialogue box.
Google Apps Script (GAS): Your First Step into Automation
Before I was fully immersed in the Google Workspace, I'd never heard of Google Apps Script (GAS). Think of it as the engine that can automate tasks across Google Sheets, Docs, and Forms. If you're familiar with VBA in Excel, the concept is similar on the Google Sheets side. With a little up-front work, you can create powerful, repeatable automations. Think of this as Excel’s cool older sister who learned how to code.
I first used it for a repetitive task I absolutely dreaded: combining several differently formatted accounting reports for my budget dashboards. It wasn’t hard… It was time-consuming and annoying. With a bit of searching online, I translated the repetitive steps into a simple script that did the following:
Normalize values (e.g., convert negative entries to positives)
Standardize dates into a format compatible for downstream tools
Removed duplicates removed rows that were already on the sheet from last month
Now, a process that took an hour of manual copy-pasting, formula updates is now done with a single click of a "Run" button.
For your first project: Your goal isn't to become a programmer overnight. It's to find one task that you do every single week that takes 15 minutes. Is it deleting empty rows? Formatting dates? Manually copying the same data? That 15-minute task is your perfect candidate for your first script. The quick win will show you the immense power of automating your most annoying tasks.
The Heavy-Duty Toolkit: SQL and Power Query
When your data outgrows a spreadsheet and the manual cleanup is eating your entire week, it's time to level up. This is where we turn to SQL and Power Query.
SQL: Playing Detective at the Source
First, let's talk about SQL (Structured Query Language). This is the language used to talk directly to a database. Now, for this, you're almost certainly going to partner with your engineering team. As much as my engineers love me, I do not have "write" access to the production database. That’s how it should be—my role isn't to change source data. My role is to be a detective. I use "read-only" access to find the root cause of our data problems so I can advocate for a real fix.
Remember my example of how many ways United States could be listed in your database? That was a real question we had. We were able to run a query to group all the different ways the country was listed so that we could tackle the issue. It's no longer just a hunch; it's hard evidence.
Uue this to find the evidence - once you know the issues it’s easier to dive into the fix.
Side note: If you're constantly asking an engineer to run the same data-fixing script every month, you're just treating a symptom. Use SQL to get the evidence you need to build a case for fixing the underlying process.
Power Query: Your Personal, Repeatable Cleaning Machine
If SQL is for diagnosing problems, Power Query is your personal tool for building an automated cleaning workflow. It’s the engine inside of Power BI and modern versions of Excel (under the Data > Get & Transform Data tab), and it’s a tool you can use yourself.
Plus this is a tool that you can use yourself! No need to get outside help with this one.
Power Query Workflow: Power Query lets you record every cleaning step you take. You build the workflow once, and then it's automated forever.
You connect to the source data (an Excel file, a SharePoint folder, etc.).
A few Examples you can use it for
Transforming date - For that weird layout, you might use the "Unpivot" feature to turn columns of monthly data into two neat rows—one for "Month" and one for "Sales."
"Conditional Column
filter out irrelevant rows,
The Best Part: All these steps are saved. Next month, when you get the new horrible export, you just hit one button: Refresh. Your hour of manual work is now done in about 15 seconds.
While there is a learning curve to a lot of these tools, experimenting with it and trying to use it on real projects helps you build out your data chops and dive deeper into the data as you won’t be spending your time cleaning it.
From Cleaning to Preventing: A Framework
Alright Friends - we have moved on from the fun part of cleaning and now are moving to the proactive prevention phase. This shift is super valuable; there is even a classic data management principle called the “1 -10-100 Rule” that proves it: it costs $1 to prevent a data error, $10 to correct it later, and $100 if you fail to do anything at all. Our goal here is to focus on the $1 work.
Audit the Process (Ask "Why?")
Don't just fix the error; find its origin story. When you see a messy column your first question shouldn't be, "How do I fix this?" It should be, "How did this get here?" Try to trace it back to its source.
Is this a free-text field that should be a locked-down dropdown menu?
Is there a process gap where two departments have different definitions of the same field?
Are we letting people upload files without a clear template and no instructions?
Trace the problem back to its source. You’ll almost always find it’s a human or process issue, not a technical one.
Remember this is a judgement free zone! It’s a good reminder that no one is doing this maliciously and that if there is no process people will default to what makes sense from their perspective when there is no clear process in place.
Build the Guardrail, Not Roadblocks
Based on your audit, implement the simplest possible change to the process that prevents the error. The goal is to guide the users into inputting the data that aligns with your process, and not make it so hard that they can’t do their jobs.
If it's a free-text field: Change it to a dropdown. This is the #1 cure for inconsistent categories.
If it's a process issue: Create a simple, visual one-page guide for "How to Enter X Data" and then investigate deeper on can this process also be automated.
If it's a system limitation: Add a validation rule to your software or spreadsheet that flags incorrect entries before they get saved.
Document and Communicate the “ WHY”
This step is critical. A new rule nobody knows about doesn't exist.
Build a simple Standard Operating Procedure (SOP) or a dedicated "Data Dictionary" worksheet.
This document defines your rules—mapping old values to new categories, listing all valid entries, and defining data standard formats.
Make it Accessible: Put this guide where your team actually works!!!
If you use Confluence/Jira - Add a page dedicated to it and share it so the team can review and add comments/questions if the process is unclear
Sharing on Slack/Teams - Most teams have dedicated channels so adding links and asking for feedback is another way to do this.
Shared Drives - Is there a dedicated folder to operational processes or specific data processes- if not you might have to start one!
Frame the Change as a Win: When you announce the new process, explain why it helps everyone. If you don't get buy-in from the team, keeping the data clean will be a constant uphill battle.
Making It Stick: Your Ongoing Maintenance Plan
A clean data process won't stay clean on its own. You need a simple maintenance plan
Define Lifecycle & Archival Rules Not all data needs to live forever. Define when records become "stale" (e.g., leads untouched for 180 days) and create a process to either archive or purge them.
Build a Data Quality Dashboard You can't fix what you can't see. Build simple dashboards in Power BI ,Tableau or your CRM to track key data quality metrics. This dashboard should be your early warning system,
% of records missing critical fields
Number of duplicates created per week
Age distribution of “last updated” activity
Schedule Quarterly Audits Once a quarter, review your dashboard and do a root-cause analysis on any new problems. This proactive check-in keeps your processes from degrading.
The chaos of growth isn’t going anywhere. But with the right tools and a proactive mindset, you can make that chaos work for you, not against you.
A couple of years ago, I made a LinkedIn post about the importance of clean data. Around the same time, McKinsey was writing about “The data-driven enterprise of 2025” And here we are…. The prediction that data would be embedded in every decision, interaction, and process wasn't just right—it's now a race to integrate into every aspect of business.
But there’s a catch: This race is incredibly expensive when the data is dirty. Gartner estimates that poor data quality costs organizations an average of $12.9 million every year
Back then, my advice was mostly about cleaning up the mess with Excel tips and tricks. But as data has become more integrated, I've learned that just cleaning is not enough. If you don't fix the underlying architecture, you're just bailing water out of a boat filled with holes. You can work as hard as you want, but you’re still going to sink.
Before we start, let's get one thing straight: There is no judgement here! Dirty data isn’t a sign of failure; it’s a symptom of growth. It’s the natural, unavoidable byproduct of scaling quickly, ambitious experiments, and a team that’s moving fast.
But to turn that growth into a scalable solution, you need a plan. This isn't about chasing a mythical state of "perfectly clean" data. Trust me, I’ve tried to get there through team games, sheer force of will, and pure stubbornness. It's about building a system that's resilient to the chaos—one that prevents the mess from happening in the first place.
Here’s how you move from constantly reacting to being proactive to have more time to focus on strategy.