Data Cleaning

Why Does Data Need Cleaning?

Real-world data is messy. When data is collected from surveys, sensors, or databases, it almost always has problems. These problems can lead to incorrect conclusions if not fixed.

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. It is one of the most important steps in data analysis.

Common real-world examples:

  • A survey response has "N/A" where a number should be
  • The same person submitted a form twice
  • Dates are in different formats: "3/1/2024" vs "March 1, 2024"
  • Someone typed their age as "150" instead of "15"