Skip to content

Introduction to Data

What is Data?

From the Latin datum — "given, fact." Data is raw, uninterpreted facts about the world, captured so they can be stored, processed, and reasoned about later.

Metadata: The Context Around Data

Metadata is data about data. Without it, a number is just a number. The five questions that define a dataset's context:

  • When was it collected?
  • Where was it collected?
  • How was it collected?
  • Who collected it?
  • Why was it collected?

Answering these up front prevents misinterpretation later.

Types of Data

By structure:

Type Description Examples
Structured Fits a fixed schema, usually tabular. SQL tables, CSVs
Semi-structured Self-describing, flexible schema. JSON, XML, YAML
Unstructured No predefined schema. Text, images, audio, video

By nature:

  • Quantitative — measurable numeric values (age, temperature, revenue).
  • Qualitative — descriptive, categorical (color, opinion, label).

The DIKW Pyramid

How raw data becomes decisions:

Data → Information → Knowledge → Wisdom
  • Data — raw, unprocessed facts.
  • Information — data with context. Organized, meaningful.
  • Knowledge — information connected into patterns that support decisions.
  • Wisdom — the ability to apply knowledge well. The hardest, most valuable layer.

From Data to Decision

Ask → Gather → Prepare → Analyze → Decide

A clear question before collection saves hours of rework. Most data projects fail not because of bad algorithms but because the goal was fuzzy.

Core Concepts

Concept What it means
Ingestion Collecting data from sources into your system (aka population).
Transformation Reshaping raw data into a form suitable for analysis.
Storage Keeping data accessible and durable.
Analysis Extracting insight via statistics or ML.
Visualization Communicating findings through charts and dashboards.
Normalization Splitting data into related tables (1NF, 2NF, 3NF, BCNF) to avoid redundancy.
Aggregation Summarizing many records into fewer (mean, sum, min, max, KPIs).
Management Unifying data across multiple flows.
Governance Ensuring data is consistent, trustworthy, and used appropriately.

Data Quality Dimensions

When you receive or produce a dataset, check it against these five criteria:

Dimension Question to ask
Accuracy Is it correct? Does it match reliable sources?
Validity Does it fit the intended use?
Completeness Are required fields missing?
Reliability Is the source credible?
Consistency Does it agree with itself over time and across datasets?

Ethics of Data

Five principles, one question each:

  • Permission — do subjects consent to this use?
  • Transparency — is the plan visible to them?
  • Privacy — is personal data protected?
  • Intent — are the goals legitimate?
  • Outcome — have you considered the downstream effects?

Data Lifecycle

Plan → Collect → Store → Manage → Clean & Process → Analyze & Visualize → Share → Archive / Destroy

Every stage has compliance and quality implications. Skipping "Plan" is the most common mistake.

Importing Data

"Importing" means moving data from a source into your storage system, usually reshaping it along the way. Common sources:

  • Relational databases (RDBMS)
  • Web / REST APIs
  • Flat files (CSV, JSON, XML)
  • Excel workbooks, BI tool exports
  • Public data directories

Choose the import strategy based on:

  • Volume of data
  • Frequency of updates (one-off, scheduled, continuous, real-time)
  • Required transformations
  • Source model vs. target model mismatch

Typical approaches, from simplest to most complex: one-off batch → scheduled batch → continuous pipeline → real-time streaming. Start simple and upgrade when a real constraint forces the change.

Common Mistakes

  • No clear question. Data projects without a goal produce reports nobody uses.
  • Wrong or biased data. Garbage in, garbage out — no amount of modeling fixes bad sources.
  • Inappropriate analysis. Right numbers, wrong metric, wrong conclusion.
  • No communication plan. An insight that isn't shared clearly might as well not exist.