Skip to content

Introduction to Data

Data is everywhere.

What is Data?

  • Derived from datum: given, fact.
  • Valuable resource in this digital era.

Data context

Information that provides meaning to data - When the data was collected? - Where the data was collected? - How the data was collected? - Who collected the data? - Why the data was collected?

These characteristics of data are called the metadata.

Type of data

Classification of data based on their structure: - Structured data - Semi-structured data - Unstructured data

Classification of data based on their nature: - Quantitative data - Qualitative data

Data storage is changing

Historical data storage - Genetic information in DNA. - Cave and wall paintings. - Scroll and books of papyrus/parchment.

19th and 20th century - Punch cards. - Magnetic tape, floppy disks.

20th and 21st century - More data on smaller media. - CDs and hard/solid state drives(local). - Data centers (cloud).

The DIKW pyramid

Data -> Information -> Knowledge -> Wisdom

Raw data - unprocessed data.

Creating information - Information is (organized) data with context.

Knowledge is power: - Information alone doesn't lead to decisions. - Connecting all the dots of information. - Knowledge is information with meaning.

Archiving wisdom - The hardest part - Insights: add more meaning to information by linking pieces. - Apply the knowledge for better decisions.

From data to decision

Ask questions -> Gather data -> Prepare data -> Conduct Analysis -> Make decision

Data as a resource

Overwhelming data

  • Data is often too large in its raw form. Even "simple" analysis require large amount of data.

  • More complex analysis can leverage even million or billion of records.

Data aggregation

Aggregation is the process to summarize a dataset into a smaller pieces (easier to understand) is required to make informed decisions.

Common aggregation: - Simple average (mean) - Sum (totals) - Minimum or Maximum - Modes

Aggregation appear in many ways throughout organizations. - Metrics - Benchmarks - Key performance indicators (KPIs)

Understanding how these aggregations are created is extremely helpful for many investigations.

Key concepts

Data flow

Data flow within organizations is often highly complex.

  • Data from many different sources systems
  • Processed through other systems
  • Displayed and manipulated in other systems.

Data management Data management is responsible trying to unify and standardize data from many different data flows. These

Data governance Ensure data is consistent, trustworthy, and isn't misused.

Data quality Ensure data is accurate, complete, consistent, and up-to-date.

Data Privacy and Security Ensure proper data access, use, and protection.

Principles of data ethics

  • Permission for
  • Transparency about the plan
  • Privacy of data
  • Good intentions
  • Consider the outcome

Data lifecycle

Planning -> Collection -> Storage -> Management -> Cleaning and processing -> Analysis and Visualization -> Sharing -> Archiving/destroying

Common mistakes about data

  • Not having a clear goal or question
  • Insufficient or wrong data (data bias, dirty data,etc.)
  • Lack of appropriate analysis (lack of context, wrong metrics,etc.)
  • No clear communication of results