Introduction to Data¶
Data is everywhere.
What is Data?¶
- Derived from
datum
: given, fact. - Valuable resource in this digital era.
Data context¶
Information that provides meaning to data - When the data was collected? - Where the data was collected? - How the data was collected? - Who collected the data? - Why the data was collected?
These characteristics of data are called the metadata.
Type of data¶
Classification of data based on their structure: - Structured data - Semi-structured data - Unstructured data
Classification of data based on their nature: - Quantitative data - Qualitative data
Data storage is changing¶
Historical data storage - Genetic information in DNA. - Cave and wall paintings. - Scroll and books of papyrus/parchment.
19th and 20th century - Punch cards. - Magnetic tape, floppy disks.
20th and 21st century - More data on smaller media. - CDs and hard/solid state drives(local). - Data centers (cloud).
The DIKW pyramid¶
Data -> Information -> Knowledge -> Wisdom
Raw data - unprocessed data.
Creating information - Information is (organized) data with context.
Knowledge is power: - Information alone doesn't lead to decisions. - Connecting all the dots of information. - Knowledge is information with meaning.
Archiving wisdom - The hardest part - Insights: add more meaning to information by linking pieces. - Apply the knowledge for better decisions.
From data to decision¶
Ask questions -> Gather data -> Prepare data -> Conduct Analysis -> Make decision
Data as a resource¶
Overwhelming data
-
Data is often too large in its raw form. Even "simple" analysis require large amount of data.
-
More complex analysis can leverage even million or billion of records.
Data aggregation
Aggregation is the process to summarize a dataset into a smaller pieces (easier to understand) is required to make informed decisions.
Common aggregation: - Simple average (mean) - Sum (totals) - Minimum or Maximum - Modes
Aggregation appear in many ways throughout organizations. - Metrics - Benchmarks - Key performance indicators (KPIs)
Understanding how these aggregations are created is extremely helpful for many investigations.
Key concepts¶
Data flow
Data flow within organizations is often highly complex.
- Data from many different sources systems
- Processed through other systems
- Displayed and manipulated in other systems.
Data management Data management is responsible trying to unify and standardize data from many different data flows. These
Data governance Ensure data is consistent, trustworthy, and isn't misused.
Data quality Ensure data is accurate, complete, consistent, and up-to-date.
Data Privacy and Security Ensure proper data access, use, and protection.
Principles of data ethics¶
- Permission for
- Transparency about the plan
- Privacy of data
- Good intentions
- Consider the outcome
Data lifecycle¶
Planning -> Collection -> Storage -> Management -> Cleaning and processing -> Analysis and Visualization -> Sharing -> Archiving/destroying
Common mistakes about data¶
- Not having a clear goal or question
- Insufficient or wrong data (data bias, dirty data,etc.)
- Lack of appropriate analysis (lack of context, wrong metrics,etc.)
- No clear communication of results