Use measures of central tendency to summarize data and assess symmetry and variation.

Lesson 5/77 | Study Time: Min

Course: MBA in Data Science

Use measures of central tendency to summarize data and assess symmetry and variation

Did you know that data analysts often use measures of central tendency to summarize data and assess the symmetry and variation in the data? If you're wondering what measures of central tendency are, they refer to the ways in which we can represent the center or middle of a dataset.

✨ Let's dive deeper into the task of using measures of central tendency to summarize data and assess symmetry and variation. Here are some steps to follow:

Understanding Variable Types and Measurement Scales

To use measures of central tendency, it's important to understand the variable types and measurement scales. Variables can be classified as either categorical or numerical.

👉 Categorical variables are those that represent qualities or characteristics, such as gender, color, or nationality.

👉 Numerical variables, on the other hand, represent quantities or amounts, such as age, weight, or income.

Numerical variables can be further divided into two categories: continuous or discrete.

👉 Continuous variables can take any value within a certain range, such as temperature or height.

👉 Discrete variables, on the other hand, can only take specific values, such as the number of children in a family or the number of pets a person owns.

Calculating Measures of Central Tendency

Once you have a good understanding of the variable types and measurement scales, you can begin to calculate the measures of central tendency. The most common measures of central tendency are the mean, median, and mode.

👉 The mean is the average value of a dataset and is calculated by summing up all the values and dividing by the number of observations.

👉 The median is the middle value of a dataset and is calculated by arranging the values in ascending or descending order and finding the value that falls in the middle.

👉 The mode is the value that occurs most frequently in a dataset.

It's important to choose the most appropriate measure of central tendency based on the variable type. For example, the mean is most appropriate for continuous variables, while the median is better suited for skewed distributions or outliers.

Assessing Variation in the Data

In addition to measures of central tendency, it's also important to assess the variation in the data. One way to do this is by calculating the coefficient of variation (CV).

👉 The CV is a measure of relative variability and is calculated by dividing the standard deviation by the mean and multiplying by 100.

A low CV indicates that the data is relatively consistent, while a high CV suggests that the data is more variable.

Assessing Symmetry of the Data

Another important aspect of exploratory data analysis is assessing the symmetry of the data. Skewness is a measure of the degree of asymmetry in a distribution.

👉 A symmetrical distribution has a skewness of zero, while a positively skewed distribution has a skewness greater than zero and a negatively skewed distribution has a skewness less than zero.

Skewness can be calculated using the skewness() function in R or the skew() function in Python.

Examples

Let's take a look at some examples of using measures of central tendency to summarize data and assess symmetry and variation. Suppose we have a dataset of salaries for a company.

To calculate the mean salary in R, we can use the mean() function:
salaries <- c(50000, 60000, 70000, 80000, 90000)
mean(salaries)

The output will be: 70000
To calculate the median salary in Python, we can use the numpy library:
import numpy as np

salaries = [50000, 60000, 70000, 80000, 90000]
np.median(salaries)
The output will be: 70000.0
To calculate the coefficient of variation in R, we can use the cv() function from the “coefvar” package:
library(coefvar)

salaries <- c(50000, 60000, 70000, 80000, 90000)
cv(salaries)
The output will be: 16.96
To assess the symmetry of the data, we can use the skewness() function in R:
library(moments)

salaries <- c(50000, 60000, 70000, 80000, 90000)
skewness(salaries)
The output will be: 0
In this case, the data is symmetrical with a skewness of zero.

Conclusion

In conclusion, using measures of central tendency to summarize data and assess symmetry and variation is an important aspect of exploratory data analysis. By understanding the variable types and measurement scales, calculating the appropriate measure of central tendency, and assessing the variation and symmetry of the data, data analysts can gain valuable insights and make informed decisions.

Identify the variable type and measurement scale of the data.

Why Variable Types and Measurement Scales Matter in Data Analysis

You might be wondering why identifying variable types and measurement scales is even necessary for data analysis. The truth is, understanding the nature of the data you are working with is critical for making valid analytical decisions. In fact, the choice of statistical tests and visualization techniques often depends on the type of data you're dealing with.

Let's dive deeper into the concept of variable types and measurement scales, and learn how to identify them in your dataset.

Crunching the Basics: Variable Types and Measurement Scales 😲

A variable is a characteristic or attribute that can take on different values, and there are two main types: qualitative (categorical) and quantitative (numerical) variables.

Qualitative variables 🏷️: These variables describe non-numeric characteristics or categories. Examples include gender, hair color, or car brand.

Quantitative variables 🔢: These variables represent numerical values. Examples include age, income, or height.

Now, let's explore the different measurement scales that can be applied to these variable types:

Nominal scale 🏷️: This scale is used for qualitative variables and assigns unique labels to different categories, with no inherent order. Example: Hair colors (red, black, blonde, brown).

Ordinal scale 🔢➡️🏷️: This scale is used for qualitative variables that have a natural order or ranking. Example: Education level (elementary, high school, undergraduate, graduate).

Interval scale 🔢📏: This scale is used for quantitative variables that have equal intervals between values but no true zero point. Example: Temperature measured in Celsius or Fahrenheit.

Ratio scale 🔢💯: This scale is used for quantitative variables that have equal intervals between values and a true zero point. Example: Height or weight.

Identifying Variable Types and Measurement Scales in Real Data 🕵️‍♀️

Now that you're familiar with the different variable types and measurement scales, let's practice identifying them using a hypothetical dataset containing information about employees in a company.

Data Sample:
Name | Gender | Age | Education Level | Salary
----------------------------------------------------
John Doe | Male | 35 | Graduate | 80000
Jane Smith | Female | 42 | Undergraduate | 75000
...
To identify the variable types and measurement scales in this dataset, let's examine each column:

Name: This is a qualitative variable with a nominal scale, as it assigns a unique label to each individual without any inherent order.
Gender: This is also a qualitative variable with a nominal scale, as it classifies individuals into categories (male or female) without any ranking.
Age: This is a quantitative variable with a ratio scale, as it represents numerical values with equal intervals and a true zero point (i.e., age can be zero).
Education Level: This is a qualitative variable with an ordinal scale because it assigns labels with a natural order (elementary < high school < undergraduate < graduate).
Salary: Lastly, this is a quantitative variable with a ratio scale, as it represents monetary values with equal intervals and a true zero point (i.e., salary can be zero).

Wrapping Up: The Importance of Variable Types and Measurement Scales ⚙️

Identifying variable types and measurement scales is a crucial step in data analysis because it helps you determine the appropriate statistical tests and visualization methods to use. By understanding the nature of your data, you can make more informed decisions and draw accurate conclusions from your analysis.

Remember: The key is to always carefully examine your dataset and be mindful of the characteristics of each variable to ensure a successful analysis.

Calculate the appropriate measure of central tendency (mean, median or mode) based on the variable type.

Why is choosing the appropriate measure of central tendency important? 🧐

Selecting the right measure of central tendency is crucial for accurately describing the center or average value of a dataset. Different measures can give different insights depending on the variable type and distribution. Let's explore when to use the mean, median, or mode in detail.

Mean - The Arithmetic Average 📊

The mean is the most common measure of central tendency, which is calculated by adding up all the values in the dataset and dividing the sum by the number of data points. It's especially suitable for interval and ratio data, where distances between values are meaningful. The mean is sensitive to outliers and can be affected by extreme values.

Example:
data = [2, 3, 4, 5, 6]
mean = sum(data) / len(data)
print("Mean:", mean)
Output:
Mean: 4.0

Median - The Middle Value 🎯

The median is the middle value in a dataset when the values are sorted in ascending or descending order. If the dataset has an even number of values, the median is the mean of the two middle values. The median is less sensitive to outliers and better suited for ordinal data or data with a skewed distribution.

Example:
data = [2, 3, 4, 5, 100]
sorted_data = sorted(data)
n = len(sorted_data)

if n % 2 == 0:
median = (sorted_data[n//2 - 1] + sorted_data[n//2]) / 2
else:
median = sorted_data[n//2]

print("Median:", median)
Output:
Median: 4

Mode - The Most Frequent Value 🌟

The mode is the value(s) that occur most frequently in a dataset. It is applicable to nominal data and can be used for ordinal, interval, or ratio data if the frequency of values is the primary consideration. Mode is not sensitive to outliers and can have multiple values in a dataset with several equally frequent values.

Example:
from collections import Counter

data = [2, 3, 4, 5, 5, 6, 6]
counted_data = Counter(data)
mode = [item for item, count in counted_data.items() if count == max(counted_data.values())]

print("Mode:", mode)
Output:
Mode: [5, 6]

How to choose the appropriate measure based on the variable type?

Here's a quick rundown on when to use each measure of central tendency:

Nominal data: Use mode, as it represents the most frequent category.
Ordinal data: Use median, as it describes the middle rank without assuming equal intervals.
Interval & Ratio data: Use mean for normally distributed data; use median for skewed data or data with outliers.

Remember to assess the data distribution and consider using visualizations like histograms or box plots to better understand the underlying data structure. This will help you make an informed decision about the appropriate measure of central tendency.

Calculate the range, interquartile range and variance to assess the variation in the data.

Real-life Scenario: Assessing Investment Options 💰

Let's say you are an investor looking to invest in three different stocks - Stock A, Stock B, and Stock C. You have collected the past returns of these stocks and want to assess their variations to determine their riskiness. In order to do this, you'll use measures of dispersion such as range, interquartile range, and variance.

Calculating Range 🔢

Range is the difference between the highest and the lowest data points in a dataset. It provides a quick idea of the spread of the data but is sensitive to outliers.

To calculate the range, follow these steps:
Find the maximum value in your dataset.
Find the minimum value in your dataset.
Subtract the minimum value from the maximum value.
For example, let's say the past returns for Stock A are:
5%, 8%, 12%, 15%, 19%, 22%, 25%, 30%
The range is calculated as:
Range = Max Value - Min Value
Range = 30% - 5% = 25%
Thus, the range of past returns for Stock A is 25%.

Calculating Interquartile Range (IQR) 📊

Interquartile Range (IQR) is the range of the middle 50% of the data, which is less sensitive to outliers as it doesn't include extreme values. It is the difference between the first quartile (Q1) and the third quartile (Q3).

To calculate the IQR, follow these steps:
Sort the dataset in ascending order.
Find the Q1 (the 25th percentile) by calculating the position: (n+1)/4, where n is the number of data points.
Find the Q3 (the 75th percentile) by calculating the position: 3(n+1)/4, where n is the number of data points.
Subtract Q1 from Q3.
For the past returns of Stock A:
Sorted data: 5%, 8%, 12%, 15%, 19%, 22%, 25%, 30%

Position of Q1 = (8+1)/4 = 2.25 (between 8% and 12%)
Q1 = 8% + 0.25(12%-8%) = 9%

Position of Q3 = 3(8+1)/4 = 6.75 (between 22% and 25%)
Q3 = 22% + 0.75(25%-22%) = 24.25%

IQR = Q3 - Q1 = 24.25% - 9% = 15.25%
The IQR for Stock A's past returns is 15.25%.

Calculating Variance 📈

Variance is a measure of the average squared deviation from the mean. It is useful for understanding the volatility of a dataset.

To calculate the variance, follow these steps:

Calculate the mean of the dataset.
Subtract the mean from each data point and square the result.
Calculate the average of the squared differences.
For Stock A's past returns:
Mean = (5% + 8% + 12% + 15% + 19% + 22% + 25% + 30%) / 8 = 17%

Squared deviations:
(5%-17%)^2 = 144
(8%-17%)^2 = 81
(12%-17%)^2 = 25
(15%-17%)^2 = 4
(19%-17%)^2 = 4
(22%-17%)^2 = 25
(25%-17%)^2 = 64
(30%-17%)^2 = 169

Variance = (144 + 81 + 25 + 4 + 4 + 25 + 64 + 169) / 8 = 64.5%
Stock A's variance in past returns is 64.5%.

Now, you can compare the range, IQR, and variance of all three stocks in order to make informed investment decisions. The stock with the least variation might be less risky, while the one with the highest variation might offer higher potential returns but with more risk.

Calculate the coefficient of variation to compare variation in two datasets.

Coefficient of Variation: A Handy Tool for Comparing Dataset Variability 📊

The coefficient of variation (CV) is a valuable statistical measure that allows you to compare the variability of two datasets, regardless of their means or units of measurement. By calculating the CV, you can determine how the data is spread out in relation to the mean — it's especially useful when comparing datasets with different measurement scales or units.

Why Does the Coefficient of Variation Matters? 🤔

Consider two investment portfolios with different average returns and risk levels. You want to compare the variability of their returns to determine which portfolio is more stable. The mean value of Portfolio A is 15% while the mean value of Portfolio B is 10%. A simple comparison of standard deviations wouldn't be adequate, as it doesn't account for the differences in mean values. That's where the coefficient of variation comes in handy, as it normalizes the standard deviation relative to the mean.

Calculating the Coefficient of Variation: Method and Formula 📝

To calculate the coefficient of variation for a dataset, you'll need to follow these steps:
Find the mean (average) of the dataset.
Calculate the standard deviation.
Divide the standard deviation by the mean.
Multiply the result by 100 to express the coefficient of variation as a percentage.
The formula for the coefficient of variation is:
CV = (Standard Deviation / Mean) x 100
Let's walk through an example to illustrate the process:

Example: Comparing Variability in Sales Data 🛍️

Suppose you own two stores, Store A and Store B, and you want to compare the variability in their monthly sales figures. The sales data for the past six months are as follows:

Store A: [1200, 1300, 1100, 1150, 1250, 1210]
Store B: [800, 900, 700, 750, 850, 820]
Step 1: Calculate the mean
Mean of Store A: (1200 + 1300 + 1100 + 1150 + 1250 + 1210) / 6 = 1201.67 Mean of Store B: (800 + 900 + 700 + 750 + 850 + 820) / 6 = 803.33
Step 2: Calculate the standard deviation
Standard Deviation of Store A: 68.85 (using a standard deviation calculator) Standard Deviation of Store B: 64.26 (using a standard deviation calculator)
Step 3: Divide the standard deviation by the mean
CV of Store A: 68.85 / 1201.67 = 0.05733 CV of Store B: 64.26 / 803.33 = 0.08005
Step 4: Multiply the result by 100 to express the CV as a percentage
CV of Store A: 0.05733 * 100 = 5.73% CV of Store B: 0.08005 * 100 = 8.01%

Analyzing the Results: Interpreting the Coefficient of Variation 🧐

Now that we have calculated the coefficient of variation for both stores, we can compare their variability. Store A has a CV of 5.73%, while Store B has a CV of 8.01%. This means that Store B's sales figures have a higher variability compared to Store A, even though their standard deviations are close in value.

In conclusion, using the coefficient of variation helps you compare the relative variability of different datasets, even if their means or units of measurement differ. It's an essential tool in data analysis, allowing you to make informed decisions based on variability comparisons.

Calculate the skewness to assess the symmetry of the data### Why is Skewness Important in Data Analysis?

Skewness is essential in data analysis because it helps in assessing the symmetry of the data. By calculating skewness, you can identify if the data is symmetric or if it is concentrated more on one side. This information helps in understanding the underlying structure of the data and can influence your decision-making process in various fields like finance, economics, and social sciences.

Skewness: A Measure of Asymmetry 📏

Skewness is the measure of asymmetry in a probability distribution. In simple terms, it tells us if the data points in a dataset are more concentrated on one side than the other. For example, if a dataset has a positive skew, it means that the data points are clustered more towards the lower end, with a longer tail on the right side of the mean. Conversely, a negative skew indicates that the data points are more concentrated towards the upper end, with a longer tail on the left side of the mean.

There are three types of skewness:
Positive skew: The tail on the right side of the distribution is longer than the left side.

Negative skew: The tail on the left side of the distribution is longer than the right side.

Zero skew: The distribution is symmetric, and the mean, median, and mode are equal.

How to Calculate Skewness? 🧮

There are several ways to calculate skewness, but the most common method is Pearson's first coefficient of skewness. This formula calculates skewness using the mean, median, and standard deviation of the dataset.
The formula for Pearson's first coefficient of skewness is:
Skewness = 3 x (Mean - Median) / Standard Deviation
Now let's dive into a step-by-step example to understand how to calculate skewness for a dataset using this formula.

Example: Calculating Skewness for a Dataset 📚

Suppose we have the following dataset representing the exam scores of a group of students:
Scores: 45, 50, 55, 60, 65, 70, 75, 80, 85, 90

Calculate the mean: The mean is the average of all the data points. To calculate the mean, add up all the numbers and divide the sum by the total number of data points.
Mean = (45 + 50 + 55 + 60 + 65 + 70 + 75 + 80 + 85 + 90) / 10
Mean = 67.5
Calculate the median: The median is the middle value of the dataset. To find the median, sort the data in ascending order and find the middle number. If there are an even number of data points, take the average of the two middle numbers.
Sorted Scores: 45, 50, 55, 60, 65, 70, 75, 80, 85, 90
Median = (65 + 70) / 2
Median = 67.5
Calculate the standard deviation: The standard deviation measures the dispersion of the data points around the mean. To calculate the standard deviation, first find the difference between each data point and the mean, square those differences, find the mean of those squared differences, and then take the square root of that result.
Squared Differences: 506.25, 306.25, 156.25, 56.25, 6.25, 6.25, 56.25, 156.25, 306.25, 506.25
Mean of Squared Differences: 206.25
Standard Deviation = √206.25
Standard Deviation = 14.36
Calculate the skewness: Now that we have the mean, median, and standard deviation, we can plug these values into Pearson's first coefficient of skewness formula:
Skewness = 3 x (Mean - Median) / Standard Deviation
Skewness = 3 x (67.5 - 67.5) / 14.36
Skewness = 0

The skewness for this dataset is 0, which indicates that the data is symmetric, and the mean, median, and mode are equal.

Understanding skewness is vital for better data analysis as it gives insights into the structure of the data, which can be helpful in decision-making processes across various fields. By calculating skewness, you can assess the symmetry of the data and determine if the data is concentrated more on one side, allowing you to make better-informed decisions based on the dataset.

Previous Lesson Next Lesson

UE Campus

Product Designer

Profile

Class Sessions

1- Introduction 2- Import and export data sets and create data frames within R and Python 3- Sort, merge, aggregate and append data sets. 4- Use measures of central tendency to summarize data and assess symmetry and variation. 5- Differentiate between variable types and measurement scales. 6- Calculate appropriate measures of central tendency based on variable type. 7- Compare variation in two datasets using coefficient of variation. 8- Assess symmetry of data using measures of skewness. 9- Present and summarize distributions of data and relationships between variables graphically. 10- Select appropriate graph to present data 11- Assess distribution using Box-Plot and Histogram. 12- Visualize bivariate relationships using scatter-plots. 13- Present time-series data using motion charts. 14- Introduction 15- Statistical Distributions: Evaluate and analyze standard discrete and continuous distributions, calculate probabilities, and fit distributions to observed. 16- Hypothesis Testing: Formulate research hypotheses, assess appropriate statistical tests, and perform hypothesis testing using R and Python programs. 17- ANOVA/ANCOVA: Analyze the concept of variance, define variables and factors, evaluate sources of variation, and perform analysis using R and Python. 18- Introduction 19- Fundamentals of Predictive Modelling. 20- Carry out parameter testing and evaluation. 21- Validate assumptions in multiple linear regression. 22- Validate models via data partitioning and cross-validation. 23- Introduction 24- Time Series Analysis: Learn concepts, stationarity, ARIMA models, and panel data regression. 25- Introduction 26- Unsupervised Multivariate Methods. 27- Principal Component Analysis (PCA) and its derivations. 28- Hierarchical and non-hierarchical cluster analysis. 29- Panel data regression. 30- Data reduction. 31- Scoring models 32- Multi-collinearity resolution 33- Brand perception mapping 34- Cluster solution interpretation 35- Use of clusters for business strategies 36- Introduction 37- Advance Predictive Modeling 38- Evaluating when to use binary logistic regression correctly. 39- Developing realistic models using functions in R and Python. 40- Interpreting output of global testing using linear regression testing to assess results. 41- Performing out of sample validation to test predictive quality of the model Developing applications of multinomial logistic regression and ordinal. 42- Selecting the appropriate method for modeling categorical variables. 43- Developing models for nominal and ordinal scaled dependent variables in R and Python correctly Developing generalized linear models . 44- Evaluating the concept of generalized linear models. 45- Applying the Poisson regression model and negative binomial regression to count data correctly. 46- Modeling 'time to event' variables using Cox regression. 47- Introduction 48- Classification methods: Evaluate different methods of classification and their performance in order to design optimum classification rules. 49- Naïve Bayes: Understand and appraise the Naïve Bayes classification method. 50- Support Vector Machine algorithm: Understand and appraise the Support Vector Machine algorithm for classification. 51- Decision tree and random forest algorithms: Apply decision trees and random forest algorithms to classification and regression problems. 52- Bootstrapping and bagging: Analyze the concepts of bootstrapping and bagging in the context of decision trees and random forest algorithms. 53- Market Baskets: Analyze transaction data to identify possible associations and derive baskets of associated products. 54- Neural networks: Apply neural networks to classification problems in domains such as speech recognition, image recognition, and document categorization. 55- Introduction 56- Text mining: Concepts and techniques used in analyzing unstructured data. 57- Sentiment analysis: Identifying positive, negative, or neutral tone in Twitter data. 58- SHINY package: Building interpretable dashboards and hosting standalone applications for data analysis. 59- Hadoop framework: Core concepts and applications in Big Data Analytics. 60- Artificial intelligence: Building simple AI models using machine learning algorithms for business analysis. 61- SQL programming: Core SQL for data analytics and uncovering insights in underutilized data. 62- Introduction 63- Transformation and key technologies: Analyze technologies driving digital transformation and assess the challenges of implementing it successfully. 64- Strategic impact of Big Data and Artificial Intelligence: Evaluate theories of strategy and their application to the digital economy, and analyze. 65- Theories of innovation: Appraise theories of disruptive and incremental change and evaluate the challenges of promoting and implementing innovation. 66- Ethics practices and Data Science: Assess the role of codes of ethics in organizations and evaluate the importance of reporting. 67- Introduction 68- Introduction and Background: Provide an overview of the situation, identify the organization, core business, and initial problem/opportunity. 69- Consultancy Process: Describe the process of consultancy development, including literature review, contracting with the client, research methods. 70- Literature Review: Define key concepts and theories, present models/frameworks, and critically analyze and evaluate literature. 71- Contracting with the Client: Identify client wants/needs, define consultant-client relationship, and articulate value exchange principles. 72- Research Methods: Identify and evaluate selected research methods for investigating problems/opportunity and collecting data. 73- Planning and Implementation: Demonstrate skills as a designer and implementer of an effective consulting initiative, provide evidence of ability. 74- Principal Findings and Recommendations: Critically analyze data collected from consultancy process, translate into compact and informative package. 75- Understand how to apply solutions to organisational change. 76- Conclusion and Reflection: Provide overall conclusion to consultancy project, reflect on what was learned about consultancy, managing the consulting. 77- Handle and manage multiple datasets within R and Python environments.

noreply@uecampus.com