futurera | open data, time series, forecasting

SUMMARY

-Characteristics metrics

Characteristics for the datapoints / observations of a time serie

- Central values: mean, median and mode.
- Measures of scatter: MAD and Normal deviation.

Mean
Mean is the center of deviations. The mean is the aritmetic of deviations. It measures that value about wich 50 percent of the deviations are above and 50 percent of the deviations are below. The mean, also known as the average, is a statistical measure that represents the central value of a dataset. It is calculated by adding up all the values in the dataset and then dividing by the total number of values

Mean = ∑X_t / n
Mean = (Sum of all values) / (Number of values)
Where n is the number of values

An example.
We have the values: 27, 23, 31, 45, 47, 42, 39, 45, 57, 59, 73 and 84

mean = (27 + 23 + ... + 84) / 12 = 572 / 12 = 47.67

- It is sensitive to outliers (extreme values).
- It is commonly used for normally distributed data.
- It is not suitable for skewed data distributions.

The mean is a simple and widely used measure of central tendency. However, it is important to consider the presence of outliers and the distribution of the data when using the mean to describe a dataset.

Top

Median

Median is the center of data. 50 percent of the values are above and 50 percent of the values are below. The median is a statistical measure that represents the middle value in a dataset when the data is arranged in ascending or descending order. It divides the data into two equal halves, with 50% of the data points being below the median and 50% above it.

- It is less affected by outliers than the mean (average).
- It is useful for skewed data distributions.
- The median is a robust measure of central tendency that provides valuable information about the center of a dataset, especially when dealing with skewed distributions or outliers.

An example.

We have values of 2, 4, 5, 6, 9, 11 and 12.

The median = 6
the values (2, 4 and 5) are below, and
the values (9, 11 and 12) are above.

Top

Mode

Mode is the most frequent value.

- It can be used for both numerical and categorical data. For example, if you're looking at the most popular color of car, the mode would be the color that appears most frequently.
- A dataset can have one mode, multiple modes, or no mode at all. A dataset with one mode is called unimodal, a dataset with two modes is bimodal, and a dataset with more than two modes is multimodal. A dataset with no mode means that all values appear the same number of times.
- The mode is not always the best measure of central tendency. In some cases, the mean or median might be a more appropriate measure. For example, if a dataset has outliers (values that are much larger or smaller than the other values), the mode might not be a good representation of the center of the data.

An example.
We have values of 1, 2, 5, 6, 6, 7, 10, 11.
The mode = 6

Top

MAD

- 1) Mean Absolute Deviation or
- 2) Median Absolute Deviation

Larger MAD means data points are more spread out, smaller MAD means they are closer together.

Mean Absolute Deviation
Measures the average distance between each data point and the mean of the data set. It gives you an idea of how spread out the data points are.

MAD = ( absolute(x₁ - mean) + absolute(x₂ - mean) + ... + absolute(x_n - mean) ) / n

An example.

We have values of 11 + 4 + 5 + 12 + 9 + 2 + 6
The number of values, data points n, is 7
The mean is (11 + 4 + 5 + 12 + 9 + 2 + 6) / 7 = 7

MAD = ( absolute(11-7) + absolute(4-7) + ... + absolute(6-7) ) / 7 = (4 + 3 + 2 + 5 + 2 + 5 + 1) / 7 = 17 / 7 =~ 2.4

Median Absolute Deviation
Measures the variability of a dataset by calculating the median of the absolute deviations from the data's median. It is more robust to outliers than the mean absolute deviation, meaning it is less affected by extreme values.

Top

Standard Deviation

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.

Standard deviation is a statistical measure that tells us how spread out a set of data is from its mean (average).
Low standard deviation: The data points are clustered closely around the mean. High standard deviation: The data points are spread out farther from the mean.

- It helps us understand the variability within a dataset.
- It allows us to compare different datasets and see which one has more variation.
- It's used in many statistical calculations and analyses.

S = √( ∑(X - X)² / (n - 1) )
Where:
S = Sample Standard deviation
n = Number of observations
X = Actual value
X = Sample mean

An example:
We have the following data set: 5, 7, 9, 11, 13

Mean: (5 + 7 + 9 + 11 + 13) / 5 = 9
Variance: ((5-9)^2 + (7-9)^2 + (9-9)^2 + (11-9)^2 + (13-9)^2) / 5 = 8
Standard deviation: √8 ≈ 2.83

Top