Univariate data - the basics#

This tutorial focuses on working with basic univariate data management and analysis using plans.

Notebook setup#

For users running this tutorial as a Jupyter Notebook, this cell must be executed first:

import sys
from pathlib import Path
import pprint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Install `plans` in `google.colab`.
# Use `pip install plans` for other environments.

if "google.colab" in sys.modules:
    import os
    os.system(f"{sys.executable} -m pip install -q plans")

# This avoids warnings related to uninstalled fonts
import logging
# Set the matplotlib font manager logger to only show errors (hides warnings)
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)

# define output folder
OUTPUT_DIR = Path("outputs/univariate")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Outputs will be saved to: ./{OUTPUT_DIR}")

Outputs will be saved to: ./outputs/univariate

The `Univar` object#

The Univar object is a very primitive class that lives under plans.analyst module. This object is a child from the DataSet object that lives in plans.root module.

The Univar stores all core methods for working with univariate data.

Import Univar object:

from plans.analyst import Univar

Create an instance of the Univar:

unv = Univar(name="Testing Univar", alias="uni-tst")

Check out the variable type:

print(type(unv))

<class 'plans.analyst.Univar'>

Handle data attributes#

Inspect state for major attributes:

print(unv)

[Testing Univar (uni-tst)]
Univar (DS):	<class 'plans.analyst.Univar'>
      field          value
       name Testing Univar
      alias        uni-tst
       size            NaN
      color           blue
     source            NaN
description            NaN
      units          units
  file_data            NaN
Data:
None

Edit attributes directly and inspect again

unv.description = "A testing univariate for a tutorial"
unv.units = "cm"
print(unv)

[Testing Univar (uni-tst)]
Univar (DS):	<class 'plans.analyst.Univar'>
      field                               value
       name                      Testing Univar
      alias                             uni-tst
       size                                 NaN
      color                                blue
     source                                 NaN
description A testing univariate for a tutorial
      units                                  cm
  file_data                                 NaN
Data:
None

Working with data#

Create synthetic gaussian dataset#

Lets first make a synthetic gaussian dataset using numpy.random and save it to a CSV file using pandas.

# make synthetic gaussian data
np.random.seed(10) # ensure reproducibility
v = np.random.normal(loc=100, scale=10, size=1000)
df = pd.DataFrame({"variable": v})

# Export CSV file
file_csv = OUTPUT_DIR / "univariate_gauss.csv"
df.to_csv(file_csv, sep=";", index="False")
print(f"Saved to: {file_csv}")

Saved to: outputs/univariate/univariate_gauss.csv

The whole dataset looks like this:

# get simple visualization
plt.plot(df.index, df['variable'])
plt.ylim([0, 200])
plt.show()

../_images/9e7a0c4919235a81affad6cdf9a267d3d74089541fcca72ded3fb4f900695c20.png

Loading data from the CSV file#

Call the .load_data() method for loading from CSV file:

# reset the uv variable
unv = Univar(name="Testing Univar", alias="uni-tst")

unv.load_data(
    file_data=file_csv,  # file path
    input_varfield="variable",  # name of column/field
    in_sep=";",  # input separator
)

Data is stored in the .data attribute as a pandas.DataFrame:

unv.data

	v
0	113.315865
1	107.152790
2	84.545997
3	99.916162
4	106.213360
...	...
995	89.856608
996	96.681723
997	114.406974
998	96.097821
999	106.424506

1000 rows × 1 columns

Access computed metadata#

After loading data, some useful attributes are computed automatically.

Basic statistics are stored in .stats_df as a pandas.DataFrame:

unv.stats_df

	statistic	value
0	count	1000.000000
1	sum	99854.433644
2	mean	99.854434
3	sd	9.379578
4	min	67.955987
5	p01	78.678389
6	p05	84.521828
7	p10	87.959449
8	p20	92.072856
9	p25	93.609900
10	p40	97.618150
11	p50	99.765252
12	p60	102.116821
13	p75	106.010132
14	p80	107.549608
15	p90	111.678248
16	p95	114.846091
17	p99	122.426391
18	max	126.799103

Data frequencies are stored in .freq_df as a pandas.DataFrame:

unv.freq_df

	Percentiles	Exceedance	Frequency	Empirical Probability	Values
0	0	100	1	0.001	67.955987
1	1	99	0	0.000	78.678389
2	2	98	0	0.000	80.584361
3	3	97	1	0.001	82.311344
4	4	96	0	0.000	83.409942
...	...	...	...	...	...
95	95	5	2	0.002	114.846091
96	96	4	2	0.002	116.115753
97	97	3	0	0.000	118.178726
98	98	2	0	0.000	120.658142
99	99	1	3	0.003	122.426391

100 rows × 5 columns

Data weibull CDF is stored in .weibull_df as a pandas.DataFrame:

unv.weibull_df

	Data	P(X)	F(X)
0	126.799103	0.000999	0.999001
1	126.716851	0.001998	0.998002
2	126.625775	0.002997	0.997003
3	124.676511	0.003996	0.996004
4	124.653251	0.004995	0.995005
...	...	...	...
995	74.885309	0.995005	0.004995
996	71.849387	0.996004	0.003996
997	71.839988	0.997003	0.002997
998	70.204032	0.998002	0.001998
999	67.955987	0.999001	0.000999

1000 rows × 3 columns

Visualizations#

Most plans. objects comes with built-in methods for getting visualizations, both inline and figure output.

Standard visualization#

Get the standard visual using the .view() method

unv.view()

../_images/6699cb7b508fd6e2bc7ef10c984ca58b0681bfd3b730fc1c814f6857de8de8c5.png

Fine-tuning visuals#

Use the .view_specs keys for fine tuning the visual. Example of some possibilities:

# Y axis range
unv.view_specs["range"] = [0, 200]
# Scatter
unv.view_specs["scatter_factor"] = 2 # occupy more space
# Color
unv.view_specs["color"] = "blue"
unv.view_specs["color_hist"] = "green"
# Decorations
unv.view_specs["plot_mean"] = False
unv.view_specs["plot_mean_data"] = True
unv.view_specs["title"] = "Hello! This is a tutorial!"
# call view again
unv.view()

../_images/d39ec0527f4bebe4e66541d460f08f63758bafbfce2f3d2bbd2272f49b342b10.png

Layouts available#

List available layout keys:

print(unv.layouts.keys())

dict_keys(['full', 'mini', 'simple', 'simple-shallow', 'default'])

unv.view_specs["layout"] = "mini"
unv.view()

../_images/6972e1825c47dd7eb4236fbe848ae9989b435939e6a3ff9cfe2b636f2933e4f4.png

unv.view_specs["layout"] = "simple"
unv.view()

../_images/407887453b221d4c09436c46a866b675fb0aa7aed661b7e0aecf62ad166650a4.png

unv.view_specs["layout"] = "simple-shallow"
unv.view_specs["title"] = None
unv.view()

../_images/78d5951100b26334d6c0369685e0949f0e6ba88bdffb3dbea33268cd8e5705ab.png

Analysis methods#

Normality assessment#

Assess if the data is normal via the .get_normality() method:

df = unv.get_normality(clevel=0.95)
df

	Test	Statistic	p-value	is_normal	Confidence
0	Kolmogorov-Smirnov	0.013838	0.989545	True	0.95
1	Shapiro-Wilk	0.998710	0.695039	True	0.95
2	D'Agostino-Pearson	0.477945	0.787436	True	0.95

Weibull distribution#

df = unv.get_cdf_weibull()
df

	Data	P(X)	F(X)
0	126.799103	0.000999	0.999001
1	126.716851	0.001998	0.998002
2	126.625775	0.002997	0.997003
3	124.676511	0.003996	0.996004
4	124.653251	0.004995	0.995005
...	...	...	...
995	74.885309	0.995005	0.004995
996	71.849387	0.996004	0.003996
997	71.839988	0.997003	0.002997
998	70.204032	0.998002	0.001998
999	67.955987	0.999001	0.000999

1000 rows × 3 columns

Gumbel distribution#

The method .get_cdf_gumbel() performs a full assessment for the Gumbel distribution.

Warning

This method assumes the data is a collection of maxima values.

dc_gumbel = unv.get_cdf_gumbel()

View model results:

df = dc_gumbel["Metadata"]
df

	Metadata	Value
0	N	1000
1	Gumbel a	95.633125
2	Gumbel b	7.313227
3	KS-test s-value	0.076734
4	KS-test p-value	0.000014
5	KS-test Is Gumbel (NullH)	False
6	QQ-plot c0	2.405764
7	QQ plot c1	0.975998
8	QQ-plot r2	0.970525

View data:

df = dc_gumbel["Data"]
df

	v	Rank	P(X)_Empirical	P(X)_Weibull	T(X)_Weibull	P(X)_Gringorten	P(X)_Gumbel	T(X)_Gumbel	T(X)_Gumbel_SE90	T(X)_Gumbel_P05	T(X)_Gumbel_P95
0	126.799103	1	0.001	0.000999	1001.000000	0.000560	0.014001	71.423884	1.784416	56.068652	91.022575
1	126.716851	2	0.002	0.001998	500.500000	0.001560	0.014158	70.630695	1.780045	55.480053	89.956708
2	126.625775	3	0.003	0.002997	333.666667	0.002560	0.014334	69.762759	1.775206	54.835575	88.791140
3	124.676511	4	0.004	0.003996	250.250000	0.003560	0.018671	53.557703	1.671825	42.715790	67.184470
4	124.653251	5	0.005	0.004995	200.200000	0.004559	0.018730	53.389231	1.670594	42.588833	66.961535
...	...	...	...	...	...	...	...	...	...	...	...
995	74.885309	996	0.996	0.995005	1.005020	0.995441	1.000000	1.000000	1.172041	1.000000	1.000000
996	71.849387	997	0.997	0.996004	1.004012	0.996440	1.000000	1.000000	1.328611	1.000000	1.000000
997	71.839988	998	0.998	0.997003	1.003006	0.997440	1.000000	1.000000	1.329099	1.000000	1.000000
998	70.204032	999	0.999	0.998002	1.002002	0.998440	1.000000	1.000000	1.414199	1.000000	1.000000
999	67.955987	1000	1.000	0.999001	1.001000	0.999440	1.000000	1.000000	NaN	NaN	NaN

1000 rows × 11 columns

View return periods T(X):

df =dc_gumbel["Data_T(X)"]
df

	T(X)_Gumbel	v	v_P25	v_P75	v_P05	v_P95
0	2	98.313631	98.129944	98.497319	97.865427	98.761836
1	3	102.235039	102.001629	102.468448	101.665511	102.804566
2	4	104.744783	104.469400	105.020167	104.072837	105.416730
3	5	106.602640	106.293272	106.912009	105.847769	107.357511
4	6	108.080229	107.742617	108.417842	107.256442	108.904016
...	...	...	...	...	...	...
994	996	146.118235	144.961888	147.274581	143.296704	148.939765
995	997	146.125577	144.969068	147.282087	143.303650	148.947505
996	998	146.132913	144.976241	147.289584	143.310589	148.955236
997	999	146.140240	144.983406	147.297074	143.317521	148.962960
998	1000	146.147561	144.990565	147.304557	143.324446	148.970676

999 rows × 6 columns