Tutorial: inflating ensembles (pre-binned data format)
While ensembles are useful data format to store nested histograms in, they are somewhat unwieldy and require many special functions to handle. The largest problem is, however, that they do not work well with vector-based operations, and require recursive traversal by the convolution code.
To combat this we can inflate ensembles by changing their nested structure to a flattened format.
[1]:
import json
import pkg_resources
from syntheticstellarpopconvolve.ensemble_utils import convert_ensemble_to_dataframe
# load the data
example_ensemble_filename = pkg_resources.resource_filename(
"syntheticstellarpopconvolve", "example_data/example_ensemble.json"
)
with open(example_ensemble_filename, "r") as f_ensemble:
ensemble = json.loads(f_ensemble.read())
One example of binned data is the Ensemble data type generated by binary_c.
Ensemble-based data is stored as nested dictionaries. An example of ensemble-based data looks like this:
"Xyield": {
"time": {
"-0.1": {
"source": {
"Wind": {
"isotope": {
"Al27": 1.3202421292393783e-08,
"Ar36": 1.7624018781546946e-08,
"Ar38": 3.502033439864038e-09,
"Ar40": 5.758546201573555e-12,
"B10": 2.4295643555965993e-13,
"B11": 1.0109986571758494e-12,
"Be9": 3.7843822119497306e-14,
[...]
}
}
[...]
}
}
[...]
}
}
With binary_c-python this type of data can be generated through the options explained in the ensemble-data logging notebook.
To use this type of data, however, one must first transform it to a different shape. In particular one must inflate the ensemble, turning it from a nested dictionairy to a rectangular data format. How to do so is covered in XXX (TODO: refer to notebook).
We can then inflate this ensmeble by using the convert_ensemble_to_dataframe function.
It is best to already know the structure of the ensemble, so you know exactly which subtree you want to take, and whether it contains named layers. If the ensemble contains named layers, the structure should be named_layer_1, value_layer_1, ... named_layer_n, value_layer_n, normalized_yield_layer_n.
inflate_ensemble = convert_ensemble_to_dataframe(
ensemble_data, # subtree of ensemble
verbose=False, # flag to show info while inflating
contains_named_layers=True, # flag to indicate whether the ensemble contains named layer (i.e. those that indicate what is in the next layer)
)
Particularly, if you indicate that the ensemble contains named layers, the first layer should be a named layer. If it does not, or somehow the structure is not like it should be, the read-out is misaligned. If that is the case, please double check you provided the correct subtree, or wrap it in {‘ensemble’: ensemble_data}
[2]:
inflated_ensemble = convert_ensemble_to_dataframe(
ensemble_data=ensemble["ensemble"]['Xyield'],
verbose=False,
contains_named_layers=True,
)
print(inflated_ensemble.head())
time source isotope probability
0 -0.1 Wind Al27 0.0
1 -0.1 Wind Ar36 0.0
2 -0.1 Wind Ar38 0.0
3 -0.1 Wind Ar40 0.0
4 -0.1 Wind B10 0.0
The inflated ensemble by default calls the final layer ‘probability’, but that data can ofcourse be anything depending on the pop-synth simulation output.
Moreover, the object-type of all columns are by default string based (except for the final layer). This is because, when reading out the ensemble, we do not want to impose any type on the data, and everything can be converted to a string, but not everything can be converted to a numerical type.
[3]:
print(inflated_ensemble.dtypes)
time object
source object
isotope object
probability object
dtype: object
After inflating the ensemble you should convert the columns to their actual types
[4]:
inflated_ensemble = inflated_ensemble.astype({'time': 'float'})
print(inflated_ensemble.dtypes)
time float64
source object
isotope object
probability object
dtype: object
[5]:
print(10**inflated_ensemble['time'])
0 0.794328
1 0.794328
2 0.794328
3 0.794328
4 0.794328
...
102787 15848.931925
102788 15848.931925
102789 15848.931925
102790 15848.931925
102791 15848.931925
Name: time, Length: 102792, dtype: float64