Tutorial: Star-formation convolution

This notebook serves to explain how to use the basic features of the star-formation convolution code. We provide a step-by-step explanation of the convolution, including examples of code. More advanced features and fleshed-out use-cases are covered in other notebooks.

First, we cover some of the concepts of convolving star-formation rates with population-synthesis results in Background on convolution, and then we dive into the technical details of how to use the star-formation rate convolution code in Convolution with Synthetic Stellar Pop Convolve (SSPC).

Similar codes that are on the market are:

Cogsworth (Wagg) https://arxiv.org/abs/2409.04543
SynthPop Kluter et al https://arxiv.org/abs/2411.18821
Galaxia https://galaxia.sourceforge.net/Galaxia3pub.html

Background on convolution

With stellar population-synthesis codes one often calculates statistics of populations of stellar systems, but these results often are expressed in quantities ‘per formed stellar mass’, i.e. these results need to be combined with a star formation rate to get an actual predicted number of stellar systems. Some useful (recent) literature on the concept of convolution of starformation rates with population-synthesis results, analytic descriptions and formulae related to the convolution, or technical implementations, are:

Our convolution code aims to perform these calculations

\[\begin{split}\mathcal{R}_{\mathrm{\bhbh}}(z_{\mathrm{merge}},\ \zeta) = \int_{Z_{\mathrm{min}}}^{Z_{\mathrm{max}}} dZ \int_{0}^{t^{*}_{\mathrm{first\, SFR}}-t^{*}_{\mathrm{merge}}}dt_{\mathrm{delay}}\\ \mathcal{N}_{\mathrm{form}}(Z,\,t_{\mathrm{delay}},\,\zeta) \times \mathrm{SFR}(Z,\,z_{\mathrm{birth}}).\end{split}\]

When the starformation is expressed as a function of redshift, we perform a set of conversions between event-redshift and birth-redshift, using the delay time of the system/ensemble

\[R_{\mathrm{merge},\,i},\ z_{\mathrm{merge}} = p_{i}\,\times\,f_{\mathrm{bin}}\,\times\,\frac{\mathrm{SFR}(z_{\mathrm{birth},\,i},\ Z_{i})\,\times\,dZ_{i}}{\left<M\right>},\]

Convolution with Synthetic Stellar Pop Convolve (SSPC)

At a high level, the convolution code has three stages.

setup: sequential preparation of the hdf5 file. A set of steps including putting the event and ensemble (population) data in the correct structure and format, providing settings used to generate the population data, generating the star-formation rate grid, setting up the output file etc.
convolution: multi-processed multiplication of events or ensembles with star-formation rates (densities). The convolution is handled by finding the star-formation rate bin in which a system/ensemble falls if we want the system/ensemble with delay time \(t_{\mathrm{delay}}\) to have an event at some target lookback time/redshift \(t_{\mathrm{target\, lookback}}\).
storage: sequential storage of all the results into the output hdf5 file. The results of the convolution at each target time are stored in temporary files before they are written into the output hdf5 file.

To convolve population-synthesis results we need to go through the following steps and sections:

Construction of the input file: we first need to construct the convolution input file appropriately. This means storing the results in a certain structure in an HDF5 file.
Global configuration of the convolution: we then need to provide a global configuration for the convolution, which includes specifying the star-formation history, as well as specifics about the time-stepping. For a complete overview of the configuration options see this section
Specific convolution instructions: after the global configuration we need to provide specific convolution instructions, which contain specifics of which dataset is convolved, which column-names or ensemble-depths are associated with the necessary quantities for convolution (i.e. delay time, metallicity, yield), and other additional things like extra weights (e.g. detection probabilities).
Running the convolution: when the above steps are finished we can convolve the data.
Reading the output file: After the convolution, the data is stored in the output hdf5 file. There is a particular structure to this output data, which is explained in this section.

At the end of the notebook we have a section that describes all the global configuration options, as well as a section that contains a complete convolution script

The codebase for the convolution functionality resides in syntheticstellarpopconvolve/, and we suggest having a look through the code to make yourself familiar with how things work.

[1]:

import os
import json
import copy
import numpy as np
import astropy.units as u

from syntheticstellarpopconvolve import convolve, default_convolution_config, default_convolution_instruction
from syntheticstellarpopconvolve.general_functions import temp_dir


TMP_DIR = temp_dir("notebooks", "notebook_convolution", clean_path=True)
VERBOSITY = 0

Construction of the output file

To convolve the data we first need to construct an input file that contains the correct information, and stores the input data in a specific way, and includes some extra information like the population settings that were used to generate the data. The file format we use for the convolution output files are hdf5 files.

Lets construct the file using the utility function generate_boilerplate_outputfile, which creates the file as well as certain data groups within this file.

[2]:

import h5py
import pkg_resources
import pandas as pd
from syntheticstellarpopconvolve.general_functions import generate_boilerplate_outputfile

# create file
output_hdf5_filename = os.path.join(TMP_DIR, 'input_hdf5.h5')
generate_boilerplate_outputfile(output_hdf5_filename)

Supported data formats and storing data

After creating the input file itself we need to fill it with the data we want to convolve. We show how to do that in the following section.

While this code is agnostic as to where the data comes from, i.e. any population synthesis code that generates the data in the correct format would be suitable, we indicate methods to generate this data with binary_c / binary_c-python.

Note that since we do not provide any methods to filter the data during the convolution, what is provided will be convolved. If a set of events contains more than you need (e.g. you’re only interested in systems that form SN1a) then its best to filter out the rest beforehand.

event-based data

Event-based data is stored in pandas dataframes, and has the form of a rectangular, line-by-line, data format.

An example of event-based data looks like this:

uuid    event_type  zams_mass_1 zams_mass_2 zams_orbital_period [...]
B2D5742C-C96C-42EB-8A44-754E4D9E867F    RLOF    29.9112 3.8264  6.58795 [...]
A2339A8F-12A4-4F8F-B594-99F22B00360C    RLOF    11.8937 4.52266 6.58795 [...]
[...]

With binary_c-python this type of data can be generated through the options explained in the event-based logging notebook.

ensemble-based data

Ensemble-based data is stored as nested dictionaries. An example of ensemble-based data looks like this:

"Xyield": {
    "time": {
        "-0.1": {
            "source": {
                "Wind": {
                    "isotope": {
                        "Al27": 1.3202421292393783e-08,
                        "Ar36": 1.7624018781546946e-08,
                        "Ar38": 3.502033439864038e-09,
                        "Ar40": 5.758546201573555e-12,
                        "B10": 2.4295643555965993e-13,
                        "B11": 1.0109986571758494e-12,
                        "Be9": 3.7843822119497306e-14,
                        [...]
                        }
                    }
                [...]
                }
            }
        [...]
        }
    }

With binary_c-python this type of data can be generated through the options explained in the ensemble-data logging notebook.

To use this type of data, however, one must first transform it to a different shape. In particular one must inflate the ensemble, turning it from a nested dictionairy to a rectangular data format. How to do so is covered in XXX (TODO: refer to notebook).

Storing the data

Data is supposed to be stored in an hdf5 file under a particular key structure: input_data/<input_data_name> where <input_data_name> is the name associated with that particular set of event data (like RLOF events). <input_data_name> is something the user can decide for themselves, and will be used to retrieve the data during the convolution.

[3]:

# load as RLOF event-data as pandas df
example_RLOF_events_filename = pkg_resources.resource_filename(
    "syntheticstellarpopconvolve", 'example_data/example_RLOF_event_data.dat'
)
RLOF_df = pd.read_csv(example_RLOF_events_filename, header=0, sep="\s+")

# store the dataframe in the hdf5file
RLOF_df.to_hdf(
    output_hdf5_filename, key="input_data/{}".format("RLOF")
)

Configuration of the convolution

After setting up the input file for the convolution, we need to provide a configuration for the convolution.

for a complete list of configuration options, please see the descriptions of configuration options section and the online documentation for config options

[4]:

#
convolution_config = copy.copy(default_convolution_config)
convolution_config['output_filename'] = output_hdf5_filename
convolution_config['tmp_dir'] = TMP_DIR

The next step is to provide the code with a star-formation rate that is used to convolve the data with. We explain the details of this in a different notebook, to avoid this one from becoming too large

We remain rather agnostic as to where these come from and how they are constructed, but we do provide some distributions that you can use to construct them. These are available in syntheticstellarpopconvolve.starformation_rate_distributions and syntheticstellarpopconvolve.metallicity_distributions.

TODO: refer to other sectin for SFR dicts

[5]:

# set up SFR dict
convolution_config["SFR_info"] = {
    'lookback_time_bin_edges': np.arange(0, 10, 1) * u.yr,
    'starformation_rate_array': np.arange(0,  9) ** 2 * u.Msun / u.yr
}

Configuring convolution bins

[6]:

# set convolution time bins
convolution_config["convolution_lookback_time_bin_edges"] = np.arange(0, 3, 1) * u.yr

After configuring the global convolution behaviour, we need to instruct the code how to convolve specific datasets (possibly with different data structures). This is done through the convolution_instructions. This entry in the config needs to contain a list of dictionaries that each contain a set of instructions for the convolution. It is possible to provide several dictionaries, which will be handled sequentially in the code.

an example would be as such:

convolution_config['convolution_instructions'] = [
    {
        'input_data_name': 'RLOF',
        'output_data_name': 'RLOF_rate',
        ... (see below)
    },
]

Each convolution_instruction dict in that list needs to contain at least the following entries:

input_data_name: the name of the dataset that is to be convolved. See the construction of the input file
output_data_name: the output name for this convolution. This gives an additional name to output, allowing us to convolve the same input dataset multiple times with different configurations.
data_column_dict: Dictionary containing mapping of input-data column names to those required by the convolution code. See below for more details

And the following optional entries:

post_convolution_function: A function that will be called after the convolution step. TODO refer to specific section

Then, depending on the input type, several other entries must (or may) be provided to the convolution_instruction entry. Below we list these according to the input-data type. More importantly, there are some restrictions as well

convolution type	General support	Time-type support
integration	Yes	lookback time & redshift
sampling	Yes

event-based data

event-based data (i.e. line-based, with each line containing an event, see event-based logging documentation and notebook on how to generate that with binary_c-python) needs information about which columns to use. These should be provided through a dictionary called data_column_dict.

Within this dictionary, the following entries must be present

delay_time: column name associated with the delay time of the event (time between the birth of the system to the occurrence of the event). Assumed to be expressed in years.
normalized_yield: column name associated with the yield rate of the systems. This column gets multiplied by the star formation rate (density), and so is expected to already be converted into something per solar mass formed into stars.

optionally, one can add

metallicity: column name associated with the metallicity of the systems.

As well as other columns that can be used in the extra_weight calculation (see TODO)

The data_column_dict values should either be of string type or a dictionary that contains (some of) the following items:

column_name: integer indicating the depth of the layer that contains this data

As an example:

convolution_config['convolution_instructions'] = [
    {
        ...,
        'data_column_dict': {
            # required
            'delay_time': 'initial_time',
            'normalized_yield': 'probability',
            # optional*
            'metallicity': 'metallicity',
            # optional**
            'mass_1': 'RLOF_initial_mass_accretor',
            'mass_2': 'RLOF_initial_mass_donor',
        },
    },
]

additional configuration of the convolution-instructions are covered in the advanced-usage notebook.

Post-convolution processing function

The convolution instruction can be supplemented with a post-convolution function. This can be used for a range of reasons, e.g. filtering of the data to only include particular systems, providing additional detection-probability weighting to go from intrinsic rates to predicted observed rates, further evolution of sampled systems to the present-day through gravitational-wave radiation, orbit integration (including kicks), etc.

The user can provide the convolution_instructions dictionary with an additional function that performs some additional calculations. This is done as follows. The extra_weights_function should be stored in the convolution_instructions as convolution_instructions['post_convolution_function']. This function should return an array of additional weights.

The arguments to this function do not require any specific parameter, but the convolution code internally allows the following arguments to be passed to the function:

config: general configuration dictionary
time_value: the time value of the current time-bin (can be redshift as well, but must be named time_value.
convolution_instruction: convolution instruction dictionary.
job_dict: dictionary containing current job information
sfr_dict: dictionary containing the star formation rate dictionary
convolution_results: dictionary containing the results of the current convolution step. This one is rather important. TODO: likely is required.
data_dict: dictionary that contains the data. This will be filled with the keys and data that either data_layer_dict or data_column_dict instructs the convolution code to use.
**convolution_instruction["post_convolution_function_extra_parameters"]: additional parameters and values that the user can provide.

Which of these arguments gets passed depends on the arguments to the post_convolution_function, which are automatically recognized and passed to the function.

This function should return an updated result_dict dictionary, where the results are updated accordingly. Apart from updating the yield field, which contains the conventional convolution results, one (in some cases, see below) can add extra fields to this dictionary, like its position in space.

Depending on the type of convolution, and the data type, the shape (length) of the entries in this dictionary can have different requirements.

This is to maintain correct data-integrity.

convolution by integration: length of arrays has to be the same before as after
convolution by sampling: no restrictions

It is also possible to return several result_dicts, but these need to be supplied with a name to identify and store the data properly. An example of a use-case for returning several return_dicts is when convolving event-based data by sampling, integrating the sampled systems forward in time and evolving the orbit of the binary system. When a supernova kick unbinds the binary, the system will turn into 2 individual stars, which requires storing more data to track both stars. In this case it would make sense to return one result_dict that contains bound binary systems, and one result_dict containing information for both unbound stars. data-integrity requirements outlined above still count.

An example of a post-convolution function call is as follows:

def post_convolution_function_detection_probability(config, time_value, data_dict, min_detection_probability):
    # just an example, this is not how detection probability is calculated
    detection_probability = data_dict['mass_1'] * data_dict['mass_2'] * time_value

    detection_probability[detection_probability<min_detection_probability] = 0

    return detection_probability

#
convolution_config['convolution_instructions'] = [
    {
        'input_data_name': 'RLOF',
        'output_data_name': 'RLOF_rate',
        'data_column_dict': {
            # required
            'delay_time': 'initial_time',
            'normalized_yield': 'probability',
            # optional*
            'metallicity': 'metallicity',
            # optional**
            'mass_1': 'RLOF_initial_mass_accretor',
            'mass_2': 'RLOF_initial_mass_donor',
        },
        'post_convolution_function': post_convolution_function_detection_probability,
        'post_convolution_function_extra_parameters': {'min_detection_probability': 1e-6}
    },
]

PS: the use of post_convolution_function_extra_parameters is somewhat unnecessary, as one can either hardcode the additional parameters within the function, or make use of the partial method (see https://www.geeksforgeeks.org/partial-functions-python/)

Some useful resources:

https://arxiv.org/abs/2404.16930 for GW detection
https://gaiaunlimited.readthedocs.io/ for gaia selection functions

Running the convolution

Before convolve your data, please make sure to follow the above steps.

Then, to convolve the data, we should provide a correct configuration dict.

[7]:

convolution_config['convolution_instructions'] = [
    {
        **default_convolution_instruction,
        'input_data_name': 'RLOF',
        'output_data_name': 'RLOF_rate',
        'data_column_dict': {
            # required
            'delay_time': 'initial_time',
            'normalized_yield': 'probability',
            # # optional*
            # 'metallicity': 'metallicity',
            # optional**
            'mass_1': 'initial_mass_accretor',
            'mass_2': 'initial_mass_donor',
        },
    },
]

# convolve
convolve(config=convolution_config)

#
with h5py.File(output_hdf5_filename, 'r') as output_hdf5_file:
    print(output_hdf5_file.keys())

<KeysViewHDF5 ['config', 'input_data', 'output_data']>

Reading and understanding the output

The results of the convolution are stored in the output_data group. The tree structure of this group, for a given convolution instruction result, is as:

output_data/
    <optional: sfr-name>/
            <input_data_name>/
                <output_data_name>/
                    convolution_result/
                        <optional: convolution-result output name>/
                            <lookback time or redshift value>

Let us go over each of these to understand them:

output_data/: Main group-name for the output data.
<optional: sfr-name>/: When multiple sfr-dicts are supplied in config[‘SFR_info’], the results of the convolution for each of these are stored based on the name of that sfr-dict in this subtree. TODO: link to sfr-dict notebook/section.
<input_data_name>: Signifies the input_data_name, and depends on what is passed in in the convolution-instruction. TODO: link to convolution-instruction notebook/section.
<output_data_name>: Signifies the output_data_name, and depends on what is passed in in the convolution-instruction. TODO: link to convolution-instruction notebook/section.
convolution_result: main sub-group name for the output data.
<optional: convolution-result output name>/: Possible when the user returns multiple result dicts (combined in a list) as the output of the post-convolution function. TODO: link to post-convolution notebook section.
<lookback time or redshift value>: Time-type value of the results of the convolution.

Within this group the convolution results are stored, with fields containing the relevant data as subgroups. Any units that are associated to these fields are stored in the ‘attrs’ of that hdf5-group, and can be read out as follows:

import json
import astropy.units as u

groupname = ... # f"output_data/event/stochastic_example/stochastic_example/convolution_results/{formation_time_bin_key}"

# convert units
unit_dict = json.loads(
    output_hdf5_file[
        groupname
    ].attrs["units"]
)
unit_dict = {key: u.Unit(val) for key, val in unit_dict.items()}

Different types of input-data also each have somewhat different content stored in their output-group which can be read out as follows:

Output for this type of data, unless supplemented by a post-convolution function, contains just the ‘yield’ field (i.e. the normalized_yield of the systems times the star formation (rate). This array can directly be attached to the input data dataframes as a column, as it maintains the same order and length.

Output of this type of data, unless supplemented by a post-convolution function, contains the indices field, which can be used to retrieve the input-data, as well as a sampled_formation_lookback_times which contains assigned formation lookback-times. TODO: what about redshifts?

TODO: explain multiple sfr-structure TODO: explain multiple convolution-result structure TODO: explain how

Lets have a look in the example convolution file.

[12]:

with h5py.File(output_hdf5_filename, 'r') as output_hdf5_file:
    print(output_hdf5_file['output_data/'].keys())
    print(output_hdf5_file['output_data/RLOF/'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results/0.5 yr'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results/0.5 yr/yield'][()])

<KeysViewHDF5 ['RLOF']>
<KeysViewHDF5 ['RLOF_rate']>
<KeysViewHDF5 ['convolution_results']>
<KeysViewHDF5 ['0.5 yr', '1.5 yr']>
<KeysViewHDF5 ['yield']>
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]

Descriptions of convolution configuration options and convolution instruction options

Both the convolution configuration and the convolution instructions have many options. These are all visible in default_convolution_config_descriptions and default_convolution_instruction_descriptions, and available (latest release) on https://synthetic-stellar-pop-convolve.readthedocs.io/en/latest/default_convolution_config.html

[18]:

from syntheticstellarpopconvolve.default_convolution_config import default_convolution_config_descriptions
from syntheticstellarpopconvolve.default_convolution_instruction import default_convolution_instruction_descriptions

# for key in sorted(default_convolution_config_descriptions.keys()):
#     print("Key: {}\nDescription: {}\n".format(key, default_convolution_config_descriptions[key]))

# for key in sorted(default_convolution_instruction_descriptions.keys()):
#     print("Key: {}\nDescription: {}\n".format(key, default_convolution_instruction_descriptions[key]))

Complete convolution script

An example of a complete script is provided below. We will make use of the example data and convolve both event-based data and ensemble-based data.

[20]:

import os
import json
import copy
import h5py
import pkg_resources

import numpy as np
import pandas as pd
import astropy.units as u

from syntheticstellarpopconvolve import convolve, default_convolution_config, default_convolution_instruction
from syntheticstellarpopconvolve.general_functions import temp_dir

TMP_DIR = temp_dir("notebooks", "notebook_convolution", clean_path=True)

# create file
output_hdf5_filename = os.path.join(TMP_DIR, "output_hdf5.h5")
generate_boilerplate_outputfile(output_hdf5_filename)

# load RLOF data as pandas dataframe
example_RLOF_events_filename = pkg_resources.resource_filename(
    "syntheticstellarpopconvolve", 'example_data/example_RLOF_event_data.dat'
)
RLOF_df = pd.read_csv(example_RLOF_events_filename, header=0, sep="\s+")

# store the dataframe in the hdf5file
RLOF_df.to_hdf(
    output_hdf5_filename, key="input_data/RLOF"
)

#################
# Global configuration
convolution_config = copy.copy(default_convolution_config)
convolution_config['output_filename'] = os.path.join(TMP_DIR, 'output_hdf5.h5')
convolution_config['tmp_dir'] = TMP_DIR
convolution_config["convolution_lookback_time_bin_edges"] = (
    np.array([0, 1, 2, 3, 4]) * u.yr
)

# convolution instructions
convolution_config['convolution_instructions'] = [
    {
        **default_convolution_instruction,
        'input_data_name': 'RLOF',
        'output_data_name': 'RLOF_rate',
        'data_column_dict': {
            # required
            'delay_time': 'initial_time',
            'normalized_yield': 'probability',
            # # optional*
            # 'metallicity': 'metallicity',
            # optional**
            'mass_1': 'initial_mass_accretor',
            'mass_2': 'initial_mass_donor',
        },
    }
]

# Set up SFR
convolution_config["SFR_info"] = {
    "lookback_time_bin_edges": np.array([0, 1, 2, 3, 4, 5]) * u.yr,
    "starformation_rate_array": np.array([1, 1, 1, 1, 1]) * u.Msun / u.yr,
}

# convolve
convolve(config=convolution_config)

# Show some of the content
with h5py.File(convolution_config['output_filename'], 'r') as output_hdf5_file:
    print(output_hdf5_file['output_data/'].keys())
    print(output_hdf5_file['output_data/RLOF/'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results/0.5 yr'].keys())
    print(output_hdf5_file['output_data/RLOF/RLOF_rate/convolution_results/0.5 yr/yield'][()])

<KeysViewHDF5 ['RLOF']>
<KeysViewHDF5 ['RLOF_rate']>
<KeysViewHDF5 ['convolution_results']>
<KeysViewHDF5 ['0.5 yr', '1.5 yr', '2.5 yr', '3.5 yr']>
<KeysViewHDF5 ['yield']>
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.]