Utilities¶

Helper functions and utilities.

Module Reference¶

ascicat/utils.py Utility functions for ASCICat package Helper functions for data manipulation, validation, and formatting

Functions¶

`format_catalyst_name(row)` ¶

Format catalyst name from composition

PARAMETER	DESCRIPTION
`row`	Data row containing composition info TYPE: `Series`

RETURNS	DESCRIPTION
`str`	Formatted catalyst name

`format_surface(row)` ¶

Format surface description

PARAMETER	DESCRIPTION
`row`	Data row containing surface info TYPE: `Series`

RETURNS	DESCRIPTION
`str`	Formatted surface description

`calculate_distance_from_optimal(delta_E, optimal_E)` ¶

Calculate deviation from optimal binding energy

PARAMETER	DESCRIPTION
`delta_E`	Adsorption energy (eV) TYPE: `float or array - like`
`optimal_E`	Optimal binding energy (eV) TYPE: `float`

RETURNS	DESCRIPTION
`float or ndarray`	Absolute deviation from optimum (eV)

`normalize_scores(scores)` ¶

Min-max normalize scores to [0, 1]

PARAMETER	DESCRIPTION
`scores`	Raw scores TYPE: `Series`

RETURNS	DESCRIPTION
`Series`	Normalized scores [0, 1]

`rank_by_column(df, column, ascending=False)` ¶

Rank DataFrame by specified column

PARAMETER	DESCRIPTION
`df`	Data to rank TYPE: `DataFrame`
`column`	Column to rank by TYPE: `str`
`ascending`	Rank in ascending order (default: False for descending) TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Ranked DataFrame with 'rank' column

`filter_by_threshold(df, column, threshold, greater_than=True)` ¶

Filter DataFrame by threshold value

PARAMETER	DESCRIPTION
`df`	Data to filter TYPE: `DataFrame`
`column`	Column to filter on TYPE: `str`
`threshold`	Threshold value TYPE: `float`
`greater_than`	If True, keep values > threshold, else < threshold TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`DataFrame`	Filtered DataFrame

`get_pareto_front(df, objectives, maximize)` ¶

Extract Pareto front from multi-objective data

A solution is Pareto optimal if no other solution is better in all objectives simultaneously.

PARAMETER	DESCRIPTION
`df`	Data with objective columns TYPE: `DataFrame`
`objectives`	List of objective column names TYPE: `List[str]`
`maximize`	Whether to maximize each objective (True) or minimize (False) TYPE: `List[bool]`

RETURNS	DESCRIPTION
`DataFrame`	Pareto optimal solutions

Examples:

>>> pareto = get_pareto_front(
...     df, 
...     objectives=['activity_score', 'cost_score'],
...     maximize=[True, True]
... )

`calculate_correlation_matrix(df, columns)` ¶

Calculate correlation matrix for specified columns

PARAMETER	DESCRIPTION
`df`	Data TYPE: `DataFrame`
`columns`	Columns to include in correlation TYPE: `List[str]`

RETURNS	DESCRIPTION
`DataFrame`	Correlation matrix

`save_to_json(data, file_path)` ¶

Save dictionary to JSON file

PARAMETER	DESCRIPTION
`data`	Data to save TYPE: `dict`
`file_path`	Output file path TYPE: `str`

`load_from_json(file_path)` ¶

Load dictionary from JSON file

PARAMETER	DESCRIPTION
`file_path`	Input file path TYPE: `str`

RETURNS	DESCRIPTION
`dict`	Loaded data

`create_metadata(results_df, config, weights)` ¶

Create metadata for ASCI results

PARAMETER	DESCRIPTION
`results_df`	ASCI results TYPE: `DataFrame`
`config`	Reaction configuration TYPE: `dict`
`weights`	(w_a, w_s, w_c) TYPE: `tuple`

RETURNS	DESCRIPTION
`dict`	Metadata dictionary

`format_number(value, decimals=3)` ¶

Format number for display

PARAMETER	DESCRIPTION
`value`	Number to format TYPE: `float`
`decimals`	Number of decimal places TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`str`	Formatted string

`print_table(df, columns=None, max_rows=20)` ¶

Print DataFrame as formatted table

PARAMETER	DESCRIPTION
`df`	Data to print TYPE: `DataFrame`
`columns`	Columns to include (default: all) TYPE: `List[str]` DEFAULT: `None`
`max_rows`	Maximum rows to print TYPE: `int` DEFAULT: `20`

`validate_file_path(file_path, must_exist=False)` ¶

Validate and convert file path

PARAMETER	DESCRIPTION
`file_path`	File path to validate TYPE: `str`
`must_exist`	If True, raise error if file doesn't exist TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Path`	Validated Path object

RAISES	DESCRIPTION
`FileNotFoundError`	If file doesn't exist and must_exist=True

`create_output_directory(dir_path)` ¶

Create output directory if it doesn't exist

PARAMETER	DESCRIPTION
`dir_path`	Directory path TYPE: `str`

RETURNS	DESCRIPTION
`Path`	Created directory path

`get_timestamp()` ¶

Get current timestamp as string

RETURNS	DESCRIPTION
`str`	Timestamp in ISO format

`load_catalyst_data(file_path)` ¶

Load catalyst data from file.

Convenience wrapper for loading CSV data.

PARAMETER	DESCRIPTION
`file_path`	Path to data file TYPE: `str`

RETURNS	DESCRIPTION
`DataFrame`	Loaded data

Examples:

>>> data = load_catalyst_data('data/HER_clean.csv')
>>> print(data.shape)
(200, 10)

`save_results(data, file_path)` ¶

Save results to file.

PARAMETER	DESCRIPTION
`data`	Results data TYPE: `DataFrame`
`file_path`	Output file path TYPE: `str`

Examples:

>>> save_results(results, 'output/HER_results.csv')

`calculate_element_cost(element, database=None)` ¶

Get cost for a single element.

PARAMETER	DESCRIPTION
`element`	Element symbol (e.g., 'Pt', 'Cu', 'Ni') TYPE: `str`
`database`	Custom cost database. If None, uses default values. TYPE: `dict` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Cost in $/kg

Notes

Default costs based on USGS Commodity data (2024). Values are approximate and should be updated periodically.

Examples:

>>> cost_pt = calculate_element_cost('Pt')
>>> print(f"Platinum: ${cost_pt:,.0f}/kg")
Platinum: $30,000/kg

>>> cost_cu = calculate_element_cost('Cu')
>>> print(f"Copper: ${cost_cu:.2f}/kg")
Copper: $8.50/kg

`get_periodic_table_data()` ¶

Get periodic table data for common elements.

Returns comprehensive element information including: - Atomic number - Atomic mass - Element name - Common oxidation states - Electronegativity

RETURNS	DESCRIPTION
`dict`	Dictionary with element symbols as keys, properties as values

Examples:

>>> pt_data = get_periodic_table_data()
>>> pt_info = pt_data['Pt']
>>> print(f"{pt_info['name']}: Z={pt_info['number']}, M={pt_info['mass']:.3f}")
Platinum: Z=78, M=195.084

`calculate_composition_cost(composition, cost_database=None)` ¶

Calculate composition-weighted cost for alloys.

PARAMETER	DESCRIPTION
`composition`	Dictionary of element symbols to atomic fractions Example: {'Cu': 0.7, 'Ni': 0.3} TYPE: `dict`
`cost_database`	Custom cost database. If None, uses default values. TYPE: `dict` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Composition-weighted cost in $/kg

Examples:

>>> # CuNi alloy (70% Cu, 30% Ni)
>>> cost = calculate_composition_cost({'Cu': 0.7, 'Ni': 0.3})
>>> print(f"CuNi alloy: ${cost:.2f}/kg")
CuNi alloy: $11.35/kg

>>> # PtRu alloy (50% Pt, 50% Ru)
>>> cost = calculate_composition_cost({'Pt': 0.5, 'Ru': 0.5})
>>> print(f"PtRu alloy: ${cost:,.0f}/kg")
PtRu alloy: $21,000/kg

`generate_unique_labels(df, label_col='display_label')` ¶

Generate unique display labels for catalysts in ranking plots.

Creates unambiguous labels by combining chemical formula with surface facet. If duplicates still exist, adds a numerical suffix.

Format: "CuZn(211)" or "CuZn(211)#2" if still not unique

PARAMETER	DESCRIPTION
`df`	DataFrame with catalyst data. Must contain 'symbol' column. Optionally contains 'slab_millers' for facet information. TYPE: `DataFrame`
`label_col`	Name of the new column for unique labels (default: 'display_label') TYPE: `str` DEFAULT: `'display_label'`

RETURNS	DESCRIPTION
`DataFrame`	DataFrame with added unique label column

Examples:

>>> df = generate_unique_labels(results)
>>> print(df[['symbol', 'slab_millers', 'display_label']].head())
    symbol  slab_millers  display_label
0     CuZn          211       CuZn(211)
1     CuZn          111       CuZn(111)
2     CuZn          100       CuZn(100)
3   Nb2Pt6          110     Nb2Pt6(110)
4   Nb2Pt6          110   Nb2Pt6(110)#2

Notes

This function is essential for ranking plots where the same chemical formula may appear multiple times with different surface facets or configurations.

`get_display_labels(df, n_top=10)` ¶

Get unique display labels for top N catalysts.

Convenience function for visualization code.

PARAMETER	DESCRIPTION
`df`	DataFrame with catalyst data (should be sorted by ranking) TYPE: `DataFrame`
`n_top`	Number of top catalysts to label TYPE: `int` DEFAULT: `10`

RETURNS	DESCRIPTION
`List[str]`	List of unique display labels

Examples:

>>> labels = get_display_labels(results.head(10))
>>> print(labels)
['CuZn(211)', 'AgAu(111)', 'PdZn(100)', ...]

`format_scientific(value, precision=3)` ¶

Format number in scientific notation.

PARAMETER	DESCRIPTION
`value`	Number to format TYPE: `float`
`precision`	Number of significant figures TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`str`	Formatted scientific notation string

Examples:

>>> format_scientific(0.000123)
'1.23×10⁻⁴'
>>> format_scientific(1234567)
'1.23×10⁶'

Functions¶

generate_unique_labels¶

Generate unique labels for catalysts with duplicate symbols.

from ascicat.utils import generate_unique_labels

df = generate_unique_labels(df, label_col='display_label')

sample_stratified¶

Stratified sampling by ASCI score.

from ascicat.utils import sample_stratified

sampled = sample_stratified(
    df,
    n_samples=2000,
    strata_col='ASCI',
    n_strata=4
)

validate_data¶

Validate input data format.

from ascicat.utils import validate_data

is_valid, errors = validate_data(df)
if not is_valid:
    print("Validation errors:", errors)

compute_pareto_mask¶

Identify Pareto-optimal points.

from ascicat.utils import compute_pareto_mask
import numpy as np

# Objectives to minimize (1 - score)
objectives = np.column_stack([
    1 - df['activity_score'],
    1 - df['stability_score'],
    1 - df['cost_score']
])

pareto_mask = compute_pareto_mask(objectives)
pareto_catalysts = df[pareto_mask]

format_results_table¶

Format results for display.

from ascicat.utils import format_results_table

table = format_results_table(
    results.head(10),
    columns=['symbol', 'ASCI', 'activity_score']
)
print(table)

Data Utilities¶

load_example_data¶

Load built-in example datasets.

from ascicat.utils import load_example_data

# Load HER data
her_data = load_example_data('HER')

# Load CO2RR data
co2rr_data = load_example_data('CO2RR', pathway='CO')

export_results¶

Export results in various formats.

from ascicat.utils import export_results

export_results(
    results,
    output_path='results.csv',
    format='csv'  # or 'xlsx', 'json'
)

Visualization Helpers¶

setup_figure_style¶

Configure matplotlib for high-quality output.

from ascicat.utils import setup_figure_style

setup_figure_style(
    font_scale=1.2,
    dpi=600
)

get_colorblind_palette¶

Get colorblind-safe color palette.

from ascicat.utils import get_colorblind_palette

colors = get_colorblind_palette(n_colors=5)

Utilities¶

Module Reference¶

Functions¶

format_catalyst_name(row) ¶

format_surface(row) ¶

calculate_distance_from_optimal(delta_E, optimal_E) ¶

normalize_scores(scores) ¶

rank_by_column(df, column, ascending=False) ¶

filter_by_threshold(df, column, threshold, greater_than=True) ¶

get_pareto_front(df, objectives, maximize) ¶

calculate_correlation_matrix(df, columns) ¶

save_to_json(data, file_path) ¶

load_from_json(file_path) ¶

create_metadata(results_df, config, weights) ¶

format_number(value, decimals=3) ¶

print_table(df, columns=None, max_rows=20) ¶

validate_file_path(file_path, must_exist=False) ¶

create_output_directory(dir_path) ¶

get_timestamp() ¶

load_catalyst_data(file_path) ¶

save_results(data, file_path) ¶

calculate_element_cost(element, database=None) ¶

get_periodic_table_data() ¶

calculate_composition_cost(composition, cost_database=None) ¶

generate_unique_labels(df, label_col='display_label') ¶

get_display_labels(df, n_top=10) ¶

format_scientific(value, precision=3) ¶

Functions¶

generate_unique_labels¶

sample_stratified¶

validate_data¶

compute_pareto_mask¶

format_results_table¶

Data Utilities¶

load_example_data¶

export_results¶

Visualization Helpers¶

setup_figure_style¶

get_colorblind_palette¶

`format_catalyst_name(row)` ¶

`format_surface(row)` ¶

`calculate_distance_from_optimal(delta_E, optimal_E)` ¶

`normalize_scores(scores)` ¶

`rank_by_column(df, column, ascending=False)` ¶

`filter_by_threshold(df, column, threshold, greater_than=True)` ¶

`get_pareto_front(df, objectives, maximize)` ¶

`calculate_correlation_matrix(df, columns)` ¶

`save_to_json(data, file_path)` ¶

`load_from_json(file_path)` ¶

`create_metadata(results_df, config, weights)` ¶

`format_number(value, decimals=3)` ¶

`print_table(df, columns=None, max_rows=20)` ¶

`validate_file_path(file_path, must_exist=False)` ¶

`create_output_directory(dir_path)` ¶

`get_timestamp()` ¶

`load_catalyst_data(file_path)` ¶

`save_results(data, file_path)` ¶

`calculate_element_cost(element, database=None)` ¶

`get_periodic_table_data()` ¶

`calculate_composition_cost(composition, cost_database=None)` ¶

`generate_unique_labels(df, label_col='display_label')` ¶

`get_display_labels(df, n_top=10)` ¶

`format_scientific(value, precision=3)` ¶