helper_functions module

The helper_functions module contains a class that holds the functions used throughout the code.

class helper_functions.HelperFunctions(ConfigObj)

Bases: object

Holds helper functions.

Args:

ConfigObj (object): holds config variables

Attributes:

many! Aside from the Arg config variables object, all are methods/functions, and are detailed below.

compute_metric_scores(true_labels, pred_labels)

Computes F1, Acc., Prec, Rec. for each of the object classes (star, galaxy,

qso), from consolidated labels. N.B. the metric scores are for each of the classes one at a time.

Args:

true_labels (array): true labels

pred_labels (array): predicted labels

Returns:

results_str (string): string of metric scores for each object

metrics_list (list): summarizes/packages up metric scores

compute_performances(indata, kind)

Computes classification metrics from all labels from hdbscan gridsearch, and produces a file with the metrics for each setup.

Args:

indata (astropy table): catalogue data - output from get_data function

kind (str): object type (star, galaxy, or qso)

Returns:

do_consolidation(predicted_labels_dict, list_of_indices, consolidation_type)

Use optimal consolidation method to consolidate binary classifiers labels. Also writes some results to a file.

Args:

predicted_labels_dict (dict): dict of arrays with each binary classifier’s predicted labels (with 1 for positive classification, 0 otherwise)

list_of_indices (list of arrays): various arrays with different indices - see find_object_indices()

consolidation_type (string): ‘optimal’ or ‘alternative’ depending on which consolidation method to use

Returns:

cluster_labels (array): consolidated labels

after_cons_str (string): number of objs post-consolidation

do_hdbscan_gridsearch(attribute_names, scaled_data, kind)

Run hdbscan hyperparameter gridsearch to get best hyperparameter setup to be used later in final model.

Args:

attribute_names (list): list of attribute names

scaled_data (array): scaled attributes

kind (str): object type - star, galaxy, qso

Returns:

cluster_labels (array): output labels according to clustering result

clusterer (object): trained hdbscan clusterer

do_pca(data, ncomp=3, kind=None, save=False)

Run PCA decomposition on data.

Args:

data (array): data on which PCA is to be performed

ncomp (int): PCA reduction dimension

kind (str): object type - star, galaxy, qso (only set if save set to True)

save (bool): whether to save PCA reducer or not

Returns:

fitted_pca (object): fitted PCA reducer model

do_random_forest(data, target, features, kind, gridsearch=True)

Run random forest gridsearch to obtain best Random Forest hyperparameter setup (optional). Then runs best RF setup to give list of importances of each attribute.

Args:

data (array): input data of attributes and their values

target (array): binary labels for target

features (list): attribute names

kind (str): object type star, galaxy, qso

gridsearch (bool): whether to do gridsearch or not

Returns:

clf (object): RF trained classifier

do_scale(data, save=False)

Scale attributes

Args:

data (array): input unscaled attributes

save (bool): whether to save scaler or not

Returns:

scaled_data (array): scaled attributes

find_best_hdbscan_setup(performance_dat_file)

Finds best hdbscan setup for given object

Args:

performance_dat_file (string): path and name of performance dat file

Returns:

best_name (string): column name/setup for best hdbscan

best_cluster (int/string): hdbscan cluster value

find_object_indices(predicted_labels_dict)

Finds indices in arrays for objects and combinations / duplicates of

positive classifications from the binary classifiers.

Required for do_consolidation (optimal or alternative) functions.

Args:

predicted_labels_dict (dict): dict of arrays with each binary classifier’s predicted labels (with 1 for positive classification, 0 otherwise)

Returns:

list_of_indices (list of arrays): various arrays with different indices

before_consolidation_str (string): number of objs pre-consolidation

get_all_features(indata, photo_band_list, combine_type, targetname=None, target=None)

Creates attributes from catalogue data.

Args:

indata (astropy table): catalogue data - output from get_data function

photo_band_list (list): list of photometric bands to use

combine_type (str): subtract (for colours) or divide - how to make attributes from photometric bands

targetname (str): column name of true labels

target (str/int): corresponds to number of object type in true labels

Returns:

attribute_list (list): list of attribute values

attribute_names (list): names of attributes

attribute_target (array): binary labels for target type

get_data(filename, filetype)

Reads in catalogue data.

Args:

filename (str): name of catalgoue

filetype (str): type of file (e.g. csv)

Returns:

incat (astropy table): table of catalogue data

plot_classification(indata, labels1, labels2, labels1_name, labels2_name, dict_labels1, dict_labels2)

Creates two side-by-side colour plots for two sets of labels for the same catalogue.

Args:

indata (astropy table): catalogue data - output from get_data function

labels1 (array): First set of labels

labels2 (array): Second set of labels

labels1_name (str): Name of first set of labels to appear in plot

labels2_name (str): Name of second set of labels to appear in plot

dict_labels1 (dict): Dict with keys as object name, value as numeric label value for first set of labels

dict_labels2 (dict): Dict with keys as object name, value as numeric label value for second set of labels

Returns:

fig, ax (objects): fig and ax objects of plot as from matplotlib

plot_confusion_matrix(y_true, y_pred, identifiers, normalize=True)

Plot confusion matrix

Args:

y_true (array): True labels

y_pred (array): Predicted labels

f (list): List of names of objects that the values in the labels arrays refer to in order of size of value in label arrays. If length of list is 1, then will convert input labels to binary form and plot a binary confusion matrix

normalize (bool): Sets whether we normalize confusion matrix or not

Returns:

fig (object): Plot object

ax (str): Plot object

select_important_attributes(attribute_names, RF_importances_file, top=0)

Select attributes according to their importance (from Random Forest output).

Args:

attribute_names (list): attribute names

RF_importances_file (str): filepath for RF imprtances file (output from do_random_forest function)

top (int): number of top attributes to select

Returns:

index (list): indices of selected attributes

train_and_save_hdbscan(attribute_names, scaled_data, hdbscan_setup, kind)

Trains and saves an hdbscan clusterer object for use later in predict stage. Also saves the position in PCA space for each datapoint to a text file. Also saves a dendrogram plot from the trained hdbscan clusterer.

Args:

attribute_names (list): Names of all possible attributes

scaled_data (array): Scaled colour data

hdbscan_setup (str): Contains setup for hdbscan training in form of ‘{}_{}_{}’.format(RF_top, ncomp, min_cluster_size)

kind (str): object type (star, galaxy, or qso)

Returns:

write_best_labels_binary(best_name, best_cluster, kind)

Writes best labels to file in a binary format

Args:

best_name (string): column name/setup for best hdbscan

best_cluster (int/string): hdbscan cluster value

kind (str): object type (star, galaxy, or qso)

Returns: