helper_functions module¶
The helper_functions module contains a class that holds the functions used throughout the code.
-
class
helper_functions.HelperFunctions(ConfigObj)¶ Bases:
objectHolds helper functions.
- Args:
ConfigObj (object): holds config variables
- Attributes:
many! Aside from the Arg config variables object, all are methods/functions, and are detailed below.
-
compute_metric_scores(true_labels, pred_labels)¶ Computes F1, Acc., Prec, Rec. for each of the object classes (star, galaxy,
qso), from consolidated labels. N.B. the metric scores are for each of the classes one at a time.
- Args:
true_labels (array): true labels
pred_labels (array): predicted labels
- Returns:
results_str (string): string of metric scores for each object
metrics_list (list): summarizes/packages up metric scores
-
compute_performances(indata, kind)¶ Computes classification metrics from all labels from hdbscan gridsearch, and produces a file with the metrics for each setup.
- Args:
indata (astropy table): catalogue data - output from get_data function
kind (str): object type (star, galaxy, or qso)
Returns:
-
do_consolidation(predicted_labels_dict, list_of_indices, consolidation_type)¶ Use optimal consolidation method to consolidate binary classifiers labels. Also writes some results to a file.
- Args:
predicted_labels_dict (dict): dict of arrays with each binary classifier’s predicted labels (with 1 for positive classification, 0 otherwise)
list_of_indices (list of arrays): various arrays with different indices - see find_object_indices()
consolidation_type (string): ‘optimal’ or ‘alternative’ depending on which consolidation method to use
- Returns:
cluster_labels (array): consolidated labels
after_cons_str (string): number of objs post-consolidation
-
do_hdbscan_gridsearch(attribute_names, scaled_data, kind)¶ Run hdbscan hyperparameter gridsearch to get best hyperparameter setup to be used later in final model.
- Args:
attribute_names (list): list of attribute names
scaled_data (array): scaled attributes
kind (str): object type - star, galaxy, qso
- Returns:
cluster_labels (array): output labels according to clustering result
clusterer (object): trained hdbscan clusterer
-
do_pca(data, ncomp=3, kind=None, save=False)¶ Run PCA decomposition on data.
- Args:
data (array): data on which PCA is to be performed
ncomp (int): PCA reduction dimension
kind (str): object type - star, galaxy, qso (only set if save set to True)
save (bool): whether to save PCA reducer or not
- Returns:
fitted_pca (object): fitted PCA reducer model
-
do_random_forest(data, target, features, kind, gridsearch=True)¶ Run random forest gridsearch to obtain best Random Forest hyperparameter setup (optional). Then runs best RF setup to give list of importances of each attribute.
- Args:
data (array): input data of attributes and their values
target (array): binary labels for target
features (list): attribute names
kind (str): object type star, galaxy, qso
gridsearch (bool): whether to do gridsearch or not
- Returns:
clf (object): RF trained classifier
-
do_scale(data, save=False)¶ Scale attributes
- Args:
data (array): input unscaled attributes
save (bool): whether to save scaler or not
- Returns:
scaled_data (array): scaled attributes
-
find_best_hdbscan_setup(performance_dat_file)¶ Finds best hdbscan setup for given object
- Args:
performance_dat_file (string): path and name of performance dat file
- Returns:
best_name (string): column name/setup for best hdbscan
best_cluster (int/string): hdbscan cluster value
-
find_object_indices(predicted_labels_dict)¶ Finds indices in arrays for objects and combinations / duplicates of
- positive classifications from the binary classifiers.
Required for do_consolidation (optimal or alternative) functions.
- Args:
predicted_labels_dict (dict): dict of arrays with each binary classifier’s predicted labels (with 1 for positive classification, 0 otherwise)
- Returns:
list_of_indices (list of arrays): various arrays with different indices
before_consolidation_str (string): number of objs pre-consolidation
-
get_all_features(indata, photo_band_list, combine_type, targetname=None, target=None)¶ Creates attributes from catalogue data.
- Args:
indata (astropy table): catalogue data - output from get_data function
photo_band_list (list): list of photometric bands to use
combine_type (str): subtract (for colours) or divide - how to make attributes from photometric bands
targetname (str): column name of true labels
target (str/int): corresponds to number of object type in true labels
- Returns:
attribute_list (list): list of attribute values
attribute_names (list): names of attributes
attribute_target (array): binary labels for target type
-
get_data(filename, filetype)¶ Reads in catalogue data.
- Args:
filename (str): name of catalgoue
filetype (str): type of file (e.g. csv)
- Returns:
incat (astropy table): table of catalogue data
-
plot_classification(indata, labels1, labels2, labels1_name, labels2_name, dict_labels1, dict_labels2)¶ Creates two side-by-side colour plots for two sets of labels for the same catalogue.
- Args:
indata (astropy table): catalogue data - output from get_data function
labels1 (array): First set of labels
labels2 (array): Second set of labels
labels1_name (str): Name of first set of labels to appear in plot
labels2_name (str): Name of second set of labels to appear in plot
dict_labels1 (dict): Dict with keys as object name, value as numeric label value for first set of labels
dict_labels2 (dict): Dict with keys as object name, value as numeric label value for second set of labels
- Returns:
fig, ax (objects): fig and ax objects of plot as from matplotlib
-
plot_confusion_matrix(y_true, y_pred, identifiers, normalize=True)¶ Plot confusion matrix
- Args:
y_true (array): True labels
y_pred (array): Predicted labels
f (list): List of names of objects that the values in the labels arrays refer to in order of size of value in label arrays. If length of list is 1, then will convert input labels to binary form and plot a binary confusion matrix
normalize (bool): Sets whether we normalize confusion matrix or not
- Returns:
fig (object): Plot object
ax (str): Plot object
-
select_important_attributes(attribute_names, RF_importances_file, top=0)¶ Select attributes according to their importance (from Random Forest output).
- Args:
attribute_names (list): attribute names
RF_importances_file (str): filepath for RF imprtances file (output from do_random_forest function)
top (int): number of top attributes to select
- Returns:
index (list): indices of selected attributes
-
train_and_save_hdbscan(attribute_names, scaled_data, hdbscan_setup, kind)¶ Trains and saves an hdbscan clusterer object for use later in predict stage. Also saves the position in PCA space for each datapoint to a text file. Also saves a dendrogram plot from the trained hdbscan clusterer.
- Args:
attribute_names (list): Names of all possible attributes
scaled_data (array): Scaled colour data
hdbscan_setup (str): Contains setup for hdbscan training in form of ‘{}_{}_{}’.format(RF_top, ncomp, min_cluster_size)
kind (str): object type (star, galaxy, or qso)
Returns:
-
write_best_labels_binary(best_name, best_cluster, kind)¶ Writes best labels to file in a binary format
- Args:
best_name (string): column name/setup for best hdbscan
best_cluster (int/string): hdbscan cluster value
kind (str): object type (star, galaxy, or qso)
Returns: