binary_classifier module

binary_classifier.run_binary(conf, lib)

Wrapper function for binary classifier code.

Args:

conf (object): contains config variables

lib (object): contains helper functions

Returns:

In short, the binary_classifier module takes an input catalogue with labels, and uses the photometic bands in the catalogue to create colours. These colours are then used as attributes to train a classifier to classify the datapoints.

Here we give a run-through of the code, taking as an example the star classifier. The bulk of this script is summarized in the image below.

_images/tikz_grid.png

Description of code

Setup: The training catalogue is read in with the get_data function, and we specify the filetype (here in .csv format). The target is also defined - this is done using the hclass_dict to convert the star into its numeric representation (as defined in the config module).

Prepare attributes: Then the get_all_features function is called. It is passed photo_band_list, which defines the filters considered, which are later used in the function get_all_features to create all colour combinations among them. The variable combine_type takes the values `subtract’ or `divide’ depending on the input data (magnitudes or fluxes respectively). It also takes the catalogue data and target name and the corresponding numeric representation.

Scaling: The do_scale function is then called to scale the data (mean of 0 and variance of 1 for each attribute column). The scaler object is saved too, for use in the prediction stage later on.

RF gridsearch: The do_random_forest function is then called which runs a hyperparameter gridsearch on a RF classifier to obtain the best hyperparameter setup (if gridsearch is set to True) to then run this RF once using the labels star/non-star (in this example), which provides a list of important attributes, which is then used later.

HDBSCAN gridsearch: The do_hdbscan_gridsearch function is then called, which runs the gridsearch on HDBSCAN, finding the best setup for the binary star classifier (in this example), trying different numbers of the top RF attributes from the list of importances (RF_top) as found in the do_random_forest function call, different numbers of dimensions to which to reduce these attributes to using PCA (ncomp), and then a different value of min_cluster_size for the hdbscan hyperparameter itself.

Process HDBSCAN gridsearch output: After this gridsearch is completed, compute_performances is called to create a file with the associated metric scores for each of the classifiers’ setups (different numbers of top RF attributes, different PCA dimensions, different min_cluster_size). The find_best_hdbscan_setup function is called to find the best setup for the classifier in question (e.g. star), and then write_best_labels_binary writes these best labels to a separate csv file in terms of a binary setup (i.e. 1 for e.g. star, 0 for e.g. non-star). We also save the best setups to a text file (i.e. how many top RF attributes used, to how many dimensions they were reduced using PCA, and what min_cluster_size was used for HDBSCAN).

Save best HDBSCAN classifier: The function train_and_save_hdbscan is then called to train and save the HDBSCAN classifier and save the trained HDBSCAN classifier object (later used in the prediction stage). The PCA object is also saved to be run on the new data in the prediction stage. The position of each datapoint in PCA space is also saved to a text file. A dendrogram plot of each trained HDBSCAN clusterer is also saved.

Outputs: The only images output from this module run are dendrograms of the HDBSCAN clusterer for each of the star/gal/QSO setups (see the example for the star setup below), and can be found in the data/output/hdbscan_gridsearch directory.

_images/CPz_HDBSCAN_dendrogram_star.png