.. _binary:

binary\_classifier module
=========================

.. automodule:: binary_classifier
   :members:
   :undoc-members:
   :show-inheritance:


In short, the binary\_classifier module takes an input catalogue with labels,
and uses the photometic bands in the catalogue to create colours. These colours are then used
as attributes to train a classifier to classify the datapoints.

Here we give a run-through of the code, taking as an example the star classifier.
The bulk of this script is summarized in the image below.

.. image:: images/tikz_grid.png
  :width: 150

*****
Description of code
*****
**Setup**: The training catalogue is read in with the ``get_data`` function, and
we specify the filetype (here in .csv  format). The target is also defined -
this is done using the ``hclass_dict``  to convert the star  into its
numeric representation (as defined in the :ref:`config`).

**Prepare attributes**: Then the ``get_all_features`` function is called. It is
passed ``photo_band_list``, which defines the filters considered, which are
later used in the function ``get_all_features`` to create all colour combinations
among them. The variable ``combine_type``  takes the values `subtract'
or `divide'  depending on the input data (magnitudes or fluxes respectively).
It also takes the catalogue data and target name and the corresponding numeric representation.

**Scaling**: The ``do_scale``  function is then called to scale the data (mean of 0
and variance of 1 for each attribute column). The scaler object is saved too, for
use in the prediction stage later on.

**RF gridsearch**: The ``do_random_forest``  function is then called which runs a
hyperparameter gridsearch on a RF classifier to obtain the best hyperparameter setup
(if ``gridsearch``  is set to True) to then run this RF once using the labels
star/non-star (in this example), which provides a list of important attributes,
which is then used later.

**HDBSCAN gridsearch**: The ``do_hdbscan_gridsearch``  function is then called,
which runs the gridsearch on HDBSCAN, finding the best setup for the binary star
classifier (in this example), trying different numbers of the top RF attributes
from the list of importances (``RF_top``) as found in the ``do_random_forest`` function call,
different numbers of dimensions to which to reduce these attributes to using PCA (``ncomp``),
and then a different value of ``min_cluster_size`` for the hdbscan hyperparameter itself.

**Process HDBSCAN gridsearch output**: After this gridsearch is completed,
``compute_performances`` is called to create a file with the associated
metric scores for each of the classifiers' setups (different numbers of top
RF attributes, different PCA dimensions, different ``min_cluster_size``).
The ``find_best_hdbscan_setup`` function is called to find the best setup
for the classifier in question (e.g. star), and then ``write_best_labels_binary``
writes these best labels to a separate csv file in terms of a binary setup
(i.e. 1 for e.g. star, 0 for e.g. non-star).  We also save the best setups to a
text file (i.e. how many top RF attributes used, to how many dimensions they were
reduced using PCA, and what ``min_cluster_size`` was used for HDBSCAN).

**Save best HDBSCAN classifier**: The function ``train_and_save_hdbscan`` is then
called to train and save the HDBSCAN classifier and save the trained HDBSCAN
classifier object (later used in the prediction stage). The PCA object is also
saved to be run on the new data in the prediction stage. The position of each datapoint
in PCA space is also saved to a text file. A dendrogram plot
of each trained HDBSCAN clusterer is also saved.

**Outputs**: The only images output from this module run are dendrograms of the HDBSCAN
clusterer for each of the star/gal/QSO setups (see the example for the star setup
below), and can be found in the data/output/hdbscan_gridsearch directory.

.. image:: images/CPz_HDBSCAN_dendrogram_star.png
  :width: 300