Configuration
=============

All **project parameters** (i.e. pertaining to a specific dataset) are contained in a configuration file, located in ``/configs/{configuration_file}.yaml``. This file, together with input/output directories and other running parameters specified directly via CLI (i.e. Command Line Interface) or via shell script (see also :ref:`files/how_to_use:Workflow scripts`), pass the necessary parameters to the scripts. 

The idea here is that the user can specify all necessary parameters for each project in this configuration file, so that one batch of images acquired in the same way (i.e. one project) always corresponds to one configuration file. See :ref:`files/best_practices:Keep your workspace organised` for project organisation tips. 

We also provide a complete configuration file ``configs/mzb_example_config.yaml`` for the example project ``data/mzb_example_data``, that can be used as a template for user's own configuration file for their projects (we recommend making a copy of it for each new project!). 

Parameters explanation
----------------------

This list is structured as follows: 

 - ``parameter_name``: ``[admissible_value_1, admissible_value_2]`` Description of parameter, suggested values and rationale. 

.. hint:: \ \ 

    Parameters appear in the same order as in the configuration file template for clarity, however the order of parameters makes no difference for the functioning of the pipelines. 

.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This first block contains some general parameters: 

    .. # Arguments not to be spec via CLI. 
 
 - ``glob_random_seed``: ``[int]`` this is just a arbitrary number used by model trainers, important for reproducibility. 
 - ``glob_root_folder``: ``[string]`` this is the root folder of the project, it could be for example ``/home/user/my_project/``. 
 - ``glob_blobs_folder``: ``[string]`` this is the location where you want the clips of the segmented organisms to be saved; we strongly recommend putting this inside of the main data folder, for example ``/data/shared/mzb-workflow/data/derived/blobs/``. 
 - ``glob_local_format``: ``[jpg, pdf, ...]`` what format do you want the plotting outputs to be saved in; acceptable values are: ``pdf``, ``jpg``, ``png`` and other common formats (see `matplotlib <https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html>`__ documentation for details).
 - ``model_logger``: ``[wandb, tensorboard]`` which data logger is used to track model training progress; for the moment, ``wandb`` (`Weights & Biases <https://wandb.ai/site>`__) and ``tensorboard`` (`TensorBoard <https://www.tensorflow.org/tensorboard>`__) are supported. Note that W&B requires an account and to be setup by the user, see :ref:`files/best_practices:Logging your model's training`. 

The second block of parameters is specific to image segmentation. If the segmentation results are not satisfactory (i.e. organisms incompletely clipped, debris or other noise segmented as organisms, etc), changing these values might produce better results: 

    .. # Image parsing specific

 - ``impa_image_format``: ``[jpg, png, ...]`` what format are the original images in? Should be caps insensitive and support common formats like ``jpg``, ``png`` and others. 
 - ``impa_clip_areas``: ``[int, int, int, int]`` it's common to place a reference scale and/or colour grid in images; here you can define the area where the scale is positioned and *exclude* it. This is specified as coordinates (``[x1, y1, x2, y2]``, in pixels, where ``x1, y1`` is the top-left corner and ``x2, y2`` is the bottom right corner), so that the regions that fall inside of this box are cropped out; for example, ``[2700, 4700, -1, -1]``, where ``-1`` indicates the end of the image. If you  don't want to exclude any area, set this value as ``None``. 

    .. hint:: \ \ 

        In ``cv2``, coordinates are opposite in respect to many other implementations, so that x is columns, and y is rows. If you are estimating the area to clip with an external program, for example `Microsoft Paint <https://apps.microsoft.com/detail/9PCFS5B6T72H?hl=en-US&gl=US>`__ or `GIMP <https://www.gimp.org/>`__, you probably need to flip x and y to obtain the expected result! 

 - ``impa_area_threshold``: this is the minimum size (in pixels) that will be considered to be an organism; anything below this threshold will be discarded. When in doubt, start with a low threshold and increase until most noise is removed. 
 - ``impa_gaussian_blur``: ``[int, int]`` the size fo the kernel that will be used to smooth the image before processing; you can think of this as the "radius" of the blur: the larger the radius, the stronger the smoothing effect, but also more loss of details in the image. This should not be changed much except for very noisy images and/or with comparatively large organisms compared to the full size of the image. 
 - ``impa_gaussian_blur_passes``: ``[int]`` How many times the gaussian filter should be applied in sequence. 
 - ``impa_adaptive_threshold_block_size``: ``[int]`` Size of the square neighborhood used to collect values and statistics for automatic thresholding. 
 - ``impa_mask_postprocess_kernel``: ``[int, int]`` This is the size of the post-processing kernel, that smooths out the segmentation masks; higher values correspond to smoothers edges but less details. 
 - ``impa_mask_postprocess_passes``: ``[int]`` Number of times the smoothing kernel is applied. 

    .. # impa_save_full_mask_dir: data/derived/project_portable_flume/full_image_masks
 
 - ``impa_bounding_box_buffer``: ``[int]`` how many pixels should be added on each side of the mask for buffer (this is useful to evaluate if masks are accurate, for example). 
 - ``impa_save_clips_plus_features``: ``[bool]`` Boolean value (`True/False`) whether the features of each mask should be saved as CSV. 

This block contains parameters for model training and inferences. 

    .. ## Run classification routine on image clips
    .. ## Preparation of learning sets (run once if output folder is not there)
    .. ## these data will need to be doctored, to move classes like errors
    .. ## and such into specific subfolders

 - ``lset_class_cut``: ``[kingdom, phylum, class, subclass, order, suborder, family, genus, species]`` This determines the taxonomic rank for cutoff, meaning that all lower taxonomic levels will be clumped together at the specified rank. Annotations at higher taxonomic level than the one specified will be excluded. 
 - ``lset_val_size``: ``[float]`` Which proportion of the annotated data should be set aside for validation? A common value is ``0.1``. 
 - ``lset_taxonomy``: ``[string]`` Full path to the location of the taxonomy file, for example: ``/data/mzb-workflow/data/MZB_taxonomy.csv``. This should be in ``.csv`` format, and contain a full taxonomy of the organisms in the annotated data (see `The taxonomy file`_). 

The following parameters relate to model training of the classification model. The proposed values will likely work for small datasets (<1'000 images) and a moderate number of classes (<20-30). Machine Learning (ML) model training is a complex topic, explanations given are very general and will likely be insufficient to fully grasp all the intricacies! 

    .. ## Finetuning / training config for classifier
 
 - ``trcl_learning_rate``: ``[float]`` This parameter controls the learning rate of the model; the higher the value the quicker it will adjust the weights, but also the quicker it will overfit. Suggested value: ``0.001``. 
 - ``trcl_batch_size``: ``[int]`` The number of images that will be used for training at each iteration. Higher numbers will use more memory and will achieve good accuracies faster, but small numbers will train the model faster. Suggested value: ``16``. 
 - ``trcl_weight_decay``: ``[float]`` How much should the weight of a node in the network decrease (i.e. decay) at each step (see ``trcl_step_size_decay``); decay combats overfitting but can slow down training. Suggested value: ``0``. 
 - ``trcl_step_size_decay``: ``[int]`` How many iterations before applying the weight decay factor. Suggested value: ``5``. 
 - ``trcl_number_epochs``: ``[int]`` How many iterations (i.e. epochs) should the model be trainer for. Longer training cycles can potentially yield better accuracies, but they take longer to train and can quickly overfit. Suggested value: ``75``. 

    .. # trcl_gpu_ids: -1 
 
 - ``trcl_save_topk``: ``[int]`` How many models should be saved among the best? You can specify if you want to retain the best 1-2-5 etc best models after training; this can be beneficial for evaluating overfitting and convergence. Suggested value: ``1``. 
 - ``trcl_num_classes``: ``[int]`` How many classes should the model be trained for? This needs to be defined by the user, and it corresponds to how many taxa are at the specified taxonomic rank. In our example we had ``8``. 
 - ``trcl_model_pretrarch``: ``[convnext-small, resenet50, efficientnet-b2, convnext-small, densenet161, mobilenet]`` Which model architecture should be used for training; the supported architectures are detailed in :ref:`files/project_structure:Models`. 
 - ``trcl_num_workers``: ``[int]`` How many processes (i.e. workers) do you want the dataloader to spawn? A good rule of thumb is to use the same number of workers as number of threads of your CPU. In our example the value is ``16``. 
 - ``trcl_wandb_project_name``: ``[string]`` Name of the Weights & Biases tracker for your project; you should change this to something meaningful for your project; in our case it was ``mzb-classifiers``. 

    .. # trai_model_save_append: "-v1"

This next block contains parameters for the supervised skeleton prediction model (see :ref:`files/scripts/processing_scripts:Supervised Skeleton Prediction`). The same considerations as for the previous block apply. 

    .. ## Finetuning / training config for skeleton prediction

 - ``trsk_learning_rate``: ``[float]`` his parameter controls the learning rate of the model; the higher the value the quicker it will adjust the weights, but also the quicker it will overfit. Suggested rate: ``0.0001``.
 - ``trsk_batch_size``: ``[int]`` The number of images that will be used for training at each iteration. Higher numbers will use more memory and will achieve good accuracies faster, but small numbers will train the model faster. Suggested value: ``32``. 
 - ``trsk_weight_decay``: ``[float]`` How much should the weight of a node in the network decrease (i.e. decay) at each step (see ``trcl_step_size_decay``); decay combats overfitting but can slow down training. Suggested value: ``0``. 
 - ``trsk_step_size_decay``: ``[int]`` How many iterations before applying the weight decay factor. Suggested value: ``50``.
 - ``trsk_number_epochs``: ``[int]`` How many iterations (i.e. epochs) should the model be trainer for. Longer training cycles can potentially yield better accuracies, but they take longer to train and can quickly overfit. Suggested value: ``750``. 

    .. # trsk_gpu_ids: -1

 - ``trsk_save_topk``: ``[int]`` How many models should be saved among the best? You can specify if you want to retain the best 1-2-5 etc best models after training; this can be beneficial for evaluating overfitting and convergence. Suggested value: ``1``. 
 - ``trsk_num_classes``: ``[int]`` Since this is a binary classifier (i.e. pixels are either part of the predicted skeleton or they are not), this should be ``2``. In case of annotations referring to multiple features this can be changed according to the number of features. 
 - ``trsk_model_pretrarch``: ``[mit_b2, mit-b2, efficientnet-b2]`` Which model architecture should be used for training; the supported architectures are detailed in :ref:`files/project_structure:Models`. 
 - ``trsk_num_workers``: ``[int]`` How many processes (i.e. workers) do you want the dataloader to spawn? A good rule of thumb is to use the same number of workers as number of threads of your CPU. In our example the value is ``16``. 
 - ``trsk_wandb_project_name``: ``[string]`` Name of the Weights & Biases tracker for your project; you should change this to something meaningful for your project; in our case it was ``mzb-skeletons``. 

    .. # trsk_tversky_loss_w1: 
    .. # trai_model_save_append: "-v1"

This block contains further convenience parameters for inference using trained skeleton prediction models and outputs. 

    .. ## Inference config 
    .. # infe_model_folder: models/mzb-classifiers/ # likely not used to allow renku parse as input

 - ``infe_model_ckpt``: ``[last, best]`` Which model should be used? The ``last`` model is the newest training iteration, and ``best`` is the model that performed best on the validation set (available only if a validation set is specified). 
 - ``infe_num_classes``: ``[int]`` How many classes should the inference be carried out on? It should be the same number of classes the model has been trained on. In our example it was ``8``. 
 - ``infe_image_glob``: ``[string]`` What suffix and/or extension should be attached to output images? This should be placed in double quotes ``""`` and can be a capture pattern (also called regular expression, see `glob documentation <https://docs.python.org/3/library/glob.html>`__). In our case, we append a suffix and extension at the end of the original image name (using the wildcard ``*``): ``"*_rgb.jpg"``.  

These parameters are related to the unsupervised skeletonization: 

    .. ## Skeletonization
    .. ## unsupervised skeletonization

 - ``skel_class_exclude``: ``[string]`` Should any class be excluded from the processing? For example, unidentifiable organisms or calibration images. In our cases these images were labelled as ``errors``. 
 - ``skel_conv_rate``: ``[float]`` This is the pixel-to-millimitres conversion rate. It has to be provided by the user and is used for all images in the dataset (see :ref:`files/scripts/processing_scripts:Segmentation`). In our case this was ``131.6625``, obtained averaging manual measurements over several images. 

.. # skel_save_usnup_masks: data/derived/project_portable_flume/skeletons/automatic_skeletons/

These are additional parameters for supervised skeletonization model output: 

    .. ## supervised skeletonization
    ..  - ``skel_label_thickness``: How many pixels wide should be the line over the skeleton be? We used a value of ``3``. ### NOT USED ANYMORE? 
 
 - ``skel_label_buffer_on_preds``: How many pixels wide should be the line over the skeleton be? We used a value of ``25``. 
 - ``skel_label_clip_with_mask``: ``[bool]`` Are the clips of the organisms the same ones that skeletonization should be carried out on? In our case we had ``False``, since blobs and skeletonization training set do not have the same filenames. 


.. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Complete configuration file for ``mzb_example_data``
----------------------------------------------------
Below a complete example of a configuration file for the example project ``mzb_example_data``. 

.. code-block:: yaml

   # Arguments not to be spec via CLI
   glob_random_seed: 222 
   glob_root_folder: /home/jovyan/work/mzb-workflow/
   glob_blobs_folder: /home/jovyan/work/mzb-workflow/data/derived/blobs/
   glob_local_format: pdf
   model_logger: wandb

   # Image parsing specific 
   impa_image_format: jpg
   impa_clip_areas: [2700, 4700, -1, -1] # x1, y1, x2, y2. Ignore areas inside this (-1 means until the end)
   impa_area_threshold: 5000 # ignore areas smaller than this
   impa_gaussian_blur: [21, 21]
   impa_gaussian_blur_passes: 3
   impa_adaptive_threshold_block_size: 351
   impa_mask_postprocess_kernel: [11, 11]
   impa_mask_postprocess_passes: 5
   # impa_save_full_mask_dir: data/derived/project_portable_flume/full_image_masks
   impa_bounding_box_buffer: 200
   impa_save_clips_plus_features: True

   # Run classification routine on image clips 
   ## Preparation of learning sets (run once if output folder is not there)
   ## these data will need to be doctored, to move classes like errors 
   ## and such into specific subfolders
   lset_class_cut: order
   lset_val_size: 0.1
   # moved to args of function
   # lset_taxonomy: /home/jovyan/work/mzb-workflow/data/MZB_taxonomy.csv

   ## Finetuning / training config for classifier
   trcl_learning_rate: 0.0001
   trcl_batch_size: 8
   trcl_weight_decay: 0
   trcl_step_size_decay: 5
   trcl_number_epochs: 75 # 75
   # trcl_gpu_ids: -1 
   trcl_save_topk: 1
   trcl_num_classes: 8
   trcl_model_pretrarch: convnext-small #resenet50 #efficientnet-b2 #convnext-small #densenet161 #mobilenet
   trcl_num_workers: 16
   trcl_wandb_project_name: mzb-classifiers # needed if using weight and biases logger
   trcl_logger: wandb # tensorboard | wandb -- select the logger for training

   ## Finetuning / training config for skeleton prediction
   trsk_learning_rate: 0.001
   trsk_batch_size: 32
   trsk_weight_decay: 0
   trsk_step_size_decay: 25
   trsk_number_epochs: 400
   # trsk_gpu_ids: -1
   trsk_save_topk: 1
   trsk_num_classes: 2
   trsk_model_pretrarch: mit_b2 #mit-b2 #efficientnet-b2
   trsk_num_workers: 16
   trsk_wandb_project_name: mzb-skeletons
   trsk_logger: wandb # tensorboard | wandb -- select the logger for training
   # trsk_tversky_loss_w1: 
   # trai_model_save_append: "-v1"

   ## Inference config 
   # infe_model_folder: models/mzb-classifiers/ # likely not used to allow renku parse as input
   infe_model_ckpt: last # best or last, best is on validation error
   infe_num_classes: 8
   infe_image_glob: "*_rgb.jpg" 

   ## Skeletonization
   # unsupervised skeletonization
   skel_class_exclude: errors
   skel_conv_rate: 131.6625 #[133.1, 136.6, 133.2, 133.2, 133.2, 118.6, 133.4, 132.0])  # px / mm
   # skel_save_usnup_masks: data/derived/project_portable_flume/skeletons/automatic_skeletons/

   # supervised skeletonization
   skel_label_thickness: 3
   skel_label_buffer_on_preds: 25
   skel_label_clip_with_mask: False # We need same set data (blobs and skeletonization training set are not the same filenames)


The taxonomy file
-----------------
This file contains information about the taxonomy of each class (e.g. species, genus, or other taxa) in the dataset. Its location is specified in the running parameters declared in the bash scripts, see :ref:`files/how_to_use:Working with the project`. 

The first column of the taxonomy file should be named ``query`` and should contain the name of the class (i.e.: "class" here refers to the category of the object, in this case the organism identity, not to a specific phylogentic rank); all the other columns should correspond to a taxonomic rank, and should contain the pertinent taxon for that class. This should be saved as CSV file in an appropriate location (for instance, ``/data/MZB_taxonomy.csv``), structured like so: 

+---------------+---------+------------+---------+-----------+---------------+----------+---------------+----------+
| query         | kingdom | phylum     | class   | subclass  | order         | suborder | family        | genus    |
+===============+=========+============+=========+===========+===============+==========+===============+==========+
| ephemeroptera | Metazoa | Arthropoda | Insecta | Pterygota | Ephemeroptera | NA       | NA            | NA       |
+---------------+---------+------------+---------+-----------+---------------+----------+---------------+----------+
| heptageniidae | Metazoa | Arthropoda | Insecta | Pterygota | Ephemeroptera | Setisura | Heptageniidae | NA       |
+---------------+---------+------------+---------+-----------+---------------+----------+---------------+----------+
| isoperla      | Metazoa | Arthropoda | Insecta | Pterygota | Plecoptera    | NA       | Perlodidae    | Isoperla |
+---------------+---------+------------+---------+-----------+---------------+----------+---------------+----------+

Such a taxonomy file can be easily generated from a list of classes using utilities like the R package `taxize <https://github.com/ropensci/taxize>`__ or others. 

Please note that the taxonomic rank selection can be different (for instance, it could be ``class, family, genus, species``), the only constrain is that the requested taxonomic cutoff rank (parameter `lset_class_cut``) must also exist in the taxonomy file. If for some classes the requested taxonomic cutoff has no value or is NA (due to the fact that that level is not available, or the query is at a higher taxonomic rank), then that class is dropped and all its instances will not be considered for model training. 

For example, if our taxonomy file looks like the table above, if we requested taxonomic cutoff ``order``, we would obtain 2 classes (Ephemeroptera, line 1+2; Plecoptera, line 3); if we requested taxonomic cutoff ``family``, we would obtain 2 classes (Heptageniidae, line 2; Perlodidae, line 3); if we requested taxonomic cutoff ``suborder``, we would obtain 1 class (Setisura, line 2). 

Please also see :ref:`files/scripts/processing_scripts:Preparing training data` for details on the function that prepares the traning data using the taxonomy file.