lidar_platform.classification.feature_selection

Created on Thu Jul 28 10:33:50 2022

@author: Mathilde Letard

lidar_platform.classification.feature_selection.feature_clean(features)[source]

Delete NaN and Inf values in the features set (no normalization, just NaN and Inf values cleaning)

Parameters:: features (numpy array) – input features dataset (for ex., the “features” field of a dict obtained with load_sbf_features.
Returns:: dataset – a dataset containing no more NaN of Inf values.
Return type:: numpy array

lidar_platform.classification.feature_selection.filter_corr_with_selected_ft(all_ft, candidate_ft, selected_ft, threshold)[source]

Check compatibility of features considering their linear correlation with an existing set of features. This function allows to evaluate whether a new predictor can be added to a set of previously selected features and scales without overcoming the maximum accepted inter-feature linear correlation coefficient.

Parameters:

all_ft (numpy array (n_points x n_predictors)) – array containing the value of each predictor for each point (“features” fields of the dict obtained with obtained using load_sbf_features()).
candidate_ft (list(int)) – indices of the elements to consider for selection in the all_ft array (column index of the candidates to evaluate).
selected_ft (numpy array (n_points x n_selected)) – array containing the features that are already selected.
threshold (float) – accepted value of correlation between predictors.

Returns:

valides – list of the indices of the selected features in the original ds[‘features’] array (obtained with load_sbf_features()).

Return type:

list(int)

lidar_platform.classification.feature_selection.get_acc_expe(trads, testds, plot=True, save=False, model=0)[source]

Train a random forest model for point cloud features classification and get metrics describing its performances.

Parameters:

trads (dictionary of numpy arrays) – training features dictionary.
testds (dict of numpy arrays) – test features dictionary.
save (bool) – defines if plot must be saved.
plot (bool) – defines if plot must be opened.
model (int (0 or 1)) – type of model. 0 = scikit-learn random forest, 1 = OpenCV random forest

Returns:

accuracy (float Overall Accuracy of classifier)
fscore (float F1-score (averaged on all classes))
numpy.mean(confid_pred) (float Mean prediction confidence)
recall (float Recall (averaged on all classes))
precision (float Precision (averaged on all classes))
uas (numpy.array(float) User’s accuracies (per class))
pas (numpy.array(float) Producer’s accuracies (per class))
fscores (numpy.array(float) F1-score per class)
confc (numpy.array(float) Mean prediction confidence per class)
recalls (numpy.array(float) Recall per class)
precisions (numpy.array(float) Precision per class)
labels (numpy.array(float) labels)
feat_imptce (numpy.array(float) feature importance values)
classifier (sklearn RandomForestClassifier or OpenCV RTrees classifier)
labels_pred (np.array(int) model predictions)

lidar_platform.classification.feature_selection.get_best_rf_select_iter(dictio_rf_select, trads, testds, wait, threshold)[source]

Get an optimized set of features and scales by analyzing the variations of OA or oob-score when performing random forest feature importance-based iterative selection.

Parameters:

dictio_rf_select (dictionary) – obtained when performing rf_ft_selection.
trads (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
testds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
wait (int) – number of iterations to take into account for monitoring.
threshold (float) – accepted value of OA variance within wait period.

Returns:

dictio_results (dictionary) – contains the resulting predictors set and associated parameters and metrics.
- ’Best_it’: best iteration
- ’Features’: optimized set of features
- ’Scales’: scales related to the optimized features
- ’Feat_names’: feature names (maybe redundant with ‘Features’)
- ’Feat_imp_mean’: mean of feature importance
- ’Scales_name’: scale names
- ’Scales_freq’: scale frequency (per scale)
- ’Scales_imp’: scale importance (per scale)
- ’OA’: Overall Accuracy,
- ’Fscore’: F1-score
- ’Confid’: confidence
- ’Recall’: recall,
- ’Precision’: precision,
- ’UAs’: User’s accuracies (per class)
- ’PAs’: Producer’s accuracies (per class)
- ’Class_fscores’: F1-score per class
- ’Class_conf’: confidence per class
- ’labels’: labels
classifier (sklearn.ensemble.RandomForestClassifier) – theoretically optimal classifier (trained only with the selected features/scales).

lidar_platform.classification.feature_selection.get_n_optimal_sc_ft(train_ds, test_ds, n_scales, n_features, eval_sc, threshold)[source]

Get the best n_features and n_scales for classification based on inter-feature correlation and information score.

Parameters:

train_ds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
test_ds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
n_scales (int) – number of different scales to select.
n_features (int) – number of different features to select.
eval_sc (float) – scale at which to evaluate each feature’s information score.
threshold (float) – accepted value of correlation between predictors.

Returns:

dictio_ft –

contains the resulting predictors set and associated parameters and metrics.

’Feats’: feature list

’Scales’: scale list

’feat_imp’: feature importance values

’Indices’: indices of the selected features in the initial array of features

’Freq’: number of votes obtained by each selected scale

’OA’: Overall Accuracy of classifier

’Fscore’: F1-score (averaged on all classes)

’Confidence’: Confidence (averaged on all classes)

’Recall’: Recall (averaged on all classes)

’Precision’: Precision (averaged on all classes)

’Class_UA’: User’s accuracies (per class)

’Class_PA’: Producer’s accuracies (per class)

’Class_Fscore’: F1-score per class

’Class_confidence’: class confidence

’Class_recall’: Recall per class

’Class_precision’: Precision per class

’Labels’: labels

Return type:

dictionary

lidar_platform.classification.feature_selection.get_n_uncorr_ft(ft_all, ft_select, ft_score, nf, threshold)[source]

Iteratively complete an unfilled set of uncorrelated features. This function iteratively looks for additional features to select to reach nf uncorrelated features.

Parameters:

ft_all (numpy array (n_points x n_predictors)) – array containing the value of each predictor for each point (“features” field of the dict obtained with load_sbf_features()).
ft_select (numpy array (n_selected)) – index in ft_all of each predictor that already passed the selection.
ft_score (numpy array (n_predictors x 1)) – array containing the information score of each predictor.
nf (int) – number of different features to select.
threshold (float) – accepted value of correlation between predictors.

Returns:

ft_select – list of the indices of the selected features in the original ds[‘features’] array (obtained with load_sbf_features()).

Return type:

list(int)

lidar_platform.classification.feature_selection.get_scales_feats(ds)[source]

Get the scales and features present in the dataset read by cc_3dmasc.load_sbf_features().

Parameters:

ds (dictionary) – data dictionary containing features, labels, names obtained using load_sbf_features().

Returns:

numpy array ((ns*nf) x 1) – list containing the scale of each descriptor.
numpy array (nf x 1) – list containing the feature name of each descriptor.
numpy array ((ns*nf) x 1) – list containing the complete name of each descriptor.

lidar_platform.classification.feature_selection.info_score(ds)[source]

Get the mutual information score of each feature (computed with respect to the labels to predict). This metric is used in the classifier optimization procedure.

Parameters:: ds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
Returns:: dictio_ft – contains the name of each feature and the associated score value.
Return type:: dictionary

lidar_platform.classification.feature_selection.inter_ft_corr_filter(features_set, features_score, threshold)[source]

Prune a set of predictors by keeping only the most informative elements among correlated pairs. First, linear correlation between all provided features at all provided scales is computed. Then, when the correlation between two predictors exceeds a given threshold, only the one with the highest mutual information score is kept.

Parameters:

features_set (numpy array (n_points x n_predictors)) – array containing the value of each predictor evaluated for each point (e.g., “features” field of the dictionary obtained with load_sbf_features()).
features_score (numpy array (n_predictors x 1)) – array containing the information score of each predictor (obtained with info_score()).
threshold (float) – accepted value of correlation between predictors.

Returns:

select – indices of the selected features in the original features_set array (column indices).

Return type:

list(int)

lidar_platform.classification.feature_selection.n_best_uncorr_ft(ds, nf, corr_threshold)[source]

Select nf uncorrelated features depending on their mutual information score and linear correlation.

Parameters:

ds (dictionary) – data dictionary containing features, labels, names obtained with cc_3dmasc.load_sbf_features().
nf (int) – number of different features to select.
corr_threshold (float) – accepted value of correlation between predictors.

Returns:

select – list of the indices of the selected features in the original ds[‘features’] array (obtained with cc_3dmasc.load_sbf_features()).

Return type:

list(int)

lidar_platform.classification.feature_selection.n_best_uncorr_sc(ds, n_scales, corr_threshold)[source]

Select ns uncorrelated scales depending on their mutual information score, linear correlation, and a voting process. For each investigated features, all available scales are investigated and pruned depending on their correlations. Then the ns most frequently retained scales among all features are kept as the final set of scales.

Parameters:

ds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
n_scales (int) – number of different scales to select.
corr_threshold (float) – accepted value of correlation between predictors.

Returns:

optim_ok (list(float)) – list of selected scales
freq_optim (list(int)) – number of votes obtained by each selected scale.

lidar_platform.classification.feature_selection.nan_percentage(ds)[source]

Get the percentage of NaN values for each feature. This can be useful to better understand why a feature at a given scale is not contributing, or to identify relevant minimal scales to use. (reminder: 3DMASC outputs NaN for points for which the feature was impossible to compute - for ex. due to no neighbors in the specified sphere scale).

Parameters:: ds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
Returns:: dictio_ft – dictionary containing the name of each feature and the associated percentage of NaN.
Return type:: dictionary

lidar_platform.classification.feature_selection.rf_ft_selection(trads, testds, n_scales, n_features, eval_sc, threshold=0.85, step=1)[source]

Perform iterative feature selection using the random forest embedded feature importance as criteria. First, n-scales and n-features are selected based on their linear correlations and mutual information. Then, this set is iteratively reduced by discarding the feature having the lowest random forest feature importance. At each step, the model is trained again to update the feature importance ranking.

Parameters:

trads (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
testds (dictionary) – data dictionary containing features, labels, names obtained with load_sbf_features().
n_scales (int) – number of different scales to select at the begining of the process.
n_features (int) – number of different features to select at the begining of the process.
eval_sc (float) – scale at which to evaluate each feature’s information score at the begining of the process.
threshold (float) – accepted value of correlation between predictors.
step (int)

Returns:

dictio_ft – contains the resulting predictors set and associated parameters and metrics at each iteration.

Return type:

dictionary