Identification of morphologically cryptic species with computer vision models: wall lizards (Squamata: Lacertidae: Podarcis) as a case study
Author
Pinho, Catarina
Author
Kaliontzopoulou, Antigoni
Author
Ferreira, Carlos A
Author
Gama, João
text
Zoological Journal of the Linnean Society
2023
2023-05-01
198
1
184
201
https://academic.oup.com/zoolinnean/article/198/1/184/6777764
journal article
10.1093/zoolinnean/zlac087
0024-4082
7926780
DISCRIMINATION
BETWEEN
PODARCIS
BOCAGEI
AND
P.
LUSITANICUS
The overall performance of the three methods for image classification of
P. bocagei
and
P. lusitanicus
in the four different datasets is shown in
Table 2
. Detailed results, including training, validation and test-set evaluation for all cross-validation sets, are shown in the Supporting Information,
Tables S1–S
4
. Accuracy is generally high, ranging from 87.3% in the case of InceptionV
3 in
female dorsal images to 94.8% in male dorsal images when applying InceptionResNetV2. AUC ranges from 0.931 using InceptionV
3 in
female dorsal images to 0.984 using Inception-ResNetV
2 in
male dorsal and head lateral images. F1-scores show that, typically,
P. lusitanicus
is more frequently misclassified than
P. bocagei
, for both
types
of images and for both sexes. All three methods perform similarly in all datasets considering the three performance metrics. Identification of males is generally more accurate than that of females. Considering all five cross-validation replicates of the three models, the identification accuracy of males is significantly higher than that of females only when considering dorsal images (
P
= 0.048, Mann–Whitney–Wilcoxon test). The same result is obtained, but even more pronounced, using other metrics (
P
= 0.009 and
P
= 0.030 for AUC and F1-scores, respectively). With respect to head lateral images, the difference in identification accuracy between sexes also exists but it is significant only for differences in AUC (
P
= 0.046, Mann–Whitney– Wilcoxon test). There is no difference in performance using different image perspectives, neither in the case of males nor in that of females.
As an extension to this basic approach, we tested whether model ensembles (calculated by averaging predictions of different models) would increase classification success. Model ensembles within each of the four datasets do not always improve classification success compared to the best single model (see results in
Table 2
). For instance, in the case of head lateral images, prediction performance is worse with the model ensemble than when using the best-performing model alone. In the case of dorsal images, the improvement is slight for males and more substantial for females.
By contrast, combining the predictions from different views results in a much higher classification success in all cases, where accuracy reaches as high as 97.1% for males and 95.9% for females. These results are presented in
Table 3
and the corresponding confusion matrices in
Figure 2
.
Grad-CAM heatmaps were produced only for the model showing the highest accuracy in each case (Inception-ResNet V
2 in
the case of male dorsal and head lateral images, ResNet
50 in
the case of female dorsal images and Inception V
3 in
the case of female head lateral images). Visualization of the heatmaps confirms that the models are indeed considering the lizard images for classification and not external features (like human fingers, writings, shadows and other non-lizard elements that appear in some images).
Examples of heatmaps used to discriminate the two classes are shown in
Figure 3
. In dorsal images, the model often uses the middle area of the trunk to discriminate the two classes. Still, the head region is also used (and both regions combined). In female dorsal images, the head is not as frequently used as the trunk, but the portion of the trunk used for discrimination is generally more anterior than in males. In both male and female head lateral images, the area around the ear is the one most frequently used for classification, although this region could be more or less shifted towards the throat in both sexes.
Table 2.
Evaluation of the three tested architectures in the four datasets for the two-class experiment. Models with the highest accuracy are highlighted in bold
Sex |
View |
Metric* |
Inception V3 |
ResNet 50 |
Inception ResNetV2 |
Combined predictions |
Males |
Dorsal |
Accuracy |
0.935 |
0.922 |
0.948
|
0.955 |
AUC |
0.976 |
0.982 |
0.984 |
0.982 |
F1
Pboc
|
0.951 |
0.941 |
0.962 |
0.967 |
F1
Plus
|
0.905 |
0.887 |
0.919 |
0.930 |
Head lateral |
Accuracy |
0.926 |
0.929 |
0.936
|
0.930 |
AUC |
0.972 |
0.975 |
0.984 |
0.976 |
F1
Pboc
|
0.946 |
0.947 |
0.953 |
0.949 |
F1
Plus
|
0.882 |
0.895 |
0.889 |
0.885 |
Females |
Dorsal |
Accuracy |
0.873 |
0.906
|
0.905 |
0.943 |
AUC |
0.931 |
0.965 |
0.970 |
0.970 |
F1
Pboc
|
0.905 |
0.930 |
0.929 |
0.948 |
F1
Plus
|
0.810 |
0.857 |
0.859 |
0.908 |
Head lateral |
Accuracy |
0.935
|
0.919 |
0.927 |
0.911 |
AUC |
0.962 |
0.959 |
0.976 |
0.972 |
F1
Pboc
|
0.953 |
0.941 |
0.947 |
0.937 |
F1
Plus
|
0.897 |
0.87 |
0.872 |
0.847 |
*AUC refers to the area under the ROC curve.
Podarcis bocagei
was the positive class. F1, harmonic mean of precision and recall;
Pboc
,
Podarcis bocagei
;
Plus
,
Podarcis lusitanicus
.
Table 3.
Classification success of the six-model ensembles for the two-class experiments
Males |
Females |
Accuracy |
AUC |
F1 |
Precision |
Recall |
Accuracy |
AUC |
F1 |
Precision |
Recall |
Average/global |
0.971 |
0.997 |
0.966 |
0.970 |
0.962 |
0.959 |
0.992 |
0.952 |
0.952 |
0.952 |
P. bocagei
|
0.979 |
0.972 |
0.986 |
0.970 |
0.970 |
0.970 |
P. lusitanicus
|
0.953 |
0.968 |
0.939 |
0.934 |
0.934 |
0.934 |
AUC was calculated assuming
P. bocagei
as the positive case; F1, precision and recall were macro-averaged.
DISCRIMINATION
BETWEEN
THE
NINE
GROUPS
Overall, the performance of the different models for classification of the nine classes is worse than in the two-class case. Unlike the experiment involving only
P. bocagei
and
P. lusitanicus
, in all analyses considering nine classes there is some evidence of overfitting (see the Supporting Information,
Tables S5–S
9 for detailed training, validation and testing evaluation scores), which could not be completely overcome by varying the hyperparameters. A summary of the performance of each model is presented in
Table 4
.
In general, accuracy ranges from 76.3% for ResNet
50 in
female head perspectives to 85.3% for InceptionResNetV
2 in
male dorsal views. A striking result is the highly significant difference between male and female image identification accuracy, with consistently higher accuracies in male datasets, which holds for both
types
of images (
P
<0.0001 for all comparisons, both for accuracy and F1 score, Mann–Whitney–Wilcoxon test). On the other hand, there are no differences in performance between the
two types
of images, neither for males nor for females. There are also no major differences between models in classification success. The only significant difference is detected in female head lateral images, in which ResNet50 performs significantly worse than Inception-ResNet
V2
(
P
= 0.0325 for both accuracy and F1-score, Wilcoxon signed rank test). Unlike the two-class case, in which the utility of ensemble models is mostly restricted to the combination of predictions from different perspectives, without important improvements in the within-dataset case, in the nine-class experiment ensemble models combining predictions from the three architectures for each image perspective greatly improve classification accuracy when compared to the best single model (see
Table 4
)
.
Using estimates from different views by averaging across the six model predictions improves classification success even further. These results are shown in
Table 5
and the respective confusion matrices shown in
Figure 4
. In this case, prediction accuracy reaches as high as 93.5% for males and 91.2% for females.
The distribution of classification metrics according to the species is shown in
Table 5
. Taking a deeper look into these classification scores, it appears that several species are fairly well recognizable, with F1 scores above 0.90: this is the case for
P. bocagei
,
P. carbonelli
,
P. lusitanicus
,
P. Ʋaucheri
s.l.
,
P. Ʋaucheri
s.s.
and
P. Ʋirescens
in males, and for the same species except
P. lusitanicus
in females. The most problematic species is, in both sexes,
P. liolepis
. Considering confusion matrices, it is noticeable that individuals of this species are often misclassified as
P. Ʋirescens
(more so in the case of females than males). Noteworthy is that the misclassification between the cryptic
P. guadarramae
and
P. lusitanicus
is minimal (7.9% of
P. lusitanicus
females and 4.6% of males are classified as
P. guadarramae
and 0 and 1.6% of
P. guadarramae
females and males are classified as
P. lusitanicus
; see
Fig. 4
).
As for the two-class problem, Grad-CAM analyses show that, typically, the models use lizard – and not other – features for classification. However, even with the visualization tool available, it is not straightforward to understand what the model considers for discrimination. More precisely, the same regions seem to be used to classify distinct species, and it is not evident how differences in these regions are used. The most common patterns for each species are summarized in
Tables 6
and
7
(for males and females, respectively).