Multitask semantic change detection guided by spatiotemporal semantic interaction
Multitask semantic change detection guided by spatiotemporal semantic interaction"
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
ABSTRACT Semantic Change Detection (SCD) aims to accurately identify the change areas and their categories in dual-time images, which is more complex and challenging than traditional binary
change detection tasks. Accurately capturing the change information of land cover types is crucial for remote sensing image analysis and subsequent decision-making applications. However,
existing SCD methods often neglect the spatial details and temporal dependencies of dual-time images, leading to problems such as change category imbalance and limited detection accuracy,
especially in capturing small target changes. To address this issue, this study proposes a network that guides multitask semantic change detection through spatiotemporal semantic interaction
(STGNet). STGNet enhances the ability to capture spatial details by introducing a Detail-Aware Path (DAP) and designs a Bidirectional Guidance Module for Spatial Detail and Semantic
Information for adaptive feature selection, improving feature extraction capabilities in complex scenes. Furthermore, to resolve the inconsistency between semantic information and change
areas, this paper designs a Cross-Temporal Refinement Interaction Module (CTIM), which enables cross-time scale feature fusion and interaction, constraining the consistency of detection
results and improving the recognition accuracy of unchanged areas. To further enhance detection performance, a dynamic depthwise separable convolution is designed in the CTIM module, which
can adaptively adjust convolution kernels to more precisely capture change features in different regions of the image. Experimental results on three SCD datasets show that the proposed
method outperforms other existing methods in various evaluation metrics. In particular, on the Landsat-SCD dataset, the F1 score (F1scd) reaches 91.64%, and the separation Kappa coefficient
improves by 17.68%. These experimental results fully demonstrate the significant advantages of STGNet in improving semantic change detection accuracy, robustness, and generalization
capability. SIMILAR CONTENT BEING VIEWED BY OTHERS MULTI-SCALE FEATURE PROGRESSIVE FUSION NETWORK FOR REMOTE SENSING IMAGE CHANGE DETECTION Article Open access 13 July 2022 A SIAMESE
SWIN-UNET FOR IMAGE CHANGE DETECTION Article Open access 25 February 2024 DASUNET: A DEEPLY SUPERVISED CHANGE DETECTION NETWORK INTEGRATING FULL-SCALE FEATURES Article Open access 30 May
2024 INTRODUCTION Change detection refers to the identification of surface changes by comparing remote sensing images from different time periods1,2, and it is widely applied in fields such
as environmental monitoring, urban planning, disaster warning, emergency response, and resource management3,4,5,6. Common change detection methods can be divided into two categories: Binary
Change Detection (BCD) and Semantic Change Detection (SCD). BCD primarily focuses on whether surface changes have occurred, distinguishing between “change” and “no change” states, but it
fails to identify the specific types of changes. To address the limitations of BCD in change type identification, SCD was introduced7,8. SCD effectively identifies specific changes in land
use types, such as forest to farmland or water body to built-up land. This comprehensive change type analysis provides richer and more specific change information, helping to better
understand the underlying causes of changes and offering more comprehensive support for land use planning9,10. However, traditional change detection methods rely on manual visual
interpretation or simple image processing techniques, which are not only time-consuming and labor-intensive but also easily influenced by subjective factors, making it difficult to meet the
processing requirements of large-scale, high-resolution remote sensing data11,12,13. As a result, with the rapid development of deep learning technologies, particularly the successful
applications in fields such as object detection14,15 and semantic segmentation16,17, remote sensing change detection based on convolutional neural networks (CNNs) has also made significant
progress. CNNs can handle complex, high-dimensional high-resolution remote sensing images and accurately identify changes in land cover, land use, and other aspects through powerful feature
extraction capabilities18,19,20. Additionally, in recent years, the introduction of advanced models such as graph convolutional networks (GCNs) and Transformers has further driven the
development of change detection technology. GCNs21 can effectively capture spatial and structural relationships in remote sensing images, improving the ability to identify change areas. The
Transformer model22, through its self-attention mechanism, excels in capturing long-range dependencies and local features, which is particularly important for detecting subtle surface
changes. Therefore, these advanced deep learning models have not only improved the accuracy and efficiency of change detection but also provided more flexible and powerful technical support
for handling more complex and dynamic remote sensing data. In deep learning-driven change detection technology, significant progress has been made in the research of BCD, especially in the
identification of change areas23,24,25,26. For example, Ling et al.27 proposed an innovative deep learning architecture, IRA-MRSNet, which combines multiscale residual twin networks with
integrated residual attention. This architecture efficiently captures multiscale features, accurately refines the edges of changing regions, and fuses global semantic information to achieve
accurate localization. On the other hand, Peng et al.28 proposed a difference-enhanced dense-attention convolutional neural network (DDCNN), an end-to-end change detection method that
improves the accuracy of change detection in dual-temporal remote sensing images by introducing a dense-attention mechanism and difference-enhancing units. Chen et al.29 designed a
dual-attention fully convolutional twin network (DASNet), which, through the dual-attention mechanism and weighted double margin contrast loss, captures more discriminative features and
enhances the robustness of change detection for high-resolution satellite images. However, the above-mentioned research categories are still focused on simple BCD tasks, while deep
learning-based SCD research is still in its development stage. In existing research, three commonly used architectures for SCD include single-branch, dual-branch, and multi-task
architectures. The first structure is the single-branch architecture (Fig. 1a), which merges dual-time images (e.g., concatenation) as input to identify the category differences between
images from different time points and output the results30,31,32. However, this design increases the complexity of category learning in the encoder, requiring the model to have stronger
capabilities to handle more categories in segmentation tasks. Therefore, in the second dual-branch structure (Fig. 1b), a shared-weight encoder processes the image categories of different
time periods separately and then performs change detection, such as FC-Siam-conc, FC-Siam-diff33, etc. This architecture reduces the classification difficulty seen in Architecture 1, but
during the encoder phase, the SCD task is still treated as a single task, and the differences between change detection (CD) and semantic segmentation (SS) tasks often make it difficult for
the model to balance both, thus limiting its potential for each task. As a result, in the third multi-task architecture (Fig. 1c, d), the SCD task is decomposed into sub-tasks. In Fig. 1c,
the CD and SS tasks are handled as two independent sub-tasks, and the final result is obtained by masking the change areas with land cover types. A typical example is the HRSCD-str3 proposed
by Daudt et al.34, which detects land cover type changes through a BCD branch. However, such methods lack deep feature interaction between the two tasks in their handling of different
tasks, and the SCD results are easily influenced by the single-branch results. Therefore, in Fig. 1d, the input for BCD is changed to the dual-time semantic information extracted from the
semantic encoder for change detection, effectively integrating the SS sub-task. Most existing SCD research adopts the structure in Fig. 1d. For example, Chen et al.35 proposed a
feature-constrained change detection network (FCCDN) that applies feature constraints in semantic feature extraction and feature fusion, using these features for the BCD task. Experiments
show that the SS branch effectively improves the accuracy of the BCD task. Ding et al.36 further validated this architecture in their proposed BiSRNet, introducing global self-attention (SR)
and cross-time self-attention (Cot-SR) to enhance information interaction between images from different times, improving consistency between the BCD and SS results. Zhang et al.37 proposed
a multi-task architecture called ChangeMask, which decouples SCD into SS and BCD, then learns the change representation from semantic representations through the Transformer module. Jiang et
al.38 proposed a semantic change detection network based on hierarchical semantic graph interaction, which models dual-time correlations and uses graph learning to represent interactions
across different feature layers to accurately identify change areas and land cover types. Wang et al.39 proposed an agricultural geographic scene and plot-scale constrained semantic change
detection framework (AGSPNet), which optimizes plot extraction using multi-source geographic data products and a bidirectional cascading network (BDCN), combined with a cross-attention
network (CCNet) to extract semantic and change features of crops. However, despite the fact that many of the above methods use multi-task frameworks as the main framework for SCD tasks, as
shown in Fig. 1d, the key parts of this architecture mainly focus on semantic feature extraction, feature interaction between different time periods, and balancing the SS and BCD tasks.
These three aspects are decisive factors for the accuracy of the SCD task. Below are the unresolved issues in these three aspects of the SCD task: * 1. _Loss of detailed information_: The
loss of detailed information leads to inaccurate detection of small targets and boundaries, as well as increased underreporting of minor changes (e.g., vegetation degradation) and false
alarms due to external factors (e.g., changes in appearance and illumination). * 2. _Lack of bi-temporal feature correlation_: Feature extraction for a single time period in bi-temporal
images lacks the ability to capture change information across time and does not account for the consistent correlation between bi-temporal features, which often leads to inaccurate
predictions of unchanged areas. * 3. _Imbalance problem between SS and BCD tasks_: In the final SCD results, there is often a situation where the detected changed regions in the BCD task are
inconsistent with the semantic regions in the SS task, leading to contradictory results. To address the shortcomings of the above SCD tasks in the field of remote sensing, this study
proposes a network that guides multi-task semantic change detection through spatiotemporal semantic interaction (STGNet). This network aims to improve the loss of edge information in change
areas and the imbalance between BCD tasks and semantic categories in SCD tasks. The main contributions of this article are as follows: * In the critical stage of feature extraction, we
introduced the Detail-Aware Path (DAP) and designed a Bidirectional Guidance Module for Spatial Detail and Semantic Information (BiDS). This module enhances the ability to extract detailed
features in the detail branch and deep semantic information in the context branch, thereby achieving comprehensive optimization of feature extraction and refining the edges of changing
regions. * We propose a Dynamic Depthwise Separable Convolution (DDConv) that can adaptively adjust the convolution kernels based on the features of the input data, thereby capturing the
change features in different regions of the image more precisely without increasing the computational burden. * We further propose a Cross-Temporal Refinement Interaction Module (CTIM),
which effectively enhances the information exchange capability between dual-temporal features. By fusing and interacting features across time scales, domain adaptation between dual-temporal
domains is achieved, capturing region change information based on the temporal dimension and improving the recognition accuracy of unchanged regions. * To verify the effectiveness and
robustness of STGNet, we conducted extensive experiments on three publicly available semantic change detection datasets and performed a comprehensive comparison with the state-of-the-art
methods. The experimental results show that our model has significant advantages in SCD tasks. The rest of this article is organized as follows: In section "The proposed method",
we provide a detailed ex-planation of the methods proposed in this article; section "Datasets and experimental setup" describes the SCD dataset and experimental setup used in this
article; section "Experimental comparison and analysis" introduces the experimental results and provides a detailed analysis; section "Conclusions" Conclusion and
Outlook. THE PROPOSED METHOD In this section, we provide a detailed introduction to the proposed SCD network for dual-temporal remote sensing images (i.e., STGNet), with the overall
structure shown in Fig. 2. The network adopts a multi-task learning architecture to separately handle semantic segmentation and change detection tasks, enabling comprehensive learning of
change regions and categories. The architecture is based on the Siamese network commonly used in BCD research, and constructs a Siamese-based Dual-Path Feature Extractor (SDPNet). This
feature extractor consists of two paths: the Detail-Aware Path (DAP) and the Context Path (CP). The CP path uses ResNet50 as the backbone network to extract deep semantic features. It
outputs resolutions at 1/2, 1/4, 1/8, 1/8, and 1/8 of the original resolution, progressively reducing spatial resolution to enhance the extraction of semantic information. The DAP path
compensates for the spatial detail information lost in the CP path, capturing fine spatial structures and edge information in remote sensing images. Its output resolutions are 1/2, 1/4, 1/4,
and 1/4 of the original resolution. It is important to note that during the dual-path extraction process, we designed a Bidirectional Guidance Module for Spatial Detail and Semantic
Information (BiDS) to facilitate the exchange of spatial detail information and deep semantic features between the two branches. This module enables each branch to selectively learn features
from the other, thereby enhancing the enhance the model’s ability to recognize unchanged regions, we propose a Cross-Temporal Interaction Module (CTIM). This module leverages cross-learning
principles to promote deep interactions and fusion of semantic information between dual-temporal images. By capturing region change information based on the temporal dimension, it helps
learn richer semantic features from single-time-period images. Additionally, we introduce an Attention Feature Fusion Module (AFF) to better integrate semantic and spatial feature
information, strengthening the spatial detail within the semantic features. Finally, we employ six residual modules to perform precise binary change detection on the tightly integrated
dual-temporal semantic features. These residual modules, leveraging their deep learning capabilities, effectively capture subtle differences between remote sensing images at two different
time points. The generated binary change detection map is then used to fine-tune the land cover classification results, thereby accurately obtaining the final results of SCD and achieving
high-precision detection of semantic changes in remote sensing images. BIDIRECTIONAL GUIDANCE MODULE FOR SPATIAL DETAIL INFORMATION AND SEMANTICS (BIDS) In SCD tasks, high-resolution remote
sensing images have a wide imaging range, rich content, and complexity. Traditional simple feature extraction strategies often struggle to comprehensively and effectively capture key
information in the images. Current methods typically combine local and global features during the single-branch feature extraction phase, attempting to capture both global context and local
details simultaneously. However, when dealing with complex backgrounds and subtle changes, these methods often fail to balance the extraction of both global context and local details,
leading to missed target areas or false positives. To address this issue, the dual-branch network architecture has been shown to effectively improve performance during feature extraction in
the encoder40,41. This architecture extracts context information and spatial detail information using different convolution layers, thereby capturing key change features in the image more
accurately. thereby capturing key change features in the image more accurately. However, traditional dual-branch networks have a significant drawback: the two branches typically operate
independently, lacking mutual supplementation and collaboration. Specifically, while the context information branch excels at capturing global information, its ability to perceive subtle
local changes is weaker, especially when processing high-resolution remote sensing images, where the rich detailed information and complex contextual relationships in the image need to be
considered simultaneously. To address this issue, the BIDS module is proposed to optimize the feature extraction capability in the dual-branch network. The BIDS module draws inspiration from
the design concept in PIDNet40, enhancing the feature expression ability of each branch by guiding the feature exchange between branches. Specifically, the BIDS module allows each branch to
selectively learn and integrate feature information from the other branch, effectively combining global and local features. This cross-branch feature interaction not only improves the
model’s ability to extract detail and semantic features but also effectively overcomes the issue of detail information loss in traditional dual-branch structures. Compared to existing local
global feature extraction methods, the BIDS module, through the cross-branch feature fusion mechanism, enables each branch to not only focus on extracting global context or local details but
also selectively integrate feature information from the other branch, thereby enhancing the collaboration between branches. The detailed structure is shown in Fig. 3, where we define the
corresponding pixel vectors in the feature maps \(X\) and \(Y\) as \({\overrightarrow{v}}_{x}\) and \({\overrightarrow{v}}_{y}\), respectively, and perform dynamic convolution operations.
The process of dynamic convolution for \({\overrightarrow{v}}_{x}\) and \({\overrightarrow{v}}_{y}\) is shown below:
$$f\left(x\right)=g({\widetilde{W}}^{T}\left(x\right)x+\widetilde{b}(x))$$ (1) $$\tilde{W}\left( x \right) = \mathop \sum \limits_{k = 1}^{K} \pi_{k} \left( x \right)\tilde{W}_{k} ,\quad
\tilde{b}\left( x \right) = \mathop \sum \limits_{k = 1}^{K} \pi_{k} \left( x \right)\tilde{b}_{k}$$ (2) where \({\pi }_{k}\left({0\le \pi }_{k}\left(x\right)\le 1\right), {\sum
}_{k=1}^{K}{\pi }_{k}\left(x\right)=1)\) represents the attention weight of the k-th linear function \({\widetilde{W}}^{T}\left(x\right)x+\widetilde{b}(x)\), \(g\) is the activation
function, \(W\) and \(b\) are the weight matrix and bias vector, respectively, and \(x\) represents the input feature component \({\overrightarrow{v}}_{x}\) or \({\overrightarrow{v}}_{y}\).
As shown in Fig. 4, dynamic convolution endows the convolution kernel with an attention mechanism. Due to the non-linear generation of \({\pi }_{k}\), dynamic convolution can adaptively
adjust the combination of convolution kernels for different inputs to focus on more critical features, resulting in a stronger feature representation capability when processing
high-resolution remote sensing images. Subsequently, we apply batch normalization to these features and perform element-wise multiplication and summation operations, effectively integrating
feature information from different branches. Finally, activation is performed using the Sigmoid function to obtain the probability \(\sigma\) that two pixels belong to the same object. If
\(\sigma\) is high, it is more likely to trust \({\overrightarrow{v}}_{x}\) from its own branch; otherwise, it is more likely to trust \({\overrightarrow{v}}_{y}\) from the other branch. The
detailed processing procedure is as follows: $$\sigma =Sig(Sum(BN\left(f\left({\overrightarrow{v}}_{x}\right)\right)\cdot BN(f({\overrightarrow{v}}_{y}))))$$ (3) $$Out=\sigma
{\overrightarrow{v}}_{x}+(1-\sigma ){\overrightarrow{v}}_{y}$$ (4) where \(f\) denotes the dynamic convolution operation, \(BN\) denotes the batch normalization operation, \(Sum\) denotes
the summation operation, and \(Sig\) denotes the activation function. By using BIDS to achieve mutual complementation and promotion between the two-branch paths, it enables the detail-aware
path to leverage contextual information for enhancing the understanding of local details. Meanwhile, the contextual path also benefits from the detail information to capture the global
structure more accurately. This bi-directional guidance between branches strengthens the extraction capability of each path, facilitating the extractor in obtaining richer feature maps
during the processing of high-resolution remote sensing images, which is crucial for subsequent SCD tasks. CROSS-TEMPORAL INTERACTION MODULE (CTIM) Change detection is based on dual-temporal
images, and the features of remote sensing images acquired at different time periods are limited. There is a lack of capability to capture change information along the temporal dimension.
Therefore, deeply exploring the intrinsic relationships and differences between dual-temporal images helps identify unchanged areas in change detection36. We employ a cross-learning strategy
to design a Cross-Temporal Interaction Module (CTIM) that facilitates feature fusion and interaction across time scales. CTIM takes the output of SDPNet as input, consisting of low-level
spatial detail features (DT1, DT2) and high-level semantic features (ST1, ST2) for the T1 and T2 time periods, respectively. The feature representations from the DAP and CP branches are
complementary. Using simple addition or concatenation to fuse them overlooks the diversity of the two types of information, which may degrade performance. Moreover, the information from
different time periods can vary significantly. Therefore, we designed a hybrid aggregation layer to combine information from different time periods, using the contextual information from the
CP branch of another time period to guide the feature responses of the detail branch. The overall structure is shown in Fig. 5. In this structure, we have designed a Dynamic Depthwise
Separable Convolution (DDConv), as shown in Fig. 6. Unlike traditional depthwise separable convolution, which uses fixed convolution kernels, the DDConv introduces a dynamic kernel mechanism
that can adaptively adjust the convolution kernels based on the features of the input data, allowing for more refined control over the convolution operation. This adaptive property enables
the DDConv to capture the change features in different regions of the image more precisely when processing complex remote sensing images. Subsequently, to promote deep interaction and fusion
of cross-temporal information, we adopt an interaction strategy. Taking the T1 time period as an example, we activate the high-level semantic features from the T2 period (ST2) using a
sigmoid function, then use them as weight factors to multiply the low-level detail features from the T1 period (DT1). This allows the high-level semantic features from T2 to guide the
spatial detail information in T1, thereby obtaining cross-temporal spatial detail information (FT1). The computation process is shown below: $${F}_{T1}=Sig({DDConv}_{3\times
3}(Up({S}_{T1})))\times {DDConv}_{3\times 3}\left({D}_{T1}\right)$$ (5) $${F}_{T2}=Sig({DDConv}_{3\times 3}(Up({S}_{T2})))\times {DDConv}_{3\times 3}({D}_{T2})$$ (6) where
\({DDConv}_{3\times 3}\) refers to Dynamic Depthwise Separable Convolution with a kernel size of 3 × 3, \(Up\) denotes upsampling, and \(Sig\) denotes the activation function. In addition,
to better fuse spatial detail features and deep semantic features, we introduce the AFF module41. During the feature fusion phase, the AFF module does not use a simple linear fusion method.
Instead, it employs a dual-branch structure with different scales to separately extract channel attention weights, and dynamic feature fusion is then performed based on the weight
information. The specific structure is shown in Fig. 7. For the input low-level spatial detail information \(X\) and high-level semantic information \(Y\), we first use linear addition for
feature fusion to obtain the preliminary fused result \(F\). The calculation formula is as follows: $$F=X+Y$$ (7) Subsequently, to more accurately capture the important information in
feature \(F\), we use global average pooling and pointwise convolution to extract the global and local channel attention from \(F\), respectively. This approach, which employs different
branches for feature processing, helps us better identify the location and spatial information of the changed areas, thereby improving the model’s localization and feature expression
capabilities. Specifically, the global average pooling branch performs a global average pooling operation on \(F\) to obtain a global feature vector, which is then processed using pointwise
convolution to extract global channel attention weights. The pointwise convolution branch directly applies pointwise convolution to \(F\) to extract local channel attention weights. Finally,
the extracted attention weights are normalized to the range [0, 1] using the sigmoid function and multiplied by the corresponding features. In this way, each feature is assigned a weight
based on its importance. The weighted features are then summed to obtain the final fused result. This fusion method allows the model to dynamically adjust during the feature fusion process,
improving its ability to extract complex feature sets from high-resolution remote sensing images. The specific calculation formula is as follows: $$L\left(F\right)=B({PWConv}_{2}(\delta
(B({PWConv}_{1}(F)))))$$ (8) $$G\left(F\right)=B({PWConv}_{2}(\delta (B({PWConv}_{1}(GAP(F))))))$$ (9) $$Z=Sig\left(L+G\right)\times X+(1-Sig\left(L+G\right)\times Y)$$ (10) where \(F\)
represents the initial feature fusion result, \({PWConv}_{1}\) and \({PWConv}_{2}\) both represent 1 × 1 pointwise convolution, \(B\) represents the BatchNorm layer, and \(\delta\)
represents the ReLU activation function. \(GAP\) stands for Global Average Pooling. LOSS FUNCTION We optimized the BCD and SS tasks in remote sensing semantic change detection using three
loss functions: the semantic loss function (\({L}_{s}\)), binary change loss function (\({L}_{c}\)), and semantic change loss function (\({L}_{sc}\)) proposed by Ding et al.42. The semantic
loss function \({L}_{s}\) is designed for the semantic categories in subtask SS, and it calculates the multiclass cross-entropy loss between the predicted semantic categories in the SS task
and the true semantic categories in the semantic change ground truth map. The detailed calculation process is as follows:
$${L}_{s}=-\frac{1}{N}\sum_{i=1}^{N}{y}_{i}\text{log}({\widehat{y}}_{i})$$ (11) where \({y}_{i}\) and \({\widehat{y}}_{i}\) denote the true semantic label category and the probability of
being predicted as the i-th category, respectively. \(N\) represents the land cover type in semantic change detection. The 'no-change’ category, denoted by 0, is excluded, as this
facilitates the model’s focus on extracting semantic features in the change regions. The binary transformation function \({L}_{c}\) is designed to address the binary cross-entropy loss
between the predicted and actual change maps in BCD tasks. It is used to balance the class imbalance between the changed and unchanged regions in these tasks. The calculation formula is as
follows: $${L}_{c}=-\frac{1}{N}\sum_{i=1}^{N}{W}_{c}\times {y}_{c}\text{log}\left({\widehat{y}}_{c}\right)+{W}_{nc}\times \left(1-{y}_{c}\right)\text{log}\left(1-{\widehat{y}}_{c}\right)$$
(12) where \({y}_{c}\) represents the true value in the binary change label (i.e., 0 for unchanged label and 1 for change label), and \({\widehat{y}}_{c}\) denotes the probability of the
label not being predicted as a change label or being predicted as a change label. \(N\) denotes the number of image pixels, \({W}_{c}\) denotes the weight of the change region, and
\({W}_{nc}\) denotes the weight of the non-change region, with \({W}_{c}\) and \({W}_{nc}\) set to 0.25 and 0.75, respectively. \({L}_{sc}\) is a loss function based on contrastive learning,
which connects BCD tasks with SS tasks. Specifically, in the overall task of semantic change detection, the \({L}_{sc}\) loss function encourages the prediction of similar probability
distributions between unchanged regions, but penalizes the prediction of similar probability distributions in changed regions. The calculation formula is as follows:
$${L}_{sc}=\left\{\begin{array}{c}1-\text{cos}\left({x}_{1},{x}_{2}\right) { y}_{c}=1\\ \text{cos}\left({x}_{1},{x}_{2}\right) { y}_{c}=0\end{array}\right.$$ (13) where \({x}_{1}\) and
\({x}_{2}\) are the pixel vectors in the semantic segmentation result, and \({y}_{c}\) is the value at the same position on \({L}_{c}\). From this, we derive the total loss function
\({L}_{scd}\) by combining \({L}_{s}\), \({L}_{c}\), and \({L}_{sc}\), which is calculated as follows: $${L}_{scd}=\frac{1}{2}\left({L}_{s1}+{L}_{s2}\right)+{L}_{c}+{L}_{sc}$$ (14) where
\({L}_{s1}\) and \({L}_{s2}\) denote the semantic segmentation losses in images from different time periods, respectively. DATASETS AND EXPERIMENTAL SETUP DATASETS To better validate the
proposed network, we conduct experiments on three publicly available semantic datasets: the SECOND dataset, the Landsat-SCD dataset, and the Hi-UCD min dataset. SECOND The SECOND dataset43
consists of 4,662 pairs of remote sensing images collected from multiple platforms and sensors, covering a number of important urban areas in China, including Hangzhou, Chengdu, and
Shanghai, and encompassing a rich variety of surface coverage types. Each image in this dataset has a size of 512 × 512 pixels, and its spatial resolution ranges from 0.5 to 3 m, capable of
capturing subtle changes in the ground surface. However, only 2968 dual-time image pairs with ground truth labels are currently available, of which change pixels account for 19.87% of the
total image pixels. As shown in the sample in Fig. 8, the dataset defines labels for seven categories, including one no-change category and five land cover change categories (unvegetated
surface, trees, low vegetation, water, buildings, and playgrounds). To evaluate model performance in a scientifically sound manner, we divided the SECOND dataset into a training set and a
test set in a ratio of 4:1. LANDSAT-SCD The Landsat-SCD dataset44 is a collection of images captured by Landsat in the Tumushuke area of Xinjiang, China, between 1990 and 2020. These images
include the three basic bands: red (R), green (G), and blue (B), and have a spatial resolution of 30 m. As shown in the sample in Fig. 9, the dataset defines five labeled categories,
including one no-change category and four land cover change categories (farmland, desert, buildings, and water bodies). The change pixels account for 18.89% of the total image pixels,
providing rich information on surface changes. The Landsat-SCD dataset contains 8,468 pairs of images, with each image sized at 416 × 416 pixels. After removing image enhancement, the
dataset includes 2,385 original image pairs, which are divided into the training set, validation set, and test set in a 3:1:1 ratio. HI-UCD MIN The Hi-UCD min dataset45 consists of 745 image
pairs captured by the Leica ADS100-SH100 between 2017 and 2019 in parts of Tallinn, the capital of Estonia. Each image in the dataset has a size of 1024 × 1024 pixels, with a spatial
resolution of 0.1 m. As shown in Fig. 10, the dataset defines 10 label categories, including one “no change” category and 9 land cover change categories (water, grass, forest, greenhouse,
road, building, bare land, and other types). To scientifically evaluate the model’s performance, we cropped the original dataset into 512 × 512 pixel images and removed the unchanged images.
Ultimately, the number of images in the training, validation, and test sets were 571, 100, and 705, respectively. EVALUATION METRICS In this study, to better perform a quantitative analysis
of the experimental results, we used eight objective metrics to evaluate the network’s performance in SCD. These metrics include SCD accuracy metrics: Overall Accuracy (OA), Separation
Kappa Coefficient (SeK)43, F1 score for SCD (F1scd), and the Composite Score (Score)36; BCD accuracy metrics: Mean Intersection over Union (mIoU) and F1 score (F1); and SS evaluation metric:
the Kappa coefficient (Kappa). The OA metric measures the proportion of correctly categorized pixels across all categories relative to the total number of pixels, providing a global,
intuitive assessment of accuracy. Let \(Q=\{{q}_{i,j}\}\) be the confusion matrix, \({q}_{i,j}\) denotes the number of pixels classified into category \(i\), and \(j\) denotes the number of
true category pixels (\(i,j\in \{\text{0,1},\cdots ,N\}\), with 0 for no change). The formula for OA is shown below: $$OA=\sum_{i=0}^{N}{q}_{ii}/\sum_{i=0}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (15)
The segmentation accuracy of changed and unchanged regions in BCD tasks is evaluated using mIoU. In the BCD task, mIoU is computed from the unchanged region (IoUnc) and the changed region
(IoUc), calculated as follows: $$mIoU=({IoU}_{nc}+{IoU}_{c})$$ (16) $${IoU}_{nc}={q}_{00}/(\sum_{i=0}^{N}{q}_{i0}+\sum_{j=0}^{N}{q}_{0j}-{q}_{00})$$ (17)
$${IoU}_{c}=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/(\sum_{i=0}^{N}\sum_{j=0}^{N}{q}_{ij-}{q}_{00})$$ (18) Sek is used to assess the classification accuracy of different land cover types in the
SS task. Here, \(\widehat{Q}=\{{\widehat{q}}_{ij}={q}_{ij}\}\), but \({\widehat{q}}_{00}=0\), which is used to exclude no-change pixels. Its calculation formula is as follows: $$\rho
=\sum_{i=0}^{N}{\widehat{q}}_{ii}/\sum_{i=0}^{N}\sum_{j=0}^{N}{\widehat{q}}_{ij}$$ (19) $$\tau
=\sum_{i=0}^{N}\left(\sum_{j=0}^{N}{\widehat{q}}_{ij}*\sum_{j=0}^{N}{\widehat{q}}_{ji}\right)/{(\sum_{i=0}^{N}\sum_{j=0}^{N}{\widehat{q}}_{ij})}^{2}$$ (20) $$Kappa=(\rho -\tau )/(1-\tau )$$
(21) $$Sek={e}^{{IoU}_{c}-1}\bullet Kappa$$ (22) The composite score (Score) can be calculated based on mIoU and SeK as follows: $$Score=0.3\times mIoU+0.7\times Sek$$ (23) F1scd is used to
evaluate the segmentation precision of Land Use/Land Cover (LULC) classes within the change region. This metric is based on F1 score to calculate the precision (Pscd) and recall (Rscd) for
the change region, respectively. Its calculation formula is as follows: $${P}_{scd}=\sum_{i=1}^{N}{q}_{ii}/\sum_{i=1}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (24)
$${R}_{scd}=\sum_{i=1}^{N}{q}_{ii}/\sum_{i=0}^{N}\sum_{j=1}^{N}{q}_{ij}$$ (25) $${F1}_{scd}=\frac{2*{P}_{scd}*{R}_{scd}}{{P}_{scd}+{R}_{scd}}$$ (26) The formula for calculating F1 score is:
$$Recall=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/\sum_{i=0}^{N}\sum_{j=1}^{N}{q}_{ij}$$ (27) $$Precision=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/\sum_{i=1}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (28)
$$F1=\frac{2*Recall*Precision}{Recall+Precision}$$ (29) EXPERIMENTAL SETTINGS All experiments in this paper were conducted on an NVIDIA GPU (GeForce RTX 4060 Ti) with 16 GB of memory. In all
experiments, we used the same experimental parameter configurations, specifically: setting the batch size to 4, training for a total of 80 epochs, and using an initial learning rate of 0.1.
Additionally, we employed a stochastic gradient descent (SGD) optimizer to iteratively update the model parameters to minimize the loss function. EXPERIMENTAL COMPARISON AND ANALYSIS
COMPARISON METHODS In order to better compare the performance of our method in identifying changing areas and land cover types in the SCD task, we compared it with six existing
state-of-the-art methods: HRSCD-str334, HRSCD-str434, BiSRNet36, SCanNet46, HGINet38, and STSP-Net47. To ensure a fair comparison, we did not use pre-trained weights in our training. * 1.
_HRSCD-str3_: By integrating multi-scale features and constructing a network with BCD branches that incorporate time-dependent information, land cover change detection is achieved. * 2.
_HRSCD-str4_: As an upgraded version of the HRSCD-str3 series, HRSCD-str4 performs skip connections while maintaining high-resolution semantic change detection capabilities. It connects the
twin encoder to the decoder of the CD branch to enhance the recognition of complex changing scenes. * 3. _BiSRNet_: It is a dual-temporal semantic inference network. This method enhances the
ability to identify change areas and land cover types by introducing cross-temporal SR (Cot SR) blocks to model temporal correlations. * 4. _SCanNet_: It is a multi-task-based semantic
change detection network. This method constructs a semantic change converter (SCanFormer) to explicitly model the "from-to" semantic transformation between dual temporal RSIs,
achieving effective recognition of change areas and land cover types in remote sensing images. * 5. _HGINet_: It is a semantic change detection network based on hierarchical semantic graph
interaction. It accurately identifies change areas and land cover types by modeling dual-temporal correlations and using graph learning to represent the interaction of different feature
layers. * 6. _STSP-Net_: It is a spatiotemporal semantic perception network that effectively captures spatiotemporal information in images by introducing spatiotemporal attention mechanisms,
thereby achieving recognition of changing areas and land cover types. QUANTITATIVE AND QUALITATIVE ANALYSIS EXPERIMENTAL RESULTS ON THE SECOND DATASET Table 1 shows the quantitative
comparison results of our method with other methods on the SECOND dataset. Compared to the other methods, the proposed STGNet performs significantly better, ranking first in all evaluation
metrics. Specifically, the mIoU reached 72.83%, Sek was 22.45%, F1scd was 61.83%, and OA was 87.51%. Among these methods, HRSCD-str3 is a network that directly operates on dual-temporal
images for binary change detection. It lacks the connection between the BCD and SS sub-tasks, resulting in the lowest mIoU value on the high-resolution SECOND dataset, with a score of only
66.85%. In contrast, other methods that extract features through an encoder and then perform the BCD task (HRSCD-str4, BiSRNet, SCanNet, HGINet, STSP-Net) have achieved strong results across
various metrics. This demonstrates the importance of feature extraction before change detection in SCD tasks. The SDPNet extractor and BIDS module we proposed are designed to better extract
features from high-resolution remote sensing images and deeply fuse dual-temporal image features through CTIM. In the ablation study in section "Analysis of model robustness and
computational complexity", we will provide a detailed introduction to their specific advantages. Therefore, for STSP-Net, which also focuses on the feature extraction stage, its mIoU
reaches second place at 72.31%. To visually demonstrate the advantages of our method on the SECOND dataset, as shown in Fig. 11, we selected 3 pairs of dual-temporal remote sensing images
for qualitative analysis, with detailed areas highlighted using red rectangles. It is clearly evident that, compared to other methods, HRSCD-str3 loses a significant amount of change
information in the results. Additionally, by observing the sixth and seventh rows in Fig. 11, we can see that BiSRNet, SCanNet, HGINet, STSP-Net, and our method, by focusing on the
interaction between features from different time periods, can better capture the changes between "non-vegetated surface" and "low vegetation." Among these, BiSRNet,
STSP-Net, and our method perform best in capturing change areas, accurately detecting the changed regions. However, due to BiSRNet not emphasizing semantic feature extraction sufficiently
during the feature extraction phase, errors in change category recognition appear in the final results, such as part of the "non-vegetated surface" incorrectly being identified as
“low vegetation” instead of "tree." Therefore, compared to other methods, the approach we propose demonstrates a significant advantage in identifying highly dense change areas, not
only accurately recognizing change regions and categories but also preserving clearer boundary information. EXPERIMENTAL RESULTS ON THE LANDSAT-SCD DATASET We also conducted both
quantitative and qualitative analyses on the lower-resolution Landsat-SCD dataset. As shown in Table 2, our method demonstrates clearer advantages for this low-resolution dataset, with all
metrics significantly outperforming those of other methods. Among the other methods, only SCanNet, which also uses the Landsat-SCD dataset, and STSP-Net, which employs a dual-branch feature
extraction approach, performed well. This further confirms that using a dual-branch feature extraction strategy can more comprehensively capture detailed information. Compared to the
second-best method, STSP-Net, our method improves F1scd, OA, Sek, mIoU, F1, Kappa, and Score by 6.92%, 2.455, 17.68%, 6.16%, 5.89%, 4.31%, and 14.23%, respectively. The significant
improvement in the Sek metric is attributed to our method not only enhancing semantic feature acquisition for different time periods but also employing a cross-learning approach that
constrains the BCD and SS tasks along the time dimension, ensuring consistency. To more intuitively demonstrate the performance of our method on low-resolution imagery, we provide partial
visual results in Fig. 12. By comparison, it can be observed that other methods tend to lose many key boundary detail features when identifying change areas in low-resolution images. Among
them, HGINet shows the most significant issue. For example, in the third pair of images in Fig. 12, HGINet fails to effectively distinguish the boundaries of subtle change areas, resulting
in a block-like structure. Additionally, it misidentifies change categories, such as the region that should have been classified as “farmland” to "desert," which is instead
misclassified as “farmland” to "building." This is primarily because HGINet is a lightweight semantic change detection model with poor robustness and is unable to capture subtle
change information. In contrast, our method not only effectively identifies change regions but also accurately recognizes the change categories. ANALYSIS OF MODEL ROBUSTNESS AND
COMPUTATIONAL COMPLEXITY To further verify the robustness of the proposed method, this study applied data degradation simulations such as noise, occlusion, and spectral distortion to the
SECOND and Landsat-SCD test datasets. Specifically, we added noise, simulated occlusion, and applied spectral distortion processing to the test data, each with a probability of 0.5. These
simulation operations effectively replicate potential data quality degradation in real-world environments, aiming to assess the model’s stability and performance when faced with complex and
imperfect data. Figure 13 shows radar charts of five key evaluation metrics, clearly presenting the detection performance of different methods under these simulated conditions. The
experimental results indicate that other methods exhibit significant instability under these conditions. For example, in the SECOND dataset, although SCanNet maintained nearly optimal
detection performance in the noise environment, its performance was lower than that of HRSCD-str4 and STSP-Net in the “occlusion” environment. In contrast, our method maintained a leading
position in all test environments, demonstrating stronger robustness and further proving its superiority in practical applications. Meanwhile, we conducted experiments on the Hi-UCD min
dataset, which contains more detailed surface information. Specifically, as shown in Table 3, compared to the second-best methods, our approach still maintained a leading position across
various evaluation metrics, particularly with a significant improvement of 9.72% in F1scd and 5% in Sek. Additionally, we conducted a comprehensive performance analysis of the model on the
Hi-UCD min dataset, considering aspects such as parameter count, computational cost, and inference time. As shown in Table 3, although network structures like HRSCD-str3, HRSCD-str4, and
HGINet maintained relatively low computational cost due to their simple design, their performance on the SCD task was less satisfactory. In contrast, compared to SCanNet and STSP-Net, which
also focus on feature extraction, our method demonstrated superior performance in terms of parameter count, computational cost, and inference speed. This not only validates the good balance
between efficiency and performance in our model but also further highlights its superiority and practicality in complex surface information processing tasks. PARAMETER DISCUSSION The two key
hyperparameters in the binary change function Lc, the weight of the change region (Wc) and the weight of the unchanged region (Wnc), play a crucial role in addressing the class imbalance
issue in the BCD task. They ensure that the model effectively detects the change regions while also accurately identifying the unchanged regions. To explore the optimal configuration of
these two hyperparameters, we conducted several comparison experiments with different hyperparameter combinations on the Landsat-SCD and SECOND datasets. As shown in Table 4, the
experimental results clearly demonstrate the impact of different hyperparameter settings on model performance. Since the change regions in images are relatively small in real-world
scenarios, when Wc is greater than 0.5 and Wnc is less than 0.5, the model’s performance metrics show significant degradation. To further improve the model’s performance, we adopted a
strategy of gradually decreasing Wc and correspondingly increasing Wnc. Ultimately, when Wc was set to 0.25 and Wnc was set to 0.75, the model successfully achieved a good balance between
the proportions of change and unchanged regions, thereby ensuring accurate change detection while significantly improving the identification accuracy of unchanged regions. ABLATION
EXPERIMENT In order to further validate the effectiveness of the various methods proposed in this paper, we conducted comprehensive ablation experiments on three public dataset, covering the
following key aspects: the selection of the backbone network, the optimization of the dual-branch feature extraction path, the enhancement effect of the BiDS module on feature extraction in
the dual-branch structure, the balancing effect of the CTIM module on SS and CD tasks in remote sensing semantics, and the computational cost of the model. As shown in Table 5, we used the
SSCDL42 architecture proposed by Ding et al. as the base model. It can be seen that, with a small number of enhanced parameters and reduced computational complexity, SSCDL using ResNet50
significantly outperforms the model using ResNet34 across various indicators. This is because ResNet50 has a deeper network structure and stronger feature extraction capabilities, enabling
it to learn more complex and richer feature representations, making it better suited for processing complex and diverse image data. In addition, by incorporating the Detail-Aware Path (DAP),
we observed further improvements across various metrics. Compared to the baseline network based on ResNet50, the Sek metric increased by 0.29%, 12.23%, and 0.1% on the SECOND dataset, the
Landsat-SCD dataset, and the Hi-UCD min dataset, respectively. This clearly demonstrates that introducing the spatial detail branch can effectively enhance the model’s performance and
improve its ability to capture spatial detail information. BiDS plays a crucial bidirectional guidance role in the dual-branch structure. The introduction of BiDS enables the network to
better focus on the precise extraction of semantic and change information in remote sensing images. This improvement is reflected in the mIoU and Sek evaluation metrics across the three
public datasets, with increases of 0.6% and 1.16%, 1.3% and 3.91%, and 0.34% and 0.45%, respectively. With the addition of CTIM, our proposed method achieved optimal results across all
metrics on these three datasets, further validating that CTIM can deeply explore the intrinsic relationships and differences between images from different time periods, thus aiding in the
identification of unchanged regions in change detection tasks. Compared to the baseline network using ResNet34, the optimal results showed significant improvements across all metrics.
Specifically, on the SECOND dataset, mIoU increased by 1.57%, Sek by 4.14%, F1scd by 4.28%, and OA by 1.33%; on the Landsat-SCD dataset, mIoU increased by 7.03%, Sek by 19.81%, F1scd by
7.8%, and OA by 2.74%; on the Hi-UCD min dataset, mIoU increased by 1.15%, Sek by 1.45%, F1scd by 1.48%, and OA by 2.22%. It is worth noting that the addition of BiDS and CTIM only increased
the parameter count by 1% and 0.03%, respectively, indicating that with the introduction of only a small number of additional parameters, we achieved better performance in change detection
and land category recognition. In order to more intuitively verify the effectiveness of the method proposed in this article, and given that the Landsat-SCD dataset shows the most significant
and optimal accuracy improvement, we specifically selected a pair of typical remote sensing images from the Landsat-SCD dataset for a visual analysis of the ablation experiments. As shown
in Fig. 14, we specifically selected key areas from the experimental results and enlarged them for easier observation of the details. The experimental results show that when ResNet50 is used
as the backbone network, its excellent feature extraction ability ensures that key features in the image are fully preserved. With the introduction of the Detail-Aware Path (DAP) and BiDS
modules, the network’s ability to capture subtle changes in regions and their edge information has been significantly improved. However, in the semantic change detection results, there were
still cases where the semantic categories of the change areas were the same but were incorrectly identified as changes. To overcome this challenge, we introduced the CTIM module, which can
accurately capture the intrinsic relationships and differences between images from different time periods, thereby ensuring the result consistency between SS tasks and CD tasks. The final
experimental results show that the addition of the CTIM module greatly improves the accuracy of semantic change detection and successfully solves the problem of misjudging the semantic
categories in the change areas. This fully demonstrates the effectiveness of the proposed Cross-Temporal Refinement Interaction Module (CTIM) in this paper. CONCLUSIONS In Semantic Change
Detection (SCD) tasks, accurate identification of change regions and types is crucial. In this study, we delve into several common challenges present in existing SCD tasks and, based on this
analysis, propose a network that guides multitask semantic change detection through spatiotemporal semantic interaction (STGNet). This network adopts a dual-path feature extractor based on
a Siamese structure, significantly enhancing its ability to extract complex feature sets from remote sensing images by cleverly incorporating spatial detail information. Building on this, we
further design an innovative bidirectional guidance module (BiDS), which establishes an effective connection between spatial detail information and high-level semantics. This strengthens
the feature extractor’s ability to capture key information, enriching the representation of deep semantic features and low-level spatial information.Additionally, to fully leverage the
temporal correlation between dual-temporal images (i.e., images from different time periods), we carefully design the Cross-Temporal Refinement Interaction Module (CTIM). This module deeply
explores the inherent connections and subtle differences between dual-temporal images, helping to more accurately identify unchanged areas in change detection, thereby improving the overall
accuracy of change detection. To comprehensively evaluate the performance of the proposed network, we conduct detailed experiments on three publicly available authoritative datasets. The
experimental results show that, compared with the most advanced change detection methods, the proposed STGNet achieves the highest accuracy with the introduction of a small number of
parameters, further verifying its effectiveness in processing complex remote sensing image change detection tasks. In future work, we will continue to focus on implementing semi-supervised
semantic change detection to further reduce excessive dependence on datasets, lower the cost of manual annotation, and enhance the model’s generalization ability with limited annotated data.
Specifically, we will explore effective semi-supervised learning algorithms by combining a large amount of unlabeled data with a small amount of high-quality labeled data. These algorithms
will aim to accurately capture subtle semantic changes while maintaining the efficiency and practicality of the model, providing stronger technical support for semantic change detection
tasks in remote sensing images. DATA AVAILABILITY The SECOND dataset used in this study is publicly available at https://captain-whu.github.io/SCD/, accessed on 18 October 2023. The
Landsat-SCD dataset can be accessed at https://doi.org/https://doi.org/10.6084/m9.figshare.19946135.v1, accessed on 18 October 2023. The Hi-UCD min dataset used in this study is publicly
available at https://github.com/Daisy-7/Hi-UCD-S. REFERENCES * Song, X.-P. et al. Global land change from 1982 to 2016. _Nature_ 560(7720), 639–643 (2018). Article ADS PubMed PubMed
Central CAS Google Scholar * Huang, X., Schneider, A. & Friedl, M. A. Mapping sub-pixel urban expansion in China using MODIS and DMSP/OLS nighttime lights. _Remote Sens. Environ._
175, 92–108 (2016). Article ADS Google Scholar * Lunetta, R.S., et al., Land-cover change detection using multi-temporal MODIS NDVI data. Remote Sens. Environ. 2006(2). * Huang, X. et al.
Multi-level monitoring of subtle urban changes for the megacities of China using high-resolution multi-view satellite imagery. _Remote Sens. Environ._ 56, 56–75 (2017). Article ADS Google
Scholar * Bovolo, F. & Bruzzone, L. The time variable in data fusion: A change detection perspective. _IEEE Geosci. Remote Sens. Mag._ 3(3), 8–26 (2015). Article Google Scholar *
Jin, S. et al. A land cover change detection and classification protocol for updating Alaska NLCD 2001 to 2011. _Remote Sens. Environ._ 195, 44–55 (2017). Article ADS Google Scholar *
Zhu, Q. et al. A review of multi-class change detection for satellite remote sensing imagery. _Geo-spatial Inf. Sci._ 27(1), 1–15 (2024). Article Google Scholar * Sakurada, K., Shibuya,
M., and Wang, W. Weakly supervised silhouette-based semantic scene change detection. in _2020 IEEE International Conference on Robotics and Automation (ICRA)_. 2020. IEEE. * Ru, L., Du, B.
& Wu, C. Multi-temporal scene classification and scene change detection with correlation based fusion. _IEEE Trans. Image Process._ 30, 1382–1394 (2020). Article ADS MathSciNet PubMed
Google Scholar * Zheng, Z. et al. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made
disasters. _Remote Sens. Environ._ 265, 112636 (2021). Article Google Scholar * El Amin, A. M., Liu, Q., Wang, Y. Zoom out CNNs features for optical remote sensing change detection. in
_2017 2nd International Conference on Image, Vision and Computing (ICIVC)_. 2017. IEEE. * Tang, Y. et al. Optimization strategies of fruit detection to overcome the challenge of unstructured
background in field orchard environment: A review. _Precision Agric._ 24(4), 1183–1219 (2023). Article MathSciNet Google Scholar * Shi, W. et al. Change detection based on artificial
intelligence: State-of-the-art and challenges. _Remote Sens._ 12(10), 1688 (2020). Article ADS Google Scholar * Wang, J. et al. Object detection based on adaptive feature-aware method in
optical remote sensing images. _Remote Sens._ 14(15), 3616 (2022). Article ADS Google Scholar * Dong, X. et al. Attention-based multi-level feature fusion for object detection in remote
sensing images. _Remote Sens._ 14(15), 3735 (2022). Article Google Scholar * Dong, H. et al. Enhanced lightweight end-to-end semantic segmentation for high-resolution remote sensing
images. _IEEE Access_ 10, 70947–70954 (2022). Article Google Scholar * Xiong, J. et al. CSRNet: Cascaded selective resolution network for real-time semantic segmentation. _Expert Syst.
Appl._ 211, 118537 (2023). Article Google Scholar * Zhang, H. et al. ESCNet: An end-to-end superpixel-enhanced change detection network for very-high-resolution remote sensing images.
_IEEE Trans. Neural Netw. Learn. Syst._ 34(1), 28–42 (2021). Article CAS Google Scholar * Han, C., et al. Change guiding network: Incorporating change prior to guide change detection in
remote sensing imagery. _IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens._ 2023. * Han, C. et al. HANet: A hierarchical attention network for change detection with bitemporal
very-high-resolution remote sensing images. _IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens._ 16, 3867–3878 (2023). Article ADS Google Scholar * Kipf, T. N., Welling, M.
_Semi-supervised classification with graph convolutional networks_. 2016. * Vaswani, A., et al. _Attention is all you need_. arXiv, 2017. * Ning, X. et al. Multi-stage progressive change
detection on high resolution remote sensing imagery. _ISPRS J. Photogramm. Remote. Sens._ 207, 231–244 (2024). Article ADS Google Scholar * Zhou, M., Qian, W. & Ren, K. Multistage
interaction network for remote sensing change detection. _Remote Sensing_ 16(6), 1077 (2024). Article ADS Google Scholar * Cai, Y. et al. CSANet: A channel-spatial attention network for
remote sensing image change detection. _Int. J. Remote Sens._ 44(19), 5936–5959 (2023). Article Google Scholar * Larabi, M. E. A. et al. High-resolution optical remote sensing imagery
change detection through deep transfer learning. _J. Appl. Remote Sens._ 13(4), 046512–046512 (2019). Article ADS Google Scholar * Ling, J. et al. IRA-MRSNet: A network model for change
detection in high-resolution remote sensing images. _Remote. Sens._ 14, 5598 (2022). Article ADS Google Scholar * Peng, X., et al. Optical remote sensing image change detection based on
attention mechanism and image difference. _IEEE Trans. Geosci. Remote Sens._ 2020. (99): pp. 1–12. * Chen, J., et al., DASNet: Dual attentive fully convolutional siamese networks for change
detection in high-resolution satellite images. _IEEE_, 2021. * Daudt, R. C., Le Saux, B., Boulch, A. Fully convolutional siamese networks for change detection. in _2018 25th IEEE
International Conference on Image Processing (ICIP)_. 2018. IEEE. * Zhou, Z. et al. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. _IEEE Trans.
Med. Imaging_ 39(6), 1856–1867 (2019). Article PubMed PubMed Central Google Scholar * Liu, R. et al. Deep depthwise separable convolutional network for change detection in optical aerial
images. _IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens._ 13, 1109–1118 (2020). Article ADS Google Scholar * Daudt, R.C., B.L. Saux, and A. Boulch, Fully Convolutional Siamese
Networks for Change Detection. IEEE, 2018. * Daudt, R. C. et al. Multitask learning for large-scale semantic change detection. _Comput. Vis. Image Underst._ 187, 102783 (2019). Article
Google Scholar * Chen, P. et al. FCCDN: Feature constraint network for VHR image change detection. _ISPRS J. Photogramm. Remote. Sens._ 187, 101–119 (2022). Article ADS Google Scholar *
Ding, L. et al. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. _IEEE Trans. Geosci. Remote Sens._ 60, 1–14 (2022). Google Scholar * Zheng, Z.
et al. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. _ISPRS J. Photogramm. Remote. Sens._ 183, 228–239 (2022). Article ADS Google
Scholar * Long, J. et al. Semantic change detection using a hierarchical semantic graph interaction network from high-resolution remote sensing images. _ISPRS J. Photogramm. Remote. Sens._
211, 318–335 (2024). Article ADS Google Scholar * Li, S., _et al._ AGSPNet: A framework for parcel-scale crop fine-grained semantic change detection from UAV high-resolution imagery with
agricultural geographic scene constraints. 2024. * Xu, J., Xiong, Z., Bhattacharyya, S. P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. in _Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition_. 2023. * Dai, Y., et al. Attentional feature fusion. in _Proceedings of the IEEE/CVF winter conference on applications of
computer vision_. 2021. * Ding, L., et al., _Bi-temporal semantic reasoning for the semantic change detection of HR remote sensing images_. 2021. * Yang, K., et al., _Asymmetric siamese
networks for semantic change detection_. 2020. * Yuan, P. et al. A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images. _Int.
J. Digital Earth_ 15, 1506–1525 (2022). Article ADS Google Scholar * Tian, S., et al. Hi-UCD: A large-scale dataset for urban semantic change detection in remote sensing imagery. 2020. *
Ding, L., et al., Joint spatio-temporal modeling for semantic change detection in remote sensing images. _IEEE Trans. Geosci. Remote Sens._ 2024. * He, Y. et al. Spatial-temporal semantic
perception network for remote sensing image semantic change detection. _Remote Sens._ 15(16), 4095 (2023). Article ADS Google Scholar Download references ACKNOWLEDGEMENTS This work was
supported in part by the Science and Technology Plan Project of Sichuan Province (NO.2023YFS0371), the Sichuan Key Provincial Research Base of Intelligent Tourism (No. ZHZJ24-01), and the
Innovation Fund for Research on Complex Scene Landscape and Grassland Change Detection Based on Deep Learning (Y2024116). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Sichuan University of
Science and Engineering, Yibin, 644000, China Yinqing Wang, Liangjun Zhao & Yuanyang Zhang * Sichuan Key Provincial Research Base of Intelligent Tourism, Yibin, 644000, China Liangjun
Zhao * School of Tropical Agriculture and Forestry, Hainan University, Haikou, 570228, Hainan Province, China Yueming Hu * School of Information and Communication Engineering, Hainan
University, Haikou, 570100, Hainan Province, China Hui Dai * Changsha City Planning Information Service Center, Changsha, 410006, Hunan Province, China Hui Dai Authors * Yinqing Wang View
author publications You can also search for this author inPubMed Google Scholar * Liangjun Zhao View author publications You can also search for this author inPubMed Google Scholar * Yueming
Hu View author publications You can also search for this author inPubMed Google Scholar * Hui Dai View author publications You can also search for this author inPubMed Google Scholar *
Yuanyang Zhang View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS Yinqing Wang was responsible for the design of research methods and
experimental procedures, provided support for thesis writing and revisions, supplied experimental equipment, and offered both technical and financial support. Liangjun Zhao participated in
the research design, data collection, data processing, and analysis, as well as providing financial support. He was also involved in the implementation of research methods. Yueming Hu and
Hui Dai participated in data collection, data processing, and analysis. Yuanyang Zhang was involved in the implementation of research methods and contributed to the revision of the thesis.
All authors reviewed the manuscript. CORRESPONDING AUTHOR Correspondence to Liangjun Zhao. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL
INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS This
article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction
in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the
licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Wang, Y., Zhao, L., Hu, Y. _et al._ Multitask semantic change
detection guided by spatiotemporal semantic interaction. _Sci Rep_ 15, 16003 (2025). https://doi.org/10.1038/s41598-025-00750-8 Download citation * Received: 03 January 2025 * Accepted: 30
April 2025 * Published: 08 May 2025 * DOI: https://doi.org/10.1038/s41598-025-00750-8 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get
shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative KEYWORDS * Remote
sensing images * Semantic change detection * Spatial–temporal semantic * Multi-task network * Deep learning
Trending News
As the telemedicine industry rapidly expands, the biden administration takes a closer look at who's benefitingTelemedicine companies will tell you the pandemic has ushered in a new era of connected health across America. But the B...
Braouonline. In dr. B. R ambedkar university ug 3rd year instant may 2017 exam results: steps to check results at manabadi. Co. InDR.B.R AMBEDKAR UNIVERSITY UG 3RD YEAR INSTANT MAY 2017 EXAM RESULTS: STEPS TO CHECK RESULTS AT MANABADI.CO.IN OR BRAOUO...
BuzzFeed News LGBTQFrom Bollywood scenes that accidentally educated our families to pop stars who made queerness feel powerful, here are th...
Rhino horn must become a socially unacceptable product in asiaAt current rates of loss to poaching, rhino species will be extinct within our lifetimes. The big problem is demand for ...
Earnings Roundup: Nov. 10What follows is a roundup of corporate earnings reports for Wednesday, Nov. 10. MACY'S and POLO RALPH LAUREN are ex...
Latests News
Multitask semantic change detection guided by spatiotemporal semantic interactionABSTRACT Semantic Change Detection (SCD) aims to accurately identify the change areas and their categories in dual-time ...
Acts of sympathy help a grieving parent after the death of a childYou can say, "It's not supposed to be this way; parents are not supposed to bury their child," even thoug...
World in brief : japan : empress regains ability to speakJapan’s Empress Michiko has almost fully recovered the power of speech, nearly seven months after being struck with a my...
Beat ato daleks: find lost super before it’s too lateNick Bruining and Neale PriorThe West Australian In the age of click-click-go, it would be nice if we could simply click...
Aspen times weekly cover story: a school learning how to growWOODY CREEK – To walk into the Aspen Community School, the charter school high on the ridgeline above Woody Creek, you m...