Day Night

Multitask semantic change detection guided by spatiotemporal semantic interaction

Nature

Multitask semantic change detection guided by spatiotemporal semantic interaction"

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

ABSTRACT Semantic Change Detection (SCD) aims to accurately identify the change areas and their categories in dual-time images, which is more complex and challenging than traditional binary

change detection tasks. Accurately capturing the change information of land cover types is crucial for remote sensing image analysis and subsequent decision-making applications. However,

existing SCD methods often neglect the spatial details and temporal dependencies of dual-time images, leading to problems such as change category imbalance and limited detection accuracy,

especially in capturing small target changes. To address this issue, this study proposes a network that guides multitask semantic change detection through spatiotemporal semantic interaction

(STGNet). STGNet enhances the ability to capture spatial details by introducing a Detail-Aware Path (DAP) and designs a Bidirectional Guidance Module for Spatial Detail and Semantic

Information for adaptive feature selection, improving feature extraction capabilities in complex scenes. Furthermore, to resolve the inconsistency between semantic information and change

areas, this paper designs a Cross-Temporal Refinement Interaction Module (CTIM), which enables cross-time scale feature fusion and interaction, constraining the consistency of detection

results and improving the recognition accuracy of unchanged areas. To further enhance detection performance, a dynamic depthwise separable convolution is designed in the CTIM module, which

can adaptively adjust convolution kernels to more precisely capture change features in different regions of the image. Experimental results on three SCD datasets show that the proposed

method outperforms other existing methods in various evaluation metrics. In particular, on the Landsat-SCD dataset, the F1 score (F1scd) reaches 91.64%, and the separation Kappa coefficient

improves by 17.68%. These experimental results fully demonstrate the significant advantages of STGNet in improving semantic change detection accuracy, robustness, and generalization

capability. SIMILAR CONTENT BEING VIEWED BY OTHERS MULTI-SCALE FEATURE PROGRESSIVE FUSION NETWORK FOR REMOTE SENSING IMAGE CHANGE DETECTION Article Open access 13 July 2022 A SIAMESE

SWIN-UNET FOR IMAGE CHANGE DETECTION Article Open access 25 February 2024 DASUNET: A DEEPLY SUPERVISED CHANGE DETECTION NETWORK INTEGRATING FULL-SCALE FEATURES Article Open access 30 May

2024 INTRODUCTION Change detection refers to the identification of surface changes by comparing remote sensing images from different time periods1,2, and it is widely applied in fields such

as environmental monitoring, urban planning, disaster warning, emergency response, and resource management3,4,5,6. Common change detection methods can be divided into two categories: Binary

Change Detection (BCD) and Semantic Change Detection (SCD). BCD primarily focuses on whether surface changes have occurred, distinguishing between “change” and “no change” states, but it

fails to identify the specific types of changes. To address the limitations of BCD in change type identification, SCD was introduced7,8. SCD effectively identifies specific changes in land

use types, such as forest to farmland or water body to built-up land. This comprehensive change type analysis provides richer and more specific change information, helping to better

understand the underlying causes of changes and offering more comprehensive support for land use planning9,10. However, traditional change detection methods rely on manual visual

interpretation or simple image processing techniques, which are not only time-consuming and labor-intensive but also easily influenced by subjective factors, making it difficult to meet the

processing requirements of large-scale, high-resolution remote sensing data11,12,13. As a result, with the rapid development of deep learning technologies, particularly the successful

applications in fields such as object detection14,15 and semantic segmentation16,17, remote sensing change detection based on convolutional neural networks (CNNs) has also made significant

progress. CNNs can handle complex, high-dimensional high-resolution remote sensing images and accurately identify changes in land cover, land use, and other aspects through powerful feature

extraction capabilities18,19,20. Additionally, in recent years, the introduction of advanced models such as graph convolutional networks (GCNs) and Transformers has further driven the

development of change detection technology. GCNs21 can effectively capture spatial and structural relationships in remote sensing images, improving the ability to identify change areas. The

Transformer model22, through its self-attention mechanism, excels in capturing long-range dependencies and local features, which is particularly important for detecting subtle surface

changes. Therefore, these advanced deep learning models have not only improved the accuracy and efficiency of change detection but also provided more flexible and powerful technical support

for handling more complex and dynamic remote sensing data. In deep learning-driven change detection technology, significant progress has been made in the research of BCD, especially in the

identification of change areas23,24,25,26. For example, Ling et al.27 proposed an innovative deep learning architecture, IRA-MRSNet, which combines multiscale residual twin networks with

integrated residual attention. This architecture efficiently captures multiscale features, accurately refines the edges of changing regions, and fuses global semantic information to achieve

accurate localization. On the other hand, Peng et al.28 proposed a difference-enhanced dense-attention convolutional neural network (DDCNN), an end-to-end change detection method that

improves the accuracy of change detection in dual-temporal remote sensing images by introducing a dense-attention mechanism and difference-enhancing units. Chen et al.29 designed a

dual-attention fully convolutional twin network (DASNet), which, through the dual-attention mechanism and weighted double margin contrast loss, captures more discriminative features and

enhances the robustness of change detection for high-resolution satellite images. However, the above-mentioned research categories are still focused on simple BCD tasks, while deep

learning-based SCD research is still in its development stage. In existing research, three commonly used architectures for SCD include single-branch, dual-branch, and multi-task

architectures. The first structure is the single-branch architecture (Fig. 1a), which merges dual-time images (e.g., concatenation) as input to identify the category differences between

images from different time points and output the results30,31,32. However, this design increases the complexity of category learning in the encoder, requiring the model to have stronger

capabilities to handle more categories in segmentation tasks. Therefore, in the second dual-branch structure (Fig. 1b), a shared-weight encoder processes the image categories of different

time periods separately and then performs change detection, such as FC-Siam-conc, FC-Siam-diff33, etc. This architecture reduces the classification difficulty seen in Architecture 1, but

during the encoder phase, the SCD task is still treated as a single task, and the differences between change detection (CD) and semantic segmentation (SS) tasks often make it difficult for

the model to balance both, thus limiting its potential for each task. As a result, in the third multi-task architecture (Fig. 1c, d), the SCD task is decomposed into sub-tasks. In Fig. 1c,

the CD and SS tasks are handled as two independent sub-tasks, and the final result is obtained by masking the change areas with land cover types. A typical example is the HRSCD-str3 proposed

by Daudt et al.34, which detects land cover type changes through a BCD branch. However, such methods lack deep feature interaction between the two tasks in their handling of different

tasks, and the SCD results are easily influenced by the single-branch results. Therefore, in Fig. 1d, the input for BCD is changed to the dual-time semantic information extracted from the

semantic encoder for change detection, effectively integrating the SS sub-task. Most existing SCD research adopts the structure in Fig. 1d. For example, Chen et al.35 proposed a

feature-constrained change detection network (FCCDN) that applies feature constraints in semantic feature extraction and feature fusion, using these features for the BCD task. Experiments

show that the SS branch effectively improves the accuracy of the BCD task. Ding et al.36 further validated this architecture in their proposed BiSRNet, introducing global self-attention (SR)

and cross-time self-attention (Cot-SR) to enhance information interaction between images from different times, improving consistency between the BCD and SS results. Zhang et al.37 proposed

a multi-task architecture called ChangeMask, which decouples SCD into SS and BCD, then learns the change representation from semantic representations through the Transformer module. Jiang et

al.38 proposed a semantic change detection network based on hierarchical semantic graph interaction, which models dual-time correlations and uses graph learning to represent interactions

across different feature layers to accurately identify change areas and land cover types. Wang et al.39 proposed an agricultural geographic scene and plot-scale constrained semantic change

detection framework (AGSPNet), which optimizes plot extraction using multi-source geographic data products and a bidirectional cascading network (BDCN), combined with a cross-attention

network (CCNet) to extract semantic and change features of crops. However, despite the fact that many of the above methods use multi-task frameworks as the main framework for SCD tasks, as

shown in Fig. 1d, the key parts of this architecture mainly focus on semantic feature extraction, feature interaction between different time periods, and balancing the SS and BCD tasks.

These three aspects are decisive factors for the accuracy of the SCD task. Below are the unresolved issues in these three aspects of the SCD task: * 1. _Loss of detailed information_: The

loss of detailed information leads to inaccurate detection of small targets and boundaries, as well as increased underreporting of minor changes (e.g., vegetation degradation) and false

alarms due to external factors (e.g., changes in appearance and illumination). * 2. _Lack of bi-temporal feature correlation_: Feature extraction for a single time period in bi-temporal

images lacks the ability to capture change information across time and does not account for the consistent correlation between bi-temporal features, which often leads to inaccurate

predictions of unchanged areas. * 3. _Imbalance problem between SS and BCD tasks_: In the final SCD results, there is often a situation where the detected changed regions in the BCD task are

inconsistent with the semantic regions in the SS task, leading to contradictory results. To address the shortcomings of the above SCD tasks in the field of remote sensing, this study

proposes a network that guides multi-task semantic change detection through spatiotemporal semantic interaction (STGNet). This network aims to improve the loss of edge information in change

areas and the imbalance between BCD tasks and semantic categories in SCD tasks. The main contributions of this article are as follows: * In the critical stage of feature extraction, we

introduced the Detail-Aware Path (DAP) and designed a Bidirectional Guidance Module for Spatial Detail and Semantic Information (BiDS). This module enhances the ability to extract detailed

features in the detail branch and deep semantic information in the context branch, thereby achieving comprehensive optimization of feature extraction and refining the edges of changing

regions. * We propose a Dynamic Depthwise Separable Convolution (DDConv) that can adaptively adjust the convolution kernels based on the features of the input data, thereby capturing the

change features in different regions of the image more precisely without increasing the computational burden. * We further propose a Cross-Temporal Refinement Interaction Module (CTIM),

which effectively enhances the information exchange capability between dual-temporal features. By fusing and interacting features across time scales, domain adaptation between dual-temporal

domains is achieved, capturing region change information based on the temporal dimension and improving the recognition accuracy of unchanged regions. * To verify the effectiveness and

robustness of STGNet, we conducted extensive experiments on three publicly available semantic change detection datasets and performed a comprehensive comparison with the state-of-the-art

methods. The experimental results show that our model has significant advantages in SCD tasks. The rest of this article is organized as follows: In section "The proposed method",

we provide a detailed ex-planation of the methods proposed in this article; section "Datasets and experimental setup" describes the SCD dataset and experimental setup used in this

article; section "Experimental comparison and analysis" introduces the experimental results and provides a detailed analysis; section "Conclusions" Conclusion and

Outlook. THE PROPOSED METHOD In this section, we provide a detailed introduction to the proposed SCD network for dual-temporal remote sensing images (i.e., STGNet), with the overall

structure shown in Fig. 2. The network adopts a multi-task learning architecture to separately handle semantic segmentation and change detection tasks, enabling comprehensive learning of

change regions and categories. The architecture is based on the Siamese network commonly used in BCD research, and constructs a Siamese-based Dual-Path Feature Extractor (SDPNet). This

feature extractor consists of two paths: the Detail-Aware Path (DAP) and the Context Path (CP). The CP path uses ResNet50 as the backbone network to extract deep semantic features. It

outputs resolutions at 1/2, 1/4, 1/8, 1/8, and 1/8 of the original resolution, progressively reducing spatial resolution to enhance the extraction of semantic information. The DAP path

compensates for the spatial detail information lost in the CP path, capturing fine spatial structures and edge information in remote sensing images. Its output resolutions are 1/2, 1/4, 1/4,

and 1/4 of the original resolution. It is important to note that during the dual-path extraction process, we designed a Bidirectional Guidance Module for Spatial Detail and Semantic

Information (BiDS) to facilitate the exchange of spatial detail information and deep semantic features between the two branches. This module enables each branch to selectively learn features

from the other, thereby enhancing the enhance the model’s ability to recognize unchanged regions, we propose a Cross-Temporal Interaction Module (CTIM). This module leverages cross-learning

principles to promote deep interactions and fusion of semantic information between dual-temporal images. By capturing region change information based on the temporal dimension, it helps

learn richer semantic features from single-time-period images. Additionally, we introduce an Attention Feature Fusion Module (AFF) to better integrate semantic and spatial feature

information, strengthening the spatial detail within the semantic features. Finally, we employ six residual modules to perform precise binary change detection on the tightly integrated

dual-temporal semantic features. These residual modules, leveraging their deep learning capabilities, effectively capture subtle differences between remote sensing images at two different

time points. The generated binary change detection map is then used to fine-tune the land cover classification results, thereby accurately obtaining the final results of SCD and achieving

high-precision detection of semantic changes in remote sensing images. BIDIRECTIONAL GUIDANCE MODULE FOR SPATIAL DETAIL INFORMATION AND SEMANTICS (BIDS) In SCD tasks, high-resolution remote

sensing images have a wide imaging range, rich content, and complexity. Traditional simple feature extraction strategies often struggle to comprehensively and effectively capture key

information in the images. Current methods typically combine local and global features during the single-branch feature extraction phase, attempting to capture both global context and local

details simultaneously. However, when dealing with complex backgrounds and subtle changes, these methods often fail to balance the extraction of both global context and local details,

leading to missed target areas or false positives. To address this issue, the dual-branch network architecture has been shown to effectively improve performance during feature extraction in

the encoder40,41. This architecture extracts context information and spatial detail information using different convolution layers, thereby capturing key change features in the image more

accurately. thereby capturing key change features in the image more accurately. However, traditional dual-branch networks have a significant drawback: the two branches typically operate

independently, lacking mutual supplementation and collaboration. Specifically, while the context information branch excels at capturing global information, its ability to perceive subtle

local changes is weaker, especially when processing high-resolution remote sensing images, where the rich detailed information and complex contextual relationships in the image need to be

considered simultaneously. To address this issue, the BIDS module is proposed to optimize the feature extraction capability in the dual-branch network. The BIDS module draws inspiration from

the design concept in PIDNet40, enhancing the feature expression ability of each branch by guiding the feature exchange between branches. Specifically, the BIDS module allows each branch to

selectively learn and integrate feature information from the other branch, effectively combining global and local features. This cross-branch feature interaction not only improves the

model’s ability to extract detail and semantic features but also effectively overcomes the issue of detail information loss in traditional dual-branch structures. Compared to existing local

global feature extraction methods, the BIDS module, through the cross-branch feature fusion mechanism, enables each branch to not only focus on extracting global context or local details but

also selectively integrate feature information from the other branch, thereby enhancing the collaboration between branches. The detailed structure is shown in Fig. 3, where we define the

corresponding pixel vectors in the feature maps $X$ and $Y$ as ${\overrightarrow{v}}_{x}$ and ${\overrightarrow{v}}_{y}$, respectively, and perform dynamic convolution operations.

The process of dynamic convolution for ${\overrightarrow{v}}_{x}$ and ${\overrightarrow{v}}_{y}$ is shown below:

$$f\left(x\right)=g({\widetilde{W}}^{T}\left(x\right)x+\widetilde{b}(x))$$ (1) $$\tilde{W}\left( x \right) = \mathop \sum \limits_{k = 1}^{K} \pi_{k} \left( x \right)\tilde{W}_{k} ,\quad

\tilde{b}\left( x \right) = \mathop \sum \limits_{k = 1}^{K} \pi_{k} \left( x \right)\tilde{b}_{k}$$ (2) where \({\pi }_{k}\left({0\le \pi }_{k}\left(x\right)\le 1\right), {\sum

}_{k=1}^{K}{\pi }_{k}\left(x\right)=1)\) represents the attention weight of the k-th linear function ${\widetilde{W}}^{T}\left(x\right)x+\widetilde{b}(x)$, $g$ is the activation

function, $W$ and $b$ are the weight matrix and bias vector, respectively, and $x$ represents the input feature component ${\overrightarrow{v}}_{x}$ or ${\overrightarrow{v}}_{y}$.

As shown in Fig. 4, dynamic convolution endows the convolution kernel with an attention mechanism. Due to the non-linear generation of ${\pi }_{k}$, dynamic convolution can adaptively

adjust the combination of convolution kernels for different inputs to focus on more critical features, resulting in a stronger feature representation capability when processing

high-resolution remote sensing images. Subsequently, we apply batch normalization to these features and perform element-wise multiplication and summation operations, effectively integrating

feature information from different branches. Finally, activation is performed using the Sigmoid function to obtain the probability $\sigma$ that two pixels belong to the same object. If

$\sigma$ is high, it is more likely to trust ${\overrightarrow{v}}_{x}$ from its own branch; otherwise, it is more likely to trust ${\overrightarrow{v}}_{y}$ from the other branch. The

detailed processing procedure is as follows: $$\sigma =Sig(Sum(BN\left(f\left({\overrightarrow{v}}_{x}\right)\right)\cdot BN(f({\overrightarrow{v}}_{y}))))$$ (3) $$Out=\sigma

{\overrightarrow{v}}_{x}+(1-\sigma ){\overrightarrow{v}}_{y}$$ (4) where $f$ denotes the dynamic convolution operation, $BN$ denotes the batch normalization operation, $Sum$ denotes

the summation operation, and $Sig$ denotes the activation function. By using BIDS to achieve mutual complementation and promotion between the two-branch paths, it enables the detail-aware

path to leverage contextual information for enhancing the understanding of local details. Meanwhile, the contextual path also benefits from the detail information to capture the global

structure more accurately. This bi-directional guidance between branches strengthens the extraction capability of each path, facilitating the extractor in obtaining richer feature maps

during the processing of high-resolution remote sensing images, which is crucial for subsequent SCD tasks. CROSS-TEMPORAL INTERACTION MODULE (CTIM) Change detection is based on dual-temporal

images, and the features of remote sensing images acquired at different time periods are limited. There is a lack of capability to capture change information along the temporal dimension.

Therefore, deeply exploring the intrinsic relationships and differences between dual-temporal images helps identify unchanged areas in change detection36. We employ a cross-learning strategy

to design a Cross-Temporal Interaction Module (CTIM) that facilitates feature fusion and interaction across time scales. CTIM takes the output of SDPNet as input, consisting of low-level

spatial detail features (DT1, DT2) and high-level semantic features (ST1, ST2) for the T1 and T2 time periods, respectively. The feature representations from the DAP and CP branches are

complementary. Using simple addition or concatenation to fuse them overlooks the diversity of the two types of information, which may degrade performance. Moreover, the information from

different time periods can vary significantly. Therefore, we designed a hybrid aggregation layer to combine information from different time periods, using the contextual information from the

CP branch of another time period to guide the feature responses of the detail branch. The overall structure is shown in Fig. 5. In this structure, we have designed a Dynamic Depthwise

Separable Convolution (DDConv), as shown in Fig. 6. Unlike traditional depthwise separable convolution, which uses fixed convolution kernels, the DDConv introduces a dynamic kernel mechanism

that can adaptively adjust the convolution kernels based on the features of the input data, allowing for more refined control over the convolution operation. This adaptive property enables

the DDConv to capture the change features in different regions of the image more precisely when processing complex remote sensing images. Subsequently, to promote deep interaction and fusion

of cross-temporal information, we adopt an interaction strategy. Taking the T1 time period as an example, we activate the high-level semantic features from the T2 period (ST2) using a

sigmoid function, then use them as weight factors to multiply the low-level detail features from the T1 period (DT1). This allows the high-level semantic features from T2 to guide the

spatial detail information in T1, thereby obtaining cross-temporal spatial detail information (FT1). The computation process is shown below: $${F}_{T1}=Sig({DDConv}_{3\times

3}(Up({S}_{T1})))\times {DDConv}_{3\times 3}\left({D}_{T1}\right)$$ (5) $${F}_{T2}=Sig({DDConv}_{3\times 3}(Up({S}_{T2})))\times {DDConv}_{3\times 3}({D}_{T2})$$ (6) where

${DDConv}_{3\times 3}$ refers to Dynamic Depthwise Separable Convolution with a kernel size of 3 × 3, $Up$ denotes upsampling, and $Sig$ denotes the activation function. In addition,

to better fuse spatial detail features and deep semantic features, we introduce the AFF module41. During the feature fusion phase, the AFF module does not use a simple linear fusion method.

Instead, it employs a dual-branch structure with different scales to separately extract channel attention weights, and dynamic feature fusion is then performed based on the weight

information. The specific structure is shown in Fig. 7. For the input low-level spatial detail information $X$ and high-level semantic information $Y$, we first use linear addition for

feature fusion to obtain the preliminary fused result $F$. The calculation formula is as follows: $$F=X+Y$$ (7) Subsequently, to more accurately capture the important information in

feature $F$, we use global average pooling and pointwise convolution to extract the global and local channel attention from $F$, respectively. This approach, which employs different

branches for feature processing, helps us better identify the location and spatial information of the changed areas, thereby improving the model’s localization and feature expression

capabilities. Specifically, the global average pooling branch performs a global average pooling operation on $F$ to obtain a global feature vector, which is then processed using pointwise

convolution to extract global channel attention weights. The pointwise convolution branch directly applies pointwise convolution to $F$ to extract local channel attention weights. Finally,

the extracted attention weights are normalized to the range [0, 1] using the sigmoid function and multiplied by the corresponding features. In this way, each feature is assigned a weight

based on its importance. The weighted features are then summed to obtain the final fused result. This fusion method allows the model to dynamically adjust during the feature fusion process,

improving its ability to extract complex feature sets from high-resolution remote sensing images. The specific calculation formula is as follows: $$L\left(F\right)=B({PWConv}_{2}(\delta

(B({PWConv}_{1}(F)))))$$ (8) $$G\left(F\right)=B({PWConv}_{2}(\delta (B({PWConv}_{1}(GAP(F))))))$$ (9) $$Z=Sig\left(L+G\right)\times X+(1-Sig\left(L+G\right)\times Y)$$ (10) where $F$

represents the initial feature fusion result, ${PWConv}_{1}$ and ${PWConv}_{2}$ both represent 1 × 1 pointwise convolution, $B$ represents the BatchNorm layer, and $\delta$

represents the ReLU activation function. $GAP$ stands for Global Average Pooling. LOSS FUNCTION We optimized the BCD and SS tasks in remote sensing semantic change detection using three

loss functions: the semantic loss function (${L}_{s}$), binary change loss function (${L}_{c}$), and semantic change loss function (${L}_{sc}$) proposed by Ding et al.42. The semantic

loss function ${L}_{s}$ is designed for the semantic categories in subtask SS, and it calculates the multiclass cross-entropy loss between the predicted semantic categories in the SS task

and the true semantic categories in the semantic change ground truth map. The detailed calculation process is as follows:

$${L}_{s}=-\frac{1}{N}\sum_{i=1}^{N}{y}_{i}\text{log}({\widehat{y}}_{i})$$ (11) where ${y}_{i}$ and ${\widehat{y}}_{i}$ denote the true semantic label category and the probability of

being predicted as the i-th category, respectively. $N$ represents the land cover type in semantic change detection. The 'no-change’ category, denoted by 0, is excluded, as this

facilitates the model’s focus on extracting semantic features in the change regions. The binary transformation function ${L}_{c}$ is designed to address the binary cross-entropy loss

between the predicted and actual change maps in BCD tasks. It is used to balance the class imbalance between the changed and unchanged regions in these tasks. The calculation formula is as

follows: $${L}_{c}=-\frac{1}{N}\sum_{i=1}^{N}{W}_{c}\times {y}_{c}\text{log}\left({\widehat{y}}_{c}\right)+{W}_{nc}\times \left(1-{y}_{c}\right)\text{log}\left(1-{\widehat{y}}_{c}\right)$$

(12) where ${y}_{c}$ represents the true value in the binary change label (i.e., 0 for unchanged label and 1 for change label), and ${\widehat{y}}_{c}$ denotes the probability of the

label not being predicted as a change label or being predicted as a change label. $N$ denotes the number of image pixels, ${W}_{c}$ denotes the weight of the change region, and

${W}_{nc}$ denotes the weight of the non-change region, with ${W}_{c}$ and ${W}_{nc}$ set to 0.25 and 0.75, respectively. ${L}_{sc}$ is a loss function based on contrastive learning,

which connects BCD tasks with SS tasks. Specifically, in the overall task of semantic change detection, the ${L}_{sc}$ loss function encourages the prediction of similar probability

distributions between unchanged regions, but penalizes the prediction of similar probability distributions in changed regions. The calculation formula is as follows:

$${L}_{sc}=\left\{\begin{array}{c}1-\text{cos}\left({x}_{1},{x}_{2}\right) { y}_{c}=1\\ \text{cos}\left({x}_{1},{x}_{2}\right) { y}_{c}=0\end{array}\right.$$ (13) where ${x}_{1}$ and

${x}_{2}$ are the pixel vectors in the semantic segmentation result, and ${y}_{c}$ is the value at the same position on ${L}_{c}$. From this, we derive the total loss function

${L}_{scd}$ by combining ${L}_{s}$, ${L}_{c}$, and ${L}_{sc}$, which is calculated as follows: $${L}_{scd}=\frac{1}{2}\left({L}_{s1}+{L}_{s2}\right)+{L}_{c}+{L}_{sc}$$ (14) where

${L}_{s1}$ and ${L}_{s2}$ denote the semantic segmentation losses in images from different time periods, respectively. DATASETS AND EXPERIMENTAL SETUP DATASETS To better validate the

proposed network, we conduct experiments on three publicly available semantic datasets: the SECOND dataset, the Landsat-SCD dataset, and the Hi-UCD min dataset. SECOND The SECOND dataset43

consists of 4,662 pairs of remote sensing images collected from multiple platforms and sensors, covering a number of important urban areas in China, including Hangzhou, Chengdu, and

Shanghai, and encompassing a rich variety of surface coverage types. Each image in this dataset has a size of 512 × 512 pixels, and its spatial resolution ranges from 0.5 to 3 m, capable of

capturing subtle changes in the ground surface. However, only 2968 dual-time image pairs with ground truth labels are currently available, of which change pixels account for 19.87% of the

total image pixels. As shown in the sample in Fig. 8, the dataset defines labels for seven categories, including one no-change category and five land cover change categories (unvegetated

surface, trees, low vegetation, water, buildings, and playgrounds). To evaluate model performance in a scientifically sound manner, we divided the SECOND dataset into a training set and a

test set in a ratio of 4:1. LANDSAT-SCD The Landsat-SCD dataset44 is a collection of images captured by Landsat in the Tumushuke area of Xinjiang, China, between 1990 and 2020. These images

include the three basic bands: red (R), green (G), and blue (B), and have a spatial resolution of 30 m. As shown in the sample in Fig. 9, the dataset defines five labeled categories,

including one no-change category and four land cover change categories (farmland, desert, buildings, and water bodies). The change pixels account for 18.89% of the total image pixels,

providing rich information on surface changes. The Landsat-SCD dataset contains 8,468 pairs of images, with each image sized at 416 × 416 pixels. After removing image enhancement, the

dataset includes 2,385 original image pairs, which are divided into the training set, validation set, and test set in a 3:1:1 ratio. HI-UCD MIN The Hi-UCD min dataset45 consists of 745 image

pairs captured by the Leica ADS100-SH100 between 2017 and 2019 in parts of Tallinn, the capital of Estonia. Each image in the dataset has a size of 1024 × 1024 pixels, with a spatial

resolution of 0.1 m. As shown in Fig. 10, the dataset defines 10 label categories, including one “no change” category and 9 land cover change categories (water, grass, forest, greenhouse,

road, building, bare land, and other types). To scientifically evaluate the model’s performance, we cropped the original dataset into 512 × 512 pixel images and removed the unchanged images.

Ultimately, the number of images in the training, validation, and test sets were 571, 100, and 705, respectively. EVALUATION METRICS In this study, to better perform a quantitative analysis

of the experimental results, we used eight objective metrics to evaluate the network’s performance in SCD. These metrics include SCD accuracy metrics: Overall Accuracy (OA), Separation

Kappa Coefficient (SeK)43, F1 score for SCD (F1scd), and the Composite Score (Score)36; BCD accuracy metrics: Mean Intersection over Union (mIoU) and F1 score (F1); and SS evaluation metric:

the Kappa coefficient (Kappa). The OA metric measures the proportion of correctly categorized pixels across all categories relative to the total number of pixels, providing a global,

intuitive assessment of accuracy. Let $Q=\{{q}_{i,j}\}$ be the confusion matrix, ${q}_{i,j}$ denotes the number of pixels classified into category $i$, and $j$ denotes the number of

true category pixels ($i,j\in \{\text{0,1},\cdots ,N\}$, with 0 for no change). The formula for OA is shown below: $$OA=\sum_{i=0}^{N}{q}_{ii}/\sum_{i=0}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (15)

The segmentation accuracy of changed and unchanged regions in BCD tasks is evaluated using mIoU. In the BCD task, mIoU is computed from the unchanged region (IoUnc) and the changed region

(IoUc), calculated as follows: $$mIoU=({IoU}_{nc}+{IoU}_{c})$$ (16) $${IoU}_{nc}={q}_{00}/(\sum_{i=0}^{N}{q}_{i0}+\sum_{j=0}^{N}{q}_{0j}-{q}_{00})$$ (17)

$${IoU}_{c}=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/(\sum_{i=0}^{N}\sum_{j=0}^{N}{q}_{ij-}{q}_{00})$$ (18) Sek is used to assess the classification accuracy of different land cover types in the

SS task. Here, $\widehat{Q}=\{{\widehat{q}}_{ij}={q}_{ij}\}$, but ${\widehat{q}}_{00}=0$, which is used to exclude no-change pixels. Its calculation formula is as follows: $$\rho

=\sum_{i=0}^{N}{\widehat{q}}_{ii}/\sum_{i=0}^{N}\sum_{j=0}^{N}{\widehat{q}}_{ij}$$ (19) $$\tau

=\sum_{i=0}^{N}\left(\sum_{j=0}^{N}{\widehat{q}}_{ij}*\sum_{j=0}^{N}{\widehat{q}}_{ji}\right)/{(\sum_{i=0}^{N}\sum_{j=0}^{N}{\widehat{q}}_{ij})}^{2}$$ (20) $$Kappa=(\rho -\tau )/(1-\tau )$$

(21) $$Sek={e}^{{IoU}_{c}-1}\bullet Kappa$$ (22) The composite score (Score) can be calculated based on mIoU and SeK as follows: $$Score=0.3\times mIoU+0.7\times Sek$$ (23) F1scd is used to

evaluate the segmentation precision of Land Use/Land Cover (LULC) classes within the change region. This metric is based on F1 score to calculate the precision (Pscd) and recall (Rscd) for

the change region, respectively. Its calculation formula is as follows: $${P}_{scd}=\sum_{i=1}^{N}{q}_{ii}/\sum_{i=1}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (24)

$${R}_{scd}=\sum_{i=1}^{N}{q}_{ii}/\sum_{i=0}^{N}\sum_{j=1}^{N}{q}_{ij}$$ (25) $${F1}_{scd}=\frac{2*{P}_{scd}*{R}_{scd}}{{P}_{scd}+{R}_{scd}}$$ (26) The formula for calculating F1 score is:

$$Recall=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/\sum_{i=0}^{N}\sum_{j=1}^{N}{q}_{ij}$$ (27) $$Precision=\sum_{i=1}^{N}\sum_{j=1}^{N}{q}_{ij}/\sum_{i=1}^{N}\sum_{j=0}^{N}{q}_{ij}$$ (28)

$$F1=\frac{2*Recall*Precision}{Recall+Precision}$$ (29) EXPERIMENTAL SETTINGS All experiments in this paper were conducted on an NVIDIA GPU (GeForce RTX 4060 Ti) with 16 GB of memory. In all

experiments, we used the same experimental parameter configurations, specifically: setting the batch size to 4, training for a total of 80 epochs, and using an initial learning rate of 0.1.

Additionally, we employed a stochastic gradient descent (SGD) optimizer to iteratively update the model parameters to minimize the loss function. EXPERIMENTAL COMPARISON AND ANALYSIS

COMPARISON METHODS In order to better compare the performance of our method in identifying changing areas and land cover types in the SCD task, we compared it with six existing

state-of-the-art methods: HRSCD-str334, HRSCD-str434, BiSRNet36, SCanNet46, HGINet38, and STSP-Net47. To ensure a fair comparison, we did not use pre-trained weights in our training. * 1.

_HRSCD-str3_: By integrating multi-scale features and constructing a network with BCD branches that incorporate time-dependent information, land cover change detection is achieved. * 2.

_HRSCD-str4_: As an upgraded version of the HRSCD-str3 series, HRSCD-str4 performs skip connections while maintaining high-resolution semantic change detection capabilities. It connects the

twin encoder to the decoder of the CD branch to enhance the recognition of complex changing scenes. * 3. _BiSRNet_: It is a dual-temporal semantic inference network. This method enhances the

ability to identify change areas and land cover types by introducing cross-temporal SR (Cot SR) blocks to model temporal correlations. * 4. _SCanNet_: It is a multi-task-based semantic

change detection network. This method constructs a semantic change converter (SCanFormer) to explicitly model the "from-to" semantic transformation between dual temporal RSIs,

achieving effective recognition of change areas and land cover types in remote sensing images. * 5. _HGINet_: It is a semantic change detection network based on hierarchical semantic graph

interaction. It accurately identifies change areas and land cover types by modeling dual-temporal correlations and using graph learning to represent the interaction of different feature

layers. * 6. _STSP-Net_: It is a spatiotemporal semantic perception network that effectively captures spatiotemporal information in images by introducing spatiotemporal attention mechanisms,

thereby achieving recognition of changing areas and land cover types. QUANTITATIVE AND QUALITATIVE ANALYSIS EXPERIMENTAL RESULTS ON THE SECOND DATASET Table 1 shows the quantitative

comparison results of our method with other methods on the SECOND dataset. Compared to the other methods, the proposed STGNet performs significantly better, ranking first in all evaluation

metrics. Specifically, the mIoU reached 72.83%, Sek was 22.45%, F1scd was 61.83%, and OA was 87.51%. Among these methods, HRSCD-str3 is a network that directly operates on dual-temporal

images for binary change detection. It lacks the connection between the BCD and SS sub-tasks, resulting in the lowest mIoU value on the high-resolution SECOND dataset, with a score of only

66.85%. In contrast, other methods that extract features through an encoder and then perform the BCD task (HRSCD-str4, BiSRNet, SCanNet, HGINet, STSP-Net) have achieved strong results across

various metrics. This demonstrates the importance of feature extraction before change detection in SCD tasks. The SDPNet extractor and BIDS module we proposed are designed to better extract

features from high-resolution remote sensing images and deeply fuse dual-temporal image features through CTIM. In the ablation study in section "Analysis of model robustness and

computational complexity", we will provide a detailed introduction to their specific advantages. Therefore, for STSP-Net, which also focuses on the feature extraction stage, its mIoU

reaches second place at 72.31%. To visually demonstrate the advantages of our method on the SECOND dataset, as shown in Fig. 11, we selected 3 pairs of dual-temporal remote sensing images

for qualitative analysis, with detailed areas highlighted using red rectangles. It is clearly evident that, compared to other methods, HRSCD-str3 loses a significant amount of change

information in the results. Additionally, by observing the sixth and seventh rows in Fig. 11, we can see that BiSRNet, SCanNet, HGINet, STSP-Net, and our method, by focusing on the

interaction between features from different time periods, can better capture the changes between "non-vegetated surface" and "low vegetation." Among these, BiSRNet,

STSP-Net, and our method perform best in capturing change areas, accurately detecting the changed regions. However, due to BiSRNet not emphasizing semantic feature extraction sufficiently

during the feature extraction phase, errors in change category recognition appear in the final results, such as part of the "non-vegetated surface" incorrectly being identified as

“low vegetation” instead of "tree." Therefore, compared to other methods, the approach we propose demonstrates a significant advantage in identifying highly dense change areas, not

only accurately recognizing change regions and categories but also preserving clearer boundary information. EXPERIMENTAL RESULTS ON THE LANDSAT-SCD DATASET We also conducted both

quantitative and qualitative analyses on the lower-resolution Landsat-SCD dataset. As shown in Table 2, our method demonstrates clearer advantages for this low-resolution dataset, with all

metrics significantly outperforming those of other methods. Among the other methods, only SCanNet, which also uses the Landsat-SCD dataset, and STSP-Net, which employs a dual-branch feature

extraction approach, performed well. This further confirms that using a dual-branch feature extraction strategy can more comprehensively capture detailed information. Compared to the

second-best method, STSP-Net, our method improves F1scd, OA, Sek, mIoU, F1, Kappa, and Score by 6.92%, 2.455, 17.68%, 6.16%, 5.89%, 4.31%, and 14.23%, respectively. The significant

improvement in the Sek metric is attributed to our method not only enhancing semantic feature acquisition for different time periods but also employing a cross-learning approach that

constrains the BCD and SS tasks along the time dimension, ensuring consistency. To more intuitively demonstrate the performance of our method on low-resolution imagery, we provide partial

visual results in Fig. 12. By comparison, it can be observed that other methods tend to lose many key boundary detail features when identifying change areas in low-resolution images. Among

them, HGINet shows the most significant issue. For example, in the third pair of images in Fig. 12, HGINet fails to effectively distinguish the boundaries of subtle change areas, resulting

in a block-like structure. Additionally, it misidentifies change categories, such as the region that should have been classified as “farmland” to "desert," which is instead

misclassified as “farmland” to "building." This is primarily because HGINet is a lightweight semantic change detection model with poor robustness and is unable to capture subtle

change information. In contrast, our method not only effectively identifies change regions but also accurately recognizes the change categories. ANALYSIS OF MODEL ROBUSTNESS AND

COMPUTATIONAL COMPLEXITY To further verify the robustness of the proposed method, this study applied data degradation simulations such as noise, occlusion, and spectral distortion to the

SECOND and Landsat-SCD test datasets. Specifically, we added noise, simulated occlusion, and applied spectral distortion processing to the test data, each with a probability of 0.5. These

simulation operations effectively replicate potential data quality degradation in real-world environments, aiming to assess the model’s stability and performance when faced with complex and

imperfect data. Figure 13 shows radar charts of five key evaluation metrics, clearly presenting the detection performance of different methods under these simulated conditions. The

experimental results indicate that other methods exhibit significant instability under these conditions. For example, in the SECOND dataset, although SCanNet maintained nearly optimal

detection performance in the noise environment, its performance was lower than that of HRSCD-str4 and STSP-Net in the “occlusion” environment. In contrast, our method maintained a leading

position in all test environments, demonstrating stronger robustness and further proving its superiority in practical applications. Meanwhile, we conducted experiments on the Hi-UCD min

dataset, which contains more detailed surface information. Specifically, as shown in Table 3, compared to the second-best methods, our approach still maintained a leading position across

various evaluation metrics, particularly with a significant improvement of 9.72% in F1scd and 5% in Sek. Additionally, we conducted a comprehensive performance analysis of the model on the

Hi-UCD min dataset, considering aspects such as parameter count, computational cost, and inference time. As shown in Table 3, although network structures like HRSCD-str3, HRSCD-str4, and

HGINet maintained relatively low computational cost due to their simple design, their performance on the SCD task was less satisfactory. In contrast, compared to SCanNet and STSP-Net, which

also focus on feature extraction, our method demonstrated superior performance in terms of parameter count, computational cost, and inference speed. This not only validates the good balance

between efficiency and performance in our model but also further highlights its superiority and practicality in complex surface information processing tasks. PARAMETER DISCUSSION The two key

hyperparameters in the binary change function Lc, the weight of the change region (Wc) and the weight of the unchanged region (Wnc), play a crucial role in addressing the class imbalance

issue in the BCD task. They ensure that the model effectively detects the change regions while also accurately identifying the unchanged regions. To explore the optimal configuration of

these two hyperparameters, we conducted several comparison experiments with different hyperparameter combinations on the Landsat-SCD and SECOND datasets. As shown in Table 4, the

experimental results clearly demonstrate the impact of different hyperparameter settings on model performance. Since the change regions in images are relatively small in real-world

scenarios, when Wc is greater than 0.5 and Wnc is less than 0.5, the model’s performance metrics show significant degradation. To further improve the model’s performance, we adopted a

strategy of gradually decreasing Wc and correspondingly increasing Wnc. Ultimately, when Wc was set to 0.25 and Wnc was set to 0.75, the model successfully achieved a good balance between

the proportions of change and unchanged regions, thereby ensuring accurate change detection while significantly improving the identification accuracy of unchanged regions. ABLATION

EXPERIMENT In order to further validate the effectiveness of the various methods proposed in this paper, we conducted comprehensive ablation experiments on three public dataset, covering the

following key aspects: the selection of the backbone network, the optimization of the dual-branch feature extraction path, the enhancement effect of the BiDS module on feature extraction in

the dual-branch structure, the balancing effect of the CTIM module on SS and CD tasks in remote sensing semantics, and the computational cost of the model. As shown in Table 5, we used the

SSCDL42 architecture proposed by Ding et al. as the base model. It can be seen that, with a small number of enhanced parameters and reduced computational complexity, SSCDL using ResNet50

significantly outperforms the model using ResNet34 across various indicators. This is because ResNet50 has a deeper network structure and stronger feature extraction capabilities, enabling

it to learn more complex and richer feature representations, making it better suited for processing complex and diverse image data. In addition, by incorporating the Detail-Aware Path (DAP),

we observed further improvements across various metrics. Compared to the baseline network based on ResNet50, the Sek metric increased by 0.29%, 12.23%, and 0.1% on the SECOND dataset, the

Landsat-SCD dataset, and the Hi-UCD min dataset, respectively. This clearly demonstrates that introducing the spatial detail branch can effectively enhance the model’s performance and

improve its ability to capture spatial detail information. BiDS plays a crucial bidirectional guidance role in the dual-branch structure. The introduction of BiDS enables the network to

better focus on the precise extraction of semantic and change information in remote sensing images. This improvement is reflected in the mIoU and Sek evaluation metrics across the three

public datasets, with increases of 0.6% and 1.16%, 1.3% and 3.91%, and 0.34% and 0.45%, respectively. With the addition of CTIM, our proposed method achieved optimal results across all

metrics on these three datasets, further validating that CTIM can deeply explore the intrinsic relationships and differences between images from different time periods, thus aiding in the

identification of unchanged regions in change detection tasks. Compared to the baseline network using ResNet34, the optimal results showed significant improvements across all metrics.

Specifically, on the SECOND dataset, mIoU increased by 1.57%, Sek by 4.14%, F1scd by 4.28%, and OA by 1.33%; on the Landsat-SCD dataset, mIoU increased by 7.03%, Sek by 19.81%, F1scd by

7.8%, and OA by 2.74%; on the Hi-UCD min dataset, mIoU increased by 1.15%, Sek by 1.45%, F1scd by 1.48%, and OA by 2.22%. It is worth noting that the addition of BiDS and CTIM only increased

the parameter count by 1% and 0.03%, respectively, indicating that with the introduction of only a small number of additional parameters, we achieved better performance in change detection

and land category recognition. In order to more intuitively verify the effectiveness of the method proposed in this article, and given that the Landsat-SCD dataset shows the most significant

and optimal accuracy improvement, we specifically selected a pair of typical remote sensing images from the Landsat-SCD dataset for a visual analysis of the ablation experiments. As shown

in Fig. 14, we specifically selected key areas from the experimental results and enlarged them for easier observation of the details. The experimental results show that when ResNet50 is used

as the backbone network, its excellent feature extraction ability ensures that key features in the image are fully preserved. With the introduction of the Detail-Aware Path (DAP) and BiDS

modules, the network’s ability to capture subtle changes in regions and their edge information has been significantly improved. However, in the semantic change detection results, there were

still cases where the semantic categories of the change areas were the same but were incorrectly identified as changes. To overcome this challenge, we introduced the CTIM module, which can

accurately capture the intrinsic relationships and differences between images from different time periods, thereby ensuring the result consistency between SS tasks and CD tasks. The final

experimental results show that the addition of the CTIM module greatly improves the accuracy of semantic change detection and successfully solves the problem of misjudging the semantic

categories in the change areas. This fully demonstrates the effectiveness of the proposed Cross-Temporal Refinement Interaction Module (CTIM) in this paper. CONCLUSIONS In Semantic Change

Detection (SCD) tasks, accurate identification of change regions and types is crucial. In this study, we delve into several common challenges present in existing SCD tasks and, based on this

analysis, propose a network that guides multitask semantic change detection through spatiotemporal semantic interaction (STGNet). This network adopts a dual-path feature extractor based on

a Siamese structure, significantly enhancing its ability to extract complex feature sets from remote sensing images by cleverly incorporating spatial detail information. Building on this, we

further design an innovative bidirectional guidance module (BiDS), which establishes an effective connection between spatial detail information and high-level semantics. This strengthens

the feature extractor’s ability to capture key information, enriching the representation of deep semantic features and low-level spatial information.Additionally, to fully leverage the

temporal correlation between dual-temporal images (i.e., images from different time periods), we carefully design the Cross-Temporal Refinement Interaction Module (CTIM). This module deeply

explores the inherent connections and subtle differences between dual-temporal images, helping to more accurately identify unchanged areas in change detection, thereby improving the overall

accuracy of change detection. To comprehensively evaluate the performance of the proposed network, we conduct detailed experiments on three publicly available authoritative datasets. The

experimental results show that, compared with the most advanced change detection methods, the proposed STGNet achieves the highest accuracy with the introduction of a small number of

parameters, further verifying its effectiveness in processing complex remote sensing image change detection tasks. In future work, we will continue to focus on implementing semi-supervised

semantic change detection to further reduce excessive dependence on datasets, lower the cost of manual annotation, and enhance the model’s generalization ability with limited annotated data.

Specifically, we will explore effective semi-supervised learning algorithms by combining a large amount of unlabeled data with a small amount of high-quality labeled data. These algorithms

will aim to accurately capture subtle semantic changes while maintaining the efficiency and practicality of the model, providing stronger technical support for semantic change detection

tasks in remote sensing images. DATA AVAILABILITY The SECOND dataset used in this study is publicly available at https://captain-whu.github.io/SCD/, accessed on 18 October 2023. The

Landsat-SCD dataset can be accessed at https://doi.org/https://doi.org/10.6084/m9.figshare.19946135.v1, accessed on 18 October 2023. The Hi-UCD min dataset used in this study is publicly

available at https://github.com/Daisy-7/Hi-UCD-S. REFERENCES * Song, X.-P. et al. Global land change from 1982 to 2016. _Nature_ 560(7720), 639–643 (2018). Article ADS PubMed PubMed

Central CAS Google Scholar * Huang, X., Schneider, A. & Friedl, M. A. Mapping sub-pixel urban expansion in China using MODIS and DMSP/OLS nighttime lights. _Remote Sens. Environ._

175, 92–108 (2016). Article ADS Google Scholar * Lunetta, R.S., et al., Land-cover change detection using multi-temporal MODIS NDVI data. Remote Sens. Environ. 2006(2). * Huang, X. et al.

Multi-level monitoring of subtle urban changes for the megacities of China using high-resolution multi-view satellite imagery. _Remote Sens. Environ._ 56, 56–75 (2017). Article ADS Google

Scholar * Bovolo, F. & Bruzzone, L. The time variable in data fusion: A change detection perspective. _IEEE Geosci. Remote Sens. Mag._ 3(3), 8–26 (2015). Article Google Scholar *

Jin, S. et al. A land cover change detection and classification protocol for updating Alaska NLCD 2001 to 2011. _Remote Sens. Environ._ 195, 44–55 (2017). Article ADS Google Scholar *

Zhu, Q. et al. A review of multi-class change detection for satellite remote sensing imagery. _Geo-spatial Inf. Sci._ 27(1), 1–15 (2024). Article Google Scholar * Sakurada, K., Shibuya,

M., and Wang, W. Weakly supervised silhouette-based semantic scene change detection. in _2020 IEEE International Conference on Robotics and Automation (ICRA)_. 2020. IEEE. * Ru, L., Du, B.

& Wu, C. Multi-temporal scene classification and scene change detection with correlation based fusion. _IEEE Trans. Image Process._ 30, 1382–1394 (2020). Article ADS MathSciNet PubMed

Google Scholar * Zheng, Z. et al. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made

disasters. _Remote Sens. Environ._ 265, 112636 (2021). Article Google Scholar * El Amin, A. M., Liu, Q., Wang, Y. Zoom out CNNs features for optical remote sensing change detection. in

_2017 2nd International Conference on Image, Vision and Computing (ICIVC)_. 2017. IEEE. * Tang, Y. et al. Optimization strategies of fruit detection to overcome the challenge of unstructured

background in field orchard environment: A review. _Precision Agric._ 24(4), 1183–1219 (2023). Article MathSciNet Google Scholar * Shi, W. et al. Change detection based on artificial

intelligence: State-of-the-art and challenges. _Remote Sens._ 12(10), 1688 (2020). Article ADS Google Scholar * Wang, J. et al. Object detection based on adaptive feature-aware method in

optical remote sensing images. _Remote Sens._ 14(15), 3616 (2022). Article ADS Google Scholar * Dong, X. et al. Attention-based multi-level feature fusion for object detection in remote

sensing images. _Remote Sens._ 14(15), 3735 (2022). Article Google Scholar * Dong, H. et al. Enhanced lightweight end-to-end semantic segmentation for high-resolution remote sensing

images. _IEEE Access_ 10, 70947–70954 (2022). Article Google Scholar * Xiong, J. et al. CSRNet: Cascaded selective resolution network for real-time semantic segmentation. _Expert Syst.

Appl._ 211, 118537 (2023). Article Google Scholar * Zhang, H. et al. ESCNet: An end-to-end superpixel-enhanced change detection network for very-high-resolution remote sensing images.

_IEEE Trans. Neural Netw. Learn. Syst._ 34(1), 28–42 (2021). Article CAS Google Scholar * Han, C., et al. Change guiding network: Incorporating change prior to guide change detection in

remote sensing imagery. _IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens._ 2023. * Han, C. et al. HANet: A hierarchical attention network for change detection with bitemporal

very-high-resolution remote sensing images. _IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens._ 16, 3867–3878 (2023). Article ADS Google Scholar * Kipf, T. N., Welling, M.

_Semi-supervised classification with graph convolutional networks_. 2016. * Vaswani, A., et al. _Attention is all you need_. arXiv, 2017. * Ning, X. et al. Multi-stage progressive change

detection on high resolution remote sensing imagery. _ISPRS J. Photogramm. Remote. Sens._ 207, 231–244 (2024). Article ADS Google Scholar * Zhou, M., Qian, W. & Ren, K. Multistage

interaction network for remote sensing change detection. _Remote Sensing_ 16(6), 1077 (2024). Article ADS Google Scholar * Cai, Y. et al. CSANet: A channel-spatial attention network for

remote sensing image change detection. _Int. J. Remote Sens._ 44(19), 5936–5959 (2023). Article Google Scholar * Larabi, M. E. A. et al. High-resolution optical remote sensing imagery

change detection through deep transfer learning. _J. Appl. Remote Sens._ 13(4), 046512–046512 (2019). Article ADS Google Scholar * Ling, J. et al. IRA-MRSNet: A network model for change

detection in high-resolution remote sensing images. _Remote. Sens._ 14, 5598 (2022). Article ADS Google Scholar * Peng, X., et al. Optical remote sensing image change detection based on

attention mechanism and image difference. _IEEE Trans. Geosci. Remote Sens._ 2020. (99): pp. 1–12. * Chen, J., et al., DASNet: Dual attentive fully convolutional siamese networks for change

detection in high-resolution satellite images. _IEEE_, 2021. * Daudt, R. C., Le Saux, B., Boulch, A. Fully convolutional siamese networks for change detection. in _2018 25th IEEE

International Conference on Image Processing (ICIP)_. 2018. IEEE. * Zhou, Z. et al. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. _IEEE Trans.

Med. Imaging_ 39(6), 1856–1867 (2019). Article PubMed PubMed Central Google Scholar * Liu, R. et al. Deep depthwise separable convolutional network for change detection in optical aerial

images. _IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens._ 13, 1109–1118 (2020). Article ADS Google Scholar * Daudt, R.C., B.L. Saux, and A. Boulch, Fully Convolutional Siamese

Networks for Change Detection. IEEE, 2018. * Daudt, R. C. et al. Multitask learning for large-scale semantic change detection. _Comput. Vis. Image Underst._ 187, 102783 (2019). Article

Google Scholar * Chen, P. et al. FCCDN: Feature constraint network for VHR image change detection. _ISPRS J. Photogramm. Remote. Sens._ 187, 101–119 (2022). Article ADS Google Scholar *

Ding, L. et al. Bi-temporal semantic reasoning for the semantic change detection in HR remote sensing images. _IEEE Trans. Geosci. Remote Sens._ 60, 1–14 (2022). Google Scholar * Zheng, Z.

et al. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. _ISPRS J. Photogramm. Remote. Sens._ 183, 228–239 (2022). Article ADS Google

Scholar * Long, J. et al. Semantic change detection using a hierarchical semantic graph interaction network from high-resolution remote sensing images. _ISPRS J. Photogramm. Remote. Sens._

211, 318–335 (2024). Article ADS Google Scholar * Li, S., _et al._ AGSPNet: A framework for parcel-scale crop fine-grained semantic change detection from UAV high-resolution imagery with

agricultural geographic scene constraints. 2024. * Xu, J., Xiong, Z., Bhattacharyya, S. P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. in _Proceedings of

the IEEE/CVF conference on computer vision and pattern recognition_. 2023. * Dai, Y., et al. Attentional feature fusion. in _Proceedings of the IEEE/CVF winter conference on applications of

computer vision_. 2021. * Ding, L., et al., _Bi-temporal semantic reasoning for the semantic change detection of HR remote sensing images_. 2021. * Yang, K., et al., _Asymmetric siamese

networks for semantic change detection_. 2020. * Yuan, P. et al. A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images. _Int.

J. Digital Earth_ 15, 1506–1525 (2022). Article ADS Google Scholar * Tian, S., et al. Hi-UCD: A large-scale dataset for urban semantic change detection in remote sensing imagery. 2020. *

Ding, L., et al., Joint spatio-temporal modeling for semantic change detection in remote sensing images. _IEEE Trans. Geosci. Remote Sens._ 2024. * He, Y. et al. Spatial-temporal semantic

perception network for remote sensing image semantic change detection. _Remote Sens._ 15(16), 4095 (2023). Article ADS Google Scholar Download references ACKNOWLEDGEMENTS This work was

supported in part by the Science and Technology Plan Project of Sichuan Province (NO.2023YFS0371), the Sichuan Key Provincial Research Base of Intelligent Tourism (No. ZHZJ24-01), and the

Innovation Fund for Research on Complex Scene Landscape and Grassland Change Detection Based on Deep Learning (Y2024116). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Sichuan University of

Science and Engineering, Yibin, 644000, China Yinqing Wang, Liangjun Zhao & Yuanyang Zhang * Sichuan Key Provincial Research Base of Intelligent Tourism, Yibin, 644000, China Liangjun

Zhao * School of Tropical Agriculture and Forestry, Hainan University, Haikou, 570228, Hainan Province, China Yueming Hu * School of Information and Communication Engineering, Hainan

University, Haikou, 570100, Hainan Province, China Hui Dai * Changsha City Planning Information Service Center, Changsha, 410006, Hunan Province, China Hui Dai Authors * Yinqing Wang View

author publications You can also search for this author inPubMed Google Scholar * Liangjun Zhao View author publications You can also search for this author inPubMed Google Scholar * Yueming

Hu View author publications You can also search for this author inPubMed Google Scholar * Hui Dai View author publications You can also search for this author inPubMed Google Scholar *

Yuanyang Zhang View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS Yinqing Wang was responsible for the design of research methods and

experimental procedures, provided support for thesis writing and revisions, supplied experimental equipment, and offered both technical and financial support. Liangjun Zhao participated in

the research design, data collection, data processing, and analysis, as well as providing financial support. He was also involved in the implementation of research methods. Yueming Hu and

Hui Dai participated in data collection, data processing, and analysis. Yuanyang Zhang was involved in the implementation of research methods and contributed to the revision of the thesis.

All authors reviewed the manuscript. CORRESPONDING AUTHOR Correspondence to Liangjun Zhao. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL

INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS This

article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction

in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the

licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article

are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this

licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Wang, Y., Zhao, L., Hu, Y. _et al._ Multitask semantic change

detection guided by spatiotemporal semantic interaction. _Sci Rep_ 15, 16003 (2025). https://doi.org/10.1038/s41598-025-00750-8 Download citation * Received: 03 January 2025 * Accepted: 30

April 2025 * Published: 08 May 2025 * DOI: https://doi.org/10.1038/s41598-025-00750-8 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get

shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative KEYWORDS * Remote

sensing images * Semantic change detection * Spatial–temporal semantic * Multi-task network * Deep learning

Telemedicine companies will tell you the pandemic has ushered in a new era of connected health across America. But the B...

Braouonline. In dr. B. R ambedkar university ug 3rd year instant may 2017 exam results: steps to check results at manabadi. Co. In

DR.B.R AMBEDKAR UNIVERSITY UG 3RD YEAR INSTANT MAY 2017 EXAM RESULTS: STEPS TO CHECK RESULTS AT MANABADI.CO.IN OR BRAOUO...

BuzzFeed News LGBTQ

From Bollywood scenes that accidentally educated our families to pop stars who made queerness feel powerful, here are th...

Rhino horn must become a socially unacceptable product in asia

At current rates of loss to poaching, rhino species will be extinct within our lifetimes. The big problem is demand for ...

Earnings Roundup: Nov. 10

What follows is a roundup of corporate earnings reports for Wednesday, Nov. 10. MACY'S and POLO RALPH LAUREN are ex...

Latests News

Multitask semantic change detection guided by spatiotemporal semantic interaction

ABSTRACT Semantic Change Detection (SCD) aims to accurately identify the change areas and their categories in dual-time ...

Acts of sympathy help a grieving parent after the death of a child

You can say, "It's not supposed to be this way; parents are not supposed to bury their child," even thoug...

World in brief : japan : empress regains ability to speak

Japan’s Empress Michiko has almost fully recovered the power of speech, nearly seven months after being struck with a my...

Beat ato daleks: find lost super before it’s too late

Nick Bruining and Neale PriorThe West Australian In the age of click-click-go, it would be nice if we could simply click...

Aspen times weekly cover story: a school learning how to grow

WOODY CREEK – To walk into the Aspen Community School, the charter school high on the ridgeline above Woody Creek, you m...

Menu

Multitask semantic change detection guided by spatiotemporal semantic interaction

Multitask semantic change detection guided by spatiotemporal semantic interaction"

Play all audios:

Trending News

Latests News