3.1. Overview
This study proposes a systematic pipeline for detecting drug-related coded language on social media platforms. As illustrated in
Figure 1, the overall workflow consists of four primary stages:
Data Acquisition and Annotation: Raw text data are systematically collected from mainstream Chinese social media platforms (e.g., Weibo, Douyin, Bilibili, and Xiaohongshu) using automated web scrapers. Following rigorous noise reduction and data cleaning, the samples are manually annotated to distinguish illicit coded language from benign text, yielding a high-quality labeled dataset.
Feature Preprocessing: The annotated dataset is transformed into structured feature representations suitable for deep learning architectures. This stage encompasses tokenization, stop-word removal, and the generation of dense word embeddings.
Model Construction and Training: An optimized TextCNN-based classification framework is deployed. By applying multi-scale convolutional filters, the model efficiently extracts and aggregates local semantic features from the short, highly noisy texts characteristic of social media.
Performance Evaluation: The trained model is rigorously evaluated against established baseline methods using standard classification metrics to validate its robustness, scalability, and overall detection efficacy.
3.2. Construction and Annotation of Drug-Related Slang Corpus
This chapter details the drug-related slang corpus construction process, including data sources, data preprocessing, data annotation, and statistical analysis, laying the foundation for subsequent model training and evaluation. The database construction flow chart is shown in
Figure 2.
3.2.1. Data Sources
This study employs a multi-source collaborative strategy for data collection, using two complementary approaches to obtain text data containing drug-related code language. First, Python web crawling techniques are used to collect publicly available interactive content from mainstream social platforms. The Sample data obtained by crawling are shown in
Table 1.
Data sources primarily include: parsing judicial documents from the Supreme People’s Court Judgment Documents Network and drug-related case files from the Ministry of Public Security to extract drug-related code language patterns from real criminal cases; mining authoritative texts such as government work reports and anti-drug bulletins from public security agencies to capture policy-oriented drug-related semantic features; and dynamically crawling public drug-related posts from social platforms such as Weibo, Douyin, and WeChat using web crawlers, constructing a temporally varying corpus through sensitive information filtering and vectorization.
To augment the training data, this study automatically generated two categories of samples based on a drug-related code language lexicon: one category consists of sentences directly conveying hidden drug-related meanings (e.g., “A new batch of express tea has arrived”), and the other consists of contrastive sentences embedding code language vocabulary into positive contexts (e.g., “Community outreach on new tea varieties to prevent fraud”). By randomly combining different grammatical structures and contextual scenarios, tens of thousands of generated texts (over 5,000 drug-related sentences and over 5,000 normal sentences) maintain linguistic naturalness while covering expressive variation across diverse real-world scenarios. Partial examples of sentences randomly generated from the code language lexicon are shown in
Table 2.
3.2.2. Data Preprocessing
Raw text data collected from various sources often contains a large amount of noise and unstructured information that cannot be directly fed into a model for training. Therefore, a series of preprocessing operations must be applied to transform the raw data into standardized, clean text corpora. The preprocessing pipeline consists of three main steps: text cleaning, word segmentation, and code-language-specific augmentation.
(1)Text Cleaning: Text cleaning is the first step of preprocessing, aimed at removing noise information unrelated to semantics. Social media texts frequently contain URL links, HTML tags, emoticons, and various special symbols. This study uses regular expressions to filter out such content: for example, pattern matching is applied to remove URLs beginning with http or www, topic hashtag symbols are removed while preserving the topic text, and various punctuation marks are eliminated. Since some emoticons may be related to drug-related code language, they are converted to textual descriptions during cleaning to preserve potential semantic cues.
(2)Word Segmentation: Word segmentation is the process of dividing a continuous sequence of Chinese characters into word units with independent semantics, and can be understood as a mapping from a character sequence to a word sequence. Let the raw text be represented as a character sequence:
where
denotes the
i-th Chinese character. The goal of word segmentation is to convert this into a word sequence:
where
denotes a word composed of one or more consecutive characters. This process can be formally expressed as:
where
denotes the segmentation function and
is the segmentation dictionary. This study uses the Jieba tokenizer to implement this mapping.
To address inaccurate segmentation of domain-specific vocabulary by general-purpose tokenizers, a domain-specific custom dictionary is introduced to extend the original dictionary:
where
contains the set of drug-related code language terms (e.g., “liubing,” “zhurou,” “linghao jiaonang,” “xiaoqi,” etc.). Under the extended dictionary constraint, the segmentation function is updated to:
By this means, words satisfying are kept intact during segmentation, thus avoiding semantic fragmentation and improving recognition accuracy for domain-specific expressions. The custom dictionary is built upon the code language lexicon collected in the preliminary phase and is continuously expanded and refined throughout the annotation process.
(3)Augmentation Processing: To address the problem of variant expressions of drug-related code language on social media (such as homophone substitution, pinyin abbreviations, and numeric encoding), this study introduces semantic normalization on top of the segmentation results, mapping non-standard expressions to canonical forms [
28]. A code language mapping function is constructed:
where
is the augmented word sequence. The mapping dictionary
transforms words one by one, defined as follows:
where
denotes the domain of the mapping dictionary and
is the normalized expression corresponding to word
w. For example, homophones or abbreviated forms are mapped to their standard code language equivalents. This process essentially implements a mapping from the original expression space to a canonical semantic space. Due to the polysemy and context-dependence of language, the mapping function is not strictly injective, and some words may carry semantic ambiguity. Nevertheless, in a statistical sense, this augmentation effectively improves the model’s ability to recognize variant code language expressions, particularly yielding significant improvement in recall.
3.2.3. Data Annotation
Data annotation is the key step in converting raw text into usable samples. Annotation quality directly affects model learning outcomes. This study adopts a binary annotation scheme, classifying sentences as either non-drug-related (labeled 0) or drug-related (labeled 1). To ensure annotation accuracy and consistency, three annotators with relevant backgrounds were organized to participate in the labeling process. The annotation workflow is as follows:
(1)Pre-annotation phase: A random sample of 200 instances is selected for pre-annotation. Inter-annotator agreement is calculated, disagreements are discussed, and consistent standards are established.
(2)Formal annotation phase: A double-blind annotation scheme is adopted, with each sample independently labeled by two annotators. Annotators do not communicate during the process to avoid mutual influence.
(3)Cross-review phase: Annotation results are compared; samples with consistent labels are directly accepted. For samples with inconsistent labels, a third senior annotator adjudicates to determine the final label.
To quantify the reliability of annotation results, this study uses Cohen’s Kappa coefficient to assess inter-annotator agreement. The Kappa coefficient is calculated as:
where
is the observed agreement rate (the proportion of samples labeled identically by both annotators), and
is the expected agreement rate (the proportion of agreement attributable to chance). The Kappa value typically ranges from 0 to 1: values above 0.75 indicate good agreement, values between 0.40 and 0.75 indicate moderate agreement, and values below 0.40 indicate poor agreement. The Kappa coefficients at each annotation stage in this study are shown in
Table 3.
The Kappa coefficients for the pre-annotation and formal annotation stages are 0.82 and 0.91, respectively, indicating extremely high inter-annotator agreement and confirming the reliability of the annotation results. Sample annotations from the dataset are shown in
Table 4.
3.3. Model Architecture
3.3.1. Model Overview
proposes a domain-customized TextCNN framework tailored for drug-related code language detection. The overall model architecture is shown in
Figure 3, comprising an embedding layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. The input consists of preprocessed short social media texts, and the output is a probability distribution indicating whether the text belongs to the drug-related or non-drug-related category.
The text sequence is first mapped through the embedding layer into 100-dimensional dense vectors, forming a word vector matrix. Subsequently, convolutional kernels of three sizes (3, 4, and 5) slide in parallel over the matrix to extract local semantic features of varying lengths, generating multiple sets of feature maps. A global max-pooling operation is then applied to each set of feature maps to retain the most salient feature responses, and all pooling results are concatenated into a fixed-length feature vector. Finally, this vector undergoes nonlinear transformation through the fully connected layer and is passed through a Softmax function to produce the classification output. During training, the Adam optimizer is used to update parameters, with the cross-entropy loss function employed to optimize the classification objective.
3.3.2. Input Layer and Embedding Layer
The input layer is responsible for receiving preprocessed sentence texts. Since neural networks require inputs of fixed dimensions, this study uniformly adjusts all text sequences to a length of 100 tokens. For sentences shorter than 100 tokens, zero-padding is applied at the end; for sentences longer than 100 tokens, only the first 100 tokens from the beginning are retained. This setting is based on statistical analysis of the dataset and is sufficient to cover the vast majority of short social media texts.
The embedding layer maps discrete word indices to continuous dense vectors. Suppose the input sentence contains
n words (
); the embedding layer converts it into a matrix
, where
d is the word vector dimensionality, and the
i-th row of the matrix corresponds to the
d-dimensional vector representation of the
i-th word in the sentence. This study sets the word vector dimensionality to
and initializes the embedding matrix using pretrained Word2Vec word vectors [
29]. Pretrained word vectors provide the model with rich semantic priors, enabling basic language understanding ability from the early stages of training. During training, the embedding layer parameters are set to a fine-tunable state, allowing the word vectors to adaptively adjust toward the semantic space of the drug-related domain.
3.3.3. Convolutional Layer Design and Domain-Specific Adaptation
The convolutional layer is designed not merely for generic feature extraction, but as a specialized mechanism to penetrate the morphological camouflage of evasive illicit text. To effectively capture the dynamic and context-dependent patterns of coded expressions, this study employs parallel multi-scale convolutional filters. Specifically, the kernel heights are configured to .
The architectural decision to employ this specific configuration is deeply rooted in the adversarial nature of Chinese drug-related coded language. In online illicit markets, offenders deliberately utilize highly condensed, obfuscated terminology—ranging from two-character euphemisms (e.g., ice smoking”, pork”) to multi-character compound metaphors (e.g., Foxy Methoxy”, Spice pen”)—to circumvent static platform surveillance. During Chinese word segmentation, core two-character illicit phrases are frequently condensed into a single token. Consequently, setting the minimum kernel size to 3 serves as an optimal baseline: it captures not only the core illicit token itself but also its immediate syntactical anchors (e.g., surrounding verbs or prepositions). This tri-gram contextualization is critical for disambiguating benign usage from illicit intent in noisy environments.
Concurrently, larger convolutional filters ( and 5) are strategically deployed to capture longer, syntactically complex semantic obfuscations. These extended receptive fields are adept at recognizing fragmented expressions or multi-word idioms where criminals deliberately insert noise characters to bypass rule-based detection. By operating these three kernel sizes in parallel, the framework dynamically learns n-gram representations across varying granularities, establishing a robust defense against the continuous evolution of code-word length and structure.
Mechanically, each of the three kernel sizes comprises 128 independent filters. Given that the width of each filter strictly corresponds to the word vector dimensionality d, the weight matrix of an individual kernel is defined as . During the forward pass, each kernel slides vertically across the input embedding matrix, processing h consecutive word vectors per stride. This operation computes a localized feature value via a dot product, followed immediately by a ReLU non-linear activation. Ultimately, each kernel generates a feature map whose sequence length is jointly determined by the input text length and the specific kernel height, effectively mapping raw adversarial text into a dense semantic feature space.
3.3.4. Pooling Layer and Feature Fusion
The pooling layer performs dimensionality reduction on the feature vectors output by the convolutional layer, extracting the most representative features. This study employs a global max-pooling strategy, selecting the maximum value from the feature vector generated by each convolutional kernel as that kernel’s final output.
The core idea behind global max-pooling is that, for the drug-related code language recognition task, a key feature need only appear once in a sentence to support a judgment—its exact position is irrelevant. Regardless of whether “ice smoking” appears at the beginning, middle, or end of a sentence, it will be captured by the corresponding convolutional kernel, and the pooling layer retains its maximum activation value, thereby ensuring that critical information is not lost.
After the pooling layer, each convolutional kernel outputs a scalar value. With 128 kernels for each of the three sizes, a total of 384 scalar values are produced. These scalars are concatenated into a single feature vector that aggregates local semantic information at different granularities from the input sentence, providing a feature basis for subsequent classification.
3.3.5. Output Layer
The output layer consists of a fully connected layer and a Softmax classifier. The feature vector output by the pooling layer first undergoes nonlinear transformation and dimensionality compression through the fully connected layer. The fully connected layer has 128 neurons and uses the ReLU activation function:
where
is the weight matrix and
is the bias term.
To prevent overfitting, a Dropout mechanism is introduced after the fully connected layer. During training, Dropout randomly drops a fraction p of neuron outputs, forcing the model to avoid dependence on specific neurons. Dropout is active only during training; during inference, all neurons participate in computation.
The output layer applies the Softmax function to map the fully connected layer’s output into binary classification probabilities:
The Softmax function is defined as . The two output components correspond to the predicted probabilities for the non-drug-related (label 0) and drug-related (label 1) categories, respectively, and satisfy . The final prediction is the category with the higher probability.
3.3.6. Model Training Settings
After annotation, the dataset is randomly shuffled and split into training, validation, and test sets in an 8:1:1 ratio. The training set is used for learning and updating model parameters; the validation set monitors model performance during training; the test set is used for final evaluation of model performance.
Model training uses the Adam optimizer with an initial learning rate of 0.001. Adam combines the concepts of momentum and adaptive learning rates, dynamically adjusting the learning rate of each parameter based on estimates of the first and second moments of the gradients. It offers the advantages of fast convergence and low sensitivity to hyperparameters.
The loss function is cross-entropy loss, defined as:
where
N is the batch size,
is the true label of the
i-th sample, and
is the model’s predicted probability of the drug-related category. Cross-entropy loss effectively measures the discrepancy between the predicted probability distribution and the true label distribution.
The training batch size is set to 64, meaning 64 samples are processed simultaneously per iteration. The number of training epochs is set to 10. After each epoch, the loss and accuracy on the validation set are computed to monitor the training process. If validation performance ceases to improve, an early-stopping strategy is applied to prevent overfitting. The model with the best performance on the validation set is ultimately selected for evaluation on the test set. The main hyperparameters of this model are listed in
Table 5.