This section provides a detailed overview of our proposed method, as illustrated in
Figure 1. Our methodology involves several key steps. Initially, we conduct a preliminary dataset pre-processing step, which involves removing duplicate passwords. Subsequently, the dataset is partitioned into two distinct sets: the training set and the test set. The primary focus of the pre-processing step is on the training set, where two sets of features are computed. The first set is based on BBPE, while the second set is derived from characters extracted from the leaked passwords. These feature sets are then used as inputs for the subsequent stages of our model. Following this, we employ the AE architecture to generate novel passwords based on the provided feature sets. These generated passwords are then evaluated by comparing them against the unseen test data.
The following subsections delve into each step in greater detail, illuminating the various processes involved and providing a more comprehensive understanding of our proposed method.
3.1. Data Pre-Processing
The initial step in our data cleaning process involves the removal of duplicate passwords from the dataset. Once this collection of unique passwords is obtained, it is further divided into two distinct sets: the training set, which comprises 80% of the cleaned data, and the testing set, which accounts for the remaining 20%. It is important to note that there are no specific constraints on the maximum length of passwords during this division process.
Moving on to the subsequent phase, we employ two distinct methods for processing the input data: character-based and token-based, also known as BBPE-based. These methods serve as the basis for training two separate models: one utilizing character-based processing, and the other employing BBPE-based processing. By utilizing these two different approaches to process the input data, we can explore and evaluate the effectiveness of each technique in training separate models. This approach allows for comprehensive analysis and comparison of the performance and outcomes of the character-based and BBPE-based models.
The BBPE algorithm, which is a variant of byte pair encoding (BPE), is utilized to handle character sequences that have unique representations at the byte level. This approach finds widespread application in subword encoding for various natural language processing tasks [
30]. The BBPE algorithm begins by taking an initial word composed of individual bytes and subsequently combines pairs of frequently occurring bytes to form longer tokens. This process involves iteratively merging the most common pairs of bytes until a predetermined number of combinations is reached. By employing this technique, the BBPE algorithm effectively captures the recurring patterns and structures present in the character sequences, allowing for the creation of meaningful and compact representations.
In our approach, we employed a BBPE-based tokenizer to extract meaningful subword units from the text of passwords. By analyzing patterns across the entire dataset, which consisted of leaked passwords from the training set, the tokenizer identified subword units that represented either individual characters or combinations of characters commonly seen in passwords (byte-level sequences). To determine when to stop the repetition of the BBPE algorithm for each chunk, we introduced a parameter called min_frequency. This parameter served as a criterion, defining the minimum number of times a subword needed to appear in the training data to be included in the vocabulary set. In our experiments, we set this parameter to 10. Consequently, the tokenizer identified and indexed subwords that appeared at least 10 times in the training data. These unique subwords and their corresponding indexes were stored in a Vocabulary file. Based on the assembled vocabulary, our tokenizer generated tokens for each password in the dataset. These tokens were then recorded in the Tokens file, along with their respective indexes. Additionally, we extracted all distinct characters from the training data and stored them in the Characters file, alongside their respective identifiers or indexes. By employing this approach, we were able to effectively tokenize passwords, extracting meaningful subword units and capturing their relationships. The resulting Vocabulary, Tokens, and Characters files serve as valuable resources for subsequent stages of our password generation guessing models.
In order to further enhance the stored tokens, we introduced additional tokens, namely <START>, <END>, <OOV> (Out-of-Vocabulary), and <PADDING>. These tokens, along with their corresponding indexes, were incorporated into the Vocabulary and Characters files. In the Tokens file, the <START> token was appended at the beginning of each encrypted password, while the <END> token was appended at the end. This served to mark the start and end points of each password sequence.
Figure 2 provides a visual representation of the tokenization process, offering clarity and insight into the steps involved in transforming passwords into tokenized sequences. By integrating these additional tokens, we expanded the token vocabulary and enriched the representation of passwords in the tokenized form. This augmentation aids in subsequent stages of analysis, modeling, and generation, providing a more comprehensive and effective approach to password processing.
Following the previous steps, the index data then underwent a windowing process with a window size of 5 and a step size of 1. This windowing process involved traversing the input data and generating three distinct forms of data:
X1: This represents the token indexes based on the window. It captures the indexes of tokens within each windowed segment of the input data.
X2: This denotes the character indexes. It records the indexes of individual characters within the windowed segments.
Y: This represents the target data, where the indexes are shifted one step forward from the X1 indexes. It captures the next token index in the sequence following the windowed segments.
By applying this windowing process, we create these three forms of data, X1, X2, and Y, which serve as inputs for subsequent stages of our model. This methodology facilitates the analysis and modeling of password sequences, enabling the exploration of relationships and patterns within the data.
The Y data, representing the target values, was encoded using the One-Hot encoding technique. One-Hot encoding is commonly utilized to convert categorical data, such as class labels, into binary vectors. It assigns a unique binary representation to each category, with one bit set to one to indicate the presence of the category and the remaining bits set to zero. In addition, we employed the crawl-300d-2M-subword model to encode the pre-processed tokens. This particular model provides word embeddings, which are dense vector representations of words used to capture semantic associations between them. The crawl-300d-2M-subword model is trained on a large corpus of text and can encode not only complete words but also subword units such as word roots or morphemes. This expanded coverage allows for a more precise representation of semantic information, including OOV terms. By following these procedures, we obtained the embedding matrix from the crawl-300d-2M-subword model. This embedding matrix serves as a mapping between the tokens and their corresponding embeddings. It enables the model to access and utilize the semantic information stored in the pre-trained word embeddings throughout the training process, enhancing its ability to capture and leverage the underlying meaning and context of the tokenized data.
3.2. AE-Based Password Generator
Figure 3 illustrates the architecture of the password generator module, which utilizes an AE framework with LSTM networks and a multi-head attention mechanism to generate passwords. The AE network consists of encoders and decoders that play distinct roles in the password generation process. The encoders are responsible for encoding and compressing the input passwords, effectively reducing their dimensionality and enabling the model to capture essential features and patterns from the input data. On the other hand, the decoders focus on restoring the data to its original representation space, reconstructing it based on the encoded information. During the training phase of the AE model, the objective is to minimize the discrepancy between the reconstructed output and the original input. By doing so, the model learns to extract significant features and patterns from the input data, enhancing its ability to generate passwords that align with the characteristics of the training set.
The AE network is structured using LSTM models, which are well-suited for representing and capturing sequential and long-term dependencies within. To further enhance the capabilities of the AE network, we incorporate the multi-head attention mechanism. This mechanism allows the model to selectively focus on different segments of the input data, assigning varying levels of importance to each segment. By leveraging this attention mechanism, the model gains the ability to prioritize and emphasize specific elements of the password input, giving greater significance to relevant features and patterns, and enhancing the efficiency and effectiveness of the model for password guessing. By selectively attending to crucial aspects of the input data, the model can more accurately capture the underlying patterns and generate passwords that align with the learned patterns. This approach enables the model to generate passwords that exhibit desirable characteristics and meet the requirements of password guessing tasks. Overall, the combination of LSTM models and the multi-head attention mechanism empowers the AE network to effectively capture sequential dependencies, assign importance to relevant elements, and leverage learned patterns for efficient password guessing. The details of the model settings is provided in
Table 1 (
Section 4.2).
3.2.1. Training AE
For training purposes, each token-based and character-based input undergoes training using separate AE networks. These networks share identical structures and parameter values, with the only difference being the input length they handle. As part of the pre-processing stage, we determine the length of the input data to appropriately configure the AE networks. Specifically, we assign a value of 5 to represent the input length for the first AE network, which operates on BBPE-based inputs. Similarly, for the second AE network, we assign the length of the max sequence of the character-based input. This ensures that each AE network is tailored to handle the specific input length requirements. Following the determination of input lengths, we transform the input sequences into tensors. This transformation enables efficient processing and computation within the network, as tensors provide a structured and optimized format for data representation. By converting the input sequences into tensors, we facilitate seamless and effective information flow through the AE networks during both training and inference stages. In summary, by employing separate AE networks for token-based and character-based inputs and configuring them based on the appropriate input lengths, we ensure that each network can effectively capture the distinctive properties of the input data.
In the model, both inputs are encoded using two LSTM layers. In the initial layer, we set the “return_sequences” option to True, allowing the network to generate sequences at each time step. This configuration ensures that the model can capture sequential dependencies and generate outputs accordingly. To obtain the final internal state, which comprises the hidden state and the cell state, for each input in the second layer, we enable the “return_state” parameter and set it to True. These modes are crucial for effectively decoding the model’s outputs and maintaining the information flow through the network. Moving on to the decoder implementation, we utilize not only the initial states acquired from the encoders but also the inputs provided to each encoder. By incorporating this strategy and utilizing the information transmitted from the encoders, we enhance the precision and effectiveness of our model.
The decoder consists of an LSTM layer with both the “return_sequences” and “return_state” parameters set to True. This configuration allows the decoder to generate sequences and retain the final internal state for subsequent steps. In order to incorporate the multi-head attention mechanism and consolidate the decoder outputs, we initially flatten the output of the LSTM layer. This flattening process prepares the output for further processing. Subsequently, the flattened outputs are passed through the attention mechanism, enabling the model to selectively focus on different segments of the input based on their significance. This attention mechanism enhances the model’s ability to capture important patterns and features during the decoding process. After the attention mechanism, we combine the results obtained for each AE and utilize a dense layer to reduce the dimensions. The dimensionality reduction is achieved by multiplying the window size with the number of signs acquired for the Y data. Finally, we apply the Softmax activation function to obtain the probability distribution over the different classes or categories, allowing the model to generate the final outputs with proper normalization. By following these steps, our model leverages LSTM layers, attention mechanisms, and dimensionality reduction techniques to enhance precision and effectiveness in generating accurate and meaningful outputs. The generated passwords will be evaluated based on provided metrics to reflect the strength of our proposed model in password guessing.
3.2.2. Password Generation Using AE
After the model has been trained, we save the obtained weights to preserve its learned knowledge and enable future use. To generate passwords using the trained model, we rely on several input files from the training data: Vocabulary, Tokens, Indexes, and Characters files. We initiate the password generation process by making a sequence of calls in the following order:
Specify the desired number of passwords to be generated.
Provide the trained model.
Include the Vocabulary file.
Include the Characters file.
Specify the window length.
During this step, we utilize the inverse dictionary, denoted as "iVocabulary", which allows us to map token indexes back to their corresponding tokens in the Vocabulary file. This mapping is essential for generating coherent and meaningful passwords. Based on the specified number of passwords to generate, the following phases are repeated for each password. We begin by creating a list of size 5, representing the moving window of the event and serving as the initial input for the token-based model. This list is initialized with the "<PADDING>" tag, which signifies an empty or null value. This initialization sets the foundation for the password generation process, ensuring that the model starts with an appropriate context and can generate passwords that align with the learned patterns and structures. By following this sequence of calls and initializing the moving window with the "<PADDING>" tag, we establish the necessary framework for generating passwords using the trained model and associated input files.
In addition to the token-based input, we have a second input in the model that consists of characters. To process this input, we convert the list of tokens into a character-based representation, which is then stored in a new list of the same length as the second longest input. This conversion allows us to effectively handle character-level information within the model. After the conversion, the pre-determined values are transmitted to the model as primary inputs, enabling it to make predictions about the probability of the subsequent token. To ensure consistency and comparability, we standardize the obtained probabilities. The next token is selected from the standardized probabilities using a random selection technique. This randomness adds a degree of variation and unpredictability to the generated passwords, enhancing their security and uniqueness. The selected token is then appended to the list of length 5, which serves as the input for the model in the subsequent iteration. This procedure continues iteratively until the model's prediction reaches the "<END>" token, which serves as a stopping criterion for the password generation process and allows us to generate passwords of variable lengths. By following this iterative process, the model generates passwords by progressively predicting and appending tokens based on their probabilities until an "<END>" token is reached, indicating the completion of a password. This approach ensures that the generated passwords are coherent and consistent with the learned patterns and structures of the training data.