3.2. Data Extraction
A Python script titled CSV-splitting.py (can be accessed here:
https://github.com/JoshuaBryan02/Personality-Prediction-System/blob/main/JoshuaBryan_Final_Code/CSV-splitting.py) was developed to automate the processing of raw data collected via Microsoft Forms. The script performs separates the consent form from the personality assessment data and computes individual personality trait scores. Initially, file path variables are defined to locate the questionnaire responses. The program checks whether it has been executed previously, prompting the user to confirm if a reload is necessary. If running for the first time or reloading is approved, the script sequentially executes three custom-defined functions. The first function, convert_xlsx(), utilizes the pandas library to convert the excel formatted dataset into a comma-separated values (CSV) format for streamlined processing.
The second function split_csv() parses the converted CSV file and separates it into two distinct files: Consent_Form.csv and Personality_Test.csv. This contains the respective data. The third function Personality_calculation() reads the Personality_Test.csv file and initializes data structures to store personality scores. These include five lists corresponding to the Big Five Inventory (BFI) traits [
20] with lengths equal to the number of valid participants. Algorithm 1 presents personality scores calculation approach. As shown in
Figure 2, a systematic sequence of preprocessing and implementation steps was employed to ensure the effective development of the prediction model.
A nested loop is implemented where the outer loop iterates through each participant (data point) and the inner loop processes responses to the 44 BFI questions. This procedure enables trait-wise score calculation [
21] by aggregating and interpreting individual responses according to the BFI scoring scheme. Each response was scored in accordance with the Big Five Inventory (BFI) guidelines [
22] with individual data points allocated to one of the five personality traits based on the corresponding questionnaire item. During each iteration of the outer loop, the counter advanced to the next index, effectively separating responses by participant. To accommodate items requiring reverse scoring, a custom function was implemented. This function performs a transformation such that a score of 5 is converted to 1, 4 to 2. This aligns with the BFI’s reversed scoring mechanism [
23]. The function is invoked conditionally based on the reverse-coded nature of specific questions as presented in Algorithm 2.
Following the assignment and adjustment of trait scores, the raw values were normalized to facilitate comparative analysis. Normalization was achieved using min-max scaling by transforming the values to a [0, 1] range which were subsequently multiplied by 100 to yield percentage-based scores for interpretability. The minimum possible score for a trait corresponds to the number of questions associated with that trait (assuming a score of 1 per item), while the maximum is calculated as the number of items multiplied by the maximum score of 5. After normalization, the resulting scores for all five traits were compiled into a list and appended as a new row to a CSV file named Normalized_Traits.csv. This is to support model analysis and visualization.
|
Algorithm 1 Personality Scores Calculation |
| Input: |
Personality_Scores_Dataset |
| Output: |
Lists of calculated scores for each of the five personality traits |
| 1. |
dataset Load_Personality_Test () |
| 2. |
convert loaded_data into list of values()
|
| 3. |
set counters()
|
| 4. |
fix variable
|
| 6. |
with
|
| 7. |
compute_scores_and_update_initialised_lists |
| 8. |
return display_results_for_lists_containing_scores(extraversion, agreeableness, conscientiousness, neuroticism, openness)
|
|
Algorithm 2 Reverse Code |
| Input: |
Number_to_reverse [1 to 5] |
| Output: |
Reversed_score |
| 1. |
define close interval ()
|
| 2. |
check and reverse score: |
| 3. |
if 1 <= number_to_reverse <= 5: |
| 4. |
endif |
| 5. |
output = 6 (number_to_reverse) |
3.3. Instagram API Instaloader
A Python script, Data_collection.py is created using the Instaloader library to facilitate the automated extraction of user data from Instagram. Initially, variables were declared to specify the file paths of both the pre-existing Consent_Form.csv file and a new output file (Instagram_data.csv) which is generated during the data collection process. For training the personality prediction system, usernames and dates of birth are extracted from the Consent_Form.csv file which is generated by a preceding script CSV-splitting.py. These values are then transformed into a list for processing. When the system is employed to predict the personality of a new individual, the corresponding Instagram username and date of birth are manually provided by the user and appended to the input list.
An instance of the Instaloader class is initialized to handle data acquisition from Instagram with parameters configured to restrict retrieval to user posts and video thumbnails. An overview of the automated Instagram data extraction pipeline is presented in Algorithm 3. The script iterates over each retrieved post using a for loop. For each post, the posting date is compared against the volunteer’s eighteenth birthday. Posts made prior to this age threshold are excluded from further analysis. Posts made on or after this date are downloaded using the initialized Instaloader instance. The retrieved content is saved in a directory named after the user's Instagram handle.
To mitigate the risk of surpassing Instagram’s API rate limits, a randomized delay is introduced between post retrieval iterations. This delay ranges from 20 to 50 seconds to generate a pseudo-random number generator implemented through a sleep function. The program prompts the user to input their Instagram username which is used by a pre-initialized instance of the Instaloader class to retrieve the corresponding session cookie. This session is then imported into the program. This process allows the instance to operate within an authenticated environment. A loop is then implemented to iterate through a predefined list of usernames. During each iteration, a temporary list is instantiated to store the data associated with the current user. The username is extracted and sanitized by removing any leading '@' characters which were occasionally appended by volunteers when submitting their Instagram identifiers.
|
Algorithm 3 Automated Extraction of Instagram User Data |
| Input: |
Instagram_user_profile_object |
| Output: |
Username_ Identifier_used_with_downloader_posts |
| 1. |
get_posts(from_the_user_profile) |
| 2. |
check if a folder with the name_ username exists: |
| 3. |
if it does not exist, |
| 4. |
iterate through the_user’s_posts using |
| 5. |
dropwhile skip posts created before the user’s_eighteenth_birthdate
|
| 6. |
takewhile process posts created after the eighteenth_birthday
|
| 7. |
create folder_path by joining output_folder and username. |
| 8. |
for each qualifying post:
|
| 9. |
download the post into the folder named after the_user |
| 10. |
randomise delay
|
| 11. |
if counts >150:
|
| 12. |
end if |
| 13. |
end for |
| 14. |
return display_results (file_paths, username, dates_of_birth, posts) |
In each iteration of the loop, the birthdate corresponding to the volunteer is extracted and converted into a Python datetime object using the strptime function. To compute the date on which the volunteer attained the age of eighteen, a time delta object equivalent to 18 years is added to the original birthdate. When initiated, the volunteer’s Instagram profile is accessed by invoking the Profile.from_username method which requires the Instagram username. This uses a pre-configured Instaloader instance to authenticate the session and enforce download constraints. Once the profile object is instantiated, essential account metadata such as username, total number of media posts, follower count, and number of followees is extracted. These data are appended as a new entry to an output CSV file named Instagram_data.csv. A get_posts method is created to retrieve metadata for all posts associated with the profile including download paths. Prior to initiating the download process, the script verifies the existence of a directory designated for the current user's posts to avoid redundant downloads.
3.4. Model Description
The Places_365 convolutional neural network (CNN) [
24] was employed to extract contextual features from the background of social media images, while the YOLOv8 model [
25,
26] was used to estimate the number of individuals depicted in each image. A CSV file named Image_Paths.csv was generated to store the file paths of the downloaded Instagram posts. A for loop iterated over a list of Instagram usernames, each of which was preprocessed to remove any leading '@' symbol. For each user, a variable was defined to specify the path to the directory containing their downloaded images. Another for loop was used to traverse all files within the user's directory by appending the file paths of individual posts to a list. The collected paths were then written as new rows into the Image_Paths.csv file as illustrated in Algorithm 4. Following this, the image paths were loaded from the Image_Paths.csv file and converted into a Python list for further processing. The file labels_sunattribute.txt contains the class labels required by the model which reads line by line with each line appended to a list.
A conditional check was implemented to verify the existence of the label_encoder_CNN1.pkl file. If the file was not found, a new instance of the LabelEncoder class was initialized and fitted using the compiled list of labels. The fitted encoder was then serialized and saved to label_encoder_CNN1.pkl using the pickle module. Similarly, if the file already existed, it was deserialized and loaded into the program. The script CNN2.py begins by defining a variable to store the output path for the People_Count_features.csv file which contains the features extracted by the CNN. An empty list was initialized to store the extracted features during the execution cycle and the corresponding image file paths were retrieved for processing.
|
Algorithm 4 Data Preprocessing |
| Input: |
Username_list |
| Output: |
Store_image _paths |
| 1. |
if Image_paths file exists:
|
| 2. |
dele the file |
| 3. |
else, |
| 4. |
open the Image_paths |
| 5. |
for each username in username_list:
|
| 6. |
if username starts ‘@’, remove ‘@’ from the beginning
|
| 7. |
create folder_path by joining output_folder and username. |
| 8. |
for each file in the directory folder_path:
|
| 9. |
create img_path by joining folder_path and file |
| 10. |
append img_path to user_paths |
| 11. |
return user_paths as Image_paths |
A nested while loop was created to iterate through all image paths and extract the associated features. The outer loop concluded by computing the frequency of each individual's appearance across all volunteer posts. The loop was subsequently normalized by the total number of images. The resulting normalized frequency was appended to the feature list. Thereafter, the set of normalized frequencies for each volunteer was further scaled to the range [0, 1] with a small constant added to mitigate division-by-zero errors. In training mode as specified by the user prompt, the maximum and minimum values used in the normalization process were stored in a pickle file. These values were retrieved from the pickle file to ensure consistent normalization of the input data. Finally, the normalized feature list was transformed into a Pandas data frame and exported as a CSV file titled People_Count_features.csv.
A Python-based implementation titled Random_Forest.py was developed to construct a personality prediction pipeline utilizing environmental and social media-derived features based on the Instagram metadata. The initial stage involved defining file paths for all requisite datasets including Final_Dataset.csv. Two feature sets namely Environment_features.csv and People_Count_features.csv were loaded independently into Pandas data frames and subsequently concatenated to form a unified feature matrix. In parallel, Instagram_data.csv was loaded into a separate data frame. its columns were iteratively normalized using min–max scaling within a while-loop construct. The maximum and minimum values employed during normalization were serialized using the pickle module to ensure consistency during future inference. Normalized_Traits.csv was imported and integrated with the composite feature to produce the final training dataset which was saved as Final_Dataset.csv. This concluded the data preprocessing phase.
A function named Random_Forest_Tuning was executed. This function ingested the Final_Dataset.csv, converting its contents into a NumPy array. The features were defined as all columns excluding the final five, which represented the target personality traits. Given the presence of low-variance features potentially detrimental to model performance, Principal Component Analysis (PCA) [
27] was employed to retain components explaining 95% of the total variance. The trained PCA model was serialized for reuse in subsequent stages. A Random Forest Regressor was instantiated, and hyperparameter optimization was performed via exhaustive grid search. This procedure evaluated multiple hyperparameter configurations using a 4-fold cross-validation scheme, where each fold held out one subset for validation and used the remaining data for training. For each configuration, the Mean Absolute Error (MAE) [
28], Mean Squared Error (MSE) [
29], and Root Mean Squared Error (RMSE) [
30] were calculated. The average of these metrics across all folds was used to identify the optimal hyperparameter set. The results were compiled into a CSV file, sorted by MAE for ease of interpretability. To further assess the influence of individual hyperparameters, a separate analysis module, Ranking_Parameters.py, was implemented. This script parsed the grid search output, extracting and analyzing hyperparameter-specific columns. A defaultdict structure was utilized to associate parameter values with their corresponding performance ranks. For each parameter, the average rank and standard deviation were computed to assess both effectiveness and stability. These results were presented in a tabular format, enabling informed selection of consistently high-performing configurations. Additionally, to account for stochastic effects, the random_state parameter was varied across 20 random seeds (ranging from 0 to 10,000), ensuring robust evaluation.
Following hyperparameter selection, final model training was conducted using the Random_Forest.py script. The Final_Dataset.csv was reloaded, and identical preprocessing steps were applied, including PCA transformation. For each of the Big Five personality traits, the corresponding labels were extracted. The dataset was partitioned into training and testing subsets using Scikit-learn's train_test_split function. A Random Forest model was then trained using the optimal hyperparameters, and its performance was assessed on the test set using standard regression metrics: standard error, standard deviation, MAE, MSE, and RMSE. Additionally, scatter plots of predicted versus actual values were generated for each trait, with trait labels displayed in the plot titles to facilitate interpretation. Each trained model was serialized and saved with filenames reflecting the personality trait being predicted.
A graphical user interface-based application name Personality_Prediction_System (
https://github.com/JoshuaBryan02/Personality-Prediction-System/blob/main/JoshuaBryan_Final_Code/Personality_Prediction_System.py) was created to facilitate personality trait inference. Upon initialization the system defines file paths and loads the machine learning models serialized in pickle format. Users are guided through a step process beginning with a brief overview of usage instructions. The application contained two primary functions. The first Create_Dataset replicates the data preparation workflow from Random_Forest.py but excluded label extraction. This saves the processed features to Predict_Dataset.csv for inference. The second function called Predictions transforms the dataset into a NumPy array and then applied PCA using the preloaded model. It then utilized the five independent RF models to predict each Big Five trait. The raw outputs were converted to percentages and rounded decimal places for easy interpretation. Apart from the csv files and functions which contain PII, the rest of the codes and functions included in the model development can be found in the data availability statement section.