3.4. Model Description
The Places_365 convolutional neural network (CNN) [
24] was employed to extract contextual features from the background of social media images, while the YOLOv8 model [
25,
26] was used to estimate the number of individuals depicted in each image. A CSV file named Image_Paths.csv was generated to store the file paths of the downloaded Instagram posts. A for loop iterated over a list of Instagram usernames, each of which was preprocessed to remove any leading '@' symbol. For each user, a variable was defined to specify the path to the directory containing their downloaded images. Another for loop was used to traverse all files within the user's directory by appending the file paths of individual posts to a list. The collected paths were then written as new rows into the Image_Paths.csv file as illustrated in Algorithm 4. Following this, the image paths were loaded from the Image_Paths.csv file and converted into a Python list for further processing. The file labels_sunattribute.txt contains the class labels required by the model which reads line by line with each line appended to a list.
A conditional check was implemented to verify the existence of the label_encoder_CNN1.pkl file. If the file was not found, a new instance of the LabelEncoder class was initialized and fitted using the compiled list of labels. The fitted encoder was then serialized and saved to label_encoder_CNN1.pkl using the pickle module. Similarly, if the file already existed, it was deserialized and loaded into the program. The script CNN2.py begins by defining a variable to store the output path for the People_Count_features.csv file which contains the features extracted by the CNN. An empty list was initialized to store the extracted features during the execution cycle and the corresponding image file paths were retrieved for processing.
|
Algorithm 4 Data Preprocessing |
| Input: |
Username_list |
| Output: |
Store_image _paths |
| 1. |
if Image_paths file exists:
|
| 2. |
dele the file |
| 3. |
else, |
| 4. |
open the Image_paths |
| 5. |
for each username in username_list:
|
| 6. |
if username starts ‘@’, remove ‘@’ from the beginning
|
| 7. |
create folder_path by joining output_folder and username. |
| 8. |
for each file in the directory folder_path:
|
| 9. |
create img_path by joining folder_path and file |
| 10. |
append img_path to user_paths |
| 11. |
return user_paths as Image_paths |
A nested while loop was created to iterate through all image paths and extract the associated features. The outer loop concluded by computing the frequency of each individual's appearance across all volunteer posts. The loop was subsequently normalized by the total number of images. The resulting normalized frequency was appended to the feature list. Thereafter, the set of normalized frequencies for each volunteer was further scaled to the range [0, 1] with a small constant added to mitigate division-by-zero errors. In training mode as specified by the user prompt, the maximum and minimum values used in the normalization process were stored in a pickle file. These values were retrieved from the pickle file to ensure consistent normalization of the input data. Finally, the normalized feature list was transformed into a Pandas data frame and exported as a CSV file titled People_Count_features.csv.
A Python-based implementation titled Random_Forest.py was developed to construct a personality prediction pipeline utilizing environmental and social media-derived features based on the Instagram metadata. The initial stage involved defining file paths for all requisite datasets including Final_Dataset.csv. Two feature sets namely Environment_features.csv and People_Count_features.csv were loaded independently into Pandas data frames and subsequently concatenated to form a unified feature matrix. In parallel, Instagram_data.csv was loaded into a separate data frame. its columns were iteratively normalized using min–max scaling within a while-loop construct. The maximum and minimum values employed during normalization were serialized using the pickle module to ensure consistency during future inference. Normalized_Traits.csv was imported and integrated with the composite feature to produce the final training dataset which was saved as Final_Dataset.csv. This concluded the data preprocessing phase.
A function named Random_Forest_Tuning was executed. This function ingested the Final_Dataset.csv, converting its contents into a NumPy array. The features were defined as all columns excluding the final five, which represented the target personality traits. Given the presence of low-variance features potentially detrimental to model performance, Principal Component Analysis (PCA) [
27] was employed to retain components explaining 95% of the total variance. The trained PCA model was serialized for reuse in subsequent stages. A Random Forest Regressor was instantiated, and hyperparameter optimization was performed via exhaustive grid search. This procedure evaluated multiple hyperparameter configurations using a 4-fold cross-validation scheme, where each fold held out one subset for validation and used the remaining data for training. For each configuration, the Mean Absolute Error (MAE) [
28], Mean Squared Error (MSE) [
29], and Root Mean Squared Error (RMSE) [
30] were calculated. The average of these metrics across all folds was used to identify the optimal hyperparameter set. The results were compiled into a CSV file, sorted by MAE for ease of interpretability. To further assess the influence of individual hyperparameters, a separate analysis module, Ranking_Parameters.py, was implemented. This script parsed the grid search output, extracting and analyzing hyperparameter-specific columns. A defaultdict structure was utilized to associate parameter values with their corresponding performance ranks. For each parameter, the average rank and standard deviation were computed to assess both effectiveness and stability. These results were presented in a tabular format, enabling informed selection of consistently high-performing configurations. Additionally, to account for stochastic effects, the random_state parameter was varied across 20 random seeds (ranging from 0 to 10,000), ensuring robust evaluation.
Following hyperparameter selection, final model training was conducted using the Random_Forest.py script. The Final_Dataset.csv was reloaded, and identical preprocessing steps were applied, including PCA transformation. For each of the Big Five personality traits, the corresponding labels were extracted. The dataset was partitioned into training and testing subsets using Scikit-learn's train_test_split function. A Random Forest model was then trained using the optimal hyperparameters, and its performance was assessed on the test set using standard regression metrics: standard error, standard deviation, MAE, MSE, and RMSE. Additionally, scatter plots of predicted versus actual values were generated for each trait, with trait labels displayed in the plot titles to facilitate interpretation. Each trained model was serialized and saved with filenames reflecting the personality trait being predicted.
A graphical user interface-based application name Personality_Prediction_System (
https://github.com/JoshuaBryan02/Personality-Prediction-System/blob/main/JoshuaBryan_Final_Code/Personality_Prediction_System.py) was created to facilitate personality trait inference. Upon initialization the system defines file paths and loads the machine learning models serialized in pickle format. Users are guided through a step process beginning with a brief overview of usage instructions. The application contained two primary functions. The first Create_Dataset replicates the data preparation workflow from Random_Forest.py but excluded label extraction. This saves the processed features to Predict_Dataset.csv for inference. The second function called Predictions transforms the dataset into a NumPy array and then applied PCA using the preloaded model. It then utilized the five independent RF models to predict each Big Five trait. The raw outputs were converted to percentages and rounded decimal places for easy interpretation. Apart from the csv files and functions which contain PII, the rest of the codes and functions included in the model development can be found in the data availability statement section.