1. Introduction: Predicting Professions from Text Using NLP and ML
In this research project, we apply recent advances in natural language processing (NLP) and machine learning (ML) techniques to personality prediction. In particular, we try to predict professions from transcripts of Youtube videos created by members of these professions. Using video transcripts, we develop models that infer professional roles with high accuracy. Combining different methods, including unsupervised clustering techniques, Support Vector Machines (SVM), BERT, and Bi-LSTM, allows us to categorize professionals into distinct career classes based on the words they use in these videos.
Furthermore, we also demonstrate how the combination of deep learning models and personality extraction techniques[
3] provides a robust framework for identifying personality attributes of professions. Key performance metrics, including 85% precision, 80% recall, and an F1 score of 0.77, highlight the efficacy of our approach. Alongside this, we introduce "personas", groups of personality characteristics, inspired by real-world behaviors (e.g., "Beeflow," "Antflow," and "Leechflow") [
4] to further understand the intersection of personality and profession.
In the following sections, we will first outline the classical models explored, detail the application of unsupervised and supervised learning techniques, and introduce our novel frameworks and tools for personality analysis and profession classification. This will provide readers with information about both the methodology and the broader implications of profession prediction from text.
4. Dataset
To develop a robust model for profession prediction, we constructed a diverse and comprehensive dataset by scraping audio content from YouTube videos and podcasts across seven distinct professional categories:
Lawyer, Medicine, Sports, Engineer, Creative, MBA, and Accountant. The audio content was transcribed into text using
Whisper AI, an advanced automatic speech recognition (ASR) tool [
28]. This transcription process yielded a total dataset of approximately
70,000 words.
4.1. Data Collection Details
The data collection involved a variety of recordings across different professions. For the Lawyer category, 110 videos and podcasts were collected, covering legal arguments, courtroom practices, and career paths in law. The Medicine category consisted of 95 recordings, which included medical lectures, interviews with doctors, and discussions on healthcare trends. In the Sports category, 100 videos were gathered, featuring athlete interviews, sports commentary, and motivational content. The Engineer category included 85 recordings, which focused on engineering innovations, career advice, and technical discussions. For the Creative field, 75 podcasts were collected, discussing creative professions such as writing, music, and design. The MBA category had 90 discussions on business strategies, case studies, and management careers. Lastly, 60 videos related to the Accountant field were included, covering financial practices, accounting principles, and career paths in finance.
The transcriptions of these recordings varied in length, with each averaging between 500 to 700 words, depending on the content’s duration. In total, the transcriptions amounted to over 70,000 words of structured text.
The sources for these recordings included major platforms like YouTube, Spotify, Apple Podcasts, and other public podcast platforms. The topics discussed in each category were specific to the respective field: for Lawyers, they covered legal case studies, law school experiences, and courtroom strategies; for Medicine, they included advances in medicine, patient care, and medical education; Sports recordings focused on personal experiences of athletes, training routines, and mental preparation; Engineers discussed industry trends, project management, and technical innovations; Creatives explored inspirations, processes, and challenges in creative fields; MBA discussions delved into business operations, market strategies, and leadership skills; and Accountants covered financial planning, tax strategies, and accounting tools.
4.2. Data Structuring
The collected text data was organized into seven separate datasets, each corresponding to one of the predefined professional categories. These datasets contained chunks of text paired with their respective labels, effectively capturing the nuances and specificities associated with each profession.
4.3. Thematic Headers and Extended Analysis
After training the model and generating predictions, we extended our analysis by calculating probabilities of belonging to a certain tribe using
Griffin. [
29].
The resulting secondary dataset provided a broad spectrum of features categorized into thematic dimensions such as Ideology, Recreation, Lifestyles, Personality, Emotions, Groupflow, and Alternative Realities. These thematic dimensions allowed us to explore deeper patterns and correlations between text-based personality features and professional classifications (see table 1).
IDEOLOGY: Capitalism, Complainers, Liberalism, Socialism
RECREATION: Arts, Fashion, Sport, Travel
LIFESTYLES: Fitness, Sedentary, Vegan, Yolo
PERSONALITY: Journalist, Politician, Risk-taker, Stock-trader
EMOTIONS: Anger, Fear, Happy, Sad
GROUPFLOW: Antflow, Beeflow, Leechflow
ALTERNATIVE REALITIES: Fatherlander, Nerd, Spiritualism, Treehugger
This comprehensive dataset, combined with the subsequent analysis, forms the foundation of our research, providing the necessary data to train and evaluate our model for profession prediction from textual content.
5. Machine Learning Methodology
This section outlines the various steps involved in the process of text preprocessing, model training, and evaluation. We also provide a detailed comparison of different models employed, highlighting their performance based on several key metrics.
5.1. Text Preprocessing
The text data collected for this study underwent a thorough preprocessing pipeline to ensure consistency and quality. These processes aimed to clean and normalize the text, reducing noise and retaining meaningful features for model training.
Initially, the text was tokenized using the nltk library, splitting it into individual words or tokens. This was followed by the removal of common stopwords, such as "the" and "and," which helped eliminate uninformative words that could introduce noise into the data set. Then, lemmatization was applied to reduce words to their root forms, ensuring uniformity between variations of the same word. Finally, all punctuation marks were removed to prevent them from influencing the model’s understanding of the text. This systematic preprocessing ensured that the data were clean, standardized, and ready for further analysis.
5.2. Data Augmentation: Merging Sentences
To enrich the data set and increase the context within each example, multiple short sentences were merged into longer ones. This merging technique allowed the model to capture richer contextual information, as longer sequences provide more detailed insights into the topic.
5.3. Model Training and Evaluation
Multiple models were trained to predict professions based on textual data, ranging from classical models such as Random Forests to more advanced deep learning models such as BERT and BiLSTM.
5.3.1. Random Forest Classifier
The Random Forest model was applied as a baseline. Before training, we used the TF-IDF vectorizer to convert the text into numerical representations. The Random Forest was trained using categorical cross-entropy as the loss function, and feature importance was derived based on Gini impurity . The model’s output for each class was calculated based on:
where
represents the true label and
the predicted probability for class
i.
5.3.2. DistilBERT Model
DistilBERT was used to capture the deep semantic relationships within the text. Each token in a sentence was embedded and the final output of the DistilBERT model was a sequence of hidden states
H. These hidden states were then passed through a softmax layer to predict the profession [
30].
where
T represents the tokenized text and
H is the set of hidden states.
5.3.3. BERT with BiLSTM
The BERT model was combined with a BiLSTM layer to capture sequential dependencies alongside contextual information. The final embeddings from BERT were passed to a bidirectional LSTM, which produced an enhanced understanding of the sequence before the dense layer for classification.
5.4. Analysis and Feature Importance
5.4.1. Correlation Analysis and Clustering
We performed correlation analysis to uncover relationships between features and professional categories. The correlation matrix C was calculated, and features with correlations above a threshold were identified as significant for each profession. These features were visualized using heatmaps and hierarchical clustering techniques.K-means clustering was used to group professions based on their feature similarities. After scaling the data, the K-means algorithm partitioned the data into 7 clusters, corresponding to the 7 professional categories.
5.4.2. SHAP Values for Model Interpretability
After formation of the dummy network and passing it through griffin we get values of different features for each entry. To better understand the predictions made by our models, we used SHAP (SHapley Additive exPlanations) values. SHAP values explain the contribution of each feature to the prediction, allowing us to assess the importance of individual features. The SHAP value
for feature
i is computed as:
where
N is the set of all features, and
is the model’s prediction with feature subset
S.
5.4.3. Feature Importance using Random Forests
Feature importance was also derived from the Random Forest model, where the Gini importance score indicated the significance of each feature in predicting the profession. Top features were visualized using bar plots, providing insights into the driving factors for each profession [
31].
where
represents the reduction in Gini impurity at tree node
t that splits on feature
.
5.4.4. Radar Charts for Feature Insights
To visualize the top features for each profession, radar charts were created. These charts highlighted the top 10 most important features for each profession, based on the Random Forest feature importance scores.
5.5. Evaluation Metrics
To evaluate the performance of the models, several metrics were employed. Accuracy was used to measure the proportion of correct predictions out of all predictions made. It is calculated as , where represents true positives, true negatives, false positives, and false negatives.
Precision, defined as , quantifies the number of correct positive predictions among all positive predictions made by the model. Recall (or sensitivity), calculated as , measures the number of correct positive predictions out of all actual positive instances.
Lastly, the F1 Score, which provides a balance between precision and recall, is computed as the harmonic mean of the two using the formula . These metrics collectively ensure a comprehensive evaluation of the model’s predictive performance across various dimensions.
6. Model Architecture
The model architecture used in this study is designed to effectively capture both the semantic and sequential nature of textual data. To achieve this, we employed a hybrid architecture consisting of the BERT (Bidirectional Encoder Representations from Transformers) model and a Bidirectional Long Short-Term Memory (BiLSTM) layer. This combination allows us to leverage BERT’s contextual understanding alongside BiLSTM’s ability to capture temporal dependencies.
6.1. BERT: Contextual Embeddings
The BERT model is the core component of our architecture, responsible for generating rich, contextual embeddings of the input text. BERT, or Bidirectional Encoder Representations from Transformers, processes input sentences as sequences of tokens and leverages a transformer-based architecture to capture the relationships between these tokens. One of the key advantages of BERT is its ability to model bidirectional context, meaning it considers both the preceding and succeeding tokens when encoding each word, allowing for more nuanced representations of the language [
13].
BERT uses a subword tokenization method, such as WordPiece, which splits words into smaller units and assigns each unit a unique identifier. For example, a sentence , where each is a token, is tokenized into subword units that are then embedded into vectors.
Once the sentence is tokenized, BERT processes the input and produces a sequence of hidden states
, where each
represents the context-aware embedding of token
. These hidden states encapsulate the meaning of each token in relation to the entire sentence and are computed as:
where
T is the tokenized input sequence, and
H is the corresponding set of embeddings. The embeddings serve as the foundational representations that are further refined by the subsequent layers of the model.
6.2. BiLSTM: Sequential Modeling
While BERT captures context on a global scale, we introduced a Bidirectional LSTM (BiLSTM) layer to model the sequential dependencies in the token embeddings. The BiLSTM processes the sequence of embeddings in both forward and backward directions, allowing the model to capture information from both past and future tokens [
32].
The hidden states from BERT
are passed into the BiLSTM, which generates the transformed hidden states
. These transformed hidden states encapsulate the temporal relationships between tokens, providing a more robust representation of the input.
6.3. Classification Layer
The output from the BiLSTM is passed into a fully connected classification layer with a softmax activation function. This layer produces the final probabilities for each profession class. Given the transformed hidden states
, the classification layer computes the probability distribution over the profession classes:
where
W is the weight matrix,
b is the bias term, and
y is the predicted profession class.
6.4. Number of Classification Labels
The model is trained to classify text into one of seven distinct professional categories, each corresponding to a specific profession. These categories include Lawyer (Class 0), Medicine and Academics (Class 1), Sports (Class 2), Engineer (Class 3), Creative (Class 4), MBA (Class 5), and Accountant (Class 6). The classification task is implemented as a 7-way classification problem, where the softmax layer outputs the probability distribution across these classes for a given input. The highest probability value determines the predicted profession, effectively leveraging the model’s understanding of linguistic and contextual features to categorize the text.
6.5. Model Training
The model is trained using the categorical cross-entropy loss function, defined as:
where
is the true label for class
i and
is the predicted probability for class
i. The model is optimized using the Adam optimizer, with a learning rate of
, and trained for 3 epochs with a batch size of 16.
6.6. Summary of the Architecture
The final architecture of the model comprises several layers, each designed to perform a specific role in processing and classifying the input text. The first component is the Input Layer, which takes tokenized input sequences prepared during preprocessing. These tokenized inputs are passed to the BERT Layer, which generates rich, contextual embeddings for each token by leveraging bidirectional context from the entire input sequence. The contextual embeddings are then processed by the BiLSTM Layer, which captures sequential dependencies and relationships between tokens, enhancing the model’s understanding of the temporal structure within the text. Finally, a Dense Layer with Softmax computes the probability distribution over the seven profession classes, allowing the model to assign the most appropriate profession to the input based on its learned features.
Figure 1.
Final model architecture.
Figure 1.
Final model architecture.
Table 2.
Top 2 Predicted Probabilities for Example Sentences.
Table 2.
Top 2 Predicted Probabilities for Example Sentences.
| Example Sentence (Shortened) |
Top Class (Probability) |
Second Class (Probability) |
| "Singing is my passion. I love drawing and painting." |
Creative (0.6758) |
Medicine (0.1564) |
| "Court overturned lower court citing procedural errors." |
Lawyer (0.9407) |
Accountant (0.0191) |
| "Team showed resilience to secure victory in the final moments." |
Sports (0.6649) |
Engineer (0.0993) |
| "I excel in budgeting, auditing, and financial management." |
MBA (0.7305) |
Accountant (0.1756) |
7. Results
This section presents a detailed analysis of how various features contribute to professional classifications, based on correlations, hierarchical clustering, feature importance, and model interpretability through SHAP values.
7.1. Unsupervised Trend Detection of Dataset
In our study, we explored the application of unsupervised learning methods to detect trends across different professions. The core idea was to utilize sentence embeddings generated by the Sentence Transformer model and apply clustering techniques to uncover hidden patterns in the text data [
33]. This approach allowed us to group similar professions based on the textual content and identify the most frequent topics or trends in each cluster.
7.2. Data Preparation and Embedding Generation
The text data was sourced from multiple professions, including Lawyers, Medicine and Academics, Sports, Engineers, Creatives, MBA, and Accountants. After preprocessing the text, which included tokenization, stopword removal, and lemmatization, we used the Sentence Transformer model, specifically `paraphrase-MiniLM-L6-v2`, to generate embeddings for each sentence.
Where is the preprocessed text and is the corresponding sentence embedding. These embeddings serve as a dense representation of the text, capturing both semantic meaning and contextual information.
7.3. Clustering Approach
We applied K-means clustering on the embeddings to detect underlying trends within the text data. The choice of the number of clusters
K was determined empirically, with
representing the seven major professional categories. The K-means algorithm grouped similar sentences into distinct clusters, with each cluster representing a collection of professions or topics with high similarity in text content.
where
C represents the cluster,
is the embedding, and
is the cluster centroid.
Figure 2.
Clusters using K-means.
Figure 2.
Clusters using K-means.
7.4. Model Comparison
The performance of the models is summarized in
Table 3. Each model was evaluated based on its accuracy, precision, recall, and F1 score on the test set.
Evaluating the different machine learning and deep learning models, we found that advanced models such as BERT with BiLSTM layers significantly outperformed traditional models such as Random Forest in the task of profession prediction. The best-performing model, BERT + BiLSTM, achieved an F1 score of 0.82, highlighting its ability to capture both contextual and sequential dependencies in text data.
7.5. Trend Detection in Clusters
After clustering the text, we analyzed the most frequent words in each cluster to uncover key topics and trends. The law-based cluster prominently featured terms such as "court," "law," and "justice," reflecting discussions centered around legal concepts and practices. Similarly, the MBA and accountant clusters were dominated by words like "stock market," "investment," and "profit," highlighting themes of finance and business. These clusters effectively captured the professional focus and vocabulary relevant to these fields.
The sports-based group emphasized terms such as "competition," "game," and "training,", highlighting physical activity and performance-related themes. On the other hand, the creatives-based cluster highlighted words like "design," "art," and "innovation," aligning with discussions on artistic and creative pursuits. This cluster analysis provided a comprehensive understanding of the dominant topics in different professions, offering insights into how language reflects professional contexts.
7.6. Visualization and Trend Analysis of Final Dataset
To visualize the clusters, we reduced the dimensionality of the sentence embeddings using Principal Component Analysis (PCA). The two-dimensional representation of the clusters helped in understanding the relationships between different professions. Additionally, the cosine similarity between cluster centroids and predefined profession embeddings helped further refine our understanding of which clusters aligned with which professions.
The top trends for each cluster were analyzed based on the cosine similarity between cluster centroids and professional embeddings, showing a strong correlation between text features and profession classification.
7.6.1. Correlation Insights
Our correlation analysis revealed meaningful relationships between personal traits and professional identities. We focus on highly correlated characteristics (threshold ) for each profession:
These correlations highlight strong links between individual traits such as recreation, ideology, and lifestyle choices with specific professions.
7.6.2. Hierarchical Clustering
We applied hierarchical clustering to understand how different professional categories share common feature profiles [
34]. The resulting dendrogram revealed several meaningful clusters.
Cluster 1 includes
Medicine, MBA, and Accountant, which are professions that share traits related to financial and managerial skills.
Cluster 4 groups
Lawyer with traits such as
Politician Persona and
Anger Emotions, suggesting that lawyers often manage emotionally charged situations with structured, political thinking. In
Cluster 6,
Engineers and
Creatives are clustered together, reflecting a shared tendency for ideological dissent, described as
Complainers Ideology. Finally,
Cluster 9 focuses on
Sports Professionals, who are grouped based on their engagement in physical activity (
Recreation Sports) and their collaborative approach (
Antflow).
The clusters reveal shared personal traits between seemingly different professions, driven by common ideologies and lifestyle preferences.
7.6.3. Feature Importance
We analyzed feature importance using a Random Forest model to rank the most influential features for each profession [
35]. This analysis provided insights into the primary characteristics that define each professional category. For the
Lawyer profession, the top feature was
Alternative Realities Fatherlander (importance:
), indicating a connection to traditional, structured thinking. Other important features included
Ideology Liberalism,
Recreation Arts, and
Lifestyles Vegan, highlighting both creative and lifestyle preferences within the legal profession.
For Medicine and Academics, the most significant feature was Recreation Travel (importance: ), emphasizing the intellectual curiosity and global mindset common in academia. Additional features like Groupflow Beeflow and Lifestyles Yolo suggested exploratory and spontaneous behaviors prevalent in academic and medical professionals.
In the Sports profession, Recreation Sport dominated the feature importance (importance: ), followed by Groupflow Antflow and Personality Risk-taker, which emphasized the physical and competitive nature of athletes.
For Creative professionals, the most significant feature was Sedentary Lifestyle (importance: ), reflecting stationary work environments often seen in design and writing, while Ideology Complainers (importance: ) highlighted the critical and introspective nature of creative professionals.
Finally, the MBA and Accountant professions were strongly influenced by features such as Groupflow Leechflow and Personality Stock-trader, indicating competitive, financially driven behaviors that are characteristic of these fields.
This feature importance analysis provided a clear understanding of how various traits, such as lifestyle, recreation, and personality, contribute to the likelihood of belonging to a particular profession.
Figure 3.
Feature Importance bar plot for Lawyer
Figure 3.
Feature Importance bar plot for Lawyer
Figure 4.
Feature Importance bar plot for Medicine and Academics
Figure 4.
Feature Importance bar plot for Medicine and Academics
Figure 5.
Feature Importance bar plot for Sports
Figure 5.
Feature Importance bar plot for Sports
Figure 6.
Feature Importance bar plot for Engineer
Figure 6.
Feature Importance bar plot for Engineer
Figure 7.
Feature Importance bar plot for Creative
Figure 7.
Feature Importance bar plot for Creative
Figure 8.
Feature Importance bar plot for MBA
Figure 8.
Feature Importance bar plot for MBA
Figure 9.
Feature Importance bar plot for Accountant
Figure 9.
Feature Importance bar plot for Accountant
7.6.4. Radar Charts for Feature Visualization
To better visualize the distribution of features across professions, we generated radar charts for each professional category. These charts allowed for a comparative analysis of the most influential features in each profession. The insights derived from the radar charts include the following:
IDEOLOGY: A diverse range of ideological preferences was observed across professions. For instance, Liberalism dominated in Lawyer-0, while Capitalism was prominent in both MBA-5 and Engineer-3, and Socialism was more common in Medicine and Academic-1. The Complainers ideology was particularly evident in Creative-4 and was also present in MBA-5, indicating a critical mindset. Interestingly, the Fatherlander ideology was uniquely strong in Lawyer-0, suggesting a patriotic inclination.
RECREATION: Travel emerged as a common recreational interest, particularly prominent in Medicine and Academic-1, but also present in several other profiles. Sports was overwhelmingly dominant in Sports-2, as expected, and moderately present in MBA-5. Arts was uniquely strong in Lawyer-0, suggesting a cultural interest within the legal profession.
PERSONALITY: The Stock-trader personality was common across multiple professions, with a particularly strong presence in Accountant-6. The Journalist trait was notable in both Lawyer-0 and MBA-5, highlighting strong communication skills. The Risk-taker personality was uniquely present in Sports-2, aligning with the competitive nature of athletes.
EMOTIONS: Emotional traits were not consistently represented across all professions. Anger was particularly prominent in Lawyer-0, potentially reflecting the adversarial nature of the profession. The Happy emotion was uniquely strong in Medicine and Academic-1, suggesting job satisfaction in these fields. Sadness appeared in Accountant-6, albeit at a low level.
GROUPFLOW: Leechflow was notably dominant in Accountant-6 and significant in MBA-5, possibly indicating a tendency to leverage others’ work. Beeflow was prominent in Medicine and Academic-1 and present in Creative-4, suggesting a tendency for collaboration. Antflow appeared in both Creative-4 and Sports-2, which may point to individualistic or contrarian behaviors in these professions.
ALTERNATIVE REALITIES: Treehugger was strongly present in Engineer-3, uniquely combining with a capitalist ideology. Fatherlander ideology was also significant in Lawyer-0, but not prominently represented in other professions. This category was not consistently represented across all professional profiles.
Figure 10.
Radar Chart for Lawyer
Figure 10.
Radar Chart for Lawyer
Figure 11.
Radar Chart for Medicine and
Academics
Figure 11.
Radar Chart for Medicine and
Academics
Figure 12.
Radar Chart for Sports
Figure 12.
Radar Chart for Sports
Figure 13.
Radar Chart for Engineer
Figure 13.
Radar Chart for Engineer
Figure 14.
Radar Chart for Creatives
Figure 14.
Radar Chart for Creatives
Figure 15.
Radar Chart for MBA
Figure 15.
Radar Chart for MBA
Figure 16.
Radar Chart for Accountant
Figure 16.
Radar Chart for Accountant
7.6.5. Unique Inferences
Several unique inferences were drawn from the analysis of the professional profiles. The Lawyer-0 profile exhibits an interesting combination of liberal ideology, artistic interests, and strong patriotic tendencies, particularly with a Fatherlander inclination. The Engineer-3 profile is unique for blending a capitalist ideology with strong environmentalist traits, notably a Treehugger mentality. The Creative-4 profile shows a higher tendency toward complaining and a sedentary lifestyle, which may reflect the nature of creative work that often involves long periods of solitary, focused effort. In the case of the Accountant-6, an extremely high Leechflow score stands out, suggesting a strong tendency to rely on others’ work or resources, much higher than any other trait across the profiles. The Sports-2 profile is highly focused on sports-related traits, with little variation in other areas, indicating a highly specialized focus. Lastly, the MBA-5 profile strikes a balance between capitalist ideology, a keen interest in sports, and a combination of traits such as journalist and politician, which likely reflects the diverse skill set required for success in business administration.
7.6.6. SHAP Value Interpretation
To improve interpretability, we employed SHAP (SHapley Additive exPlanations) values [
36], which allowed us to break down the contributions of each feature to individual predictions. SHAP values revealed the specific traits that influenced the model’s decision-making process for each profession. For
Sports, SHAP values indicated that
Recreation Sport and
Groupflow Antflow had a positive contribution to the predictions, confirming that physical activity and collaboration are central to athletes. In the case of
Medicine and Academics,
Recreation Travel and
Personality Stock-trader were key contributors, suggesting a mix of exploratory and managerial traits that align with academia and healthcare professions. For
Lawyer, SHAP values highlighted the importance of
Alternative Realities Fatherlander and
Personality Politician, aligning with the structured, politically oriented behavior often seen in legal professionals. By leveraging SHAP values, we provided a transparent view of how individual features influenced the model’s predictions, enhancing interpretability and trust in the decision-making process.
Table 5.
Top SHAP Feature Contributions for Each Profession.
Table 5.
Top SHAP Feature Contributions for Each Profession.
| Profession |
Feature 1 (Value) |
Feature 2 (Value) |
Feature 3 (Value) |
| Lawyer |
Arts (0.0299) |
Fatherlander (0.0268) |
Anger (0.0245) |
| Medicine &Academic |
Travel (0.0895) |
Beeflow (0.0507) |
Stock-trader (0.0195) |
| Sports |
Sport (0.1501) |
Antflow (0.0672) |
Complainers (0.0368) |
| Engineer |
Capitalism (0.0228) |
Beeflow (0.0215) |
Fashion (0.0177) |
| Creative |
Sedentary (0.0942) |
Complainers (0.0426) |
Yolo (0.0271) |
| MBA |
Politician (0.0342) |
Leechflow (0.0323) |
Complainers (0.0310) |
| Accountant |
Leechflow (0.1320) |
Stock-trader (0.0263) |
Sad (0.0165) |
8. Discussion
This study examined how natural language processing (NLP) and machine learning (ML) models can predict professions by analyzing text, specifically focusing on how different personality traits, lifestyle choices, and value systems, which we term Alternative Realities, are reflected in professional personas. By leveraging deep learning models such as BERT and BiLSTM, we were able to categorize professions like Lawyers, Engineers, and Sports professionals with a high degree of accuracy.
8.1. Alternative Realities: Personalities and Values
The concept of Alternative Realities plays a crucial role in understanding the intersection between personality traits and professional identity. We classified individuals into four broad categories, each representing different worldviews, values, and behaviors.
The Fatherlander category includes individuals who exhibit a deep belief in tradition, nation, and family. These individuals uphold the values of the "good old times" and view their fatherland as superior. Such personalities tend to align with professions that emphasize order, authority, and the preservation of cultural values, such as lawyers, military leaders, or politicians. Our models detected this strong traditionalism in legal professions, suggesting a correlation between conservative values and structured, rule-driven jobs.
The Nerd category is characterized by a belief in progress, science, and technology as forces for good. Nerds often aspire to transcend human limitations and are enthusiasts of global connectivity and advancements like space exploration. These individuals tend to thrive in professions related to engineering, technology, and scientific research. The BERT + BiLSTM model effectively captured these traits in Engineers and other tech-centric professions.
The Spiritualist group seeks meaning through subjective experiences of the sacred. Their behaviors are driven by a quest for spiritual fulfillment and contemplation. Professions that align with these values may include religious leaders, yoga instructors, or philosophers. While this group was more challenging to capture with traditional NLP models, future work could focus on refining models to detect the subtle language of spiritual guidance and contemplation.
The Treehugger category includes individuals who advocate for sustainability and environmental preservation. They challenge certain technological advancements like genetic manipulation while supporting others, such as alternative energy sources. Their value system often conflicts with industrial or corporate norms, leading them to professions in environmental activism, sustainability consulting, or conservation. Our model detected some alignment between these values and professions in academia or NGOs focused on sustainability.
8.2. Groupflow: Collaborative and Competitive Dynamics
In addition to Alternative Realities, we introduced the concept of Groupflow, which describes how individuals engage in teamwork and their behavioral dynamics in professional settings. These categories are crucial for understanding professional performance and interpersonal relationships within organizations.
Beeflow refers to individuals who are collaborative creators. They focus on creating value for both themselves and society. This profile aligns with professions in creative industries, research, and innovation. Beeflow members experience a state of flow while working, indicating high levels of engagement and satisfaction in collaborative environments.
Antflow describes competitive, disciplined individuals who are driven by personal goals and hard work. Professions such as athletes or business executives often display these traits, thriving in environments where success is measured by individual accomplishments and victories. Our models found strong correlations between Antflow behaviors and professions in sports and competitive business environments.
Leechflow individuals are characterized by exploitative tendencies, benefiting themselves often at the expense of others. They may be found in roles that allow for opportunism or manipulation of systems for personal gain, such as high-risk stock trading or certain management positions. This group was more difficult to define within traditional professional categories but could be inferred through language patterns associated with competitive and self-serving behaviors.
8.3. Lifestyle Categories: Influences on Professional Identity
Lifestyle choices were another significant factor in predicting professions, with certain lifestyles aligning closely with professional categories.
The Fitness Lifestyle is marked by individuals addicted to physical training and sports, predominantly found in athletic professions, where discipline, physical performance, and competition are paramount. The model accurately predicted professions in sports based on this lifestyle.
The Sedentary Lifestyle is characterized by limited physical activity, with individuals more frequently associated with desk-based professions such as accounting or administrative roles. The absence of physical engagement in their work was a notable characteristic detected by the model.
The Vegan Lifestyle aligns with individuals adhering to a plant-based diet, often found in environmentally conscious professions. Their ethical stance on avoiding animal products suggests a broader alignment with sustainability-focused roles, such as environmental advocacy or health-related fields.
The Yolo Lifestyle adheres to the philosophy of living in the moment and maximizing present opportunities, often exhibiting impulsive behavior. Yolo individuals may be drawn to risk-heavy professions such as entrepreneurship, creative industries, or high-stakes finance. The model’s predictions suggested a correlation between Yolo behaviors and dynamic, risk-taking professions.
8.4. Recreation and Personality Traits: Enhancing Profession Classification
In addition to the Alternative Realities and Groupflow models, we explored how recreational interests and personality traits influence professional identities.
Recreation interests in Art, Fashion, Sport, and Travel reveal much about an individual’s professional alignment. For instance, professionals in the creative industries were frequently linked with interests in art and fashion, while individuals in travel-related professions showed a strong interest in global exploration and cross-cultural experiences.
Personality Traits such as being a Risk-taker, Journalist, or Politician also correlated with professional categories. For example, risk-takers were prevalent in high-risk, high-reward professions like sports and stock trading, while journalists and politicians exhibited traits aligned with professions involving communication and influence.
8.5. Model Performance and Limitations
The BERT + BiLSTM model showed robust performance in predicting professions based on text data, particularly when aligned with distinct lifestyle and personality categories. Professions like Sports, Engineering, and Law were consistently predicted with high accuracy, showcasing the model’s ability to capture both contextual and sequential information. However some professional categories, such as MBAs and accountants, exhibited overlapping language patterns, which reduced the model’s ability to distinguish between them.The model faced challenges in capturing subtle nuances associated with personality traits or alternative realities like Spiritualist behaviors, which are often implicit and context-dependent. Additionally, the dataset, while diverse, may not fully capture the nuances of all professional categories, particularly those with overlapping traits.
8.6. Future Directions
This study opens the door to further research on the relationship between personality, lifestyle, and professional identity. Future work could explore the incorporation of more advanced personality models and extend the dataset to include a wider range of professions. Additionally, exploring more refined group behaviors within organizations through Groupflow could provide deeper insights into how personality traits influence team dynamics and professional success.
Overall, our findings highlight the complex interplay between individual traits, societal values, and professional identities, offering a new lens through which to view career prediction and development.