Submitted:
12 March 2024
Posted:
13 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce a novel framework that utilises Zero-Shot Instance Retrieval (ZSIR) as a method to study and analyse the cognitive alignment of large visual language models. This approach allows us to simulate and evaluate how AI interprets and processes visual information in a manner that parallels human cognitive abilities, particularly in scenarios where the model encounters data it has not been explicitly trained to recognise.
- A key innovation of our research is the development of a unified similarity function specifically designed to quantify the level of cognitive alignment in AI systems. This function provides a metric that correlates the AI’s interpretations with human-like cognition, offering a quantifiable measure of the AI’s ability to align its processing with human thought patterns.
- The effectiveness of our proposed similarity function is thoroughly tested through extensive experiments on the CUB and SUN datasets. Our results demonstrate that the function is versatile and robust across different forms of knowledge representation, including visual attributes and free text generated by large AI models. This versatility is critical as it reflects the level of cognition alignment between humans and AI.
- Our experiments not only establish the validity of the proposed similarity functions but also showcase the enhanced performance of our model in the context of ZSIR tasks. The model demonstrates superior capabilities compared to existing state-of-the-art models, marking a significant advancement in the field.
2. Related Work
2.1. Cognition-Alignment AI
2.2. Zero-Shot Learning
3. Methodology
3.1. Cognition Representation
- Automated Attribute Generation: Using LLMs to automatically generate descriptive attributes for visual data, thereby providing a structured and detailed attribute set that mirrors human perception.
- Free-Text Description Synthesis: LLMs can be employed to create comprehensive free-text descriptions of visual stimuli. These narratives offer a deeper, more contextual understanding of the images, akin to how humans might describe them.
- Class Embedding: Automatic description provided by AI for each given class.
- AI-Revised Human Attributes: By incorporating the class names and human-designed attributes, AI revises the attribute list and makes the words more related to visual perception for the image retrieval task.
- AI-Generated Attributes: AI create attributes that are associated to the class names without any constrained.
- ZSL-Contexualised AI Attributes: Based on the AI-Generated Attributes, the prompting further constrains the task for ZSL purposes that focuses on improving the visual perception association and generalisation for unseen classes and instances.
3.2. Latent Instance Attributes Discovery
3.3. Cognition Alignment via ZSIR
| Algorithm 1: LIAD Optimisation for ZSIR Cognition Alignment |
| Input: Visual feature of training images ; Attributes of seen classes ; |
| Test images from unseen classes with the attributes ; |
| Output: Gallery and query instances and ; |
| 1. Initialise: and ; |
| 2. While and not converge: |
| 3. ; |
| 4. for iter ∈ 0, 1, ..., MaxIter: |
| 5. ; |
| 6. ; |
| 7. for iter ∈ 0, 1, ..., MaxIter: |
| 8. ; |
| 9. Return: and according to Eq. 4. |
4. Experimental Results
4.1. Experiment Setup
| Human Attributes | AI-revised Human Attributes | AI-Generated Attributes | ZSL-Contextualised AI Attributes |
| ’sailing/ boating’ | Open Space | Natural | Natural |
| ’driving’ | Enclosed Space | Man-made | Man-Made |
| ’biking’ | Natural Landscape | Indoor | Indoor |
| ’transporting things or people’ | Man-made Structures | Outdoor | Outdoor |
| ’sunbathing’ | Urban Environment | Urban | Urban |
| ’vacationing/ touring’ | Rural Setting | Rural | Rural |
| ’hiking’ | Water Presence | Modern | Bright |
| ’climbing’ | Vegetation Density | Historical | Dim |
| ’camping’ | Color Palette | Spacious | Colorful |
| ’reading’ | Textural Qualities | Cramped | Monochrome |
| ’studying/ learning’ | Lighting Conditions | Bright | Spacious |
| ’teaching/ training’ | Weather Elements | Dim | Cramped |
| ’research’ | Architectural Style | Colorful | Populated |
| ’diving’ | Historical Context | Monochrome | Deserted |
| ’swimming’ | Modern Elements | Busy | Vegetated |
| ’bathing’ | Artistic Features | Tranquil | Barren |
| ’eating’ | Functional Aspects | Populated | Watery |
| ’cleaning’ | State of Maintenance | Deserted | Dry |
| ’socializing’ | Population Density | Greenery | Mountainous |
| ’congregating’ | Noise Level | Barren | Flat |
| ’waiting in line/ queuing’ | Movement Dynamics | Waterbody | Forested |
| ’competing’ | Activity Presence | Dry | Open |
| ’sports’ | Cultural Significance | Mountainous | Enclosed |
| ’exercise’ | Geographical Features | Flat | Architectural |
| ’playing’ | Seasonal Characteristics | Forested | Naturalistic |
| ’gaming’ | Time of Day | Open | Ornate |
| ’spectating/ being in an audience’ | Material Dominance | Enclosed | Simple |
| ’farming’ | Symmetry | Architectural | Cluttered |
| ’constructing/ building’ | Asymmetry | Naturalistic | Minimalistic |
| ’shopping’ | Spaciousness | Ornate | Artistic |
| ’medical activity’ | Clutter | Simple | Functional |
| ’working’ | Tranquility | Cluttered | Symmetrical |
| ’using tools’ | Bustle | Minimalistic | Asymmetrical |
| ’digging’ | Accessibility | Artistic | Traditional |
| ’conducting business’ | Seclusion | Functional | Contemporary |
| ’praying’ | Safety Perception | Symmetrical | Luxurious |
| ’fencing’ | Risk Elements | Asymmetrical | Modest |
| ’railing’ | Sensory Stimuli | Traditional | Cultivated |
| ’wire’ | Emotional Atmosphere | Contemporary | Wild |
| ’railroad’ | Privacy Level | Luxurious | Paved |
| ’trees’ | Connectivity | Modest | Unpaved |
| ’grass’ | Isolation | Cultivated | Vibrant |
| ’vegetation’ | Ecological Elements | Wild | Muted |
| ’shrubbery’ | Industrial Presence | Paved | Textured |
| ’foliage’ | Commercial Features | Unpaved | Smooth |
| ’leaves’ | Educational Aspects | Vibrant | Reflective |
| ’flowers’ | Recreational Facilities | Muted | Matte |
| ’asphalt’ | Religious Symbols | Textured | Elevated |
| ’pavement’ | Cultural Diversity | Smooth | Ground-level |
| ’shingles’ | Historical Monuments | Reflective | Aerial |
| Attributes | Value | Association to the class Abbey as an example |
| Natural | 0 | (Abbeys are man-made structures, though they may be situated in natural settings.) |
| Indoor | 1 | (Abbeys typically have significant indoor areas.) |
| Outdoor | 1 | (They also have outdoor components like courtyards.) |
| Urban | 0 | (Generally, abbeys are in rural or secluded settings, but some may be in urban areas.) |
| Bright | 0 | (Traditional abbeys might have a dimmer, more solemn interior.) |
| Colorful | 0 | (Abbeys often have a more muted, monastic color scheme.) |
| Spacious | 1 | (They usually have spacious interiors like chapels and halls.) |
| Populated | 0 | (Abbeys are often associated with tranquility and seclusion.) |
| Vegetated | 1 | (Many abbeys have gardens or are located in green settings.) |
| Watery | 0 | (Unless located near a water body, which is not typical for all abbeys.) |
| Mountainous | 0 | (This is location-dependent.) |
| Forested | 0 | (Again, location-dependent.) |
| Open | 1 | (They often have open courtyards.) |
| Enclosed | 1 | (Enclosed structures like cloisters are common.) |
| Architectural | 1 | (Abbeys are known for their distinctive architecture.) |
| Ornate | 1 | (Many abbeys are ornately decorated, especially older ones.) |
| Simple | 0 | (Abbeys are typically not simple in design.) |
| Artistic | 1 | (Abbeys often contain artistic elements like stained glass.) |
| Symmetrical | 1 | (Many have symmetrical architectural designs.) |
| Modest | 1 | (Abbeys are often modest in terms of luxury.) |
| Cultivated | 1 | (Gardens or cultivated lands are common.) |
| Paved | 1 | (Pathways and internal floors are typically paved.) |
| Textured | 1 | (Stone walls, woodwork, etc.) |
| Elevated | 0 | (Dependent on the specific location.) |
| Underground | 0 | (Some abbeys may have crypts or basements.) |
| Foggy | 0 | (Location-specific.) |
| Daytime | 1 | (Abbeys are typically functional during the day.) |
| Weathered | 1 | (Many abbeys are old and show signs of weathering.) |
| Secluded | 1 | (Abbeys are often in secluded locations.) |
| Quiet | 1 | (Associated with quietude and peace.) |
| Cool | 1 | (Stone buildings often have a cool interior.) |
4.2. ZSIR Main Results
4.3. Ablation Study
4.4. Cognition Alignment Analysis
4.5. Observations and Discussion
5. Conclusions
Author Contributions
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligencce |
| LLM | Large Language Model |
| ZSL | Zero-Shot Learning |
| ZSIR | Zero-Shot Instance Retrieval |
| LIAD | Latent Instance Attributes Discovery |
| PG | Prototype Grouping |
| CA | Cognition Alignment |
References
- Long, Y.; Liu, L.; Shen, Y.; Shao, L. Towards affordable semantic searching: Zero-shot retrieval via dominant attributes. Proceedings of the AAAI conference on artificial intelligence, 2018, Vol. 32. [CrossRef]
- Xu, S.; others. Taming Simulators: Challenges, Pathways and Vision for the Alignment of Large Language Models. arXiv preprint arXiv:2308.01317 2023.
- Wang, X.; others. Emotional Intelligence of Large Language Models. arXiv preprint arXiv:2307.09042 2023. [CrossRef]
- Lester, B.; others. CogAlign: Learning to Align Textual Neural Representations to Cognitive Language Processing Signals. arXiv preprint arXiv:2107.06354 2021. [CrossRef]
- Xu, G.; others. CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. arXiv preprint arXiv:2307.09705 2023. [CrossRef]
- Sengupta, N.; others. Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. arXiv preprint arXiv:2308.16149 2023. [CrossRef]
- Zhang, S.; others. BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models. arXiv preprint arXiv:2306.10968 2023. [CrossRef]
- Wang, P.; others. Making Large Language Models Better Reasoners with Alignment. arXiv preprint arXiv:2309.02144 2023. [CrossRef]
- Bhardwaj, R.; others. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv preprint arXiv:2308.09662 2023. [CrossRef]
- Liu, Y.; others. Large Language Model Alignment: A Survey. arXiv preprint arXiv:2308.05374 2023.
- Petroni, F.; others. Language Models as Knowledge Bases? arXiv preprint arXiv:1909.01066 2019. [CrossRef]
- Gu, Z.; others. Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. arXiv preprint arXiv:2306.08891 2023. [CrossRef]
- Kirk, H.R.; others. Personalisation within Bounds: A Risk Taxonomy and Policy Framework for the Alignment of Large Language Models with Personalised Feedback. arXiv preprint arXiv:2303.05453 2023. [CrossRef]
- Larochelle, H.; others. Zero-data learning of new tasks. AAAI, 2009.
- Palatucci, M.; others. Zero-shot learning with semantic output codes. Neural Information Processing Systems 2009.
- Lampert, C.H.; others. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conference on Computer Vision and Pattern Recognition, 2009. [CrossRef]
- Frome, A.; others. DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems, 2013.
- Norouzi, M.; others. Zero-shot learning by convex combination of semantic embeddings. International Conference on Learning Representations, 2014.
- Changpinyo, S.; others. Synthesized classifiers for zero-shot learning. IEEE Conference on Computer Vision and Pattern Recognition, 2016. [CrossRef]
- Xian, Y.; others. Zero-shot learning - A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017. [CrossRef]
- Liu, W.; others. Generalized zero-shot learning with deep calibration network. Advances in Neural Information Processing Systems, 2018.
- Wang, W.; others. A survey on zero-shot learning. IEEE Transactions on Neural Networks and Learning Systems 2019.
- Khandelwal, S.; others. Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline. arXiv preprint arXiv:2302.07319 2023. [CrossRef]
- Chen, C.; others. Automatic vision-based calculation of excavator earthmoving productivity using zero-shot learning activity recognition. Automation in Construction 2023, 104, 104702. [CrossRef]
- Díaz, G.; others. CSI-Based Cross-Domain Activity Recognition via Zero-Shot Prototypical Networks. arXiv preprint arXiv:2312.07076 2023. [CrossRef]
- Nag, S.; others. Semantics Guided Contrastive Learning of Transformers for Zero-shot Temporal Activity Detection. IEEE Winter Conference on Applications of Computer Vision (WACV), 2023.
- Zhang, Z.; Saligrama, V. Zero-shot learning via semantic similarity embedding. International Conference on Computer Vision (ICCV), 2015.
- Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [CrossRef]
- Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-embedding for attribute-based classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [CrossRef]
- Romera-Paredes, B.; Torr, P.H.S. An embarrassingly simple approach to zero-shot learning. International Conference on Machine Learning (ICML), 2015.
- Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; Schiele, B. Latent embeddings for zero-shot classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [CrossRef]





| Dataset | SUN | CUB |
|---|---|---|
| # instances | 14,340 | 11,788 |
| # attributes | 102 | 312 |
| seen/unseen splits | 707/10 | 150/50 |
| attribute type | ins.+ cont. | ins.+ bin. |
| # total concepts | 819 | 512 |
| unseen gallery size | 200 | 2933 |
| SUN Attribute Dataset | CUB Dataset | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Methods | @Rank1 | @Rank5 | @Rank10 | @Rank20 | @Rank50 | @Rank1 | @Rank5 | @Rank10 | @Rank20 | @Rank50 |
| DAP [28] | 7.5 | 18.8 | 34.9 | 48.5 | 61.2 | 3.80 | 5.82 | 12.61 | 17.92 | 24.25 |
| ALE [29] | 14.8 | 29.6 | 47.5 | 64.2 | 78.4 | 7.81 | 18.23 | 22.52 | 30.74 | 38.72 |
| ESZSL [30] | 19.9 | 38.8 | 56.2 | 69.7 | 82.8 | 15.28 | 20.34 | 25.88 | 38.21 | 40.72 |
| LatEm [31] | 25.3 | 38.4 | 62.8 | 70.1 | 85.2 | 17.42 | 24.82 | 32.48 | 40.96 | 46.81 |
| LIAD [1] | 28.7 | 42.2 | 68.5 | 72.8 | 86.2 | 19.82 | 27.53 | 36.20 | 44.12 | 48.83 |
| CCA | 8.3 | 18.2 | 33.2 | 56.2 | 63.2 | 7.63 | 11.32 | 18.89 | 27.53 | 28.76 |
| Siamese Network | 12.8 | 22.5 | 40.2 | 57.2 | 69.8 | 8.52 | 12.42 | 18.92 | 28.42 | 30.79 |
| Ours (orthogonal only) | 26.6 | 38.2 | 58.8 | 65.2 | 79.9 | 17.72 | 26.85 | 29.97 | 37.72 | 40.12 |
| Ours (PG only) | 28.9 | 44.6 | 69.7 | 74.4 | 87.7 | 20.28 | 28.82 | 38.83 | 46.62 | 50.53 |
| Ours | 35.5 | 49.8 | 71.0 | 79.9 | 92.8 | 25.52 | 32.74 | 48.85 | 52.88 | 59.92 |
| DAP [28] | 8.8 | 19.2 | 32.6 | 44.7 | 52.5 | 5.42 | 8.82 | 14.27 | 16.82 | 22.36 |
| ALE [29] | 12.2 | 26.7 | 43.0 | 61.5 | 72.2 | 12.87 | 16.43 | 24.50 | 29.98 | 34.71 |
| ESZSL [30] | 18.8 | 34.2 | 49.1 | 66.2 | 76.9 | 14.31 | 17.40 | 23.65 | 36.48 | 39.22 |
| LatEm [31] | 17.3 | 36.4 | 58.8 | 67.6 | 80.8 | 15.82 | 20.26 | 29.48 | 36.25 | 43.82 |
| LIAD [1] | 18.7 | 37.7 | 61.9 | 70.2 | 78.8 | 18.61 | 26.62 | 32.81 | 39.42 | 44.28 |
| CCA | 13.8 | 27.4 | 44.5 | 62.8 | 70.7 | 10.43 | 14.52 | 18.85 | 25.58 | 30.76 |
| Siamese Network | 15.5 | 30.2 | 49.9 | 58.8 | 69.4 | 11.13 | 18.82 | 24.95 | 31.10 | 37.74 |
| Ours (orthogonal only) | 17.2 | 35.2 | 58.8 | 64.9 | 72.2 | 17.72 | 24.32 | 28.81 | 35.52 | 39.98 |
| Ours (PG only) | 18.9 | 38.1 | 63.2 | 73.2 | 79.1 | 19.21 | 27.78 | 37.75 | 42.29 | 46.62 |
| Ours | 20.5 | 40.2 | 65.5 | 75.8 | 82.2 | 28.21 | 30.87 | 39.92 | 44.97 | 48.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).