4. Whole/Part Compositionality in Vision, Mereotopology, and Operational Constraints in Vision
I return to elaborate on the whole/part compositionality of visual iconic representations, starting from an example of what I mean by part/whole compositionality, as it reveals itself in the phenomenology of visual perception. 2-D surfaces are composed of edges that, in turn, are composed of line-segments. When one perceives in realistic conditions a line segment, one sees it as part of an edge, which is also a part of a surface, despite the fact that the perceptual system constructs first line segments, then edges, and then 2-D surfaces. In other words, segments of an edge are perceived by perceiving the edge and, thus, are perceived as edge parts and not as independent perceptual units. The same holds for properties such as size and brightness. Burns (1987) shows that size and brightness are not encoded at first as independent features of objects but are perceived in terms of holistic objects.
This bears directly on why iconic symbolic representations having whole/parts compositionality do not satisfy the principle of semantic constituency that characterizes, or even defines, symbolic representations. According to this principle, if a representational vehicle s is a syntactic part of a representational vehicle p, the meaning of s is a semantic part of the meaning of p. Even though syntactically line segments are parts of edges, which in turn are parts of 2D surfaces, because this is how the neural system constructs the representations, at the semantic, phenomenological level, line segments are not perceived as semantic parts of edges in the sense that to perceive phenomenologically an edge one need not phenomenologically perceive a line segment; in fact, one does not. To put it metaphorically in conceptual terms, one understands an edge without understanding line segments, even though the latter are syntactic constituents of the former. As a matter of course, iconic representations satisfy semantic constituency at some level. If iconic representations bind holistically parts together, some of these parts, say the objects in a visual scene, are individuated as distinct patterns in the representation, and the meaning of the whole representation depends on the meaning of these parts.
The difference of semantic constituency to the extent that it applies to iconic representations from the same principle as it applies to symbolic representations lies in the fact that in iconic representations the meaning of the whole is not determined by combining parts at the syntactic level following the rules of a formal syntax, as it happens in symbolic representations. In symbolic representations, one composes syntactically well-formed expressions following a set of syntactic rules and it is the syntactic composition along the semantics of the atomic parts that determines the meaning of the whole. In iconic representations, in contrast, one does not rely on some formal rules to compile parts to form a whole. Visual iconic representations, for example, are constructed by the process of the visual system and the visual processes of this system that form parts and combine them to form the whole (that is, the percept) are not guided by, and do not underwrite a, set of formal rules but are driven by the stimulus, and are guided by a set of constraints that express the physical and geometrical regularities of the environment in which the perceiving organism lives. In this sense, it is the semantics and pragmatics of the system formed by the representation and the represented entity that guide the formation of the whole, rather than syntax.
A characteristic of many iconic representations is the use of space or some geometrical structure in representing the representatum; this spatial or geometrical structure maps onto the spatial structure of the representatum. Visual representations, a clear-cut case of iconic representations, have a two-dimensional or three-dimensional form. This is the spatial form that grounds the iconic structure of the perceptual representation; spatial representations preserve the spatial relationships of the represented entities owing to the mapping of the spatial relationships between the elements of the representation onto those in the representatum. Visual perceptual representations, being spatial representations, inherit this property and, thus, preserve the spatial relationships of the entities (be it objects or properties) in the visual scene. This is, of course, one aspect of the iconicity of visual representations, the other one being that some of the elements of the representational content of the visual representation map into the visible features of the objects in the represented visual scene.
The whole/part compositionality constrains the way the parts of an iconic representation are put together to form the whole: ‘iconic representations acquire accuracy conditions from the way features are holistically bound in each part, together with the spatial arrangement of those parts.’ (Quilty-Dunn, 2020). I said above that line segments are put together to form edges, which in turn, put together in appropriate ways, form 2-D surfaces. Moreover, one does not compile a surface from edges by following some logical, or syntactic in general, rules of construction of complex representations. Owing to the whole/part compositionality of the representation, the parts are perceived as parts of the whole rather than as distinct entities; one does not perceive line segments as distinct entities but as parts of edges. In discursive representations such as ‘(Φa)’, in contrast, ‘Φ’ retains its autonomy and can be seen independently of the whole (Φa); this is the result of the compositionality of discursive representations.
The part/whole compositionality seems to entail that since line segments are parts of edges and edges are parts of 2-D surfaces, the line segments are parts of the 2-D surface, a conclusion that seems warranted. However, even though the handles of a door are parts of a door, and the door is a part of a house, the handles are not properly speaking parts of the house? How is the latter case different from the former? The answer to questions like these should be searched in the mathematical models of Mereology and Topology, the two disciplines that examine part/whole compositionality.
All models in mereology start by defining at a first pass the ‘is part of’ relation by means of three axiom, namely, (P1) Pxx; (P2) Pxy∧ Pyx→x=y; (P3) Pxy ∧ Pyz→ Pxz, where ‘P’ is a two-place predicate to be interpreted as the parthood relation. In other words, the parthood relation is reflexive, antisymmetric, and transitive.
I said in ‘a first pass’ because this definition has a serious shortcoming. If parthood is transitive, since the handle is a part of door and the door is a part of a house, it follows that the handle is a part of the house too. To avoid this conclusion, the general intended interpretation of “P” by the three axioms should be narrowed by introducing some additional conditions restricting its application. Such a condition involves functionality, which requires that x be a part of y if x makes a direct contribution to the functioning of the whole of which x is a part. In this case, the handle is a functional part of a door and, thus, it is a proper part of the door, and although the door is a functional part of the house, the handle is not a functional part of the house and, thus, the handle is not a proper part of the house. Mathematically put, where ‘φ’ is any formula in the language, the implication: (1) (Pxy∧φ[x,y]) ∧ (Pyz∧φ[y,z]) → (Pxz∧φ[x,z]) may well fail to be a theorem of the mereological theory if x is not functionally related to z. Note that the functional restriction has its own problems. A spot on the door that is painted differently from the rest of the door is a part of the door, but it adds directly nothing to the functioning of the whole entity of which it is a part, that is, the door. Be that as it may, this discussion clearly bears on the problem of which parts of an iconic representations can be construed as parts of it properly speaking. I have argued that the part consisting of a part of the back of an object conjoined to a part of the background has no functionality in perceptual processing given the nature of our perceptual system; hence it is not a proper representationally speaking part of the representation, according to the restriction introduced in Mereology.
To address the aforementioned, and other problems, the meaning of “P” is further clarified by a set of additional theorems that differ from one Mereology theory to another. Here I consider only those extensions that will be used in defining the sum or fusion of two parts to form a whole. These are the following:
(i) PPxy =df Pxy ¬Pyx (Proper Part), that is, x is a proper part of y iff x is a part of y and y is not a part of x;
(ii) Oxy =df z (Pzx Pzy) (Overlap), that is, x and y overlap iff there exists a z such that z is a part of x and z is a part of y; and
(iii) Uxy =df z (Pxz Pyz) (Underlap), that is, x and y undelap iff there exists a z such that x is a part of z and y is a part of z
To define the sum of two parts the theory is extended so as to constrain further the meaning of parthood. The first extension is:
(P4) ¬Pxy → z(Pzx ¬ Ozy), that is, if an individual has a proper part, it has more than one. (P4) entails the following theorems
(P4a) |– PPxy → z(PPzy ¬Ozx), that is, if x is a proper part of y, then there exists a z such that z is a proper part of y and it is not the case that z and x overlap;
(P4b)) EM |– zPPzx ∀z(PPzx → PPzy) → Pxy; and
(P4c)) EM |– zPPzx ∀z(PPzx PPzy) → x=y, which means that no two distinct objects can share the same proper parts.
The second extensions is:
(P5) Uxy → z∀w(Owz (Owx v Owy)), that is, that if x and y undelap, then there exists a z such that for every w, w and z overlap iff either w and x overlap or w and y overlap.
In a Mereological theory that holds (P1) to (P5), the following “sum” definition can be supported:
(Sum Definition) x+y =df iz∀w(Owz (Owx v Owy)), where ‘i’ is a description operator in the language. This says that the sum of two parts x and y is defined as follows: for every w, w and x overlap iff either w and x overlap or w and y overlap. Adopting the ‘+’ sum operator allows us to recast (P5) as follows:
(P5′) Uxy → z(z=x+y), that is, that if x and y undelap, then there exists a z such that z is the sum of x and y.
The sum definition determines the sum of two parts that are combined or fused to form a whole. This definition can be generalized to yield the notion of Unrestricted sum: ∃wφw → ∃zSizφw, where φ is any formula in a language (which picks out the components or parts that form the sum, say the grains of sand in a pile of sand). This definition stipulates that if there are parts φi, then there is an object z that is the sum of φi.
The most pressing problem with Mereological theories, a problem that pertains directly to the compositionality of iconic representations, is that the way the “sum” operation is defined, Mereology by itself cannot capture some of the basic properties we attribute to wholes (for example, a whole is a one-piece, self-connected entity, such as an object in a visual scene, as opposed to a scattered entity made up of several disconnected parts, such as parts of different objects). Parthood is a relational concept, whereas wholeness a global property (Varzi 1996, 269), and, thus, the former, and the ensuing definition of “sum”, cannot capture completely the meaning of “wholeness”. This shows immediately if one considers that in Mereology for every whole there is a set of parts, and that for every specifiable set of parts (for example, arbitrary objects) there is in principle a complete whole, i.e., its mereological sum, or fusion. As the “Sum definition” shows, the only restrictions concern the overlapping relations between the combined parts, but this is hardly enough to constrain the definition so that only combinations of parts that result in a whole qua object that satisfies our basic understanding of what should count as an object. Thus, within Mereology itself, there is no way to draw a distinction between “good” and “bad” wholes, and, thus, there is no way to distinguish between an integral whole and a scattered sum of disparate parts (yes, this is the Gavagai problem).
The problem is that it is not possible, on pure Mereological grounds, to determine the appropriate restrictions that would permit the fusion of parts in ways that allow only the formation of integral or natural wholes (such as objects or whole visual scenes), and would exclude the formation of sums of disparate parts or concrete heap-like composites (such as a pile of bricks, or sums of disparate parts of objects or background). Mereology does not say what constitutes a natural whole, since the existence of a sum x+y is conditional to the existence of an object z containing both x and y, in the sense that if x and y undelap, there exists a z such that z is the sum of x and y. This allows that parts of different objects be combined to form a whole, which thus consists in a conglomeration of disparate parts, and which is, thus scattered all over the place. Consequently, any attempt to account for basic spatio-temporal relations, such as the relationship between an object and its surface, or the relation of something being inside or around something else., which are some among the relations that any theory concerned with spatio-temporal entities should supply, cannot be defined directly in terms of mereological primitives only. This is not the only misgiving with the unrestricted form of fusion, as there are arguments that it
does not sit well with certain fundamental intuitions about persistence through time . . . that it is incompatible with certain plausible theories of space . . . or that it leads to paradoxes similar to the ones afflicting naïve set theory. (Varzi 2016, 38)
Many theoreticians accept this limitation of summation but argue that there is no structured way to restrict the notion of fusion or sum and that the standard definition of fusion in mereology is the only plausible option (Varzi 2016, section 4.5).
It is possible, of course, to restrict the definition of “sum” so as to output only sums that satisfy certain conditions, such as a sum of concrete parts must be a non-heap like entity, for example. In this case, if φ is any formula in a language (which picks out the components or parts of that form the sum, say the grains of sand in a pile of sand), and ψ is a condition that the sum must satisfy (for example, that the sum of some material parts must be a natural whole), the set of all φ-ers has a sum φi if and only if every φ is ψ, that is, it is part of a natural whole. This yields the following definition of restricted sum: (∃wφw ∧ ∀w(φw → ψw)) → ∃zSizφw.
Accordingly, many theoreticians propose that mereology should be supplemented with some topological theory (Pianesi & Varzi 1996; Norton 2011). Pianesi & Varzi (1996) and Fletcher & Lackey (2022, 15) call the resulting amalgam of Mereology and Topology “Mereotopology”. A topological theory provides a primary predicate that is essential in solving the abovementioned problems, namely the relationship of “connectedness” by introducing the Connection predicate ‘C’. According to C, in order for two parts to form a sum it is necessary that they be connected or joined to each other, a demand that introduces the notions of contiguity and of interval relations that promptly show in discussions of iconic representations, as indispensable ingredients for forming natural wholes. ‘C’ is defined as follows: (C1) Cxx; (C2) Cxy → Cyx; (C3) Pxy → ∀z(Czx → Czy). In other words, x is connected to itself; if x is connected to y, y is connected to x; and if x is a part of y, then for every z, if z is connected to x then z is connected to y. This last axiom ensures that putting together disparate parts does not provide an acceptable sum.
There are three ways to accommodate the fusion of Topology with Mereology. The first is to accept that the two theories together can provide an adequate framework to explain the part/whole compositionality, the second is an attempt to fuse Mereology into Topology by defining the Parthhood relation P of Mereology in terms of the connection predicate C of topology, and the third is an attempt to subsume Topology under Mereology by defining the connection predicate C in terms of P and the vocabulary of Mereology. The details need not concern us, because what is important is that to determine the conditions under which parts can be summed or fused to form natural, that is, integral wholes, and exclude the formation of sums of disparate parts require both notions of ‘parthood’ and connectedness.
My analysis of the whole/parts compositionality befitting iconic representations reflects Werning’s (2012) work. Werning employs neurobiological findings concerning topologically structured cortical feature maps and the mechanism of object-related binding by neuronal synchronization to argue that iconic representations do compose, but their compositionality does not follow the principle of semantic constituency. The use of neuronal synchronization mechanisms underlying object-related binding can be used to complement my discussion of the neuronal implementation of the principles of Mereotopology underlying the formation of well-formed whole objects out of parts. Synchronization is the preferable mechanism nowadays invoked to explain how feature binding takes place to produce the representation of whole objects, and is used by a class of models that purport to explain how different attributes that are registered and processed in different visual areas can be bound together to form a visual object. The main characteristic of these models is that oscillators with neighbouring receptive fields and similar feature selectivities tend to synchronize . . . whereas oscillators with neighbouring receptive fields and different feature selectivities tend to desynchronize. As a consequence, oscillators selective for proximal stimulus elements with like properties tend to form a synchronous oscillation when stimulated simultaneously. This oscillation can be regarded as one object representation. In contrast, inputs that contain proximal elements with unlike properties tend to cause anti-synchronous oscillations, that is different object representations. This result is in line with the findings of object-related neural synchronization. (Werning 2012, 642)
Synchronization, thus, could be the mechanism, or one among others, underlying the compositionality of parts to form wholes in vision. As the quotation makes clear, neigborhood relations are crucial in guiding synchronization, which brings in the role of Mereotopology in explaining the bindings of parts of objects to produce natural objects. Werning shows that the network of the top-down and bottom-up signals as implemented by neuronal synchronization can be modelled by means of neural networks that, famously, do not employ symbolic representations. This means that object-binding may be a case of composing non-symbolic, iconic representations following the principles of Mereotopology.
Before I turn to examine in some detail the way neural synchronization is involved in perceptual groupings, let me say that the discussion of iconic compositionality in terms of Mereotopology is relevant to various problems plugging the application of the Picture Principle in visual representations (see Burge (2022) for a thorough discussion). One of the problems of the principle is that it entails that any part of a picture is a representation of a part of what the picture represents. This is a problem for applying the principle to visual perception, because if one combines the back part of an object in a visual image and a part of the immediate background in the image and forms a whole, this whole is irrelevant in terms of what is computationally relevant in perception, and, thus, it is highly unlikely that this complex part of the image represents anything in perception. In terms of Mereology and Topology, this combination does not result in a natural, integral whole. In this sense it is not true that any part of the representation represents a part of the domain that the representation represents; only parts that are admissible as components of perceptual processes are admitted. Which parts are these is an empirical issue, which means that it is the perceptual system itself that solves the problem of which combinations of parts of an image are admissible as natural wholes.
We saw before that oscillating neurons with neighbouring receptive fields and similar feature selectivities tend to synchronize. It is thus plausible to suggest that synchronization is the mechanism underlying the neural implementation of the constraints of “local proximity” and “feature similarity”. Since such neurons are usually grouped together in the brain, Local Field Potentials (LFP’s) are very useful to study the activity of these neurons and their synchronization, because the LFP’s are transient electrical signals generated in groups of neurons by the summed and synchronous electrical activity of individual neurons. They express the aggregate activity of small populations of neigborhing neurons represented by their extracellular potentials. Unlike action potentials that are generated by individual neurons, LFPs measure synaptic potentials pooled across groups of neurons near the recording electrode.
Studies (Baldauf & Desimone 2014; Bastos et al. 2015; Fries et al. 2001, Gregoriou et al. 2009; 2015) illuminate the role of LFPs in tasks involving top-down attention and VWM tasks, since these two play a pivotal role in grouping behaviorally relevant stimuli. These studies suggest that attention increases gamma frequency synchronization, increases low-frequency alpha-band synchronization for distractors, reduces low-frequency alpha-band synchronization of V4 neurons representing behaviorally relevant stimuli, increase theta-band frequency, and increases low-frequency beta synchronization for attended stimuli. As we shall shortly see, top-down visual attention originates in the Prefrontal Cortex (PFC), which seems to modulate through low-frequency waves the activity in visual areas from V1 to V4, a modulation that results in increased synchronization at the gamma-band high-frequencies between either Front Eye Fields (FEF) in the case of spatial attention, or Inferior Frontal Junction (IFJ) in the case of object/feature-based attention, and the visual areas.
Gregoriou et al. (2009; 2015) found that the distributions of the latencies of attentional effects in LFP gamma power in both FEF and V4 were significantly later than the distribution of the latencies for attentional effects on the firing rates in FEF, and significantly earlier than the distribution of latencies for attentional effects on V4 firing rates. These results indicate that significant attentional effects on LFP gamma power in either FEF or V4 occur later than the earliest attentional effects on firing rates in the FEF. Thus, rather than being caused by enhanced gamma oscillations, increases in firing rates in FEF with attention may initiate the coupled oscillations within and across areas. In contrast, firing rate changes in area V4 occur later and might result at least in part from enhanced gamma oscillations. Fries et al. (2001) found that the coupled oscillations remain even during the delay period where the firing activity in visual areas is subthreshold. More generally, increases in gamma synchrony are found among cells that decrease, or show no change in, their firing activity (Brunet et al. 2014).
During the delay period of VWM tasks, cells in PFC show increased activation rates, whereas it is likely that the activations in visual areas are subthreshold. According to the evidence discussed in the previous paragraphs, the increased firing rates activity in PFC, and the increased activity in visual areas, especially in FEF, due to top-down attention, induce coupled oscillations at the gamma band frequency both within PFC and across visual areas, starting from V4 where the attention effects are more pronounced and extending to other mid-level and high-level visual areas. The attentional effects on mid-level and high-level visual areas during the delay period, thus, manifest themselves in the LFP gamma power and not in the firing rates in the neurons in mid- or high-level visual areas. It should be noted that reports concerning the impact of attention on neuronal synchronization in early visual areas are conflicting (Gregoriou et al. 2015). Chalk et al. (2010) freport that attention reduced gamma-synchronization in V1, probably due to a decrease in the inhibitory drive that controls surround suppression.
We examined evidence showing the role of attention in increasing gamma-band high-frequencies synchronization both between visual areas, and between visual areas and higher cortical areas. Is this an attentional effect, or is this the way attention affects perceptual processing? In other words, how exactly does top-down attention affect the oscillations in the inter-communicating areas? The answer is that top-down attentional effects on visual areas are transmitted through low frequencies in alpha- and beta-bands and modulate the gamma-band activity in the modulated visual areas. Several studies in monkeys and humans show that spatial attention reduces local low-frequency alpha-band synchronization in visual areas and V4 (Gregoriou 2015; Fries et al. 2001), in contradistinction to the gamma band activity that is increased by attended stimuli in V4. Specifically, distractors increase alpha-band activity, whereas attended stimuli reduce alpha-band activity. Beta-band activity, on the other hand increases for attended stimuli. As Anderson et al. (2011) show, FEF that plays a predominant role in directing top-down spatial attention, excites inhibitory neurons in target areas on the visual cortex during attentional modulation. Similarly, Gregoriou et al. (2015) suggest that alpha band waves may play a role in suppressing irrelevant stimuli, enhancing, thus, the activations of neuronal assemblies representing the attended stimuli.
Research (Bastos et al. 2015; Michalareas et al. 2016; Fries et al. 2001; Fries 2015) suggest that gamma band waves subserve feedforward signalling, whereas the alpha-beta band waves subserve top-down, feedback flow of information. When top-down attention affects visual processing, the predominantly bottom-up directed gamma-band (high frequency 30-90 Hz) influences are controlled by predominantly top-down directed alpha-beta band (8-20 Hz) influences. Plomp et al. (2014) and van Kerkoerle et al. (2014) show that stimulation of V1 induces enhanced gamma-band activity in V4 (V1 to V4 feedforward projections), while stimulation of V4 under visual stimulation with a background stimulus induces enhanced alpha-beta activity in V1 (V4 to V1 feedback projections), which likely supresses the background stimulus. In general, attentional top-down influences carried by low-bad frequency waves are thought to modify gamma-band synchronization in the lower areas that receive the attentional feedback enhancing the feedforward signals emanating from these areas. Top-down signals increase both the synchronization strength, as measured by LPFs power, and the synchronization frequency of the gamma-band synchronization.
Fries et al. (2001) found that synchronization or coherence was modulated by spatial attention very early in the response to the stimulus. The Spiked-Triggered Averages (STAs) for the 100-ms. period after response onset (starting 50 ms. after stimulus onset because it takes about 50 ms. for the brain to start responding to the incoming stimuli) contained large low-frequency modulations with superimposed gamma-frequency modulations. The low-frequency alpha-band (10 Hz) synchronization was reduced by attention, whereas, there was a smaller gamma-frequency peak at around 65 Hz that was enhanced by attention. Both the visual evoked potential (VEP) and the spike histogram contained strong stimulus-locked gamma-frequency oscillations in the first 100 ms. of the response. Since gamma-band synchronization is related to bottom-up processing, this finding shows the bottom-up, stimulus locked activity in visual areas. However, oscillatory synchronization during the later sustained visual response was not stimulus locked, which shows that effects of spatial attention that modulates perceptual processing. This is shown by the fact that the low-frequency synchronization in the alpha-band was reduced by attention (alpha band synchronization is reduced for attended stimuli and increased for distractors). Since attention strengthens the representation of the attended stimulus in V4, it facilitates the bottom-up signals from V4 to IT and other cortical areas, which explains the increase in gamma-band frequency. Thus, attention increased gamma high-frequency and reduced low-frequency synchronization of V4 neurons representing the behaviorally relevant stimulus. This was observed even during the delay period and in the first few hundred milliseconds after response onset, when firing rates were not affected, because spatial attention fixated at a certain location modulates the preparatory activity of the neurons whose receptive field falls within the attended location before the presentation of the stimulus and this effect is carried on after stimulus onset before the attentional cue is being presented.
In contradistinction to alpha-band waves that suppress irrelevant stimuli, since distracting stimuli enhance alpha band oscillatory activity, beta-band wave activity may enhance the activation of attended stimuli by inducing stronger synchrony in lower frequency bands, which enhances the top-down signals. Bastos et al. (2015) argue that top-down signals that facilitate processing of attended stimuli are carried by beta-band (14-18 Hz) synchronization. As Bastos et al. (2015) suggest, cognitive tasks enhance top-down beta-band influences. Moreover, when attention selects a stimulus and enhanced top-down signals (carried by beta-band waves) reach the representation of the attended stimulus in visual cortical areas, this may lead to enhanced bottom-up signalling of that stimulus carried by gamma-band waves. Bosman et al. (2012) show that bottom-up causal influences from V1 to V4 are enhanced when they carry information about the attended stimulus, in accordance with Fries et al. (2001) finding that neurons activated by attended stimuli show increased gamma-frequency.
I claimed above that the top-down flow of information is subserved by alpha and beta-band synchronizations and the bottom-up flow of information by gamma (and theta)-band synchronizations. We also saw that there is substantial evidence that gamma-band causal influences between FEF and V4 predominate in the FEF-to-V4 direction after an attentional cue (post-cue condition), but subsequently predominate in the V4-to-FEF direction. In view of the fact that FEF are anatomically higher than V4 and, thus, the connections from FEF to V4 seem to be feedback/top-down projections, one might think that gamma-band synchronization subserves feedback flow of information, which contradicts the evidence that feedback is carried by low frequency waves. However, there is no discrepancy since FEF may be anatomically higher in the hierarchy than V4 but functional things are different.
FEF is situated in the PFC at a site heavily interconnected with the parietal cortex and is a part of the dorsal visual system. The mean activation latency of FEF neurons is 70 ms poststimulus. Signals arrive at FEF with a slight (if at all) time delay time with respect to the signals arriving at V1 (50-80 ms.) and V2 (85 ms.) and much earlier than they arrive at V4 despite the fact that FEF is anatomically higher than V4. Thus, functionally, FEF is at a lower level than V4 and, therefore, the projections from FEF to V4 are feedforward. This describes prestimulus conditions since it indicates the potential functional relations of FEF and V4. FEF contains visual and movement neurons (O’Shea et al. 2004).
Bastos et al., (2015) show that in the post-cue condition, that is, before spatial attention triggered by the cue intervenes, 8L area of FEF is lower than V4, which means that the projections from area 8L of FEF to V4 are feedforward. Thus, Gregoriou’s et al. (2009) finding that gamma-band causal influences between macaque FEF and V4 predominate in the FEF-to-V4 direction after an attentional cue reinforces rather than contradicts the view that feedforward influence are subserved by gamma-band synchronizations. Notice that this description purports to explain the increase in gamma synchronization between FEF and V4 and the fact that the causal influence is from FEF to V4, in other words, that FEF projects feedforward signals to V4. It does not account for the top-down attentional effect of FEF, and especially of area 8M, on V4, because this is carried by low-frequency waves, as top-down signal propagation predominantly is, from FEF to V4.
Recall that Fries et al. (2001) found that spatial attention modulates synchronization in V4 at 150 ms. after stimulus onset, when large low-frequency modulations were superimposed by gamma-frequency synchronizations. The low frequency waves on V4 carry the top-down attentional modulation from area 8M of FEF to V4. Bastos et al. (2015) show that area 8M is slightly higher than V4. The gamma-band synchronization in the FEF to V4 direction before the cue probably shows the effect of the processing of the visual stimulus in area 8L of FEF in which the visual neurons of FEF are clustered and which process the stimulus and are independent of attention. Moreover, once the representation in V4 of an item at the attended location has been boosted by the top-down attentional effects carried by low-frequency waves, the feedforward projections from V4 to FEF dominate and enhanced bottom-up signalling of that stimulus carried by gamma-band waves prevails. This explains Gregoriou et al. (2009) finding that that gamma-band causal influences between macaque FEF and V4 predominate in the V4-to-FEF direction later in the post-cue period.
One might be puzzled by the fact that the same areas (for example, V4 and FEF) can have both feedforward and feedback projections between them, which leads to the functional hierarchy exhibiting dynamic changes. Fries (2015) argues that when two neuronal areas are bidirectionally connected, unidirectional entrainment (that is the causal influence of one on the other) occurs separately in both directions of the bidirectional link. Anatomical data show that for each direction of communication, the linked brain areas have specialized neuronal groups for sending and receiving signals; that is, a specific brain area has neurons receiving inputs and different neurons sending outputs (Fries 2015; Markov et al. 2014).
Jensen et al. (2014) examine the role of alpha oscillations in relations to gamma oscillation in attentional effects on the visual areas. Irrespective of the exact nature of the signals that MVPA decodes, it is clear that VWM involves visual areas. When the test display appears, the perceptual areas also receive bottom-up, stimulus driven activation and neuronal assemblies in these areas encode the perceptual information in the test display. This information is compared to the perceptual information concerning the sample item that is stored in the connection weights of neuronal assemblies in visual areas that are reactivated by both the bottom-up signal from the test item owing to the distributed nature of representations, and by the top-down signals from PFC.
Work by Nakatani and Leeuwen (2006) elucidates the role of attention in perceiving ambiguous figures relating attentional activity to the synchronized activity in the right parietal areas that are responsible for perceptual awareness, and in the right frontal areas that correlate with perceptual flexibility and, hence, to perceptual switching between the two possible percepts of the ambiguous or bistable figure. Note that the same areas are also involved in top-down selective attention (Corbetta and Shulman 2002). Nakatani and Leeuwen’s (2006) research shows two cycles of synchrony in the gamma band; the first occurs 800–600 ms., and the second 400–200ms before button pressing. The first period of synchronicity coincides with a drastic suppression of eye blinks that is related to attentional demands, as these demands make viewers to focus closely postponing saccades (Ito et al. 2003). The second period of synchronicity in the observed activity patterns in PFC coincides with the maximum saccade frequency that reaches its peak at about 250ms before the switch response. Since saccade frequency is associated with shifts of attention (Leopold and Logothetis 1999), the second period of synchronicity probably reflects the final focus of spatial attention after a series of attentional shifts, which, by determining the critical points on the image also determines which interpretation of the ambiguous figure will be perceived.
Nakatani and Leeuwen (2006) also explored the role of the activity in frontal and occipital cortex during switching episodes. They found that the theta activity in the frontal cortex is a general characteristic of the processing activity of viewers that perform frequent switches, but is not specifically related to perceptual switching. Increased theta band activity in the frontal cortex is related to the concentration of attention to a task and to the inhibition of eye blink (Yamada 1998), as is the activity in the first period of synchrony in the gamma band. The alpha band activity observed in the occipital cortex is related to frequent perceptual switches. Increased alpha activity in the occipital cortex is related to attention to the stimulus by enhancing the efficiency of information processing (Yamagishi et al. 2003). Thus, the frontal and occipital cortex activity during perceptual switches signifies the crucial role of attentional modulation of the perception of ambiguous figures and its effects on the rate of perceptual switches.
The previous studies concern the role of synchronization in attention tasks. Attention is closely related to tasks involving VWM as well and, so, it would be interesting to see what role synchronization plays in VWM tasks. Examining the role of synchronic activity among brain regions in memory tasks, Roux & Uhlhaas (2014) proposed that gamma-band oscillations are specifically involved in the active maintenance of VWM information. Theta-band oscillations are specifically involved in the temporal organization of NWM items, a view that generalizes Nakatani and Leewen’s (2006) proposal. Finally, alpha-band oscillations are involved in the inhibition of task-irrelevant information, which results in enhancing the efficiency of information processing, as Yamagishi et al. (2003) proposed. However, other factors may contribute to the enhancement of the efficiency of information processing in occipital area, and the increased alpha activity may reflect these factors as well. Such factors may a direct enhancement of task-relevant information (Miller & Cohen 2001), or the sharpening of the representations of different object categories in the extrastriate cortex by an increase of the distinctiveness of their distributed neural representations (Fuster et al. 1985).
These results are based on studies demonstrating amplitude modulation of neural oscillations presumably emanating from particular brain regions involved in WM. During a delayed match-to-sample task while recording human EEG, Tallon-Baudry et al. (1999) observed that occipital gamma and frontal beta oscillations were sustained across the retention interval. Moreover, as the delay interval increased, these oscillations decreased in parallel with decreased performance on the task. Anderson et al. (2014) showed that the spatial distribution of power in the alpha frequency band (8–12 Hz) tracked both the content and the quality of the representations stored in visual working memory. Recall that in memory tasks there is increased power in the alpha frequency band in the occipital cortex related to the enhancement of the efficiency of information processing by attention to the stimulus (Yamagishi et al. 2003). These empirical findings together support both the view that neural oscillations are critical for VWM maintenance processes, and the view that in VWM tasks posterior visual processing areas play a critical role in sustaining the representations held in VWM. Finally, Lee et al. (2005) found evidence of enhanced local field potentials (4–10Hz) in area V4 of the monkey during a visual working memory task.
Long-range synchronization of the oscillations between brain regions likely also plays an important role in VWM function (Crespo-Garcia et al. 2013) as well. In a human MEG study, synchronized oscillations in the alpha, beta, and gamma bands were observed between frontoparietal and visual areas during the retention interval of a delayed match-to-sample visual working memory task. These synchronized oscillations were sustained and stable throughout the delay period of the task, were memory load dependent, and were correlated with an individual’s VWM capacity (Palva et al. 2010).
These studies bring to the fore the crucial role of synchronous oscillations in alpha- beta- theta- and gamma-frequency-bands in top-down attention and in VWM tasks. Top-down attention effects are carried by low frequency oscillations that synchronize the LFPs oscillations between the affecting and affected brain areas, while bottom-up signals are carried by high-frequency gamma-band oscillations.
The foregoing discussion shows the close collaboration between cognitive centers and visual areas in the brain in VWM, since the higher-level cognitive centers guide attention through which they sustain the perceptual representations in visual areas during memory tasks. This suggests that the perceptual information used in memory tasks is most likely represented in visual areas and, in this sense, memory recruits the representations in these areas to achieve its goals. This is the basic premise of sensorimotor recruitment models of VWM, a class of models that hold that the systems and representations engaged to perceive information can also contribute to the short-term retention of that information in VWM (D’Esposito & Postle 2015).
We can bring in, now, the upshot of the discussion about Mereology and Topology, which is that to solve the problem of how to constrain the fusion of parts so that only natural objects be accepted as proper wholes, one must consider both “parthood” and “connectedness”. Since the visual system has solved this problem, it is plausible to assume that its computations directly implement, in one form or another, the Mereotopological principles of, say, “parthood” and “connectedness” and combine them in such a way that only natural wholes are represented in visual perception under normal conditions. The reader has perhaps made this connection as a result of our discussion of the neural mechanisms of synchronicity that underlie the compositionality of visual representations. If iconic compositionality is realized by neural mechanisms and the ways parts are composed are hardwired in the visual system, and if these compositions are expressed by the principles of Mereotopology, it follows that the principles of Mereotopology express the functioning of the relevant neural mechanisms.
At the same time, the operation of the visual system, as we have seen, is characterized by a set of operational constraints. Thus, these constraints must be realized by the neural circuits in our visual system. This is what Han et al. (2002) studies show. Han et al. (2002) studied the neural mechanisms underpinning the operational constraints of “local proximity” and “similarity”. The findings of their study suggest that proximity grouping resulted in short latency modulations of medial occipital activity that was followed by longer latency modulations in the occipito-parietal cortex. Proximity grouping that relies on “local proximity” induced similar medial occipital modulations at 110 ms., which suggests that it depends mainly on representations of spatial relationships between local elements, and is independent of visual features. Grouping by color similarity that relies on feature similarity produced only long-latency occipito-temporal modulations.
In view of these considerations, it follows that the operational constraints in vision should reflect the principles or Mereotopology. I shall argue now that they do so, starting with a brief account of the operational constraints in vision. As I have repeatedly said (Author), perceptual computations are constrained by a set of what I have called operational constraints. Burge (2010) calls them “formation principles”, Echeverri (2017) calls them “object constraints”, and some among them are also known as Spelke (1990) principles of “object perception”. These principles also figure in Haugeland’s account of perception, where he claims (Haugeland 1998, 261) that non-concept possessing creatures and we share various innate “object-constancy” and “object-tracking” mechanisms that automatically ‘lock onto’ medium sized lumps of matter. These mechanisms implement the operational principles. The operational constraints reflect general or higher-order physical regularities that govern the behavior of objects in our world and the geometry of our environment, and which have been ingrained in the perceptual system through causal interaction with the environment over the evolution of our species. This means that they reflect generalities in the world given our physical constitution and needs for survival.
Empirical studies by Spelke (1990), Spelke et al. (1995), and Karmiloff-Smith (1992) strongly support the assumption that the infant, from the beginning of life, is constrained by a number of domain-specific principles about material objects and some of their properties. These constraints involve ‘attention biases toward particular inputs and a certain number of principled predispositions constraining the computation of those inputs.’ (Karmiloff-Smith 1992, 15) Among these predispositions are the conception of object persistence, and four basic principles: boundness, cohesion, rigidity, and no action at a distance.
The operational constraints function at almost all levels of visual processing, are hardwired in the brain, and do not entail that perception is cognitively penetrated since they do not constitute cognitive states that affect perceptual processing (Raftopoulos 2009; 2019, chapter 3). One of these principles is cohesion (Bloom, 2000). ‘Objects are connected and bounded bodies that maintain both their connectedness and their boundaries as they move freely’ (Spelke et al. 1995, 45). That is, the cohesion principle dictates that two surface points lie on the same object only if the points are linked by a path of connected surface points. This entails that if some relative motion alters the adjacency relations among points at their borders, the surfaces lie on distinct objects, and that all points on an object move on connected paths over space and time. When surface points appear at different places and times such that no connected path could unite their appearances, the surface points do not lie on the same object. This constraint reflects the principle of “connectedness” of Topology that, as we have seen, supplants the rules of Mereology to restrict the definition of “sum” so that only compiling parts that yield natural whole objects is an acceptable notion of “sum”. This is so because if two surface points lie on the same object only if the points are linked by a path of connected surface points, then the parts on which the two points lie are connected. The importance of the cohesion principle is manifested by the finding that some violations of cohesion seem to destroy infants’ representations of enduring object (Chiang & Wynn, 2000; Huntley-Fenner et al., 2002). Mitroff et al., (2004a, b) and vanMarle & Scholl, (2003) show that even adults’ visual processing is critically affected by cohesion violations.
Another principle, closely related to “cohesion” is the principle of solidity; ‘objects move only on unobstructed paths: no parts of two distinct objects coincide in space and time’ (Spelke, et al., 1992, 606). This is clear expression of Theorem (P4c) of Mereology presented above that states that no two distinct objects can share parts, and shows how the perceptual system has hardwired basic regularities of the environment, in this case, a basic Mereological property of solid objects.
Another constraint is the boundness principle, according to which two surface points lie on distinct objects only if no path of connected surface points links them. This principle determines the set of those points that define an object boundary, and entails that two distinct objects cannot interpenetrate, because two distinct bodies cannot occupy the same place at the same time. Finally, the rigidity and no action at a distance principles specify that bodies move rigidly (unless the other mechanisms show that a seemingly unique body is, in fact, a set of two distinct bodies) and that they move independently of one another (unless the mechanisms show that two seemingly separate objects are in fact connected). These constraints guide the perception of the motions of objects, of the layout of adjacent objects, of object boundaries, and of object segmentation by both adults and infants, and play a crucial role in the segmentation processes that take place in the visual system upon viewing a scene. These constraints relate both to “connectedness” and to theorem P4a of Mereology. The former because if there is no path of connected surface points lining the two surface points, they belong to unconnected parts that form distinct objects since they cannot be combined to form a natural whole. Theorem P4a applies because it rules out the possibility that two object parts that combine to form a whole object can have any common parts (they cannot overlap), because if that were the case, two distinct object wholes formed of parts that can overlap could penetrate one the other over the points of overlapping.
There are more constraints at work in perception than those mentioned above. The formation of the full primal sketch in Marr’s (1982) theory, for instance, which involves the grouping of the edge fragments formed in the raw primal sketch, relies on the principles of “local proximity” (adjacent elements are combined), which shows the Topological principle of “connectivity” at work, and of “similarity” (elements with similar features are combined). All these principles are parts of “perceptual grouping”, which refers to the function of the human visual system to organize discrete entities in the visual field into chunks or perceptual objects. The principle of local proximity states that spatially close objects or object parts tend to be grouped together ,constraining thus which parts can be combined to form natural wholes. The principle of similarity claims that element s with similar features in the field tend to be grouped together. Grouping processes have been assumed to take place at an early stage in the visual processing stream. Perceptual grouping also relies on the more general principle of “closure” (two edge segments could be joined even though their contrasts differ because of illumination effects) (Bruce and Green 1993, 131–132). Other assumptions that are brought to bear upon the early visual processing to solve the problem of the underdetermination of perception by the retinal image are those of “continuity” (the shapes of natural objects tend to vary smoothly and usually do not have abrupt discontinuities), “proximity” (since matter is cohesive, adjacent regions usually belong together and remain so even when the object moves), and “similarity” (since the same kind of surface absorbs and reflects light in the same way, the different subregions of an object are likely to look similar).
The formation of the 21/2D sketch is similarly underdetermined, in that there is a great deal of ambiguity in matching features between the two images form in the retinas of the two eyes, since there are usually more than one possible matches. Stereopsis requires a unique matching, which means that the matching processing must be constrained. The formation of the 21/2D sketch, therefore, relies upon a different set of operational constraints that guide stereopsis. ‘A given point on a physical surface has a unique position in space at some time’ (Marr 1982, 112), and matter is cohesive and surfaces are generally smooth. These operational constraints give rise to the general constraints of “compatibility” (a pair of image elements are matched together if they are physically similar, since they originate from the same point of the surface of an object), “uniqueness” (an item from one image matches with only one item from the other image), and “continuity” (disparities must vary smoothly).
I think that it is not hard to see, and I have given some examples, that almost all of these operational constraints are directly or indirectly related to the connectedness and parthood Mereotopological relations among elements in perceptual computations and that their role is exactly to ensure the formation, and thereby visual representation, of natural wholes. All these suggest that the Picture Principle as it applies to perception is severely constrained and that it is not the case that any way you cut a visual representation of a visual scene you get a representation of some part of this scene.
We know that adjacent regions in a visual scene are registered by adjacent regions in the retina, and, through retinotopic projections, are represented by adjacent regions in most of the brain. Thus, neighborhood relations, which is a topological notion strongly associated with ‘connectedness,’ are retained in the brain. Moreover, parthood relations in a visual scene are also retained in the brain because a neural representation of, say, a whole object or scene consists, in the sense of containment, of the neural representations of its parts, the same way a picture of a scene contains as parts pictures of the constituents of the scene. Thus, mereological relations are also retained in the brain. It follows that even though mental states are not spatially arrayed in space as pictures are, mereotopology is still applicable and can be put into use to show how the brain distinguishes natural wholes from mere collections of parts (recall that mereology by itself cannot solve this problem but requires the topological notion of ‘connectedness’). This means that the mathematical tools of mereotopology can be used not only to describe the compositionality of visual perceptual contents, but also the compositionality of their neural vehicles.