It is often not until we are faced with an unfamiliar musical style that we fully realize the importance of the musical mental schemata gradually acquired through our past listening experience. These cognitive structures automatically intervene as music is heard, and they are necessary to build integrated and organized perceptions from acoustic sensations: without them, as it happens when listening to a piece in a musical style foreign to our experience, a flow of notes seems like a flow of words in a foreign language, incoherent and unintelligible. The impression is that all pieces or phrases sound more or less the same, and musical styles such as Indian Rags, Chinese Guqin or Balinese Gamelan are often qualified as being monotonous by Western listeners, new to these kinds of music. This happens to experienced, musically trained listeners as well as to listeners without any musical experience other than just listening. Thus it is clear that the mental schemata required to interpret a certain kind of music can be acquired through gradual acculturation (Francès, 1988), which is the result of passive listening in the sense that it does not require any conscious effort or attention directed towards learning. This is not to say that formal training has no influence, but only that it is not necessary and that exposure to the music is sufficient.
Becoming familiar with a particular musical style usually implies two things:
1) The memorization of particular melodies
2) An intuitive sense of the prototypicality of musical sequences relative to that style (i.e., the sense of tonality in the context of Western music).
These underlie two kinds of expectancies, respectively melodic and stylistic expectancies. Melodic (also called 'veridical') expectancies rely on the listener's familiarity with a particular melody and refer to his knowledge of which notes will be coming next after hearing part of it. Stylistic expectancies rely on the listener's familiarity with a particular musical style, and refer to his sense of the notes that should or will probably follow a passage in order for the piece to fit well in that style.
These expectancies can be probed in different ways, for instance with Dowling's (1973) recognition task of familiar melodies interleaved with distractor notes, and Krumhansl and Shepard's (1979) probe-tone technique, respectively.
Some connectionist models of tonality have been proposed before but they are rarely realistic in that they often use a priori knowledge from the musical domain (e.g., octave equivalence) or are built without going through learning (Bharucha, 1987; extended by Laden, 1995). This paper presents an Artificial Neural Network (ANN), based on a simplified version of Grossberg's (1982) Adaptive Resonance Theory (ART) to model the tonal acculturation process. The model does not presuppose any musical knowledge except the categorical perception of pitch for its input, which is a research problem in itself (Sano and Jenkins, 1989) and beyond the scope of this paper. The model gradually develops through unsupervised learning. That is, it does not need any other information than that present in the music to generate the schemata, just like humans do not need a teacher. Gjerdingen (1990) used a similar model for the categorization of musical patterns, but did not aim at checking the cognitive reality of these musical categories. Page (1999) also applied successfully ART2 networks to the perception of musical sequences. The goal of the present paper is to show that this simple and realistic model is cognitively pertinent, by comparing its behaviour with humans' directly on the same tasks. As mentioned in the previous section, these tasks have been chosen because they are robust, having stood the test of time, and because they reflect broad and fundamental aspects of music cognition.
The ART2 self-organizing ANN (Carpenter and Grossberg, 1987) was developed for the classification of analogue input patterns and is well suited for music processing. It seemed a bit more complex than what is needed here, and a few simplifications were made to build the present model, ARTIST (Adaptive Resonance Theory to Internalize the Structure of Tonality). It is made up of 2 layers (or fields) of neurons, the input field (F1) and the categories field (F2), connected by synaptic weights that play the role of both Bottom-Up and Top-Down connections. Learning occurs through the modification of the weights, that progressively tune the 'category units' in F2 to be most responsive to a certain input pattern (the 'prototype' for this category). The weights store the long-term memory of the model.
The neurons in F1 represent the notes played. For now the model will be tested only with conventional Western music, so an acoustic resolution of one neuron per semitone is sufficient to code the musical pieces used. This is the only constraint applied to the model imposed by assuming Western music, and this can easily be overridden simply by changing the number of input nodes. Bach's 24 preludes from the Well-tempered clavier were used for learning. The notes they contain span 6 octaves. With 12 notes per octave, 72 nodes are needed to code the inputs. The activation of the inputs is updated at the end of every measure. Each note played within the last measure activates its corresponding input neuron proportionally to its loudness (or velocity; notes falling on beats 1 and 3 were accentuated) and according to a temporal exponential decay (activation is halved every measure). Before propagating the activation to F2, the activation in F1 is normalized. Each prelude was transposed in the 12 possible keys, so 288 pieces were available for training ARTIST.
Upon presentation of an input to F1, the activation of the F2 nodes is computed. The degree of activation (or match) of each category depends on the similarity between its prototype (stored in the weights) and the input. The fuzzy AND operator (i.e., min) is used as the similarity measure, which is equivalent to computing the proportion of features common to both the input and the prototype. The most activated category is then chosen as the winner to simulate lateral inhibition, and the other categories' activations reset to zero. If in learning mode, the weights of the winner are updated at this point (see next paragraph). Then Top-Down activation from the winner propagates back to F1 as the new input measure is being presented; the average constitutes the new F1 activation, and the cycle starts again.
Upon learning, the winner category is subject to the vigilance test, a way to test the hypothesis that the stimulus belongs to the category: if the match is higher than the set vigilance parameter, resonance occurs. That is, the input is considered as belonging to the category, and its prototype is modified to reflect the new input. The new weights are a linear combination of the old weights and of the input pattern being newly integrated into the category. They are then normalized, to avoid the problem of synaptic erosion (the weights decreasing towards 0) and the subsequent category proliferation and classification instability. If the vigilance test fails, that is if the input is too different from even the best fitting prototype, a new category is created and its prototype set equal to the input pattern. Thus the vigilance parameter controls the generality of the categories. A low value (close to 0) generates a few broad categories, having rather abstract prototypes, each representing many exemplars. On the other hand, a high vigilance value (close to 1) creates many narrow categories that contain few exemplars, or only one in the extreme case which is perfect memorization (but poor abstraction).
ARTIST's behaviour is quite robust regarding the vigilance values, at least when they lie in the middle range. With the chosen value equal to 0.55, the presentation of the 41,004 input measures constituting the 288 pieces resulted in the creation of 709 categories.
When we are very familiar with a melody, we can usually still recognize it after various transformations like transposition, rhythmic or tonal variations, etc... This is not the case when distractor (random) notes are added in between the melody notes, and even the most familiar tunes become unrecognizable as long as the distractors 'fit in' (if no primary acoustic cue like frequency range, timbre or loudness for instance segregates them; Bregman, 1990). However, when given a few possibilities regarding the identity of the melody, it can be positively identified (Dowling, 1973). This means that Top-Down knowledge can be used to test hypotheses and categorize stimuli. For melodies, this knowledge takes the form of a pitch-time window within which the next note should occur, and enables the direction of auditory attention (Dowling, Lung & Herrbold, 1987; Dowling, 1990). As the number of possibilities offered to the subject increases, his ability to name that tune decreases: when Top-Down knowledge becomes less focused, categorization gets more difficult. With its built-in mechanism of Top-Down activation propagation, ARTIST can be subjected to the same task.
To get ARTIST to become very familiar with the first 2 measures of 'Twinkle twinkle little star', the learning rate and vigilance values were set to their maximum equal to 1 for both), so that the learning procedure would create two new F2 nodes to memorize those two exemplars and act as labels for the tune. Had the vigilance level been too low, the tune would have been assimilated into an already existing category, which activation couldn't be interpreted as recognition of the tune. After learning the tune, the activation in F2 was recorded under 5 conditions corresponding to the presentation of:
The control condition is necessary to make sure that testing the hypothesis that the tune is 'Twinkle twinkle...' by activating the label nodes does not always provoke false alarms.
For each condition, the activation ranks of the 2 label nodes were computed (1 for the most activated, 2 for the next, and so on) and added. A low rank indicates few categories competing and interfering with the recognition, and a probable "Yes" response to the question "Is this Twinkle twinkle ?". As the sum of the ranks increases, meaning the label nodes are overpowered by other categories, the response goes towards "No".
Results appear in Figure 1. Thus the straight presentation of the tune results in the smallest possible summed ranks, equal to 3: ARTIST recognizes unambiguously the melody. When the melody is presented with distractors, the ranks are higher, indicating some difficulty in recognition. Among these 3 conditions, the lowest ranks are found when testing the 'Twinkle' hypothesis and only that one. The label nodes are amongst the top 5 most activated, which suggests a strong possibility to identify the melody. Then, identification gets much more difficult when testing multiple hypotheses, about as difficult as without Top-Down activation (no explicit hypothesis is being tested), exactly like when human subjects are not given a clue about the possible identity of the melody. Finally, the control condition shows that ARTIST does not imagine recognizing the melody amongst distractors when it is not there, even after priming the activation of its notes through Top-Down propagation.
Equivalent results can be obtained by summing the activations of the label nodes instead of computing their ranks. However, given the important number of categories, small differences of activation (especially in the middle range) imply strong differences in ranks and therefore the latter measure was preferred to exhibit the contrast between conditions. In any case, the order of the conditions most likely to lead to the recognition of the familiar melody is the same for ARTIST and humans, and the effects of melodic expectancies can easily be observed in ARTIST.
The most general and concise characterization of tonality --and therefore of most Western music-- probably comes from the work of Krumhansl (1990). With the probe-tone technique, she empirically quantified the relative importance of pitches within the context of any major or minor key, by what is known as the 'tonal hierarchies'. These findings are closely related to just about any aspect of tonality and of pitch use: frequency of occurrence, accumulated durations, aesthetic judgements of all sorts (e.g., pitch occurrence, chord changes or harmonization), chord substitutions, resolutions, etc... Many studies support the cognitive reality of the tonal hierarchies (Jarvinen, 1995; Cuddy, 1993; Repp, 1996; Sloboda, 1985; Janata and Reisberg, 1988). All these suggest that subjecting ARTIST to the probe-tone technique is a good way to probe whether it has extracted a notion of tonality (or its usage rules) from the music it was exposed to, or at least elements that enable a reconstruction of what tonality is.
The principle of the probe-tone technique is quite simple. A prototypical sequence of chords or notes is used as musical context, to establish a sense of key. The context is followed by a note, the probe tone, that the subjects have to rate on a scale to reflect how well the tone fits within this context. Repeating this procedure for all 12 possible probe notes, the tone profile of the given key can be generated. Out of the many types of contexts used by Krumhansl et al. over several experiments, the 3 standard ones were retained to test ARTIST: for each key and mode (major and minor), the corresponding chord, the ascending and the descending scales were used as contexts. Several keys are used so the results do not depend on the choice of the particular pitch of reference. Here all 12 keys are used (as opposed to 4 by Krumhansl) and ARTIST's profile is obtained by averaging the profiles obtained with the 3 contexts for each key, after transposition to a common tonic. Thus the tone profile obtained for each mode is the result of 432 trials (3 contexts × 12 keys × 12 probes). After each trial, the activations of the F2 nodes were recorded. Following Katz' (1999) idea that network's total activation directly relates to pleasantness, the sum of all activations in F2 is taken as ARTIST's response to the stimulus, the index of its receptiveness/aesthetic judgement towards the musical sequence.
It appeared in the first studies using the probe-tone technique that subjects judged the fitness of the probe-tone more as a function of its acoustic frequency distance from the last note played rather than as a function of tonal salience. This results naturally from the usual structure of melodies, that favours small steps between consecutive notes. This problem was eluded by using Shepard tones (Shepard, 1964). They are designed to conserve the pitch identity of a tone while removing the cues pertaining to its height, and their use to generate ever-ascending scales illusions prove they indeed possess this property. Shepard tones are produced by generating all the harmonics of a note, filtered through a bell-shaped amplitude envelope. To simulate Shepard tones for ARTIST, the notes are played simultaneously on all 6 octaves, with different velocities (loudness) according to the amplitude filter: high velocities for the middle octaves, decreasing as the notes get close to the boundaries of the frequency range.
Figures 2 and 3 allow direct comparison of the tone profiles obtained with human data (Krumhansl and Kessler, 1982) and with ARTIST, for major and minor keys, respectively. Both Pearson correlation coefficients between human and ARTIST profiles are significant, respectively -.95 and -.91, p<.01 (2-tail). Surprisingly, the correlations are negative, so ARTIST's profiles are inverted on the figures for easier comparison with the human data. This is discussed in the next section.
Once the tone profile for a particular key is available, we can deduce the profiles of all other keys by transposition. Then, we can compute the correlation between two different key profiles as a measure of the distance between the two keys. This is the procedure used by Krumhansl to obtain all the inter-key distances, and the same was done with ARTIST data, to check whether its notion of key distances conforms that of humans. Both graphs for distances between C major and all minor keys are shown in Figure 4. Keys on the X-axis appear in the same order as around the circle of fifths. It is immediately apparent that both profiles are close to being identical. This is even more true for the key distances between C major and all major keys, as well as for distances between C minor and all minor keys: The correlations between human and ARTIST data for major-major, minor-minor and major-minor key distances are respectively .988, .974 and .972, all significant at p<.01.
Thus ARTIST clearly emulates human responses on the probe-tone task, and therefore can be said to have developed a notion of tonality, with the tonal invariants extracted directly from the musical environment.
From the two simulations above we can see that it is easy to subject ARTIST to the same musical tasks as given to humans in a natural way, and that it approximates human behaviour very closely on these tasks. When probed with the standard techniques, it shows both melodic and stylistic expectancies, the two main aspects of musical acculturation. ARTIST learns unsupervised, and its knowledge is acquired only from exposure to music, so it is a realistic model explaining how musical mental schemata can be formed. The implication is that all is needed to accomplish those complex musical processings and to develop mental schemata is a memory system capable of storing information according to similarity and abstracting prototypes from similar inputs, while constantly interpreting the inputs through the filter of Top-Down (already acquired) knowledge. From the common action of the mental schemata results a musical processing sensitive to tonality. This property emerges from the internal organisation of the neural network, it is distributed over its whole architecture. Thus it can be said that the structure of tonality has been internalized. Testing ARTIST with other musical styles could further establish it as a general model of music perception.
In the simulation of the probe-tone task, ARTIST's response has to be recorded before any lateral inhibition occurs in F2. Otherwise, the sum of all activations in F2 would simply be that of the winner, all others being null, and a lot of information regarding ARTIST's reaction would be lost. This is taking one step further Gjerdingen's (1990) argument for using ANNs, namely that cognitive musical phenomena are probably too complex to be represented through the tidiness of a set of rules. In the present example, the simulation of complex human behaviours is achieved through the 'chaos' of the activation of hundreds of prototypes, each activated to a degree reflecting its resemblance to the input. This explains why the correlations between human and ARTIST tone profiles are negative, and resolves the apparent contradiction with Katz' (1999) theory of the aesthetic ideal being 'unity in diversity': The global activation in F2 is a measure of diversity, not of unity. Lateral inhibition is the key element of the theory, but it is deliberately not used here to preserve all aspects of the complexity of the abstract representation of a stimulus.
The major limitation of ARTIST in its current state is that it cannot account for transpositional invariance. Whether the perception of invariance under transposition can be acquired at all through learning is not obvious, as the question concerning how humans possess this ability is still opened.