Main changes in Manulex Morpho ver.2
Manulex-Morpho Version 2 brings major changes to version 1.
Changes are mostly the result of differences in (a) the
segmentations of G-Ph and Ph-G, (b) the positional coding of the
associations (initial, final or internal association), and (c)
the coding of Ph-G consistency when silent graphemes correspond
to nominal inflections (gender or number), as well as to
derivational / gender inflectional supports (e.g., the final 't'
in 'petit' vs. 'petite', the 'd' in 'bavard' vs. 'bavarde', see
'morphology codes' tab).
• G-Ph (reading direction) and Ph-G (writing direction)
segmentations are different. In version 1 of Manulex_Morpho,
associations were conceived to reflect reading processes, from
grapheme to phoneme. We used the same associations to reflect
writing from phoneme and grapheme. This lead to problems in
calculating consistency, notably when words include final silent
letters. For example, the word 'nid' ends with a silent
'd'. The G-Ph association on the final grapheme is highly
consistent because final 'd' in most words is silent. By
contrast, the final 'd' of the word 'sud' is pronounced,
therefore the consistency of this association (grapheme 'd'
associated with the phoneme /d/) in word final position is low.
When considering writing, Ph-G associations for word final
silent letters are difficult to predict, because they are not
pronounced. The solution adopted in Manulex_Morpho v.2 is to
combine them with the last pronounced phoneme. For example, the
word 'renard’ ends in /R/ which is associated to the spelling
'-rd', and the final phoneme /R/ in 'gare' and 'terre' are
associated to '-re' and '-rre', respectively. The
pronunciation of 'renard' is thus compatible with the invented
spelling errors '*renarre', or '*renare'). The same coding
principle applies to silent letters that follow vowels;
'dans' is coded with /ã/-ans, and 'étang' with /ã/-ang.
• Positional coding of associations. 1) Noun gender inflections
are coded as final graphemes whether or not they are followed by
a number inflection. 2) Similarly, verbal inflections are coded
as final graphemes, whether or not they are followed by a gender
or number inflection (e.g. the final -ées of plural feminine
past participles). However, the verbal inflections -ant, -it,
-is, and -t which mark the present and past participles, are
coded as internal when followed by a gender inflection.The
inflection modifies the pronunciation of the final consonant of
the verbal inflection (t, s) and places them in internal
position within the word. This exception is necessary to
preserve the consistency of pronunciations for verbal
inflections (e.g., -ant in 'glissant' and in 'glissante'). 3) In
order to process consonants and vowels the same way whether the
word is singular or plural, graphemes followed by a nominal
number inflection are coded as final as in: the final ‘d’ in
bavards' or 'renards'; the final 'et' in 'bouquets', and the
final 'a' in 'caméras'. This does not apply when graphemes are
followed by a vowel (as in the nominal gender inflection 'e').
• Ph-G consistency (writing direction) when silent graphemes
correspond to nominal inflections (gender or number) or
derivation/flexion supports: nominal inflections are indicated
with code '3' and derivation/flexion supports with code '6' (see
tab 'morphological codes'). The consistency of the Ph-G
associations for nominal inflections is established at 100%
since the probability that the word ends with an 'e' is 100% if
the word is gender inflected, and the probability that the word
ends in 's' or 'x' is 100% if the word is number inflected. The
consistency of derivation/flexion supports for silent consonants
(e.g., 't' in ‘aliment’, 'd' in ‘grand') is also 100% since the
silent grapheme can be guessed based on derived words
('alimentation', 'alimentaire', 'grandeur') or inflected words
(‘grande’). Guessing a silent grapheme is generally
straightforward for endings in -b (plomb), -d (grand), -g (long,
sang), -l (cristal, gentil), -p (camp), -t (absent) even though
the pronunciation of the grapheme may differ between
derived/inflected forms (e.g., 'g' pronounced /g/ in 'longueur'
and /ʒ/ in 'longer'). As for the final -f, they are changed into
/v/ during derivations/flexions (sportif - sportive, neuf -
neuve), as the presence of a -v in word finals is
phonotactically illegal in French. Final -s (gros, gris,
frais), in -x (choix, doux), and -z (riz) are associated with
the phonemes /s/, /z/, /S/ and the Ph-G consistencies are thus
estimated according to the probability of each of the silent
grapheme from the phoneme. For exemple, the /z/ in 'choisir'
/SwaziR/ is compatible with 'x' in the word 'choix', but 's'
(*chois) or 'z' (*choiz) are also possible because /z/ can occur
in derived/inflected forms of words ending in –z or –s
(riz-rizière, gris-grise).
• For each word, the database provides the least consistent G-Ph
or Ph-G association, and the least frequent association. Note
that the least consistent association is not necessarily the
least frequent, and vice-versa.
• Information theory measures (surprisal, entropy) are computed
on G-Ph and Ph-G associations. Entropy measures the level of
uncertainty associated with a probability distribution. It is
measured in information bits (0 or 1). Applied to G-Ph (or Ph-G)
mappings, entropy measures the uncertainty associated with the
pronunciation of a given grapheme (or the spelling associated
with a given phoneme). For example, the pronunciation
uncertainty of the grapheme 'v' (as in 'ville') is zero because
'v' is always pronounced /v/. Conversely, the pronunciation of
the grapheme 'eu' is uncertain because that grapheme is
sometimes pronounced as in the word 'deux' and sometimes as in
the word 'neuf'. The entropy of a grapheme (or phoneme) is a
function of both the number of possible pronunciations of the
grapheme (or spellings for the same phoneme) and the probability
(consistency) of each of the G-Ph (or Ph-G) mappings. The
minimum entropy value is 0 (no uncertainty), as is the case for
the grapheme 'v'. The maximum value of entropy (maximum
uncertainty) depends on both the number of alternatives and the
probability distribution of the mappings. The higher the entropy
value, the higher the uncertainty. For example, the entropy of
the phoneme /ɑ̃/ at the end of words is very high, as there are
a dozen possible spellings (en, an, aon, emps, ang…). To
calculate entropy, we need to calculate the 'surprise'
associated with each G-Ph (or Ph-G) association. It corresponds
to an inverse logarithmic transformation (in base 2) of the
probability (consistency) of each G-Ph (or Ph-G) association.
The more likely the association, the less surprising it is.
• Modifications and corrections of several phonological codes
and segmentations into graphemes and phonemes
• The distinction between the two 'a' (/a/ of 'patte' and /ɑ/ of
'pâte') is removed from consistency calculation. They are
considered as the same phoneme.
• Words including the grapheme 'ai' ('maison', 'laine') can be
transcribed with /E/ or /e/. Therefore consistency calculation
consider the G-Ph association as the same.
• Introduction of differences between always pronounced, always
silent, and optional 'e' (see 'phonetic codes' tab)
• The G-Ph consistency for the grapheme 'e' whose schwa is
optional ('gare', 'parle') is set to 100 since the 'e' may or
may not be pronounced.
• In the case of Ph-G associations only, the few rare silent
consonants in internal position (e.g. 'm' in 'automne', 'p' in
'baptême') are not present in the speech signal, and their Ph-G
consistency is therefore 0%.
• Cases where 'e' is followed by two identical consonants. The
coding for G-Ph associations, was standardized by sorting 'e'
when followed by two identical consonants as .e[CC]. (with CC to
indicate 2 identical consonants). For exemple, the word 'femme'
is now coded as 'f.e[CC].mm.e'. This change in coding now
describes the word 'femme' with a low consistency score because
'e' followed by a doublet is usually pronounced /e/ or /E/.
However, this only happens when the 'e' is not included in a
morphologically coded group (subscript '6'; derivation/flexion
support) such as in 'ancienne' where 'enn' is coded '6enn'. This
coding of 'e[CC]' is only done for G-Ph associations but not for
Ph-G associations since, in spoken French, double consonants are
not distinguishable from single consonants. The coding of G-Ph
associations with 'e[CC]' applies to all words in order to
highlight inconsistencies (thus also to words such as 'ennui'
coded 'e[CC]-@.nn-n.ui-8i')
• Coding has been modified when -eill or -eil are not preceded
by 'u' ('abeille', 'bienveillant', 'sommeil'). 'eil' and 'eill'
are now single blocks where 'il' or 'ill' are always associated
with the semi-vowel /j/, never with the consonant /l/.
• In order to eliminate some rare G-Ph or Ph-G associations,
proper names are excluded from the analyses
• The phonological CV structure and the identity of consonant
clusters have been added.
• From ver. 2.4, by-token values computed using a log transform
of word frequency, log10(frequency+1).