Français


Orthographic, grapho-phonological, and morphological characteristics
of written words from French elementary textbooks







Manulex-Morpho



Main changes in Manulex Morpho ver.2

Manulex-Morpho Version 2 brings major changes to version 1. Changes are mostly the result of differences in (a) the segmentations of G-Ph and Ph-G, (b) the positional coding of the associations (initial, final or internal association), and (c) the coding of Ph-G consistency when silent graphemes correspond to nominal inflections (gender or number), as well as to derivational / gender inflectional supports (e.g., the final 't' in 'petit' vs. 'petite', the 'd' in 'bavard' vs. 'bavarde', see 'morphology codes' tab).

• G-Ph (reading direction) and Ph-G (writing direction) segmentations are different. In version 1 of Manulex_Morpho, associations were conceived to reflect reading processes, from grapheme to phoneme. We used the same associations to reflect writing from phoneme and grapheme. This lead to problems in calculating consistency, notably when words include final silent letters. For example, the word  'nid' ends with a silent 'd'. The G-Ph association on the final grapheme is highly consistent because final 'd' in most words is silent. By contrast, the final 'd' of the word 'sud' is pronounced, therefore the consistency of this association (grapheme 'd' associated with the phoneme /d/) in word final position is low. When considering writing, Ph-G associations for word final silent letters are difficult to predict, because they are not pronounced. The solution adopted in Manulex_Morpho v.2 is to combine them with the last pronounced phoneme. For example, the word 'renard’ ends in /R/ which is associated to the spelling '-rd', and the final phoneme /R/ in 'gare' and 'terre' are associated to  '-re' and '-rre', respectively. The pronunciation of 'renard' is thus compatible with the invented spelling errors '*renarre', or '*renare'). The same coding principle applies to silent letters that follow vowels;  'dans' is coded with /ã/-ans, and 'étang' with /ã/-ang.

• Positional coding of associations. 1) Noun gender inflections are coded as final graphemes whether or not they are followed by a number inflection. 2) Similarly, verbal inflections are coded as final graphemes, whether or not they are followed by a gender or number inflection (e.g. the final -ées of plural feminine past participles). However, the verbal inflections -ant, -it, -is, and -t which mark the present and past participles, are coded as internal when followed by a gender inflection.The inflection modifies the pronunciation of the final consonant of the verbal inflection (t, s) and places them in internal position within the word. This exception is necessary to preserve the consistency of pronunciations for verbal inflections (e.g., -ant in 'glissant' and in 'glissante'). 3) In order to process consonants and vowels the same way whether the word is singular or plural, graphemes followed by a nominal number inflection are coded as final as in: the final ‘d’ in bavards' or 'renards'; the final 'et' in 'bouquets', and the final 'a' in 'caméras'. This does not apply when graphemes are followed by a vowel (as in the nominal gender inflection 'e').

• Ph-G consistency (writing direction) when silent graphemes correspond to nominal inflections (gender or number) or derivation/flexion supports: nominal inflections are indicated with code '3' and derivation/flexion supports with code '6' (see tab 'morphological codes'). The consistency of the Ph-G associations for nominal inflections is established at 100% since the probability that the word ends with an 'e' is 100% if the word is gender inflected, and the probability that the word ends in 's' or 'x' is 100% if the word is number inflected. The consistency of derivation/flexion supports for silent consonants (e.g., 't' in ‘aliment’, 'd' in ‘grand') is also 100% since the silent grapheme can be guessed based on derived words ('alimentation', 'alimentaire', 'grandeur') or inflected words (‘grande’). Guessing a silent grapheme is generally straightforward for endings in -b (plomb), -d (grand), -g (long, sang), -l (cristal, gentil), -p (camp), -t (absent) even though the pronunciation of the grapheme may differ between derived/inflected forms (e.g., 'g' pronounced /g/ in 'longueur' and /ʒ/ in 'longer'). As for the final -f, they are changed into /v/ during derivations/flexions (sportif - sportive, neuf - neuve), as the presence of a -v in word finals is phonotactically illegal in French. Final  -s (gros, gris, frais), in -x (choix, doux), and -z (riz) are associated with the phonemes /s/, /z/, /S/ and the Ph-G consistencies are thus estimated according to the probability of each of the silent grapheme from the phoneme. For exemple, the /z/ in 'choisir' /SwaziR/ is compatible with 'x' in the word 'choix', but 's' (*chois) or 'z' (*choiz) are also possible because /z/ can occur in derived/inflected forms of words ending in –z or –s (riz-rizière, gris-grise).

• For each word, the database provides the least consistent G-Ph or Ph-G association, and the least frequent association. Note that the least consistent association is not necessarily the least frequent, and vice-versa.

• Information theory measures (surprisal, entropy) are computed on G-Ph and Ph-G associations. Entropy measures the level of uncertainty associated with a probability distribution. It is measured in information bits (0 or 1). Applied to G-Ph (or Ph-G) mappings, entropy measures the uncertainty associated with the pronunciation of a given grapheme (or the spelling associated with a given phoneme). For example, the pronunciation uncertainty of the grapheme 'v' (as in 'ville') is zero because 'v' is always pronounced /v/. Conversely, the pronunciation of the grapheme 'eu' is uncertain because that grapheme is sometimes pronounced as in the word 'deux' and sometimes as in the word 'neuf'. The entropy of a grapheme (or phoneme) is a function of both the number of possible pronunciations of the grapheme (or spellings for the same phoneme) and the probability (consistency) of each of the G-Ph (or Ph-G) mappings. The minimum entropy value is 0 (no uncertainty), as is the case for the grapheme 'v'. The maximum value of entropy (maximum uncertainty) depends on both the number of alternatives and the probability distribution of the mappings. The higher the entropy value, the higher the uncertainty. For example, the entropy of the phoneme /ɑ̃/ at the end of words is very high, as there are a dozen possible spellings (en, an, aon, emps, ang…). To calculate entropy, we need to calculate the 'surprise' associated with each G-Ph (or Ph-G) association. It corresponds to an inverse logarithmic transformation (in base 2) of the probability (consistency) of each G-Ph (or Ph-G) association. The more likely the association, the less surprising it is.

• Modifications and corrections of several phonological codes and segmentations into graphemes and phonemes

• The distinction between the two 'a' (/a/ of 'patte' and /ɑ/ of 'pâte') is removed from consistency calculation. They are considered as the same phoneme.

• Words including the grapheme 'ai' ('maison', 'laine') can be transcribed with /E/ or /e/. Therefore consistency calculation consider the G-Ph association as the same.

• Introduction of differences between always pronounced, always silent, and optional 'e' (see 'phonetic codes' tab)

• The G-Ph consistency for the grapheme 'e' whose schwa is optional ('gare', 'parle') is set to 100 since the 'e' may or may not be pronounced.

• In the case of Ph-G associations only, the few rare silent consonants in internal position (e.g. 'm' in 'automne', 'p' in 'baptême') are not present in the speech signal, and their Ph-G consistency is therefore 0%.

• Cases where 'e' is followed by two identical consonants. The coding for G-Ph associations, was standardized by sorting 'e' when followed by two identical consonants as .e[CC]. (with CC to indicate 2 identical consonants). For exemple, the word 'femme' is now coded as 'f.e[CC].mm.e'. This change in coding now describes the word 'femme' with a low consistency score because 'e' followed by a doublet is usually pronounced /e/ or /E/. However, this only happens when the 'e' is not included in a morphologically coded group (subscript '6'; derivation/flexion support) such as in 'ancienne' where 'enn' is coded '6enn'. This coding of 'e[CC]' is only done for G-Ph associations but not for Ph-G associations since, in spoken French, double consonants are not distinguishable from single consonants. The coding of G-Ph associations with 'e[CC]' applies to all words in order to highlight inconsistencies (thus also to words such as 'ennui' coded 'e[CC]-@.nn-n.ui-8i')

• Coding has been modified when -eill or -eil are not preceded by 'u' ('abeille', 'bienveillant', 'sommeil'). 'eil' and 'eill' are now single blocks where 'il' or 'ill' are always associated with the semi-vowel /j/, never with the consonant /l/.

• In order to eliminate some rare G-Ph or Ph-G associations, proper names are excluded from the analyses

• The phonological CV structure and the identity of consonant clusters have been added.

• From ver. 2.4, by-token values computed using a log transform of word frequency, log10(frequency+1).