Orthographic, grapho-phonological, and morphological characteristics
of written words from French elementary textbooks


Main changes in Manulex_Infra version 2

• Separate G-Ph and Ph-G segmentations. In version 1 of Manulex-infra, the description of grapho-phonological associations focused mainly on grapheme-to-phoneme mappings (reading direction). The same associations were then used to analyze the associations between phonemes and graphemes (writing direction). However, this procedure causes problems especially -though not exclusively- when the words include silent letters. Indeed, Ph-G consistency is defined as the probability of writing a particular grapheme from the pronounced phoneme. In the case of a silent grapheme, no phoneme is produced and it is difficult to predict for sure what letter should be written (unless one knows the exact spelling of the word). Therefore associations must be described differently when considering reading (G-Ph) and writing (Ph-G). This information is now available in version 2 (see tab 'Understanding Manulex_Infra').

• Analyses of the consistency and frequency of grapho-phonological associations on the final rime unit of words.

• Information theory measures (surprisal, entropy, informational gain) are computed on G-Ph and Ph-G associations

• Analyses of lexemes (lemmas)

• For each word the least consistent and the least frequent G-Ph or Ph-G association. Note that the least consistent association is not always the least frequent, and vice-versa.

• Phonological codes and segmentations into graphemes and phonemes were modified

• The distinction between the two 'a' (/a/ of 'patte' and /ɑ/ of 'pâte') is removed from consistency calculation. They are considered as the same phoneme.

• Words including the grapheme 'ai' ('maison', 'laine') can be transcribed as /E/ or /e/. Therefore consistency calculation consider the G-Ph association as the same.

• The difference between the 'e' that are obligatorily pronounced, obligatorily silent, or with optional schwa (see 'phonetic codes' tab) is now included in the analyses.

• The syllabic segmentation of words in accordance with the coding of silent or non silent 'e' is included in the analyses.

• The G-Ph consistency for the grapheme 'e' whose schwa is optional ('gare', 'parle') is set to 100 since the 'e' may or may not be pronounced.

• In the case of Ph-G associations only, the few rare silent consonants in internal position (e.g. 'm' in 'automne', 'p' in 'baptême') are not present in the speech signal, and their Ph-G consistency is therefore 0%.

• Case of 'e' followed by two identical consonants. In version 1 of Manulex_infra, the orthographic sequences 'emm' and 'enn' were coded as one single graphemic group while the 'e' followed by other doublets ('err', 'ett') were coded as two (e.g. 'e.rr' in 'terre'). The coding of 'emm' and 'enn' responded to the segmentation principle aimed at highlighting inconsistencies in word pronunciation, as these two orthographic sequences were pronounced differently in 'antenne' and 'flemme' than in 'solennel' and 'patiemment'. However, given the high number of adverbs in '-emment' sharing the association 'emm'-/am/ (patiemment, évidemment, récemment), a word like “femme” was described as consistent. The coding for reading was standardized by sorting 'e' when followed by two identical consonants as .e[CC]. (with CC to indicate 2 identical consonants) so the word 'femme' is now coded as 'f.e[CC].mm.e'. This change in coding now describes the word “femme” with a low consistency score because 'e' followed by a doublet is usually pronounced /e/ or /E/. This coding of 'e[CC]' only occurs for G-Ph associations but not for Ph-G associations since, in French, nothing signals the presence of the doublet when words are pronounced orally. This coding of G-Ph associations with 'e[CC]' applies to all words in order to highlight inconsistencies, thus also to words such as 'ennui' coded 'e[CC]-@.nn-n.u-8.i-i'.

• Coding has been modified when -eill or -eil are not preceded by 'u' ('abeille', 'bienveillant', 'sommeil'). 'eil' and 'eill' are now single blocks where 'il' or 'ill' are always associated with the semi-vowel /j/, never with the consonant /l/.

• Verb endings in -ent are, when reading is considered (G-Ph), segmented 'en.t'. It is comparable to the description of verb endings in '-ant', '-ont', '-ons', ..., (an .t, on.t, on.s).

• By-token values are computed using a log transform of word frequency, log10(frequency+1). ver.2.4

• In order to eliminate some rare G-Ph or Ph-G associations, proper names are excluded from the analyses.