Orthographic, grapho-phonological, and morphological characteristics
of written words from French elementary textbooks


Understanding Manulex_infra

Consistency and Frequency of the grapho-phonological associations : The predictability of a word’s pronunciation or spelling from grapheme-phoneme (G-Ph) or phoneme-grapheme (Ph-G) associations is typically estimated by a consistency index. The consistency of G-Ph and Ph-G associations is a critical factor in learning to read and write in an alphabetical script. The term grapheme refers to a letter or a group of letters that corresponds to a phoneme. In French, graphemes include groups of letters such as 'ou', 'an', 'un', 'in',  'eu',  'ch', and 'gn'. The G-Ph consistency index is equal to the frequency of occurrence of a particular G-Ph association in words, divided by the total frequency of the grapheme, regardless of its pronunciation. The consistency of a G-Ph association thus reflects the probability of associating a particular phoneme with a given grapheme. For example, the G-Ph consistency index of the association 'ch'->/S/ (as in the word 'chat' /Sa/) is obtained by dividing the frequency of occurrence of the association 'ch'->/S/ by the frequency of the grapheme 'ch', whatever its pronunciation (including /S/, but also /k/ for instance, as in 'choral' /koRal/). The G-Ph consistency index is then multiplied by 100. Its maximum value is 100. Similarly, the Ph-G consistency index is equal to the frequency of occurrence of a particular phoneme-grapheme correspondence, divided by the total frequency of the phoneme multiplied by 100, regardless of the phoneme’s orthography..

Consistency can vary considerably depending on a grapheme or a phoneme’s position within a word. Due to the evolution of inflectional and derivational morphology rules in French, word endings are often silent, which reduces their orthographic consistency. Thus, to better characterize grapheme-phoneme correspondences in the database, frequency and consistency were computed separately based on their position in words: initial (first grapheme/phoneme), final (last grapheme/phoneme), or intermediate (graphemes/phonemes in the middle of words). Manulex_infra also provides two types of consistency and frequency data: the lexical and the textual statistics. Lexical frequency (i.e., count by type) reflects the number of different words in the database that include a G-Ph or Ph-G correspondence of interest while each word is counted only once. Textual frequency (i.e. count per token) reflects the number of words in the texts that include a correspondence while each word is counted as often as it appears in the corpus. Thus, the frequency and consistency values of the G-Ph and Ph-G correspondences do not account for the frequency of occurrence of words in the texts (rare or frequent words) when a count by type is considered, while the frequency and consistency values are weighted by the frequency of occurrence of words in a count by token.

Graphemic segmentation of words tends to be easy in French. As far as possible, each segment matches a single phoneme. When word segmentation was ambiguous, the decision was based on a second principle that segmentation should highlight inconsistencies in each word’s pronunciation and writing.

Silent letters at the end of many words required a different approach to analyse G-Ph associations and Ph-G associations. For example, the word “nid” ends with a silent 'd' and the final G-Ph is highly consistent when reading French, because 'd' in word final position is almost always silent. Conversely the final 'd' of the word 'sud' is pronounced, and the consistency of the rare G-Ph association 'd' -> /d/ is low. When representing reading (G-Ph) related statistics in Manulex-infra v.2, silent letters are coded with the character '#' (e.g., 'd' in 'lourd', 'p' in 'loup', nominal gender inflection 'e' in 'amie', nominal number inflection 's' in 'tables'). However, segmenting word final silent letters proceeded differently when spelling was considered, because silent letters are not pronounced and can therefore not be reported as standalone units (as in the word “foulard” pronounced /fulaR/ where the final 'd' is silent and cannot be matched to a phoneme). In this case, the final silent letters were merged with the last pronounced letter so that accurate statistics could be reported. For example, silent final ‘d’ in the word 'renard' is coded with the final ‘r’ (/R/-rd), in the word 'terre' silent final ‘e’ is coded with the final 'rr' (/R/-rre), and in the word 'gare' silent final ‘e’ is coded with the  final 'r' (/R/-re). The same logic was applied to code other silent letters in words such as 's' or 't' in 'jamais' and 'lait' (/E/-ais, and /E/-ait respectively).

Final rime of words. Additional analyses are provided by considering the broader phonological context corresponding to the final phonological rime of the word. These additional analyses are also motivated by the observation that, in orthographic production, orthographic choices are partially a function of the rime context. In this case, a silent grapheme such as 'd' in the word 'renard' is no longer integrated into the Ph-G association /R/-'rd' but as part of the final rime of the word /aR/-'ard'. In Manulex-infra v.2, the phonological rime is defined as the last vowel of the word (different from a schwa), the possible semi-vowels preceding it, and the possible semi-vowels, consonants, or schwa following it. Note that a semi-vowel placed before a vowel is part of the rime because the rime’s spelling pattern will partially depend on whether a semi-vowel is present or not. For example, nouns ending in /ɔ͂/ ('on') can be spelled in multiple ways (-on, -ons, -ond, -om, -ont, -onc, -omb, -ong...), whereas words ending in /jɔ͂/ are never spelled with a silent final consonants (-ion, -yon, -illon in 'nation', 'rayon', 'bouillon'). Similarly, The rime /aR/ can be spelled in multiple ways (-ard, -art, -are, -ar, -arre, -ars) but paired with /w/, /aR/ can only be spelled -oir or -oire.

Estimating the degree of difficulty of a word requires considering its consistency and frequency at the level of its final rime, but also at the level of G-Ph or Ph-G correspondences. Some words may be very consistent at the rime level but very inconsistent at the G-Ph or Ph-G level. For exemple, the word “femme” is very consistent when considering the rime but inconsistent when considering G-Ph or Ph-G associations.

Analyses including or excluding nominal gender/number inflections and verbs inflections. Grapho-phonological relations (consistency and frequency G-Ph, Ph-G, rime) are analyzed either by including:
• all the orthographic forms encountered in the textbooks
• or the orthographic forms that correspond to the associated lemma (or lexeme) only.
This second analysis allows a description of grapho-phonological relations by excluding nominal gender and nominal number inflections (final -e for feminine), number forms (final -s or -x for plurial) and verbs inflections (person, tense, mode). Note that words that appear in schoolbooks in an inflected form only are not included in this second analysis.

Other variables coded in Manulex_Infra v.2 The Manulex_Infra v.2 database provides additional information on the lexical entries (more detailed description under the 'download' tab):):
• Least consistent or least frequent G-Ph or Ph-G association of the word (the association is provided along with the consistency or frequency values)
• Orthographic, phonological, graphemic, and syllabic length of the word
• Textbook word frequency at grade 1, grade 2, and grade 1 to 5 based on Manulex
• Syllabic segmentation of the phonological code
• G-Ph and Ph-G segmentation, phonological rime (and its orthographic equivalent)
• Orthographic neighborhood of the word according to the 'n-count' index or the 'Levenshtein distance' index. The n-count index provides the number of words generated if any letter in a word is substituted with another one (e.g., 'rire' has as orthographic neighbors 'lire', 'rare', and 'rime'). The higher the value, the denser the orthographic neighborhood. The Levenhstein distance index (OLD20) provides the average number of orthographic modifications that need to occur in a word (letter substitution, letter transposition, letter deletion or addition) to generate 20 of its closest orthographic neighbors. The lower the value, the denser the orthographic neighborhood.
• Number of non-homograph homophones (e.g., port-porc-pore) for adjectives and nouns.
• Average frequency of bigrams (groups of two adjacent letters)