
Orthographic, grapho-phonological, and morphological characteristics
of written words from French elementary textbooks



Two types of files (.xlsx format) are available for download (detailed description below):

1. Lexical databases including the words and their characteristics (ver. 2.4.2). The ManuAll file contains all the lexical entries whereas the ManuLemme file contains only the orthographic forms corresponding to the lemmas. The grapho-phonological statistics of the words are independent. The ManuLemme file allows to characterize the grapho-phonological properties of words independently of gender/number inflections and verbal inflections.Note that words that appear in schoolbooks in an inflected form only are not included in this second analysis.

Filters are added at the top of the word lists to help selection

2. General statistics derived from the lexical databases. Several files are available:

• Consistency and frequency of G-Ph, Ph-G, and word rime associations. Statistics are generated from all lexical entries in the ManuAll-Associations file and from the lemmas in the ManuLemme-Associations file.

• Other orthographic statistics computed from the lexical corpus of ManuAll. These statistics, gathered in the file ManuAll-OrthoStat , are a) the frequency of letters and b) the frequency of bigrams and trigrams. These data are identical to those described in Manulex_Infra version 1.

Note: Google Sheets allows you to browse the files from your Google Drive. To import files directly into your Google Drive, use Chrome and the "Save to Google Drive" extension available on the Chrome Web Store. Then right-click on the file link to save it to your Google Drive


• Orthographic and phonological codes
• Grammatical category
• Number of letters, phonemes, graphemes, syllables
• Graphemic complexity (n of letters / n of phonemes)
• Syllabification (phonological)
• Word frequency in Grade 1 (CP), Grade 2 (CE1), and Grade 1 to Grade 5 (cp-cm2) according to the Manulex database (U values taking into account the frequency dispersion of words in textbooks)
• Number of heterographic homophones (e.g., port-porc-pore) for singular adjectives and nouns
• Orthographic neighborhood (N-Count and Levenshtein OLD20 index)
• Average bigram frequency (values per type and per token), and bigram frequency as a function of position (initial bigram, internal bigram(s), final bigram)
• G-Ph segmentation and Ph-G segmentation
• Phonological rime and orthographic counterpart
• Frequency and consistency of G-Ph associations (values per type and per token) as a function of the position within the word (initial, internal, final)
• Frequency and consistency of Ph-G associations (values per type and per token) as a function of the position within the word (initial, internal, final)
• Least frequent and least consistent G-Ph and Ph-G associations in the word
• Consistency and frequency of orthography-to-phonology (reading direction) or phonology-to-orthography (direction of spelling) associations on the phonological rime of words. Values by type and token.

(note: token values are based on word frequency from Grade 1 to Grade 5)

• Orthographic and phonological codes
• Grammatical category
• Number of letters, phonemes, graphemes, syllables
• Graphemic complexity (n of letters / n of phonemes)
• Syllabification (phonological)
• Word frequency in Grade 1 (CP), Grade 2 (CE1), and Grade 1 to Grade 5 (cp-cm2) according to the Manulex database (U values taking into account the frequency dispersion of words in textbooks)
• G-Ph segmentation and Ph-G segmentation
• Phonological rime and orthographic counterpart
• Frequency and consistency of G-Ph associations (values per type and per token) as a function of the position within the word (initial, internal, final)
• Frequency and consistency of Ph-G associations (values per type and per token) as a function of the position within the word (initial, internal, final)
• Least frequent or least consistent G-Ph and Ph-G associations in the word
• Consistency and frequency of orthography-to-phonology (reading direction) or phonology-to-orthography (direction of spelling) associations on the phonological rime of words. Values by type and token.

(note: token values are based on word frequency from Grade 1 to Grade 5)

ManuAll-Associations and ManuLemme-Associations
• G-Ph and Ph-G associations, rime (orthography-to-phonology; phonology-to-orthography)
• Frequency and consistency of associations (type and token values) as a function of the position within the word (initial, internal, final)
• Entropy and ‘surprisal’ of associations (type and token values) as a function of the position within the word (initial, internal, final)

(note: token values are based on word frequency from Grade 1 to Grade 5)