Workshop Materials
✨To the Collaborative Guidelines✨
Case Studies
#1 - HTR for Codex Palatinus Graecus 23
👥 Speakers
Mathilde Verstraete, Maxime Guénette (with the help of Alix Chagué and Marcello Vitali-Rosati)
📌 Overview
This case study focused on the development of an HTR model for the Codex Palatinus Graecus 23, the primary witness of the Greek Anthology.
📖 Sources
A single manuscript, divided in 2 parts: Palatinus graec. 23, pp. 1-614 and Parisinus Suppl. graec. 384, pp. 615-709
📋 Characteristics:
- Script: Xth century Byzantine round minuscule
- Hands: at least 4
- Readability and abbreviations: clear and few abbreviations
- Layout: One columns & a few scholia
🎯 Project Context
This project emerged from the Greek Anthology Project held at the Canada Research Chair on Digital Textualities
🔬 Results
HTR models trained using Kraken and eScriptorium:
Meleagre-NFD-finetuned (91,05%), Meleagre-NFD (90,85%), Meleagre-NFC (91,00%)
📑 Repository & Guidelines
#2 - HTR for Byzantine manuscripts: recognising Ioannes Chrysostomus, Maximus Planudes, and Cyril of Alexandria
(→Slides←)
👥 Spsker
Elpida Perdiki, PhD Candidate, Democritus University of Thrace
📌 Overview
This is a presentation of three case studies, all concerning the training of HTR models on data from Byzantine manuscripts. The manuscripts in question were transmitting texts of a) Ioannes Chrysostomus, b) Maximus Planudes, and c) Cyril of Alexandria. The results are three separate models, not a single unified one, since these are three distinct projects.
📖 Sources & 📋 Characteristics
Based on the previously mentioned list, the manuscripts used are the following:
a) Ioannes Chrysostomus:
- (Q) Athos, Dionysiou 70, ff. 380v-387r & ff. 404v-411r;
- 10th c., minuscule, clear readability, minimal ligatures and abbreviations, two columns, minimal marginalia
- (H) Athos, Vatopedi 328, ff. 004r-010r & ff. 027r-033v;
- 14th c., minuscule, clear readability, few ligatures and abbreviations, two columns, minimal marginalia
- (A) Athens, Nat. Libr. 263, ff. 155r-159r & ff. 171v-176v;
- 10th c., minuscule, clear readability, few ligatures, abbreviations of nomina sacra, two columns, minimal marginalia
- (I) Alexandria, Patr. Libr. 34, ff. 97v-101v & ff. 112v-116v;
- 10th c., minuscule, clear readability, no ligatures, abbreviations only on nomina sacra, two columns, no marginalia;
- (D) Venice, ONB theol. gr.14, ff. 120v-127r;
- 10th/11th c., minuscule, clear readability, minimal ligatures and abbreviations, two columns, minimal marginalia
- (E) Paris, Bibl. Nat., Gr. 745, ff. 12r-18r;
- 12th c., minuscule, clear readability, almost no ligatures but nomina sacra and minimal abbreviated ending syllables, two columns, minimal marginalia
- (K) Munich, Gr. 377, ff. 134r-137v & ff. 146r-149r;
- 10th/11th c., minuscule, clear readability, minimal ligatures and abbreviations, two columns, minimal marginalia
- (L) Munich, Gr. 353, ff. 200v-205v & ff. 225v-234v;
- 10th c., minuscule, clear readability, minimal ligatures and abbreviations, two columns, almost no marginalia.
b) Maximus Planudes:
- (V) Vat. Urb. gr. 125, ff. 215r-223v;
- 13th c., minuscule – Maximus Planudes’ autograph, clear readability, few ligatures and abbreviations, one column, minimal marginalia
- (A) National Library of Scotland Adv.MS.18.7.15, ff. 55r-60r;
- 13th c., minuscule – Maximus Planudes’ autograph, somewhat clear readability, heavy on ligatures and abbreviations, one column, no marginalia.
c) Cyril of Alexandria:
- (M) Mytilene, Monastery of Leimon 228, ff. 4-10v;
- 13th c., minuscule, clear readability, minimal ligatures and abbreviations, one column, no marginalia
- (A) Athens, National Library of Greece 1082, ff. 10v-14v;
- 15th c., minuscule, clear readability, minimal ligatures and abbreviations, two columns, minimal marginalia
- (Z) Deskate (Zaborda), Saint Nikanor Monastery 95, ff. 57v-62r;
- 13th c., minuscule, somewhat clear readability, heavy on abbreviations mostly on ending syllables, two columns, minimal marginalia
- (X) Mount Athos, Xiropotamou Monastery 93 Lambros 2426, Item 6, ff. 438r-450v;
- 16th c., minuscule, clear readability, one column, abbreviated ending syllables and nomina sacra, no marginalia.
🎯 Project Context
a) Ioannes Chrysostomus: The model was curated at the Department of Greek Philology, Democritus University of Thrace by the author, for the purposes of her PhD dissertation.
b) Maximus Planudes: The model was curated at the Department of Greek Philology, Democritus University of Thrace by the following team: Maria Konstantinidou, Assistant Professor (team supervisor); Elpida Perdiki, PhD Candidate (data curation); Athanasia Kiorapostolou, Postgraduate Student (transcriber); Irene Mpogdanou, Postgraduate Student (transcriber); Athanasios Papadopoulos, Postgraduate Student (transcriber); Maria Tsikouraki, Postgraduate Student (transcriber). The project was conducted for the purposes of the postgraduate course “Palaeography I”.
c) Cyril of Alexandria: A sample of the Greek manuscripts transmitting the lexicon of Cyril of Alexandria. The model is curated by Maria Konstantinidou, Assistant Professor (Principal Investigator and team supervisor); Elpida Perdiki, PhD Candidate (data curation); Ioannis Kouroudis, PhD Candidate (transcriber); Nikolaos Tsoukatos, Postgraduate Researcher (transcriber). The project is under the scope of the DMC – Lexi research, implemented within the framework of H.F.R.I. call “Basic Research Financing (Horizontal support of all Sciences)” under the National Recovery and Resilience Plan “Greece 2.0” funded by the European Union – NextGenerationEU (H.F.R.I. Number: KE 014890).
🔬 Results
a) Ioannes Chrysostomus: The model “Chrysostomicus I” (ID: 44872) with a 3.90% CER. The model is trained from combined data of all the previously mentioned manuscripts. A sample of the data is already available in the Zenodo repository and will be gradually updated with the full dataset.
b) Maximus Planudes: In total 4 models were trained. Two with data from the ms. V and two with data from both the V and the A mss. The results were: 16%, 8.50%, 13.1%, and 8.9% respectively. The results will be discussed in more detail during the presentation.
c) Cyril of Alexandria: The model is currently under development. Results to be announced.
#3 - HTR for Codex Genavensis Graecus 44
(→Slides←)
👥 Speakers
Ariane Jambé (postdoctoral researcher, Université de Lausanne)
📌 Overview
The proposed case study, which focuses on the development of an HTR model on eScriptorium for the Genavensis graecus 44, aims to raise two issues that have received relatively little attention in existing HTR projects dealing with Greek codices:
- the challenge of analysing the layout of exegetical manuscripts;
- the need to anticipate transcription problems likely to arise from the complexity of such layouts.
Accordingly, this case study seeks to shed light on some of the difficulties inherent in manuscripts that have been heavily used and annotated over time—a condition that characterises the Genavensis graecus 44, which was in use for nearly four centuries, first in Constantinople and later by the Genevan humanist Henri II Estienne. The study will explore how such challenges relate (or fail to relate) to the broader methodological reflection necessary for the creation of standards and guidelines within our field.
📖 Sources
The case study will draw on the Genavensis graecus 44, a thirteenth-century Byzantine manuscript of the Iliad. Its main point of interest lies in its rich paratext, which includes a prose paraphrase in Greek inserted between each verse (Α 1 to Μ 454, p. 1-526), as well as numerous scholia and interlinear glosses. This manuscript comprises 802 pages.
📋 Characteristics:
- Script: XIIIth century “scholarly” Byzantine round minuscule.
- Hands: One main hand (Gen I); a second hand responsible for numerous corrections (Gen II); and several additional hands whose identification is still debated (for example, Gen *II, possibly intervening between Gen I and Gen II, and Gen III, clearly of a later date).
- Readability and abbreviations: A distinction should be made here between legibility and readability. While legibility—understood as the ease with which one can decipher a string of characters—is generally quite high in the Genavensis, this assessment could be tempered in the case of the scholia and glosses, as certain writing modules are rather small and (standard) abbreviations are regularly employed. Readability, on the other hand—defined as the ability to trace the relationship between the poem and its paratext—is challenging.
- Layout: In the first part of the manuscript (p. 1-526), which contains the paraphrase, the main scribe’s editorial project appears to be as follows: in the principal text block, each Homeric verse is immediately followed by its prose paraphrase, while the margins (particularly the lateral ones) are occupied by scholia seemingly aligned with the corresponding verse. In the second half of the manuscript (p. 527-802), the main text block is divided into two unequal columns, the first reserved for the poem and the second for the scholia. However—and this is a central point of discussion in a HTR project context—the very category of “layout” may prove inadequate for documents that have been continuously annotated and reused. Disruptions to the original layout are frequent and do not always follow a systemic logic since each successive owner added glosses and scholia wherever space was available.
🎯 Project Context
The development of an HTR model for the Genavensis graecus 44 forms part of a postdoctoral research project (August 2024 to July 2029), whose aim is to produce a digital edition capable of connecting the poem with its paratextual apparatus (paraphrase, scholia, and glosses).
🔬 Results
As the project is still in its early stages and the model under devlopment, no results have yet been published.
#4 - OCR for Patrologia Graeca: Recognizing Noisy 19th-Century Greek Editions
👥 Speakers
Chahan Vidal-Gorène (Calfa)
📌 Overview
This talk presents a case study on the development of a specialized OCR model for the Patrologia Graeca (PG), a large 19th-century printed collection of Christian Greek texts. It highlights the challenges of working with typographically complex, poorly digitized, and linguistically rich documents, and demonstrates the effectiveness of an active learning strategy based on iterative fine-tuning.
📖 Sources
The corpus consists of 161 volumes of the Patrologia Graeca, published by J.-P. Migne (1857–1866), encompassing works by Greek Church Fathers and Byzantine authors from the 1st to the 15th century.
📋 Characteristics:
- Script: XIXth century Greek minuscule print (variable quality)
- Hands: N/A (printed text, but significant variation across volumes)
- Readability and abbreviations: High use of diacritics, significant noise and print degradation; footnote markers excluded from transcription
- Layout: Dense dual-column layout (Greek and Latin), with marginalia, running titles, interlinear content, and footnotes
🎯 Project Context
The project is led by Calfa and the GREgORI initiative, under the academic supervision of Prof. Jean-Marie Auwers. It is part of the Calfa GREgORI Patrologia Graeca (CGPG) project, which aims to produce a machine-readable, enriched digital edition of the PG.
🔬 Results
Models were trained using Calfa Vision’s iterative fine-tuning approach, building on an initial HTR model developed for Codex Genavensis Graecus 44. Following an automatic classification phase to identify pages with low recognition performance, we show that transcribing and correcting just 10 pages reduced the Character Error Rate (CER) to 4.19%. With 50 pages, the CER dropped to 1.1% on a target document. The layout model reached a 95% mean IoU for Greek zone detection. A major challenge remains the accurate treatment of transversal text spanning multiple columns.
📑 Repository & Guidelines:
- Main GitHub;
- Intermediary models and dataset are available on Zenodo;
- Layout analysis model available for free on Calfa Vision under: Greek printed (Patrologia Graeca) type of project;
- Webpage of the project
#5 - A variant to HTR: Detection and recognition of Greek characters on papyri
(→Slides←)
👥 Speakers
Isabelle Marthot-Santaniello, University of Basel
📌 Overview
This talk will briefly explain why a character-based approach, rather than the line-based level of HTR, is more promising in the case of paleographic research on Greek papyri. It will present the work on the detection and recognition of Greek letters on papyrus done in the scope of the project EGRAPSA. It will also show how the project invites “humans in the loop” to curate the automatically generated output.
📖 Sources
The corpus studied in the project is constantly extending. A first dataset served as material for the ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri (Seuret et al. 2023). It is composed of 185 images from 136 different manuscripts of Homer’s Iliad, covering the millennium between the 3rd c. BCE and the 7th c. CE.
📋 Characteristics:
- Script: Various kinds used to pen Homer’s Iliad over a millennium (from book hands to semi-cursive), variable quality of execution and preservation.
- Hands: Extremely varied, anonymous
- Readability and abbreviations: Mostly detached characters, a few are semi-detached (connected but not really ligatured, meaning in contact with but not distorted by the neighboring letters)
- Layout: mostly columns from scrolls or pages from codices, very rare interlinear interventions, some diacritics in a few manuscripts.
🎯 Project Context
The project entitled “EGRAPSA: Retracing the evolutions of handwritings in Graeco-Roman Egypt thanks to digital palaeography” is a Starting Grant funded by the Swiss National Science Foundation in Basel between June 2023 and May 2028. A team of Papyrologists and Computer Scientists joins forces to improve computer-assisted paleography, especially on the topic of script typology, dating and writer identification.
🔬 Results
Elaborating upon the ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri, the Mean Average Precision (mAP) for Detection is 52.43 and for Recognition is 45.18. However, recent evaluations based on human curation indicate that in fact only an average of 20% of the automatically generated output needs to be corrected by experts.
📑 Repository & Guidelines:
- Glyfix tool to curate automatically detected and recognized Greek characters available on Giuseppe de Gregorio’s GitHub: https://github.com/giuseppedeg/Annotation_correction_tool
- ICDAR 2023 competition dataset available on Zenodo
- Webpage of the project
📚 References on HTR and Ancient Greek
Something missing ? Let us know