THE DIGITISED HOLLE LIST PROJECT: BUILDING A DATABASE FROM LEGACY MATERIALS FOR CONSERVING INDIGENOUS INDONESIAN LANGUAGES
Abstract
Advances in cloud computing, as well as computational tools for extracting text from images, offer an opportunity to scale up the development of digital databases for Indigenous languages. This paper reports on the application of these advances to the digitalisation of old, paper-based lexical items of over a hundred Indigenous languages in Indonesia; these items are part of the so-called Holle List (HL). After introducing the (structure of the) HL, the paper underlines the motivation for the HL digitalisation project. It then provides an overview of Google Colab as a free cloud-computing platform for executing a series of optical character recognition (OCR) operations on hundreds of scanned pages of the HL, utilising pytesseract, a Python interface for Google’s Tesseract-OCR engine. Advantages (e.g., computational searchability and manipulability), as well as issues (especially typos and unrecognised characters) in the plain-text OCR outputs, are discussed. In conclusion, the paper highlights the importance of digital technology in conserving Indigenous languages via digital platforms, despite some unavoidable challenges that require human intervention.
Keywords: Holle List; Indigenous Indonesian languages; Digital Humanities; Lexical databases
References
Belval, E. (2024). Pdf2image: A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list. (Version 1.17.0) [Computer software]. https://badge.fury.io/py/pdf2image
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. https://doi.org/10.1080/10618600.2017.1384734
Drucker, J. (2021). The Digital Humanities coursebook: An introduction to digital methods for research and scholarship. Routledge. https://doi.org/10.4324/9781003106531
Fomin, M., & Toner, G. (2006). Digitizing a dictionary of Medieval Irish: The eDIL Project. Literary and Linguistic Computing, 21(1), 83–90. https://doi.org/10.1093/llc/fqh050
François, A. (2008). Semantic maps and the typology of colexification: Intertwining polysemous networks across languages (M. Vanhove, Ed.; pp. 163–215). John Benjamins Publishing Company. https://doi.org/10.1075/slcs.106.09fra
François, A. (2022). Lexical tectonics: Mapping structural change in patterns of lexification. Zeitschrift Für Sprachwissenschaft, 41(1), 89–123. https://doi.org/10.1515/zfs-2021-2041
Google. (2026). Google Colaboratory. https://colab.google/
Hoffstaetter, S. (2024). pytesseract: A python wrapper for Google’s Tesseract-OCR (Version 0.3.13). https://pypi.org/project/pytesseract/
Holle, K. F. (1894). Blanco woordenlijst. Landsdrukkerij. https://hdl.handle.net/2027/coo.31924023363215
Krauße, D., Rajeg, G. P. W., Pramartha, C. R. A., Zobel, E., Nothofer, B., Hemmings, C., Ogilvie, S., Arka, I. W., & Dalrymple, M. (2024). EnoLEX: A diachronic lexical database for the Enggano language. https://doi.org/10.25446/oxford.28282169.v1
Lai, Y., & List, J.-M. (2023). Lexical data for the historical comparison of Rgyalrongic languages. Open Research Europe, 3, 99. https://doi.org/10.12688/openreseurope.16017.2
Llanes-Ortiz, G. (2023). Digital initiatives for indigenous languages. United Nations Educational, Scientific; Cultural Organization (UNESCO) & STICHTING GLOBAL VOICES. https://unesdoc.unesco.org/ark:/48223/pf0000387186
Rajeg, G. P. W. (2023a). CLDF dataset of the Enggano word list from 1895 in Stokhof and Almanar’s (1987) Holle List. https://doi.org/10.25446/oxford.23515788
Rajeg, G. P. W. (2023b). Digitised, searchable Holle List in Stokhof (1980). https://doi.org/10.25446/oxford.23205173
Rajeg, G. P. W., & Arka, I. W. (2025). Group Work in the Lexicography Class for the Holle List of the Barrier Islands Languages of Indonesia (Version 0.0.1) [Dataset]. https://doi.org/10.17605/OSF.IO/7TQG6
Rajeg, G. P. W., & Arka, I. W. (2025). The digitised and annotated Holle List of the Barrier Islands languages, off the west coast of Sumatra, Indonesia [Dataset]. Open Science Framework (OSF). https://doi.org/10.17605/OSF.IO/P8A3R (Original work published 2024)
Rajeg, G. P. W., Arka, I. W., Pramartha, C. R. A., & Sangian, E. Z. (2025). The data science behind the curation of the Holle List: A case study from the Enggano Holle List and its neighbouring Barrier Islands Languages [Presentation]. Oceanic and Southeast Asian Navigators (OCSEAN) Conference, Faculty of Humanities, Udayana University. University of Oxford. https://doi.org/10.25446/oxford.29625407.v1
Rajeg, G. P. W., Krauße, D., & Pramartha, C. (2024). EnoLEX: A diachronic lexical database for the Enggano language. In A. Inoue, N. Kawamoto, & M. Sumiyoshi (Eds.), AsiaLex 2024 Proceedings: Asian Lexicography - Merging cutting-edge and established approaches (pp. 123–132). https://doi.org/10.25446/oxford.27013864
Rzymski, C., Tresoldi, T., Greenhill, S. J., Wu, M.-S., Schweikhard, N. E., Koptjevskaja-Tamm, M., Gast, V., Bodt, T. A., Hantgan, A., Kaiping, G. A., Chang, S., Lai, Y., Morozova, N., Arjava, H., Hübler, N., Koile, E., Pepper, S., Proos, M., Van Epps, B., … List, J.-M. (2020). The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies. Scientific Data, 7(1, 1), 13. https://doi.org/10.1038/s41597-019-0341-x
Smith, R. (2007). An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2, 629–633. https://doi.org/10.1109/ICDAR.2007.4376991
Stokhof, W. A. L. (Ed.). (1980). Holle lists, vocabularies in languages of indonesia, vol. 1: Introductory volume: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. https://doi.org/10.15144/PL-D17
Stokhof, W. A. L., & Almanar, A. E. (Eds.). (1986). Holle lists, vocabularies in languages of Indonesia, Vol. 8: Kalimantan (Borneo). Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. https://doi.org/10.15144/PL-D69
Stokhof, W. A. L., & Almanar, A. E. (Eds.). (1987). Holle lists: Vocabularies in languages of indonesia, vol. 10/3: Islands off the west coast of sumatra: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. http://hdl.handle.net/1885/144589
Stokhof, W. A. L., Saleh-Bronckhorst, L., & Almanar, A. E. (Eds.). (1982). Holle lists, vocabularies in languages of Indonesia, Vol. 5/1: Irian Jaya: Austronesian languages; Papuan languages, Digul area: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. http://hdl.handle.net/1885/144577
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (Second edition). O’Reilly. https://r4ds.hadley.nz/