THE DIGITISED HOLLE LIST PROJECT: BUILDING A DATABASE FROM LEGACY MATERIALS FOR CONSERVING INDIGENOUS INDONESIAN LANGUAGES

Authors

  • Gede Primahadi Wijaya Rajeg Universitas Udayana Author

Abstract

Advances in cloud computing, as well as computational tools for extracting text from images, offer an opportunity to scale up the development of digital databases for Indigenous languages. This paper reports on the application of these advances to the digitalisation of old, paper-based lexical items of over a hundred Indigenous languages in Indonesia; these items are part of the so-called Holle List (HL). After introducing the (structure of the) HL, the paper underlines the motivation for the HL digitalisation project. It then provides an overview of Google Colab as a free cloud-computing platform for executing a series of optical character recognition (OCR) operations on hundreds of scanned pages of the HL, utilising pytesseract, a Python interface for Google’s Tesseract-OCR engine. Advantages (e.g., computational searchability and manipulability), as well as issues (especially typos and unrecognised characters) in the plain-text OCR outputs, are discussed. In conclusion, the paper highlights the importance of digital technology in conserving Indigenous languages via digital platforms, despite some unavoidable challenges that require human intervention.
Keywords: Holle List; Indigenous Indonesian languages; Digital Humanities; Lexical databases

Author Biography

  • Gede Primahadi Wijaya Rajeg, Universitas Udayana

    Computer-assisted Lexicology and Lexicography (CompLexico) Research Group Centre for Interdisciplinary Research on the Humanities and Social Sciences (CIRHSS) Bachelor of English Literature, Faculty of Humanities, Udayana University

References

Belval, E. (2024). Pdf2image: A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list. (Version 1.17.0) [Computer software]. https://badge.fury.io/py/pdf2image

Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. https://doi.org/10.1080/10618600.2017.1384734

Drucker, J. (2021). The Digital Humanities coursebook: An introduction to digital methods for research and scholarship. Routledge. https://doi.org/10.4324/9781003106531

Fomin, M., & Toner, G. (2006). Digitizing a dictionary of Medieval Irish: The eDIL Project. Literary and Linguistic Computing, 21(1), 83–90. https://doi.org/10.1093/llc/fqh050

François, A. (2008). Semantic maps and the typology of colexification: Intertwining polysemous networks across languages (M. Vanhove, Ed.; pp. 163–215). John Benjamins Publishing Company. https://doi.org/10.1075/slcs.106.09fra

François, A. (2022). Lexical tectonics: Mapping structural change in patterns of lexification. Zeitschrift Für Sprachwissenschaft, 41(1), 89–123. https://doi.org/10.1515/zfs-2021-2041

Google. (2026). Google Colaboratory. https://colab.google/

Hoffstaetter, S. (2024). pytesseract: A python wrapper for Google’s Tesseract-OCR (Version 0.3.13). https://pypi.org/project/pytesseract/

Holle, K. F. (1894). Blanco woordenlijst. Landsdrukkerij. https://hdl.handle.net/2027/coo.31924023363215

Krauße, D., Rajeg, G. P. W., Pramartha, C. R. A., Zobel, E., Nothofer, B., Hemmings, C., Ogilvie, S., Arka, I. W., & Dalrymple, M. (2024). EnoLEX: A diachronic lexical database for the Enggano language. https://doi.org/10.25446/oxford.28282169.v1

Lai, Y., & List, J.-M. (2023). Lexical data for the historical comparison of Rgyalrongic languages. Open Research Europe, 3, 99. https://doi.org/10.12688/openreseurope.16017.2

Llanes-Ortiz, G. (2023). Digital initiatives for indigenous languages. United Nations Educational, Scientific; Cultural Organization (UNESCO) & STICHTING GLOBAL VOICES. https://unesdoc.unesco.org/ark:/48223/pf0000387186

Rajeg, G. P. W. (2023a). CLDF dataset of the Enggano word list from 1895 in Stokhof and Almanar’s (1987) Holle List. https://doi.org/10.25446/oxford.23515788

Rajeg, G. P. W. (2023b). Digitised, searchable Holle List in Stokhof (1980). https://doi.org/10.25446/oxford.23205173

Rajeg, G. P. W., & Arka, I. W. (2025). Group Work in the Lexicography Class for the Holle List of the Barrier Islands Languages of Indonesia (Version 0.0.1) [Dataset]. https://doi.org/10.17605/OSF.IO/7TQG6

Rajeg, G. P. W., & Arka, I. W. (2025). The digitised and annotated Holle List of the Barrier Islands languages, off the west coast of Sumatra, Indonesia [Dataset]. Open Science Framework (OSF). https://doi.org/10.17605/OSF.IO/P8A3R (Original work published 2024)

Rajeg, G. P. W., Arka, I. W., Pramartha, C. R. A., & Sangian, E. Z. (2025). The data science behind the curation of the Holle List: A case study from the Enggano Holle List and its neighbouring Barrier Islands Languages [Presentation]. Oceanic and Southeast Asian Navigators (OCSEAN) Conference, Faculty of Humanities, Udayana University. University of Oxford. https://doi.org/10.25446/oxford.29625407.v1

Rajeg, G. P. W., Krauße, D., & Pramartha, C. (2024). EnoLEX: A diachronic lexical database for the Enggano language. In A. Inoue, N. Kawamoto, & M. Sumiyoshi (Eds.), AsiaLex 2024 Proceedings: Asian Lexicography - Merging cutting-edge and established approaches (pp. 123–132). https://doi.org/10.25446/oxford.27013864

Rzymski, C., Tresoldi, T., Greenhill, S. J., Wu, M.-S., Schweikhard, N. E., Koptjevskaja-Tamm, M., Gast, V., Bodt, T. A., Hantgan, A., Kaiping, G. A., Chang, S., Lai, Y., Morozova, N., Arjava, H., Hübler, N., Koile, E., Pepper, S., Proos, M., Van Epps, B., … List, J.-M. (2020). The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies. Scientific Data, 7(1, 1), 13. https://doi.org/10.1038/s41597-019-0341-x

Smith, R. (2007). An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2, 629–633. https://doi.org/10.1109/ICDAR.2007.4376991

Stokhof, W. A. L. (Ed.). (1980). Holle lists, vocabularies in languages of indonesia, vol. 1: Introductory volume: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. https://doi.org/10.15144/PL-D17

Stokhof, W. A. L., & Almanar, A. E. (Eds.). (1986). Holle lists, vocabularies in languages of Indonesia, Vol. 8: Kalimantan (Borneo). Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. https://doi.org/10.15144/PL-D69

Stokhof, W. A. L., & Almanar, A. E. (Eds.). (1987). Holle lists: Vocabularies in languages of indonesia, vol. 10/3: Islands off the west coast of sumatra: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. http://hdl.handle.net/1885/144589

Stokhof, W. A. L., Saleh-Bronckhorst, L., & Almanar, A. E. (Eds.). (1982). Holle lists, vocabularies in languages of Indonesia, Vol. 5/1: Irian Jaya: Austronesian languages; Papuan languages, Digul area: Vols. Materials in Languages of Indonesia. Dept. of Linguistics, Research School of Pacific Studies, The Australian National University. http://hdl.handle.net/1885/144577

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (Second edition). O’Reilly. https://r4ds.hadley.nz/

Downloads

Published

2026-05-17