latin-lemmatizer

Latin Lemmatizer

Custom tool for Latin text lemmatisation. Uses an extracted copy of the CLTK Latin lemmata data, but does not explicitly depend on CLTK.

Getting started

Usage

An –input-parameters option pointing to a yaml file is necessary.

uv run latin_lemmatizer --input-parameters ./input_parameters.yaml

The input parameters yaml must specify text_path and output_path values

text_path: "./data/test.txt"
output_path: "./outputs/test.csv"

Features

This tool uses a naive dictionary lookup to lemmatize a Latin text file. Lemma frequency information is then written to a CSV file. Several user override options can be activated by including filepaths in the input parameters yaml. For example:

text_path: "./data/test.txt"
output_path: "./outputs/test.csv"
word_lemma_overrides_path: "./data/word_lemma_overrides.csv"
word_word_overrides_path: "./data/word_word_overrides.csv"
lemma_lemma_overrides_path: "./data/lemma_lemma_overrides.csv"
proper_nouns_path: "./data/proper_names.txt"

proper nouns

A text file containing proper names (one per line) can be provided. Words beginning with an upper case letter will be converted to lower case, unless they are included in the proper nouns file. This occurs before any other input override. For example, the word Arma will fail lemmatization unless convered to lower case. On the other hand, Iuppiter will fail unless it remains upper case. Therefore the default behaviour is to convert all words to lower case unless found in the proper nouns override.

word-lemma overrides

word-lemma overrides are used to bypass the lookup entirely. Consider the following Latin sentence:

arma virumque canō

The default output of the lemmatizer is

lemma,frequency,hits...
armo,1,arma
vir,1,virum
cano,1,cano

We can see that arma has been lemmatized as the verb armo instead of the desired arma (arma is already nominative singular). So we will include a row arma, arma in the word_lemma_overrides.csv with the word on the left and the correct lemma on the right. Note that all user overrides are trusted. If we add arma, potato as a word-lemma override, the output will be

lemma,frequency,hits...
potato,1,arma
vir,1,virum
cano,1,cano

word-word overrides

word-word overrides allow us to replace one word with another before lemmatization.

lemma-lemma overrides

lemma-lemma overrides us allow us to replace one lemma with another after lemmatization.

Summary

Consider two words with two default lemmas:

word1: lemma1
word2: lemma2

With no overrides, the sentence word1 word2 will lemmatize to

lemma, frequency, hits...
lemma1, 1, word1
lemma2, 1, word2

With a word-lemma override word1, lemma3 the sentence will instead lemmatize to

lemma, frequency, hits...
lemma3, 1, word1
lemma2, 1, word2

With a word-word override word1, word2 the sentence will instead lemmatize to

lemma, frequency, hits...
lemma2, 2, word1, word2

With a lemma-lemma override lemma2, lemma3 the sentence will instead lemmatize to

lemma, frequency, hits...
lemma1, 1, word1
lemma3, 1, word2