nefnir package

Submodules

nefnir.nefnir module

class nefnir.nefnir.Nefnir[source]

Bases: object

A rule-based lemmatizer

lemmatize(form, tag)[source]

Lemmatize a word form given its part-of-speech tag.

Parameters:
  • form – A word form.
  • tag – The word form’s part-of-speech tag.
Returns:

The word form’s lemma.

recase(form, tag, lemma)[source]

Determine how to properly case a lemma given the word form and part of speech tag it was derived from.

Nefnir transforms words into lowercase prior to lemmatization. Some words, such as proper nouns, abbreviations and foreign words therefore need to be re-capitalized or changed back into uppercase.

Parameters:
  • form – A word form, cased as it was written.
  • tag – The word form’s part-of-speech tag.
  • lemma – The word form’s lemma, in lowercase.
Returns:

A properly cased lemma.

nefnir.nefnir.get_suffixes(s)[source]

Return an iterator yielding a string’s suffixes, from the largest to the smallest.

Parameters:s – A text string.
Returns:An iterator for the string’s suffixes.
nefnir.nefnir.main()[source]

nefnir.wrapper module

nefnir.wrapper.init() → None[source]

Read configuration files.

nefnir.wrapper.lemmatize(form: str, tag: str) → str[source]

Lemmatize a word form given its part-of-speech tag.

Parameters:
  • form – A word form.
  • tag – The word form’s part-of-speech tag.
Returns:

The word form’s lemma.

nefnir.wrapper.lemmatize_line(line: str, separator: str = '\t') → Tuple[Optional[str], Optional[str], Optional[str]][source]

Lemmatize a word form given its part-of-speech tag.

Parameters:
  • line – A line with form and tag separated by seperator.
  • separator – The token separator.
Returns:

Tuple with form, tag, lemma (any can be None if data invalid).

nefnir.wrapper.recase(form: str, tag: str, lemma: str) → str[source]

Determine how to properly case a lemma given the word form and part of speech tag it was derived from.

Nefnir transforms words into lowercase prior to lemmatization. Some words, such as proper nouns, abbreviations and foreign words therefore need to be re-capitalized or changed back into uppercase.

Parameters:
  • form – A word form, cased as it was written.
  • tag – The word form’s part-of-speech tag.
  • lemma – The word form’s lemma, in lowercase.
Returns:

A properly cased lemma.

Module contents

Top-level package for nefnir (nefnir-package).

nefnir.init() → None[source]

Read configuration files.

nefnir.lemmatize(form: str, tag: str) → str[source]

Lemmatize a word form given its part-of-speech tag.

Parameters:
  • form – A word form.
  • tag – The word form’s part-of-speech tag.
Returns:

The word form’s lemma.

nefnir.lemmatize_line(line: str, separator: str = '\t') → Tuple[Optional[str], Optional[str], Optional[str]][source]

Lemmatize a word form given its part-of-speech tag.

Parameters:
  • line – A line with form and tag separated by seperator.
  • separator – The token separator.
Returns:

Tuple with form, tag, lemma (any can be None if data invalid).

nefnir.recase(form: str, tag: str, lemma: str) → str[source]

Determine how to properly case a lemma given the word form and part of speech tag it was derived from.

Nefnir transforms words into lowercase prior to lemmatization. Some words, such as proper nouns, abbreviations and foreign words therefore need to be re-capitalized or changed back into uppercase.

Parameters:
  • form – A word form, cased as it was written.
  • tag – The word form’s part-of-speech tag.
  • lemma – The word form’s lemma, in lowercase.
Returns:

A properly cased lemma.