How to parse the Aleppo codex and analyze its content in python

Sat Jun 4, 2022

These notes are part of the blog post I wrote in my quest of trying to learn Hebrew. Maybe you want to check that out first for proper context.

Some important notes to keep in mind:

I am applying literate programming for some lisp code that I write to maintain my Doom Emacs configuration.
I am using pysword which is a wrapper in python around the Sword project. So I already have the bible modules correctly correctly installed on my computer.
I am using several other libs like nltk for natural language processing, and others to generate the plots of the wordcloud
I use Arch Linux as my OS. I thought at first of using jupyter notebook to be able to execute blocks. Influenced by my love for Smalltalk, I thought there should be a Melpa package to execute chunks of code and lo and behold, I discovered org-babel-eval-in-repl, which does exactly that… So I installed it on emacs.

With that in mind… lets start…

This chunk is needed to install jupyter-notebook, which is not used here at this time, but I am looking forward to integrate it with Emacs also. So take this sections as a minor and irrelevant parenthesis for the time being.

sudo pacman -S jupyter-notebook python-ipykernel python-ipython-genutils

Now lets get it done… First of all, create a virtual environment and install the libs

virtualenv ~/tmp/venv-geek-hebrew
cd ~/tmp/venv-geek-hebrew
source ./bin/activate.fish
pip install pysword nltk matplotlib wordcloud

Within emacs, you need to load the virtual environment by using pyvenv to create a session:

(pyvenv-activate "/home/av/tmp/venv-geek-hebrew")

from pysword.modules import SwordModules
import nltk
from nltk import word_tokenize
from nltk.probability import FreqDist
import urllib.request
from matplotlib import pyplot as plt
from wordcloud import WordCloud

modules = SwordModules()
found_modules = modules.parse_modules()
bible = modules.get_bible_from_module(u'Aleppo')

pentateuch_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy']
historical_books = ['Joshua', 'Judges', 'Ruth', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings', 'I Chronicles', 'II Chronicles', 'Ezra', 'Nehemiah', 'Esther']
wisdom_books = ['Job', 'Psalms', 'Proverbs', 'Ecclesiastes', 'Song of Solomon']
major_prophets_books = ['Isaiah', 'Jeremiah', 'Lamentations', 'Ezekiel', 'Daniel']
minor_prophets_books = ['Hosea', 'Joel', 'Amos', 'Obadiah', 'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi']
christian_books = pentateuch_books + historical_books + wisdom_books + major_prophets_books + minor_prophets_books

torah_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy']
neviim_former_books = ['Joshua', 'Judges', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings']
neviim_later_books = ['Isaiah', 'Jeremiah', 'Ezekiel']
neviim_minor_books = ['Hosea', 'Joel', 'Amos', 'Obadiah', 'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi']
ketuvim_poetic_books = ['Psalms', 'Proverbs', 'Job']
ketuvim_five_megillot = ['Song of Solomon', 'Ruth', 'Lamentations', 'Ecclesiastes', 'Esther']
ketuvim_historical_books = ['Daniel', 'Ezra', 'Nehemiah', 'I Chronicles', 'II Chronicles']
judaism_books = torah_books + neviim_former_books + neviim_later_books + neviim_minor_books + ketuvim_poetic_books + ketuvim_five_megillot + ketuvim_historical_books

def analyze_words(analysis_books, analysis_word_count, analysis_ignore_words):
    text = bible.get(books=analysis_books)
    words = word_tokenize(text)

    print('Analizing the books of: {}'.format(analysis_books))
    print('Number of words to consider: {}'.format(analysis_word_count))
    print('Ignoring the following words: {}'.format(analysis_ignore_words))
    print('Sample of text: {}'.format(text[1:10]))
    print('Sample of tokens: {}'.format(words[1:10]))
    print('Total number of words in the text is: {}'.format(len(words)))

    fdist = FreqDist(words)
    fdist.most_common(analysis_word_count)
    #print(fdist)
    i = 1
    all_words = []
    for word in fdist:
        #print('{}: {}'.format(word, fdist[word]))
        all_words.append(word)
        # Yes, I know I should be using a while loop and I don't like this pattern
        i += 1
        if i > analysis_word_count:
            break
    return(all_words)

def generate_word_cloud(all_words):
    all_words_string = " ".join(all_words)
    wordcloud = WordCloud(
        #font_path = '/usr/share/fonts/TTF/Cardo104s.ttf',
        font_path = '/usr/share/fonts/TTF/Cardob101.ttf',
        background_color="white",
        relative_scaling = 1.0,
        scale=3,
        random_state=1
    ).generate(all_words_string)
    plt.figure(figsize = (12, 12))
    plt.imshow(wordcloud)

    plt.axis("off")
    plt.show()

print('block loaded')

analysis_books = ['Genesis']
analysis_word_count = 300
all_words = analyze_words(analysis_books, analysis_word_count, [])
generate_word_cloud(all_words)