How to parse the Aleppo codex and analyze its content in python
— Albert De La Fuente VigliottiThese notes are part of the blog post I wrote in my quest of trying to learn Hebrew. Maybe you want to check that out first for proper context.
Some important notes to keep in mind:
- I am applying literate programming for some lisp code that I write to maintain my Doom Emacs configuration.
- I am using pysword which is a wrapper in python around the Sword project. So I already have the bible modules correctly correctly installed on my computer.
- I am using several other libs like
nltk
for natural language processing, and others to generate the plots of the wordcloud - I use Arch Linux as my OS. I thought at first of using jupyter notebook to be able to execute blocks. Influenced by my love for Smalltalk, I thought there should be a Melpa package to execute chunks of code and lo and behold, I discovered
org-babel-eval-in-repl
, which does exactly that… So I installed it on emacs.
With that in mind… lets start…
This chunk is needed to install jupyter-notebook, which is not used here at this time, but I am looking forward to integrate it with Emacs also. So take this sections as a minor and irrelevant parenthesis for the time being.
sudo pacman -S jupyter-notebook python-ipykernel python-ipython-genutils
Now lets get it done… First of all, create a virtual environment and install the libs
virtualenv ~/tmp/venv-geek-hebrew
cd ~/tmp/venv-geek-hebrew
source ./bin/activate.fish
pip install pysword nltk matplotlib wordcloud
Within emacs, you need to load the virtual environment by using pyvenv
to create a session:
(pyvenv-activate "/home/av/tmp/venv-geek-hebrew")
from pysword.modules import SwordModules
import nltk
from nltk import word_tokenize
from nltk.probability import FreqDist
import urllib.request
from matplotlib import pyplot as plt
from wordcloud import WordCloud
modules = SwordModules()
found_modules = modules.parse_modules()
bible = modules.get_bible_from_module(u'Aleppo')
pentateuch_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy']
historical_books = ['Joshua', 'Judges', 'Ruth', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings', 'I Chronicles', 'II Chronicles', 'Ezra', 'Nehemiah', 'Esther']
wisdom_books = ['Job', 'Psalms', 'Proverbs', 'Ecclesiastes', 'Song of Solomon']
major_prophets_books = ['Isaiah', 'Jeremiah', 'Lamentations', 'Ezekiel', 'Daniel']
minor_prophets_books = ['Hosea', 'Joel', 'Amos', 'Obadiah', 'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi']
christian_books = pentateuch_books + historical_books + wisdom_books + major_prophets_books + minor_prophets_books
torah_books = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy']
neviim_former_books = ['Joshua', 'Judges', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings']
neviim_later_books = ['Isaiah', 'Jeremiah', 'Ezekiel']
neviim_minor_books = ['Hosea', 'Joel', 'Amos', 'Obadiah', 'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah', 'Malachi']
ketuvim_poetic_books = ['Psalms', 'Proverbs', 'Job']
ketuvim_five_megillot = ['Song of Solomon', 'Ruth', 'Lamentations', 'Ecclesiastes', 'Esther']
ketuvim_historical_books = ['Daniel', 'Ezra', 'Nehemiah', 'I Chronicles', 'II Chronicles']
judaism_books = torah_books + neviim_former_books + neviim_later_books + neviim_minor_books + ketuvim_poetic_books + ketuvim_five_megillot + ketuvim_historical_books
def analyze_words(analysis_books, analysis_word_count, analysis_ignore_words):
text = bible.get(books=analysis_books)
words = word_tokenize(text)
print('Analizing the books of: {}'.format(analysis_books))
print('Number of words to consider: {}'.format(analysis_word_count))
print('Ignoring the following words: {}'.format(analysis_ignore_words))
print('Sample of text: {}'.format(text[1:10]))
print('Sample of tokens: {}'.format(words[1:10]))
print('Total number of words in the text is: {}'.format(len(words)))
fdist = FreqDist(words)
fdist.most_common(analysis_word_count)
#print(fdist)
i = 1
all_words = []
for word in fdist:
#print('{}: {}'.format(word, fdist[word]))
all_words.append(word)
# Yes, I know I should be using a while loop and I don't like this pattern
i += 1
if i > analysis_word_count:
break
return(all_words)
def generate_word_cloud(all_words):
all_words_string = " ".join(all_words)
wordcloud = WordCloud(
#font_path = '/usr/share/fonts/TTF/Cardo104s.ttf',
font_path = '/usr/share/fonts/TTF/Cardob101.ttf',
background_color="white",
relative_scaling = 1.0,
scale=3,
random_state=1
).generate(all_words_string)
plt.figure(figsize = (12, 12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
print('block loaded')
analysis_books = ['Genesis']
analysis_word_count = 300
all_words = analyze_words(analysis_books, analysis_word_count, [])
generate_word_cloud(all_words)