Learning hebrew like a geek

Albert De La Fuente Vigliotti

The other day I was thinking about some practical way of learning hebrew, something dynamic that tickles my scientist curiosity. And I thought, wouldn’t it be cool if I could parse one of the hebrew codexes and get the top 20 most frequent words of a chapter or a book? Well this is an attempt to answer these type of questions.

How? #

So I rolled my sleeves and I thought: I will google the Aleppo codex in plain text so I can download it and later use it for analysis. The problem is, I couldn’t find it…

The next challenge then is, how can I produce a plain text version of the Aleppo codex? I thought about parsing a mysword module, which are in reality a sqlite database so that should be easy… but then when I was doing something totally unrelated (probably around the bees or my first batch of mead) it hit me… I have several bibles from the Sword project already installed on my notebook and is all OSS, so… there is probably a python wrapper around diatheke or similar… and sure it is =)

If you want the technicalities, go read my note on How to parse the Aleppo codex and analyze its content in python.

But if you are not a geekie human, you probably are interested only in the results rather than the bits and bytes. This is why I splitted the whole thing in two notes, one more IT related and one more Hebrew related. So here we go, brace yourselves this is probably going to be long.

I am not sure on how to “slice” the codex so the chunks makes sense for an analysis, for instance: per chapter? Per book? Per “stories”? Or even other slicing like Torah, Neviim and Ketuvim. Since it is not clear I will start slicing by books at first.

Assumptions and limitations #

Hebrew has some peculiarities one of them being the vowel pointing. This brings a ton of challenges. For the sake of simplicity I had to stick to a codex that does not include niqqud whatsoever. I am not sure if this is a good approach or not because two different words without niqqud can render the same writing yet have totally different meaning.

Another problem is that there are words that are trivially known like לא (h2834) or אל (h3882); or words that are not translated yet used very much like את (h7073). I have implemented a filter to have the possibility to skip these words. Not because they are not relevant, the Aleph-Tav has a ton of secrets and importance. But there are mixed together some irrelevant tokens considered as words, like opening and closing brackets and parenthesis or similar. So I implemented a list to ignore these words that are rather known or really irrelevant.

Interesting findings #

It was rather interesting to me how fast the repetitions decline on unfiltered words. For instance the most used word is the את (h854) with 7073 occurrences. Yet 30 words later in a row, there is a 90% decline in occurrences, משה (h) with 704 occurrences.

Experiments organized by sections #

Full Tanakh - top 20 words - unfiltered #

Word Count Strong
את 7073 h854
יהוה 5611 h3068
אשר 4629 h834
אל 3882 h410
כי 3553 h3588
על 3140 h5921
לא 2834 h3809
כל 2757 h5921
ואת 2190 h854
ישראל 2085 h3479
ויאמר 2043 h559 Related
בני 1650 h1123
בן 1607 h1121
ולא 1447 h3809 Related

Full Tanakh - Top 50 words - unfiltered #

The words from above as the top 20, plus the following words

Word Count Strong
לו 1045 h3863
איש 1027 h376
המלך 1014 h4428
בית 1003 h1004
מלך 1000 h4427
הוא 910 h1931
עד 904 h5704
לאמר 897 h559 / 564
לך 871 h
הארץ 856 h776
ויהי 808 h
אמר 797 h559
דבר 787 h1697
העם 724 h5971 Related
וכל 712 h3606 Related
משה 704 h4872
שם 681 h8043
מן 661 h4478
לי 660 h
הזה 650 h1957
אני 635 h589
יהודה 632 h3063
לפני 615 h3942
להם 607 h3859
אם 607 h518
אלהים 597 h430
אדני 587 h136
דוד 583 h1730
אתה 582 h857
עם 582 h5973

Torah #

Word Count Strong
את 2569 h
אשר 1617 h
יהוה 1493 h
אל 1241 h
על 949 h
כל 921 h
כי 895 h
לא 861 h
ואת 809 h
בני 620 h
ויאמר 619 h
משה 598 h
ישראל 508 h
ס 491 h
הוא 420 h
הארץ 352 h
ולא 345 h
לו 330 h

Experiments organized by books #

Genesis #

Word Count Strong
את 658 h854
אשר 351 h834
ויאמר 337 h559
אל 335 h410
כי 263 h3588
ואת 205 h854
על 203 h5921
כל 199 h3605
אלהים 150 h430
יוסף 144 h3130
יעקב 142 h3290
יהוה 141 h3068
לא 137 h3809
הארץ 126 h776
לו 126 h3863
ויהי 125 h1961
בני 118 h1123
הוא 110 h1931
אברהם 108 h85
שנה 102 h8141

Conclusion #

The more I try to know, the less I feel I know… This coding-linguistic area is fascinating and intreaguing. I feel like a taxonomy of this area is needed.

This will probaly take a lot of time, which I don’t have. So I want to try to balance between being effective and being efficient. I cannot afford to spend much time with this project but on the other hand the Hebrew language is really really apealing to me. I have yet to find a balance on how to proceed.

As part of documenting things I am going to include some very interesting resources that I have found.

From the time I spent researching I would probably go with the BHSA DB. This is a proposed roadmap:

  • Check the ETCBC/course materials for an introduction
  • Check the BHSA/bigTables for Pandas examples and learn how to use it
  • Reevaluate then

Also, if focus is more in learning Hebrew, check the Parabible website which looks really lean and straigthforward.

Bible in Hebrew Demo #

See, Hear, and Read the Bible in Hebrew - BibleinHebrew.com #

Parabible | Genesis 1 #

GitHub - jcuenod/awesome-bible-data: 😎 A curated list of generously licensed Bible data. #

Text-Fabric versus SHEBANQ - BHSA #

GitHub - ETCBC/course_materials: Contains scripts to learn to work with the ETCBC database #

bhsa/bigTablesP.ipynb at master · ETCBC/bhsa · GitHub #

Shebanq / Words Words #

Shebanq / text [2017] Genesis 30:1 #

Jupyter Notebook Viewer #

OSHB Read #

OSHB Lexicon #

Bible Online Learner #

  • Source: https://booge.eu/
  • Title: Bible Online Learner
  • Captured on: [2022-06-05 Sun]

J. Ted Blakley — Online Hebrew Resources #

Tanach.us text files webservice Text files #

Removing Vowels from Hebrew Unicode Text · GitHub #

shoroshim.pdf #

unfoldingWord® Hebrew Grammar — unfoldingWord® Hebrew Grammar 1 documentation #