Learning hebrew like a geek
— Albert De La Fuente VigliottiThe other day I was thinking about some practical way of learning hebrew, something dynamic that tickles my scientist curiosity. And I thought, wouldn’t it be cool if I could parse one of the hebrew codexes and get the top 20 most frequent words of a chapter or a book? Well this is an attempt to answer these type of questions.
How? #
So I rolled my sleeves and I thought: I will google the Aleppo codex in plain text so I can download it and later use it for analysis. The problem is, I couldn’t find it…
The next challenge then is, how can I produce a plain text version of the Aleppo codex? I thought about parsing a mysword module, which are in reality a sqlite database so that should be easy… but then when I was doing something totally unrelated (probably around the bees or my first batch of mead) it hit me… I have several bibles from the Sword project already installed on my notebook and is all OSS, so… there is probably a python wrapper around diatheke or similar… and sure it is =)
If you want the technicalities, go read my note on How to parse the Aleppo codex and analyze its content in python.
But if you are not a geekie human, you probably are interested only in the results rather than the bits and bytes. This is why I splitted the whole thing in two notes, one more IT related and one more Hebrew related. So here we go, brace yourselves this is probably going to be long.
I am not sure on how to “slice” the codex so the chunks makes sense for an analysis, for instance: per chapter? Per book? Per “stories”? Or even other slicing like Torah, Neviim and Ketuvim. Since it is not clear I will start slicing by books at first.
Assumptions and limitations #
Hebrew has some peculiarities one of them being the vowel pointing. This brings a ton of challenges. For the sake of simplicity I had to stick to a codex that does not include niqqud whatsoever. I am not sure if this is a good approach or not because two different words without niqqud can render the same writing yet have totally different meaning.
Another problem is that there are words that are trivially known like לא (h2834) or אל (h3882); or words that are not translated yet used very much like את (h7073). I have implemented a filter to have the possibility to skip these words. Not because they are not relevant, the Aleph-Tav has a ton of secrets and importance. But there are mixed together some irrelevant tokens considered as words, like opening and closing brackets and parenthesis or similar. So I implemented a list to ignore these words that are rather known or really irrelevant.
Interesting findings #
It was rather interesting to me how fast the repetitions decline on unfiltered words. For instance the most used word is the את (h854) with 7073 occurrences. Yet 30 words later in a row, there is a 90% decline in occurrences, משה (h) with 704 occurrences.
Experiments organized by sections #
Full Tanakh - top 20 words - unfiltered #
Word | Count | Strong | |
---|---|---|---|
את | 7073 | h854 | |
יהוה | 5611 | h3068 | |
אשר | 4629 | h834 | |
אל | 3882 | h410 | |
כי | 3553 | h3588 | |
על | 3140 | h5921 | |
לא | 2834 | h3809 | |
כל | 2757 | h5921 | |
ואת | 2190 | h854 | |
ישראל | 2085 | h3479 | |
ויאמר | 2043 | h559 | Related |
בני | 1650 | h1123 | |
בן | 1607 | h1121 | |
ולא | 1447 | h3809 | Related |
Full Tanakh - Top 50 words - unfiltered #
The words from above as the top 20, plus the following words
Word | Count | Strong | |
---|---|---|---|
לו | 1045 | h3863 | |
איש | 1027 | h376 | |
המלך | 1014 | h4428 | |
בית | 1003 | h1004 | |
מלך | 1000 | h4427 | |
הוא | 910 | h1931 | |
עד | 904 | h5704 | |
לאמר | 897 | h559 / 564 | |
לך | 871 | h | |
הארץ | 856 | h776 | |
ויהי | 808 | h | |
אמר | 797 | h559 | |
דבר | 787 | h1697 | |
העם | 724 | h5971 | Related |
וכל | 712 | h3606 | Related |
משה | 704 | h4872 | |
שם | 681 | h8043 | |
מן | 661 | h4478 | |
לי | 660 | h | |
הזה | 650 | h1957 | |
אני | 635 | h589 | |
יהודה | 632 | h3063 | |
לפני | 615 | h3942 | |
להם | 607 | h3859 | |
אם | 607 | h518 | |
אלהים | 597 | h430 | |
אדני | 587 | h136 | |
דוד | 583 | h1730 | |
אתה | 582 | h857 | |
עם | 582 | h5973 |
Torah #
Word | Count | Strong |
---|---|---|
את | 2569 | h |
אשר | 1617 | h |
יהוה | 1493 | h |
אל | 1241 | h |
על | 949 | h |
כל | 921 | h |
כי | 895 | h |
לא | 861 | h |
ואת | 809 | h |
בני | 620 | h |
ויאמר | 619 | h |
משה | 598 | h |
ישראל | 508 | h |
ס | 491 | h |
הוא | 420 | h |
הארץ | 352 | h |
ולא | 345 | h |
לו | 330 | h |
Experiments organized by books #
Genesis #
Word | Count | Strong |
---|---|---|
את | 658 | h854 |
אשר | 351 | h834 |
ויאמר | 337 | h559 |
אל | 335 | h410 |
כי | 263 | h3588 |
ואת | 205 | h854 |
על | 203 | h5921 |
כל | 199 | h3605 |
אלהים | 150 | h430 |
יוסף | 144 | h3130 |
יעקב | 142 | h3290 |
יהוה | 141 | h3068 |
לא | 137 | h3809 |
הארץ | 126 | h776 |
לו | 126 | h3863 |
ויהי | 125 | h1961 |
בני | 118 | h1123 |
הוא | 110 | h1931 |
אברהם | 108 | h85 |
שנה | 102 | h8141 |
Conclusion #
The more I try to know, the less I feel I know… This coding-linguistic area is fascinating and intreaguing. I feel like a taxonomy of this area is needed.
This will probaly take a lot of time, which I don’t have. So I want to try to balance between being effective and being efficient. I cannot afford to spend much time with this project but on the other hand the Hebrew language is really really apealing to me. I have yet to find a balance on how to proceed.
As part of documenting things I am going to include some very interesting resources that I have found.
From the time I spent researching I would probably go with the BHSA DB. This is a proposed roadmap:
- Check the ETCBC/course materials for an introduction
- Check the BHSA/bigTables for Pandas examples and learn how to use it
- Reevaluate then
Also, if focus is more in learning Hebrew, check the Parabible website which looks really lean and straigthforward.
Bible in Hebrew Demo #
- Source: https://www.bibleinhebrew.com/bih/BiH_demo.php?uid=10
- Title: Bible in Hebrew Demo
- Captured on:
See, Hear, and Read the Bible in Hebrew - BibleinHebrew.com #
- Source: https://www.bibleinhebrew.com/bih/
- Title: See, Hear, and Read the Bible in Hebrew - BibleinHebrew.com
- Captured on:
Parabible | Genesis 1 #
- Source: https://parabible.com/Genesis/1
- Title: Parabible | Genesis 1
- Captured on:
GitHub - jcuenod/awesome-bible-data: 😎 A curated list of generously licensed Bible data. #
- Source: https://github.com/jcuenod/awesome-bible-data
- Title: GitHub - jcuenod/awesome-bible-data: 😎 A curated list of generously licensed Bible data.
- Captured on:
Text-Fabric versus SHEBANQ - BHSA #
- Source: https://etcbc.github.io/bhsa/mql/
- Title: Text-Fabric versus SHEBANQ - BHSA
- Captured on:
GitHub - ETCBC/course_materials: Contains scripts to learn to work with the ETCBC database #
- Source: https://github.com/ETCBC/course_materials
- Title: GitHub - ETCBC/course_materials: Contains scripts to learn to work with the ETCBC database
- Captured on:
bhsa/bigTablesP.ipynb at master · ETCBC/bhsa · GitHub #
- Source: https://github.com/ETCBC/bhsa/blob/master/programs/bigTablesP.ipynb
- Title: bhsa/bigTablesP.ipynb at master · ETCBC/bhsa · GitHub
- Captured on:
Shebanq / Words Words #
- Source: https://shebanq.ancient-data.org/hebrew/words
- Title: Words
- Captured on:
Shebanq / text [2017] Genesis 30:1 #
- Source: https://shebanq.ancient-data.org/hebrew/text
- Title: [2017] Genesis 30:1
- Captured on:
Jupyter Notebook Viewer #
- Source: https://nbviewer.org/github/etcbc/bhsa/blob/master/tutorial/search.ipynb
- Title: Jupyter Notebook Viewer
- Captured on:
OSHB Read #
- Source: https://hb.openscriptures.org/read/
- Title: OSHB Read
- Captured on:
OSHB Lexicon #
- Source: http://openscriptures.github.io/HebrewLexicon/HomeFiles/Lexicon.html
- Title: OSHB Lexicon
- Captured on:
Bible Online Learner #
- Source: https://booge.eu/
- Title: Bible Online Learner
- Captured on:
J. Ted Blakley — Online Hebrew Resources #
- Source: https://www.blakleycreative.com/jtb/HebrewOnline.htm
- Title: J. Ted Blakley — Online Hebrew Resources
- Captured on:
Tanach.us text files webservice Text files #
- Source: https://www.tanach.us/Pages/TextFiles.html
- Title: Text files
- Captured on:
- Command: wget –user-agent=” Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36” “http://tanach.us/Server.txt?Deuteronomy26:1-1&layout=Text-only&content=Consonants“
Removing Vowels from Hebrew Unicode Text · GitHub #
- Source: https://gist.github.com/yakovsh/345a71d841871cc3d375
- Title: Removing Vowels from Hebrew Unicode Text · GitHub
- Captured on:
shoroshim.pdf #
- Source: https://halakhah.com/rst/shoroshim.pdf
- Title:
- Captured on:
unfoldingWord® Hebrew Grammar — unfoldingWord® Hebrew Grammar 1 documentation #
- Source: https://uhg.readthedocs.io/en/latest/front.html
- Title: unfoldingWord® Hebrew Grammar — unfoldingWord® Hebrew Grammar 1 documentation
- Captured on: