It all began in 1947, when a Bedouin shepherd stumbled upon the first part of a treasure trove of 15,000 ancient Jewish texts, in a cave near the shores of the Dead Sea. The 2,000-year-old texts became known as the Dead Sea Scrolls and are a series of manuscripts written primarily in Hebrew detailing life at that time in the Holy Land. Although some of the manuscripts are complete texts, There are thousands of fragments whose poor condition means that they cannot be deciphered..
But a new artificial intelligence system developed at Ben-Gurion University of the Negev (BGU) could be a solution, both for those that have not yet been deciphered as well as for those whose poor condition leaves us with more questions than answers.
The new system is the work of four undergraduate students from BGU's Department of Software Engineering and Information Systems, who produced it as part of their final project. Employs masked language modeling (MLM)a system that uses context to predict invisible words in a phrase or sentence, and thus decipher the text in Hebrew and Aramaic inscriptions.
The process created by Itay Asraf, Niv Fono, Eldar Karol and Harel Moshayof is similar to large language models (artificial intelligence platforms that process enormous amounts of written text to understand and create human language), the main difference between modeling standard masked language and the newly developed platform is the way missing text is presented.
In MLM, the type of text to be examined is selected beforehand, whether it is a word, a phrase or a sentence. But there is no such luxury when trying to decipher fragmented ancient manuscripts. “In the case of a damaged ancient inscription, the missing parts may be different – explains Mike Last, leader of the group -. Sometimes they include one word, sometimes they include a partial word, sometimes they include several words.”
The entire project was completed in one year. First, the four students found large language models and masked language models that were compatible with Modern Hebrew. Then they began to accumulate text so that the algorithm could understand what was being asked. Once the Modern Hebrew data was incorporated into the models, they used it to create a model based on Ancient Hebrew.
Last explains that, due to the scarcity of Aramaic texts to feed the model, emphasis was placed on Hebrew. So the four students used the biblical texts from the Old Testament (mostly in Hebrew, but also several in Aramaic) to train the platform. In total, the team used 22,144 Old Testament phrases.
The purpose of using the Old Testament was not just the language, but knowing what it says very precisely. In that way, if they hid some words and then looked at the model's predictionthey could know how close the AI's prediction was.
“Thanks to this new model, we can help historians who have dedicated their lives to recreating these ancient texts as accurately as possible,” concludes Last.