🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security

800+ IT News als RSS Feed abonnieren

Thema auswählen:

📚 Lesk’s Algorithm: A Method for Word Sense Disambiguation in Text Analytics

🕛 Zeit seit Veröffentlichung: 375 Tage, 8 Stunden 38 Minuten
📆 Veröffentlicht am: 24.03.2023 um 15:33 Uhr
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

In the English language, more than 38% of words are polysemous, a term referring to a single word which may have multiple definitions, or “senses” (Edmonds, 2006). The word “set” for instance — which can be used as a noun, verb, and adjective — has dozens of unique definitions, making it one of the most polysemous words recorded in the English dictionary. Thus, if we were to ask someone to “please set the set of cutlery on the table”, how would we instinctively know the difference between the two uses of “set”? The answer is all in the context. The human brain’s neural networks perform semantic processing, storage, and retrieval in a way which makes us naturally gifted at word-sense disambiguation (WSD). This essentially means that we are able to be quite proficient at determining the meaning of a word with multiple definitions, based on what makes the most sense in a given context. As such, the natural language has consequently also developed in a way which reflects this innate ability to perform these complex contextual associations.

Word-sense disambiguation in NLP

While our brains can play these games with great success, words with multiple senses pose a significant problem in Natural Language Processing (NLP), a branch of artificial intelligence which studies the ability of computers to interpret and “understand” the human language. While the sense of a word depends on its context, the context of a word can be difficult to capture by computers, as it is complicated by metaphors, modifiers, sentence negations, and countless other intricacies of language which make it difficult for a machine to learn.

Due to the significant value of word sense disambiguation in practical applications of NLP, several methods of performing word sense disambiguation have emerged over the years. Current machine learning approaches include supervised-learning methods, whereby a collection of manually sense-tagged words are used to train classification algorithms. However, these training sets can be expensive, time consuming to procure, and imperfect, since even human annotators often only agree on a word sense 70–95% of the time (Edmonds, 2006). Given the manual effort required, unsupervised learning methods are also used, many of which aim to cluster words based on some measure of similar context.

Lesk’s Algorithm: A simple method for word-sense disambiguation

Perhaps one of the earliest and still most commonly used methods for word-sense disambiguation today is Lesk’s Algorithm, proposed by Michael E. Lesk in 1986. Lesk’s algorithm is based on the idea that words that appear together in text are related somehow, and that the relationship and corresponding context of the words can be extracted through the definitions of the words of interest as well as the other words used around it. Developed long before the advent of modern technology, Lesk’s algorithm aims to disambiguate the meaning of words of interest — usually appearing within a short phrase or sentence — by finding the pair of matching dictionary “senses” (i.e. synonyms) with the greatest word overlap in dictionary definitions.

In the example Lesk uses, he references the words “pine” and “cone”, noting that the words return the following definitions from the Oxford English Dictionary:

Example of sense-matching proposed by Lesk in 1986.

In the simplest terms, Lesk’s algorithm counts the number of overlaps between all dictionary definitions of a word of interest and all dictionary definitions of the words surrounding it, known as a “context window”. Then, it takes the definition corresponding to the word with the highest number of overlaps, without including stop words (words such as “the”, “a”, “and”), and infers it to be the word’s “sense”. If we consider “pine” to be the word of interest, and “cone” to be the only word in its context window, comparing dictionary definitions of “pine” and “cone” would find that “evergreen” is the most common “sense” to both terms. Thus, we can logically infer that the word of interest “pine” refers to an evergreen tree, rather its alternate definition.

Advantages and Disadvantages

There are numerous advantages to Lesk’s algorithm, the primary being that its simplicity makes it easy to implement, applicable in a variety of different contexts, and thus easily generalizable. Lesk notes that the algorithm does not depend on global information, meaning that since the same word could be referenced many times throughout a text but change each time, the meaning of a word is only derived from the collection of immediate supporting words in its context window, rather than from the entire text itself. In addition, Lesk’s algorithm is non-syntactic, meaning that the approach does not depend on the arrangement of words or structure of a sentence, since all associations are made by dictionary-definition only. This makes the algorithm a good approach to use in tandem to syntax based text analytics solutions. For instance, a part-of-speech based text tagger might be able to identify the use of “mole” as a noun, but would fail at accurately differentiating between the animal (mole), the skin growth (mole), and the unit of measurement (mole), since all three are nouns. In cases such as this, the benefits of Lesk’s algorithm shines through.

Despite its simplicity and power, the biggest drawback to Lesk’s original algorithm is its performance — its accuracy was proposed by Lesk to be only around 50–70%, and has been shown to be much lower when experimentally validated against sense-tagged texts (Viveros-Jiménez, 2013). The algorithm also notably suffers from low recall, in that it cannot provide a corresponding contextual definition for many words simply because there is either no overlap to be found between dictionary definitions at all, or that several definitions have the same number of overlaps. Furthermore, Lesk leaves several questions unanswered. This includes what dictionary is best used, if all matched terms should be considered equally or weighed by the length of the dictionary definition itself, and how wide the context window of words should be (Lesk proposes around 10 words but suggests that it is quite flexible).

Moving forward: more recent advancements from Lesk

While Lesk left several questions and weaknesses in his approach unaddressed, significant efforts have been made to build upon the logic of his original work. In 2002, researchers proposed an adaptation to Lesk’s dictionary based algorithm by replacing it with the lexical database, WordNet (Banerjee & Pedersen, 2002). Unlike a dictionary, wherein words and definitions are arranged alphabetically in a sequence, WordNet is an online database which arranges words semantically, creating groups of databases of nouns, verbs, adjectives, and adverbs. Words that are synonymous to each other are grouped together into a relation known as a “synset”, and polysemous words can be identified in WordNet by their occurrence in multiple synsets, since each individual synset represents a distinct definition or “sense” of a word. WordNet considers word relationships such as hyponymy (a word that is part of a broader class, i.e. a tulip is a flower, a flower is a plant), and metonymy (a figure of speech where a word is referred to by the name of something closely related, i.e. the “crown” refers to the monarchy), and is therefore able to capture a much wider range of word definitions and relationships than the original Lesk algorithm. The structure of WordNet allows for more targeted searching of related words, since if the part of speech for the word of interest is known, relations and synsets subsequently searched will only be those within that part of speech.

The low-recall problem of Lesk’s algorithm has also been an area of improvement. In 2013, researchers published a paper comparing several experiments in which they adjusted the size of the context window of words considered, in an effort to improve the performance of the algorithm (Viveros-Jiménez et al., 2013). A novel proposition made was to both consider only four words with at least one overlap when defining the size of the context window, as well as removing the target word of interest itself from consideration in sense-matching. Since dictionary definitions will often include the word of interest itself in an example sentence, "i.e. the pine tree is a type of evergreen tree", for the word "pine", this may create false overlaps on "pine" in which case the top matched word from all senses results in "pine" itself being output, instead of "evergreen". Researchers found that when combining these two approaches, both precision and recall of the Lesk algorithm improved significantly, benchmarked against the alternative of simply choosing a context-window of 4 words.

Applying Lesk’s Algorithm in Python

The following is a simple example of how Lesk’s algorithm can be implemented, using the pywsd package for word-sense disambiguation in Python. The following function was developed as an implementation of the Adapted Lesk algorithm described by Banerjee & Pederson in 2002, which uses the WordNet implementation of Lesk.

Example of the Adapted Lesk’s Algorithm implemented in Python

We can see here that given the word “bank”, Lesk’s algorithm is able to infer that when used in context, one sense of bank refers to the financial institution, while the other refers to a type of sloping land. Again, when making such associations, the syntactical structure does not matter, rather, the success of Lesk’s algorithm depends on the availability of contextual cues, such as “river” and “fish”, when referring to a river bank, and “deposit” and “money” when referring to a financial bank. It can be seen through this example that Lesk is robust to sentence structure, and can therefore be used to extract word senses from a variety of unstructured text data types, whether it is excerpts from a classical novel or a Twitter comment. However, since Lesk’s algorithm is sensitive to the size of the context window, this data structure constraint perhaps most limits its implementation. As such, the data must be structured and tokenized into appropriately sized windows so that enough contextual information is provided, but the true meaning of the words of interest can still be captured. Since Lesk’s algorithm is perhaps best applied to short phrases and groupings of related terms with contextual information, it could be applied to query-and-retrieve type situations, such as an automated chatbot or search box where some contextual information is input, it is interpreted, and then a response or result is returned.

Though simple in intuition, Lesk’s algorithm and similar word sense disambiguation methods have served as foundational stepping stones for current work in developing intuitive everyday tools where word-sense disambiguation is necessary. Most notably, the intuition behind such algorithms can be practically applied to develop new algorithms to improve the relevance of information retrieval by search engines used by billions of people around the world today. After all, when we search for an ambiguous term, we instinctively know to include a “contextual window” of terms to increase the likelihood of an accurate hit.

And why? Because context is everything.

Thanks for reading! Please feel free to reach out to me on Linkedin if you want to share any thoughts!

https://www.linkedin.com/in/duncan-w/

Resources

Banerjee, S. & Pedersen, T. (2002). An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. Computational Linguistics and Intelligent Text Processing. 2276. 136–145. 10.1007/3 − 540 − 45715 − 111.

Edmonds, P. (2006). Disambiguation, Lexical. Encyclopedia of Language and Lingustics (second edition), Elsevier, p. 607–623.

Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries. Proceedings of the 5th Annual International Conference on Systems Documentation — SIGDOC
Liling Tan. 2014. Pywsd: Python Implementations of Word Sense Disambiguation (WSD)

Technologies [software]. Retrieved from https : //github.com/alvations/pywsd

SAP Conversational AI. (2015, November 20). From context to user understanding. Retrieved February 11, 2021, from https : //medium.com/@SAPCAI/from − context − to − user − understanding − a692b11d95aa

Viveros-Jiménez F., Gelbukh A., Sidorov G. (2013) Simple Window Selection Strategies for the Simplified Lesk Algorithm for Word Sense Disambiguation. In: Castro F., Gelbukh A., González M. (eds) Advances in Artificial Intelligence and Its Applications. MICAI 2013. Lecture Notes in Computer Science, vol 8265. Springer, Berlin, Heidelberg. https : //doi.org/10.1007/978 − 3 − 642 − 45114 − 017

Lesk’s Algorithm: A Method for Word Sense Disambiguation in Text Analytics was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...