Cookie Consent by Free Privacy Policy Generator ๐Ÿ“Œ Deep Dive into Preprocessing Techniques in NLP using Python - Part 1

๐Ÿ  Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeitrรคge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden รœberblick รผber die wichtigsten Aspekte der IT-Sicherheit in einer sich stรคndig verรคndernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch รผbersetzen, erst Englisch auswรคhlen dann wieder Deutsch!

Google Android Playstore Download Button fรผr Team IT Security



๐Ÿ“š Deep Dive into Preprocessing Techniques in NLP using Python - Part 1


๐Ÿ’ก Newskategorie: Programmierung
๐Ÿ”— Quelle: dev.to

Photo by Patrick Tomasso on Unsplash

Language and Speech data we encounter in the real-world are normally messy and disorganized; this makes it hard for machines to understand and therefore it necessitates we preprocess them so we can make informed decisions during analysis and modelling.

Without a systematic way to start and keep data clean, bad data will happen - Donato Diorio

Consider the sentence below :

Lifeeee is such a painnn :(

And another sentence :

Life is such a pain :(

The two sentences carry the same semantic meaning but the former requires a bit of preprocessing to remove the extra characters at the end of some of the words.

Preprocessing in NLP tasks is very essential and an important toolkit for Machine learning engineers and Data Scientists as they transition to build ML models.

For one to become good at data preprocessing especially in NLP, it is necessary you are able to detect and extract patterns in data.

As a result, knowing how to manipulate strings with regex should be a priority. In this tutorial we will dive deep into how to use regular expressions in Python.

Regular Expressions

Regular Expressions are some strings of characters and symbols(literals and metacharacters) that are used to detect patterns in text.
Suppose we have the following text :

The numbers are : 022-236-1823, 0554-236-172, 055 345 17584, and 0234456812

We might want to extract only the numbers from the text.

The following regex can help us achieve this :

/\d+[\s-]?\d+[\s-]?\d+/

Another example :

Hello everyone, Helium, hectic, help me!

Considering the text above we might be interested in words that start with He or he.
We can type the following expression :

/[Hh]e[a-z]+/

We can also extract a specific word pattern by typing some literals. Example :

The regex below :

/happy/

will extract or match happy from the sentence I am very happy

To understand regex, we have to know the difference between metacharacters and literals.

Let's consider the pattern we used earlier:

/\d+[\s-]?\d+[\s-]?\d+/

The metacharacters are :

\, +, \d, [ ], \s, ?

The only literal we have is -.

In our second example :

/[Hh]e[a-z]+/

Our literals are H, h, e, a , z.

There are a lot of metacharacters in regex and each of them has its specific use case. Let's explore them and know when to use them :

Metacharacter
Description
\ It is used before some characters to illustrate that the character is a special character or a literal.
^ Matches the start of an input
$ Matches the end of an input
. Detects any single character except a newline
| Match either characters given. E.g. x | y will match either x or y
? Matches the character before it zero or more times. E.g. s?it will match sit or it
+ Matches the character before it one or more times. E.g. a+ will match bag and baaaag
[ ] Matches everything inside it. E.g [A-Z] will match any uppercase from A to Z
\w Matches any word character including underscore. It is equivalent to [A-Za-z0-9_]
\W Matches any non-word character. It is equivalent to [^A-Za-z0-9_]
\d Matches any digit. i.e. 0-9
\D Matches a non-digit number
\s Matches any whitespace
\S Matches any non-whitespace

For information on Metacharacters, check this resource.

We will be using the popular python module re for regex matching operations.

Let's import regex :

import re

To create a pattern in regex you can use the compile function which strictly takes in the pattern you want to extract.

re.compile(pattern, flags=0)

Let's say we want to create a pattern to extract some number from the text : I am 25 years old. We can type :

re.compile(r'\d+')

r is just used to indicate the pattern as a raw string. This is because there are some characters like \ which performs a specific function in python so we have to make them raw strings to be used for regex-specific tasks.

We will be sing the following functions to match a pattern:

  • re.match() -> checks for a match only at the beginning of the string
  • re.search() -> checks for a match anywhere in the string
  • re.findall() -> checks for all occurrences of the match

Suppose we want to check whether Coming is at the beginning of the text below :

Coming is a verb

We will first create our pattern :

# Create our pattern
pattern = re.compile(r'Coming')

Let's use match() to match our pattern :

text = 'Coming is a verb'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>

Alternatively, we can use re.match() directly :

match = re.match(pattern, text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>

NOTE: When using match(), if the pattern isn't found at the beginning of the text, there will be no match.

Let's verify that with an example below :

pattern = re.compile(r'Coming')
text = 'Is Coming a verb?'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
None

As illustrated above, because the text begins with Is, there will be no match.

We can rectify this by using search() function below :

pattern = re.compile(r'Coming')
text = 'Is coming a verb?'

# Creating our Match Object
match = pattern.search(text)
print(match)

##Output:
<re.Match object; span=(3, 9), match='Coming'>

Yes! We have been able to match Coming. This is because the search() function matches anywhere within the text.

Now, what if a particular pattern exists multiple times within a text and we would like to detect all the instances of that pattern?

Example: Say, we want to detect all occurrences of a number within the string below :

These are four-digit numbers : 1245, 1220, 9028. 

Using the search() function will only match the first occurrence of the number :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.search(text)
print(match)

## Output
<re.Match object; span=(31, 35), match='1245'>

Intuition behind the above code :

  • our pattern \d+ has two components : \d and +.
  • \d will match any single digit like 1, 2, ...
  • + is a quantifier which when added to \d will match 1 or more additional digit till it reaches a non-digit character like whitespace or an alphabet. Eg: 1245
  • search() then goes through our text and once it sees a single pattern as described above, it immediately matches and returns that pattern. In this case it will match only 1245.

NOTE: search() only returns a single occurrence of the match.

We can use findall() to match all occurrences of the pattern in our text:

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.findall(text) # -> Returns a list
print(match)

## Output:
['1245', '1220', '9028']  

Suppose you have a large chunk of data and you aren't interested in getting all the matches in the text at once, we can retrieve the matches in a sequence.

finditer() can help us achieve that.

Let's get the four-digit numbers in sequences :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.finditer(text) # -> Returns an callable iterator

# Let's check the type of the match 
print(match)

## Output 
<class 'callable_iterator'>

To get the next item in the iterator object, we can use the next() function in python.

Let's get the matches in sequences :

print(next(match))  # -> Outputs the first match 

print(next(match))  # -> Outputs the second match 

print(next(match))  # -> Outputs the last match 

## Output
<re.Match object; span=(31, 35), match='1245'>
<re.Match object; span=(37, 41), match='1220'>
<re.Match object; span=(43, 47), match='9028'>

Using the ^ and & metacharacter

^ is used before characters to match a pattern only at the beginning of a text. E.g. We can check whether say the word The is at the beginning of a line by typing ^The.

Let's illustrate that with an example:
We can detect whether The is at the beginning of the text below:

The work is super easy. 

We can achieve that as illustrated:

text = 'The work is super easy.'
pattern = re.compile(r'^The')

match = pattern.search(text)

print(match)

## Output
<re.Match object; span=(0, 3), match='The'>

In the same way, $ is used to match whether a character or some set of characters is at the end of a line.
Let's check if cool is at the end of the sentence in the text below :

Regex is super cool

Code :

text = 'Regex is super cool'
pattern = re.compile(r'cool$')
match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(15, 19), match='cool'>

NOTE: There is a limitation to ^ and $ metacharacter as it only matches a pattern within the first line. In NLP and other applications, you might be working with multiple documents which you would have to preprocess to extract patterns.

Let's consider an example.

Suppose we want to extract the first user-id(24ga-d34) in the string:

'User-ids\n24ga-d34\n87bx-f60\n47nd-q21'

which contains user ids each at the beginning of a new line,

using search() function alone wouldn't work :

pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}')
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
None

We can fix this by adding a re.MULTILINE or re.M flag to our compile() function.

You can check all the available flags in re module.

re.MULTILINE flag prevents ^ or $ from considering just the first line. It allows it to check the beginning of all the lines in the text.

Code :

import re
pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}', re.MULTILINE)
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(9, 17), match='24ga-d34'> 

Intuition behind the above code :

  • ^ -> matches the pattern at the beginning of a line
  • \d{2} -> matches any two-digit number
  • [a-z]{2} -> matches any two lowercase alphabet
  • - -> matches a hyphen
  • [a-z] -> matches any single alphabet
  • \d{2} -> matches any two-digit number
  • re.MULTILINE -> overrides the default behavior of ^ in matching only at the beginning of a single line.

{n} is a metacharacter which will match anything before it n number of times, where n is a non-negative integer.

To be continued later...

Conclusion

In this tutorial, you learnt about the difference between literal and metacharacters in regex. You also learnt about how to use these metacharacters to match patterns in texts using the re module in python. In the next part of the tutorial, we will delve more into other preprocessing techniques in NLP.

Follow me for more of this content. Let's connect on LinkedIn!

References

...



๐Ÿ“Œ Enhancing Underwater Image Segmentation with Deep Learning: A Novel Approach to Dataset Expansion and Preprocessing Techniques


๐Ÿ“ˆ 47.53 Punkte

๐Ÿ“Œ A Deep Dive into NLP with PyTorch


๐Ÿ“ˆ 46.85 Punkte

๐Ÿ“Œ AWS CDK Deep Dive: Advanced Infrastructure as Code Techniques With Typescript and Python


๐Ÿ“ˆ 41.23 Punkte

๐Ÿ“Œ DarkRace Ransomware: A Deep Dive into its Techniques and Impact


๐Ÿ“ˆ 40.92 Punkte

๐Ÿ“Œ Pandas Core Data Preprocessing Techniques - A Recap


๐Ÿ“ˆ 38.51 Punkte

๐Ÿ“Œ Deep dive into Flutter deep linking


๐Ÿ“ˆ 37.33 Punkte

๐Ÿ“Œ Deep Dive into apple-app-site-association file: Enhancing Deep Linking on iOS


๐Ÿ“ˆ 37.33 Punkte

๐Ÿ“Œ Deep Dive into apple-app-site-association file: Enhancing Deep Linking on iOS


๐Ÿ“ˆ 37.33 Punkte

๐Ÿ“Œ http://web.nlp.gov.ph/nlp/m.txt


๐Ÿ“ˆ 37.09 Punkte

๐Ÿ“Œ NuMind Launches NLP Tool Leveraging LLMs to Democratize Creation of Custom NLP Models


๐Ÿ“ˆ 37.09 Punkte

๐Ÿ“Œ Deep Learning 6: Deep Learning for NLP


๐Ÿ“ˆ 36.58 Punkte

๐Ÿ“Œ Python: __init__ is NOT a constructor: a deep dive in Python object creation


๐Ÿ“ˆ 35.41 Punkte

๐Ÿ“Œ A Deep Dive into Object-Oriented Programming in Python: From Novice to Virtuoso


๐Ÿ“ˆ 35.09 Punkte

๐Ÿ“Œ Unleashing the Power of Python: A Deep Dive into Data Visualization


๐Ÿ“ˆ 35.09 Punkte

๐Ÿ“Œ Deep Learning for Forecasting: Preprocessing and Training


๐Ÿ“ˆ 34.92 Punkte

๐Ÿ“Œ Cornell Researchers Uncover Insights into Language Model Prompts: A Deep Dive into How Next-Token Probabilities Can Reveal Hidden Text


๐Ÿ“ˆ 34.78 Punkte

๐Ÿ“Œ Deep Dive Into AIโ€™s Inheritance Into Software Development


๐Ÿ“ˆ 34.78 Punkte

๐Ÿ“Œ A Deep Dive into Git Performance using Trace2


๐Ÿ“ˆ 33.65 Punkte

๐Ÿ“Œ Deep Dive into HPLBs for A/B testing using Random Forest


๐Ÿ“ˆ 33.65 Punkte

๐Ÿ“Œ Deep Dive into Data structures using Javascript - Priority Queue


๐Ÿ“ˆ 33.65 Punkte

๐Ÿ“Œ "Optimizing Data Redundancy: ๐Ÿš€ A Deep Dive into Cross-Region Replication Using Amazon S3 Batch Operation ๐Ÿ”„"


๐Ÿ“ˆ 33.65 Punkte

๐Ÿ“Œ Some Basic Image Preprocessing Operations for Beginners in Python


๐Ÿ“ˆ 32.69 Punkte

๐Ÿ“Œ Antisquat - Leverages AI Techniques Such As NLP, ChatGPT And More To Empower Detection Of Typosquatting And Phishing Domains


๐Ÿ“ˆ 31.15 Punkte

๐Ÿ“Œ PYTHON 3 BOOTCAMP: DEEP LEARNING INTO PYTHON 3 WITH GUIS


๐Ÿ“ˆ 29.06 Punkte

๐Ÿ“Œ A deep dive into the forces driving Russian and Chinese hacker forums


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ Deep dive into Header Bars and KDE


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ 35C3 - A deep dive into the world of DOS viruses - deutsche รœbersetzung


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ 35C3 - A deep dive into the world of DOS viruses


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ A deep dive into Linux namespaces


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ A Deep Dive into React Redux


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ A very deep dive into iOS Exploit chains found in the wild


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ Deep Dive Into .NET Malwares


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ Dive deep into the world of cyber attackers at the CyberThreat Summit


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ Black Hat USA 2018 A Deep Dive into macOS MDM and How it can be Compromised


๐Ÿ“ˆ 28.31 Punkte

๐Ÿ“Œ A Deepfake Deep Dive into the Murky World of Digital Imitation


๐Ÿ“ˆ 28.31 Punkte











matomo