Normalization
Normalization is the process of transforming text into a standard format. This involves converting Arabic characters and numbers to Persian equivalents, replacing spaces with ZERO WIDTH NON-JOINER (half-space) where appropriate, and removing or unifying special characters. Normalization is an essential step in Persian natural language processing (NLP) as it reduces textual variation and improves the performance of downstream models such as search engines, classifiers, or information extraction tools.
Normalizer
The Normalizer
is a composite tool that standardizes input Persian text by applying a sequence of text transformations.
Example Usage
from shekar import Normalizer
normalizer = Normalizer()
text = "ۿدف ما ػمګ بۀ ێڪډيڱڕ أښټ"
normalized = normalizer.normalize(text)
print(normalized) # Output: "هدف ما کمک به یکدیگر است"
Batch and Decorator Support
# Apply pipeline to multiple strings
texts = [
"یادته گل رز قرمز 🌹 به تو دادم؟",
"بگو یهویی از کجا پیدات شد؟"
]
outputs = normalizer.fit_transform(texts)
print(outputs)
# ["یادته گل رز قرمز به تو دادم", "بگو یهویی از کجا پیدات شد"]
# Use decorator to apply pipeline on specific arguments
@normalizer.on_args("text")
def process_text(text):
return text
print(process_text("تو را من چشم👀 در راهم!"))
# Output: "تو را من چشم در راهم"
Custom Normalization Pipeline
You can also build a custom pipeline by selecting specific preprocessors using Pipeline
from shekar.preprocessing
.
Example Pipeline
from shekar import Pipeline
from shekar.preprocessing import EmojiRemover, PunctuationRemover
steps = [
("removeEmoji", EmojiRemover()),
("removePunct", PunctuationRemover()),
]
pipeline = Pipeline(steps)
text = "پرندههای 🐔 قفسی، عادت دارن به بیکسی!"
output = pipeline.fit_transform(text)
print(output) # Output: "پرندههای قفسی عادت دارن به بیکسی"
Batch and Decorator Support
# Apply pipeline to multiple strings
texts = [
"یادته گل رز قرمز 🌹 به تو دادم؟",
"بگو یهویی از کجا پیدات شد؟"
]
outputs = pipeline.fit_transform(texts)
print(outputs)
# ["یادته گل رز قرمز به تو دادم", "بگو یهویی از کجا پیدات شد"]
# Use decorator to apply pipeline on specific arguments
@pipeline.on_args("text")
def process_text(text):
return text
print(process_text("تو را من چشم👀 در راهم!"))
# Output: "تو را من چشم در راهم"
Notes
Normalizer
class internally uses transformation steps listed in order as following:
steps = [
("AlphaNumericUnifier", AlphabetNormalizer()),
("ArabicUnicodeNormalizer", ArabicUnicodeNormalizer()),
("NumericNormalizer", NumericNormalizer()),
("PunctuationUnifier", PunctuationNormalizer()),
("EmailMasker", EmailMasker(mask="")),
("URLMasker", URLMasker(mask="")),
("EmojiRemover", EmojiRemover()),
("HTMLTagRemover", HTMLTagRemover()),
("DiacriticsRemover", DiacriticsRemover()),
("RedundantCharacterRemover", RedundantCharacterRemover()),
("NonPersianRemover", NonPersianRemover()),
("SpacingStandardizer", SpacingStandardizer()),
]