Preprocessing
The shekar.preprocessing
module offers a suite of tools designed to clean and standardize Persian (and mixed) text for NLP tasks. These tools include removers, normalizers, and maskers. Below is a detailed guide to each class.
1. SpacingStandardizer
Purpose: Cleans extra spaces and newlines, and fixes spacing around punctuation and ZWNJ.
from shekar.preprocessing import SpacingStandardizer
text = " این یک متن تستی است. "
standardizer = SpacingStandardizer()
print(standardizer(text)) # Output: "این یک متن تستی است."
2. AlphabetNormalizer
Purpose: Unifies variant or Arabic forms of Persian characters (e.g., "ۀ" to "ه").
from shekar.preprocessing import AlphabetNormalizer
text = "نشاندهندة سایة"
normalizer = AlphabetNormalizer()
print(normalizer(text)) # Output: "نشاندهنده سایه"
3. NumericNormalizer
Purpose: Converts English, Arabic, and circled numerals into Persian digits.
from shekar.preprocessing import NumericNormalizer
text = "٠١٢٣ ⒈ 1"
normalizer = NumericNormalizer()
print(normalizer(text)) # Output: "۰۱۲۳ ۱ ۱"
4. PunctuationNormalizer
Purpose: Converts various forms of punctuation to their Persian equivalents.
from shekar.preprocessing import PunctuationNormalizer
text = "؟?،٬!%:؛"
normalizer = PunctuationNormalizer()
print(normalizer(text)) # Output: "؟؟،،!٪:؛"
5. EmojiRemover
Purpose: Removes all emoji characters from the text.
from shekar.preprocessing import EmojiRemover
text = "سلام 😊🌹🎉"
remover = EmojiRemover()
print(remover(text)) # Output: "سلام"
6. EmailMasker
Purpose: Masks or removes email addresses.
from shekar.preprocessing import EmailMasker
text = "تماس با ما: [email protected]"
masker = EmailMasker(mask="")
print(masker(text)) # Output: "تماس با ما: "
7. URLMasker
Purpose: Masks or removes URLs.
from shekar.preprocessing import URLMasker
text = "وبسایت ما: https://example.com"
masker = URLMasker(mask="")
print(masker(text)) # Output: "وبسایت ما: "
8. DiacriticsRemover
Purpose: Removes diacritical marks (e.g., َ ,ِ ,ُ ) from Persian/Arabic text.
from shekar.preprocessing import DiacriticsRemover
text = "کُجا نِشانِ قَدَم"
remover = DiacriticsRemover()
print(remover(text)) # Output: "کجا نشان قدم"
9. PunctuationRemover
Purpose: Removes all punctuation symbols.
from shekar.preprocessing import PunctuationRemover
text = "سلام، دنیا!"
remover = PunctuationRemover()
print(remover(text)) # Output: "سلام دنیا"
10. RedundantCharacterRemover
Purpose: Reduces sequences of repeated characters (like stretched letters).
from shekar.preprocessing import RedundantCharacterRemover
text = "سلاممممممممم"
remover = RedundantCharacterRemover()
print(remover(text)) # Output: "سلامم"
11. ArabicUnicodeNormalizer
Purpose: Converts Arabic presentation forms and symbols into Persian equivalents or full phrases.
from shekar.preprocessing import ArabicUnicodeNormalizer
text = "﷽ پنجاه هزار ﷼"
normalizer = ArabicUnicodeNormalizer()
print(normalizer(text)) # Output: "بسم الله الرحمن الرحیم پنجاه هزار ریال"
12. StopwordRemover
Purpose: Removes common Persian stopwords (e.g., "این", "است", "به").
from shekar.preprocessing import StopwordRemover
text = "این یک جملهٔ نمونه است"
remover = StopwordRemover()
print(remover(text)) # Output: "جملهٔ نمونه"
13. NonPersianRemover
Purpose: Removes all non-Persian characters (can keep English/diacritics if configured).
from shekar.preprocessing import NonPersianRemover
text = "This is یک متن ترکیبی!"
remover = NonPersianRemover()
print(remover(text)) # Output: " یک متن ترکیبی!"
With English support:
remover = NonPersianRemover(keep_english=True)
print(remover("Test در کنار تست")) # Output: "Test در کنار تست"
14. HTMLTagRemover
Purpose: Removes HTML tags while keeping the content.
from shekar.preprocessing import HTMLTagRemover
text = "<p>سلام دنیا</p>"
remover = HTMLTagRemover()
print(remover(text)) # Output: "سلام دنیا"
Notes on Usage
- All preprocessors implement
__call__
andfit_transform()
for pipeline compatibility. - You can pass a single string or an iterable of strings to all classes.
- Raise
ValueError
if input is invalid (e.g., not a string or list of strings).