Preprocessing

The shekar.preprocessing module offers a suite of tools designed to clean and standardize Persian (and mixed) text for NLP tasks. These tools include removers, normalizers, and maskers. Below is a detailed guide to each class.

1. `SpacingStandardizer`

Purpose: Cleans extra spaces and newlines, and fixes spacing around punctuation and ZWNJ.

from shekar.preprocessing import SpacingStandardizer

text = "   این یک   متن   تستی   است. "
standardizer = SpacingStandardizer()
print(standardizer(text))  # Output: "این یک متن تستی است."

2. `AlphabetNormalizer`

Purpose: Unifies variant or Arabic forms of Persian characters (e.g., "ۀ" to "ه").

from shekar.preprocessing import AlphabetNormalizer

text = "نشان‌دهندة سایة"
normalizer = AlphabetNormalizer()
print(normalizer(text))  # Output: "نشان‌دهنده سایه"

3. `NumericNormalizer`

Purpose: Converts English, Arabic, and circled numerals into Persian digits.

from shekar.preprocessing import NumericNormalizer

text = "٠١٢٣ ⒈ 1"
normalizer = NumericNormalizer()
print(normalizer(text))  # Output: "۰۱۲۳ ۱ ۱"

4. `PunctuationNormalizer`

Purpose: Converts various forms of punctuation to their Persian equivalents.

from shekar.preprocessing import PunctuationNormalizer

text = "؟?،٬!%:؛"
normalizer = PunctuationNormalizer()
print(normalizer(text))  # Output: "؟؟،،!٪:؛"

5. `EmojiRemover`

Purpose: Removes all emoji characters from the text.

from shekar.preprocessing import EmojiRemover

text = "سلام 😊🌹🎉"
remover = EmojiRemover()
print(remover(text))  # Output: "سلام"

6. `EmailMasker`

Purpose: Masks or removes email addresses.

from shekar.preprocessing import EmailMasker

text = "تماس با ما: [email protected]"
masker = EmailMasker(mask="")
print(masker(text))  # Output: "تماس با ما: "

7. `URLMasker`

Purpose: Masks or removes URLs.

from shekar.preprocessing import URLMasker

text = "وب‌سایت ما: https://example.com"
masker = URLMasker(mask="")
print(masker(text))  # Output: "وب‌سایت ما: "

8. `DiacriticsRemover`

Purpose: Removes diacritical marks (e.g., َ ,ِ ,ُ ) from Persian/Arabic text.

from shekar.preprocessing import DiacriticsRemover

text = "کُجا نِشانِ قَدَم"
remover = DiacriticsRemover()
print(remover(text))  # Output: "کجا نشان قدم"

9. `PunctuationRemover`

Purpose: Removes all punctuation symbols.

from shekar.preprocessing import PunctuationRemover

text = "سلام، دنیا!"
remover = PunctuationRemover()
print(remover(text))  # Output: "سلام دنیا"

10. `RedundantCharacterRemover`

Purpose: Reduces sequences of repeated characters (like stretched letters).

from shekar.preprocessing import RedundantCharacterRemover

text = "سلاممممممممم"
remover = RedundantCharacterRemover()
print(remover(text))  # Output: "سلامم"

11. `ArabicUnicodeNormalizer`

Purpose: Converts Arabic presentation forms and symbols into Persian equivalents or full phrases.

from shekar.preprocessing import ArabicUnicodeNormalizer

text = "﷽ پنجاه هزار ﷼"
normalizer = ArabicUnicodeNormalizer()
print(normalizer(text))  # Output: "بسم الله الرحمن الرحیم پنجاه هزار ریال"

12. `StopwordRemover`

Purpose: Removes common Persian stopwords (e.g., "این", "است", "به").

from shekar.preprocessing import StopwordRemover

text = "این یک جملهٔ نمونه است"
remover = StopwordRemover()
print(remover(text))  # Output: "جملهٔ نمونه"

13. `NonPersianRemover`

Purpose: Removes all non-Persian characters (can keep English/diacritics if configured).

from shekar.preprocessing import NonPersianRemover

text = "This is یک متن ترکیبی!"
remover = NonPersianRemover()
print(remover(text))  # Output: " یک متن ترکیبی!"

With English support:

remover = NonPersianRemover(keep_english=True)
print(remover("Test در کنار تست"))  # Output: "Test در کنار تست"

14. `HTMLTagRemover`

Purpose: Removes HTML tags while keeping the content.

from shekar.preprocessing import HTMLTagRemover

text = "<p>سلام دنیا</p>"
remover = HTMLTagRemover()
print(remover(text))  # Output: "سلام دنیا"

Notes on Usage

All preprocessors implement __call__ and fit_transform() for pipeline compatibility.
You can pass a single string or an iterable of strings to all classes.
Raise ValueError if input is invalid (e.g., not a string or list of strings).

Preprocessing

1. SpacingStandardizer

2. AlphabetNormalizer

3. NumericNormalizer

4. PunctuationNormalizer

5. EmojiRemover

6. EmailMasker

7. URLMasker

8. DiacriticsRemover

9. PunctuationRemover

10. RedundantCharacterRemover

11. ArabicUnicodeNormalizer

12. StopwordRemover

13. NonPersianRemover

14. HTMLTagRemover