Skip to content

Preprocessing

Notebook Open In Colab

Normalizer

The built-in Normalizer class provides a ready-to-use, opinionated normalization pipeline for Persian text. It combines the most common and error-prone normalization steps into a single component, covering the majority of real-world use cases such as web text, social media, OCR output, and mixed informal–formal writing.

Most importantly, the normalization rules in Shekar strictly follow the official guidelines of Academy of Persian Language and Literature (فرهنگستان زبان و ادب فارسی). This makes the output suitable not only for NLP pipelines, but also for linguistically correct and publishable Persian text.

from shekar import Normalizer

normalizer = Normalizer()

text = "«فارسی شِکَر است» نام داستان ڪوتاه طنز    آمێزی از محمد علی جمالــــــــزاده ی گرامی می   باشد که در سال 1921 منتشر  شده است و آغاز   ڱر تحول بزرگی در ادَبێات معاصر ایران بۃ شمار میرود."
print(normalizer(text))

# نرمال‌سازی نویسه‌های گفتاری و روزمره
text = normalizer("می دونی که نمیخاستم ناراحتت کنم.اما خونه هاشون خیلی گرون تر شده")
print(text)

# نرمال‌سازی واژه‌های مرکب و افعال پیشوندی! 
text = normalizer("یک کار آفرین نمونه و سخت کوش ، پیروز مندانه از پس دشواری ها برخواهدآمد.")
print(text) 

Batch and Decorator Support

texts = [
    "یادته گل رز قرمز 🌹 به تو دادم؟",
    "بگو یهویی از کجا پیدات شد؟"
]
outputs = normalizer.fit_transform(texts)
print(outputs)

@normalizer.on_args("text")
def process_text(text):
    return text

print(process_text("تو را من چشم👀 در راهم!"))

Customization

Shekar is built around a modular and composable preprocessing framework that allows fine-grained control over each step of text processing. Preprocessing is implemented as small, independent operators such as filters, normalizers, and maskers, which can be used on their own or combined into flexible pipelines.

Pipelines are constructed using the Pipeline abstraction and composed with the | operator, making preprocessing logic explicit, readable, and easy to customize.

For example, the following pipeline is functionally equivalent to the default normalizer:

from shekar.preprocessing import (
    PunctuationNormalizer,
    AlphabetNormalizer,
    DigitNormalizer,
    SpacingNormalizer,
    RemoveDiacritics,
    RepeatedLetterNormalizer,
    ArabicUnicodeNormalizer,
    YaNormalizer,
)

normalizer = (
            AlphabetNormalizer()
            | ArabicUnicodeNormalizer()
            | DigitNormalizer()
            | PunctuationNormalizer()
            | RemoveDiacritics()
            | RepeatedLetterNormalizer()
            | SpacingNormalizer()
            | YaNormalizer(style="joda")
        )

Operators can also be composed for lightweight, task-specific preprocessing. For example, removing emojis and punctuation:

from shekar.preprocessing import EmojiRemover, PunctuationRemover

text = "ز ایران دلش یاد کرد و بسوخت! 🌍🇮🇷"
pipeline = EmojiRemover() | PunctuationRemover()
output = pipeline(text)
print(output)

Pipeline Features

  • Composable steps in a defined order
  • | operator chaining for concise definitions
  • Single-string and batch input support
  • Callable pipeline object (pipeline(text))
  • Decorator support via .on_args(...)
  • Clear errors for invalid inputs or configuration

Modular Preprocessing Components

The shekar.preprocessing package provides reusable blocks that can be used standalone or inside Pipeline.

1) Normalizers

Component Aliases Description
AlphabetNormalizer NormalizeAlphabets Converts Arabic characters to Persian equivalents
ArabicUnicodeNormalizer NormalizeArabicUnicodes Replaces Arabic presentation forms (e.g., ﷽) with Persian equivalents
DigitNormalizer NormalizeDigits Converts English/Arabic digits to Persian
PunctuationNormalizer NormalizePunctuations Standardizes punctuation symbols
RepeatedLetterNormalizer NormalizeRepeatedLetters Normalizes repeated letters (e.g., “سسسلام” → “سسلام”)
SpacingNormalizer NormalizeSpacings Fixes spacing, ZWNJ placement, and punctuation spacing
YaNormalizer NormalizeYas Normalizes Persian “یـا” in either standard or joda style
from shekar.preprocessing import AlphabetNormalizer, PunctuationNormalizer, SpacingNormalizer

print(AlphabetNormalizer()("نشان‌دهندة"))
print(PunctuationNormalizer()("سلام!چطوری?"))
print(SpacingNormalizer()("اینجا کجاست؟تو میدانی؟نمیدانم!"))

2) Filters / Removers

Component Aliases Description
DiacriticFilter DiacriticRemover, RemoveDiacritics Removes Persian/Arabic diacritics
EmojiFilter EmojiRemover, RemoveEmojis Removes emojis
NonPersianLetterFilter NonPersianRemover, RemoveNonPersianLetters Removes non-Persian content (optionally keeps English)
PunctuationFilter PunctuationRemover, RemovePunctuations Removes punctuation characters
StopWordFilter StopWordRemover, RemoveStopWords Removes frequent Persian stopwords
DigitFilter DigitRemover, RemoveDigits Removes digit characters
RepeatedLetterFilter RepeatedLetterRemover, RemoveRepeatedLetters Shrinks repeated letters
HTMLTagFilter HTMLRemover, RemoveHTMLTags Removes HTML tags while keeping content
HashtagFilter HashtagRemover, RemoveHashtags Removes hashtags
MentionFilter MentionRemover, RemoveMentions Removes @mentions
from shekar.preprocessing import EmojiFilter, DiacriticFilter

print(EmojiFilter()("😊🇮🇷سلام گلای تو خونه!🎉🎉🎊🎈"))
print(DiacriticFilter()("مَنْ"))

3) Maskers

Component Aliases Description
EmailMasker MaskEmails Masks or removes email addresses
URLMasker MaskURLs Masks or removes URLs
from shekar.preprocessing import URLMasker

print(URLMasker(mask="")("وب‌سایت ما: https://example.com"))

4) Utility Transforms

Component Purpose
NGramExtractor Extracts n-grams from text
Flatten Flattens nested lists into one list
from shekar.transforms import NGramExtractor, Flatten

ngrams = NGramExtractor(range=(1, 2))("سلام دنیا")
print(ngrams)

nested = [["سلام", "دنیا"], ["خوبی؟", "چطوری؟"]]
print(list(Flatten()(nested)))