Quick Start Guide
Welcome to Shekar, a Python library for Persian Natural Language Processing. This guide will walk you through the most essential components so you can get started quickly with preprocessing, tokenization, pipelines, normalization, and embeddings.
1. Normalize Your Text
Use the Normalizer
to standardize noisy or non-standard Persian text by applying a built-in pipeline of transformations.
from shekar import Normalizer
normalizer = Normalizer()
text = "ۿدف ما ػمګ بۀ ێڪډيڱڕ أښټ"
print(normalizer.normalize(text)) # Output: "هدف ما کمک به یکدیگر است"
2. Use Preprocessing Components
Import and use individual text cleaners like EmojiRemover
, DiacriticsRemover
, or URLMasker
.
from shekar.preprocessing import EmojiRemover
text = "سلام 🌹😊"
print(EmojiRemover()(text)) # Output: "سلام"
See the full list of components in shekar.preprocessing
.
3. Build Custom Pipelines
Create your own pipeline by chaining any number of preprocessing steps:
from shekar import Pipeline
from shekar.preprocessing import EmojiRemover, PunctuationRemover
pipeline = Pipeline([
("emoji", EmojiRemover()),
("punct", PunctuationRemover())
])
text = "پرندههای 🐔 قفسی، عادت دارن به بیکسی!"
print(pipeline(text)) # Output: "پرندههای قفسی عادت دارن به بیکسی"
Supports: - Single strings or batches - Function decorators for auto-cleaning input arguments
4. Tokenize Text into Sentences
Use SentenceTokenizer
to split text into sentences:
from shekar.tokenizers import SentenceTokenizer
text = "هدف ما کمک به یکدیگر است! ما میتوانیم با هم کار کنیم."
sentences = SentenceTokenizer().tokenize(text)
for s in sentences:
print(s)
Summary
Task | Tool / Class |
---|---|
Normalize text | Normalizer |
Clean text | shekar.preprocessing |
Create pipeline | Pipeline |
Tokenize | SentenceTokenizer |
Word Cloud | WordCloud |
All components work with both single strings and lists. Pipelines are composable and can be reused, tested, or applied dynamically via decorators.