Normalizer Module¶
The Normalizer module is a collection of static methods that allow for the normalization of text in Turkish language. It includes methods for converting text to lowercase, deasciify, removing punctuation and accent marks and converting numbers to words. The module utilizes a combination of built-in Python functions and regular expressions to achieve these normalizations. Each method takes in a string of text as an input and returns the normalized version of the text. The module also includes a built-in functions at _builtin module for converting numbers to words in Turkish language. Overall, this module is useful for preparing text for further processing, such as natural language understanding and machine learning tasks.
How the Normalizer Module Improves Text Preprocessing:¶
Advanced Context-aware Replacements: The class is able to replace specific characters at a given position, taking into account the context of the surrounding characters. This ensures that the resulting text is grammatically and semantically correct, and makes the text more natural and readable.
Flexibility: The class can be easily integrated into various NLP tasks such as text classification, sentiment analysis and text summarization.
Language Specific: The class is specifically designed to handle Turkish text, and is able to handle the unique characteristics and complexities of the language. This makes it more accurate and efficient compared to general-purpose normalization methods that do not take into account the specific language.
Function Name |
Description |
|---|---|
|
Converts numbers in a text to words in Turkish language. |
|
Deasciifies the given text for Turkish. |
|
Removes punctuations from the given string. |
|
Removes accent marks from the given string. |
|
Converts a string of text to lowercase for Turkish language. |
|
Normalizes Turkish characters in the given text. |
|
Removes extra spaces from the given text. |
|
Removes stop words from the given text. |