--- description: "Identify document languages accurately using FastText models supporting 176 languages for multilingual text processing" categories: ["how-to-guides"] tags: ["language-identification", "fasttext", "multilingual", "176-languages", "detection", "classification"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "intermediate" content_type: "how-to" modality: "text-only" --- # Language Identification Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets. ## How it Works NeMo Curator's language identification system works through a three-step process: 1. **Text Preprocessing**: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces. 2. **FastText Language Detection**: The pre-trained FastText language identification model ([`lid.176.bin`](https://fasttext.cc/docs/en/language-identification.html)) analyzes the preprocessed text and returns: - A confidence score (0.0 to 1.0) indicating certainty of the prediction - A language code (for example, "EN", "ES", "FR") in FastText's two-letter uppercase format 3. **Filtering and Scoring**: The pipeline filters documents based on a configurable confidence threshold (`min_langid_score`) and stores both the confidence score and language code as metadata. ### Language Detection Process The `FastTextLangId` filter implements this workflow by: - Loading the FastText language identification model on worker initialization - Processing text through `model.predict()` with `k=1` to get the top language prediction - Extracting the language code from FastText labels (for example, `__label__en` becomes "EN") - Comparing confidence scores against the threshold to determine document retention - Returning results as `[confidence_score, language_code]` for downstream processing This approach supports **176 languages** with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical. ## Usage The following example demonstrates how to create a language identification pipeline using Curator with distributed processing. ```python """Language identification using Curator.""" from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.filters import ScoreFilter from nemo_curator.stages.text.filters.fasttext import FastTextLangId from nemo_curator.stages.text.io.reader import JsonlReader def create_language_identification_pipeline(data_dir: str) -> Pipeline: """Create a pipeline for language identification.""" # Define pipeline pipeline = Pipeline( name="language_identification", description="Identify document languages using FastText" ) # Add stages # 1. Reader stage - creates tasks from JSONL files pipeline.add_stage( JsonlReader( file_paths=data_dir, files_per_partition=2, # Each task processes 2 files ) ) # 2. Language identification with filtering # IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html fasttext_model_path = "/path/to/lid.176.bin" # or lid.176.ftz (compressed) pipeline.add_stage( ScoreFilter( FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3), score_field="language" ) ) return pipeline def main(): # Create pipeline pipeline = create_language_identification_pipeline("./data") # Print pipeline description print(pipeline.describe()) # Create executor and run results = pipeline.run() # Process results total_documents = sum(task.num_items for task in results) if results else 0 print(f"Total documents processed: {total_documents}") # Access language scores for i, batch in enumerate(results): if batch.num_items >0: df = batch.to_pandas() print(f"Batch {i} columns: {list(df.columns)}") # Language scores are now in the 'language' field if __name__ == "__main__": main() ``` ## Understanding Results The language identification process adds a score field to each document batch: 1. **`language` field**: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility): - Element 0: The confidence score (between 0 and 1) - Element 1: The language code in FastText format (for example, "EN" for English, "ES" for Spanish) 2. **Task-based processing**: Curator processes documents in batches (tasks), and results are available through the task's Pandas DataFrame: ```python # Access results from pipeline execution for batch in results: df = batch.to_pandas() # Language scores are in the 'language' column print(df[['text', 'language']].head()) ``` For quick exploratory inspection, converting a `DocumentBatch` to a Pandas DataFrame is fine. For performance and scalability, write transformations as `ProcessingStage`s (or with the `@processing_stage` decorator) and run them inside a `Pipeline` with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.