--- # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. name: python-development description: Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/. --- # Python Development in Apache Beam ## Project Structure ### Key Directories - `sdks/python/` - Python SDK root - `apache_beam/` - Main Beam package - `transforms/` - Core transforms (ParDo, GroupByKey, etc.) - `io/` - I/O connectors - `ml/` - Beam ML code (RunInference, etc.) - `runners/` - Runner implementations and wrappers - `runners/worker/` - SDK worker harness - `container/` - Docker container configuration - `test-suites/` - Test configurations - `scripts/` - Utility scripts ### Configuration Files - `setup.py` - Package configuration - `pyproject.toml` - Build configuration - `tox.ini` - Test automation - `pytest.ini` - Pytest configuration - `.pylintrc` - Linting rules - `.isort.cfg` - Import sorting - `mypy.ini` - Type checking ## Environment Setup ### Using pyenv (Recommended) ```bash # Install Python pyenv install 3.X # Use supported version from gradle.properties # Create virtual environment pyenv virtualenv 3.X beam-dev pyenv activate beam-dev ``` ### Install in Editable Mode ```bash cd sdks/python pip install -e .[gcp,test] ``` ### Enable Pre-commit Hooks ```bash pip install pre-commit pre-commit install # To disable pre-commit uninstall ``` ## Running Tests ### Unit Tests (filename: `*_test.py`) ```bash # Run all tests in a file pytest -v apache_beam/io/textio_test.py # Run tests in a class pytest -v apache_beam/io/textio_test.py::TextSourceTest # Run a specific test pytest -v apache_beam/io/textio_test.py::TextSourceTest::test_progress ``` ### Integration Tests (filename: `*_it_test.py`) #### On Direct Runner ```bash python -m pytest -o log_cli=True -o log_level=Info \ apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \ --test-pipeline-options='--runner=TestDirectRunner' ``` #### On Dataflow Runner ```bash # First build SDK tarball pip install build && python -m build --sdist # Run integration test python -m pytest -o log_cli=True -o log_level=Info \ apache_beam/ml/inference/pytorch_inference_it_test.py::PyTorchInference \ --test-pipeline-options='--runner=TestDataflowRunner --project= --temp_location=gs:///tmp --sdk_location=dist/apache-beam-2.XX.0.dev0.tar.gz --region=us-central1' ``` ## Building Python SDK ### Build Source Distribution ```bash cd sdks/python pip install build && python -m build --sdist # Output: sdks/python/dist/apache-beam-X.XX.0.dev0.tar.gz ``` ### Build Wheel (faster installation) ```bash ./gradlew :sdks:python:bdistPy311linux # For Python 3.11 on Linux ``` ### Build and Push SDK Container Image ```bash ./gradlew :sdks:python:container:py311:docker \ -Pdocker-repository-root=gcr.io/your-project/your-name \ -Pdocker-tag=custom \ -Ppush-containers # Container image will be pushed to: gcr.io/your-project/your-name/beam_python3.11_sdk:custom ``` To use this container image, supply it via `--sdk_container_image`. ## Running Pipelines with Modified Code ```bash # Install modified SDK pip install /path/to/apache-beam.tar.gz[gcp] # Run pipeline python my_pipeline.py \ --runner=DataflowRunner \ --sdk_location=/path/to/apache-beam.tar.gz \ --project=my_project \ --region=us-central1 \ --temp_location=gs://my-bucket/temp ``` ## Common Issues ### `NameError` when running DoFn Global imports, functions, and variables in the main pipeline module are not serialized by default. Use: ```bash --save_main_session ``` ### Specifying Additional Dependencies Use `--requirements_file=requirements.txt` or custom containers. ## Test Markers - `@pytest.mark.it_postcommit` - Include in PostCommit test suite ## Gradle Commands for Python ```bash # Run WordCount ./gradlew :sdks:python:wordCount # Check environment ./gradlew :checkSetup ``` ## Code Quality Tools ```bash # Linting pylint apache_beam/ # Type checking mypy apache_beam/ # Formatting (via yapf) yapf -i apache_beam/file.py # Import sorting isort apache_beam/file.py ```