--- title: Row Format sidebar_position: 11 id: row_format license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --- Apache Fory™ provides a random-access row format that enables reading nested fields from binary data without full deserialization. ## Overview Row format drastically reduces overhead when working with large objects where only partial data access is needed. It also supports memory-mapped files for ultra-low memory footprint. **Key Benefits:** | Feature | Description | | ----------------------- | ------------------------------------------------------ | | Zero-Copy Access | Read nested fields without deserializing entire object | | Memory Efficiency | Memory-map large datasets directly from disk | | Cross-Language | Binary format compatible between Python, Java, C++ | | Partial Deserialization | Deserialize only specific elements you need | | High Performance | Skip unnecessary data parsing for analytics workloads | ## Basic Usage ```python from dataclasses import dataclass from typing import Dict, List import pyfory @dataclass class Bar: f1: str f2: List[pyfory.Int64] @dataclass class Foo: f1: pyfory.Int32 f2: List[pyfory.Int32] f3: Dict[str, pyfory.Int32] f4: List[Bar] # Create encoder for row format encoder = pyfory.encoder(Foo) # Create large dataset foo = Foo( f1=10, f2=list(range(1_000_000)), f3={f"k{i}": i for i in range(1_000_000)}, f4=[Bar(f1=f"s{i}", f2=list(range(10))) for i in range(1_000_000)] ) # Encode to row format binary: bytes = encoder.to_row(foo).to_bytes() # Zero-copy access - no full deserialization needed! foo_row = pyfory.RowData(encoder.schema, binary) print(foo_row.f2[100000]) # Access 100,000th element directly print(foo_row.f4[100000].f1) # Access nested field directly print(foo_row.f4[200000].f2[5]) # Access deeply nested field directly ``` ## PyArrow Schema Conversion Row format can convert PyArrow schemas through `pyfory.format` when the `format` optional dependency is installed: ```python import pyarrow as pa from pyfory.format import from_arrow_schema, to_arrow_schema arrow_schema = pa.schema( [ pa.field("id", pa.int32(), nullable=False), pa.field("scores", pa.list_(pa.float64())), ] ) fory_schema = from_arrow_schema(arrow_schema) roundtrip_arrow_schema = to_arrow_schema(fory_schema) ``` This PyArrow conversion surface is separate from cross-language dense-array field annotations. In object serialization, `pyfory.PyArray[T]` means the standard-library Python `array.array` carrier, not PyArrow. ## Cross-Language Compatibility Row format works seamlessly across languages. The same binary data can be accessed from Java and C++. ### Java ```java public class Bar { String f1; List f2; } public class Foo { int f1; List f2; Map f3; List f4; } RowEncoder encoder = Encoders.bean(Foo.class); // Encode to row format (cross-language compatible with Python) BinaryRow binaryRow = encoder.toRow(foo); // Zero-copy random access without full deserialization BinaryArray f2Array = binaryRow.getArray(1); // Access f2 list BinaryArray f4Array = binaryRow.getArray(3); // Access f4 list BinaryRow bar10 = f4Array.getStruct(10); // Access 11th Bar long value = bar10.getArray(1).getInt64(5); // Access 6th element of bar.f2 // Partial deserialization - only deserialize what you need RowEncoder barEncoder = Encoders.bean(Bar.class); Bar bar1 = barEncoder.fromRow(f4Array.getStruct(10)); // Deserialize 11th Bar only Bar bar2 = barEncoder.fromRow(f4Array.getStruct(20)); // Deserialize 21st Bar only ``` ### C++ ```cpp #include "fory/encoder/row_encoder.h" #include "fory/row/writer.h" struct Bar { std::string f1; std::vector f2; FORY_STRUCT(Bar, f1, f2); }; struct Foo { int32_t f1; std::vector f2; std::map f3; std::vector f4; FORY_STRUCT(Foo, f1, f2, f3, f4); }; fory::encoder::RowEncoder encoder; encoder.Encode(foo); auto row = encoder.GetWriter().ToRow(); // Zero-copy random access without full deserialization auto f2_array = row->GetArray(1); // Access f2 list auto f4_array = row->GetArray(3); // Access f4 list auto bar10 = f4_array->GetStruct(10); // Access 11th Bar int64_t value = bar10->GetArray(1)->GetInt64(5); // Access 6th element of bar.f2 std::string str = bar10->GetString(0); // Access bar.f1 ``` ## Installation Row format requires Apache Arrow: ```bash pip install pyfory[format] ``` ## When to Use Row Format - **Analytics workloads**: When you only need to access specific fields - **Large datasets**: When full deserialization is too expensive - **Memory-mapped files**: Working with data larger than RAM - **Data pipelines**: Processing data without full object reconstruction - **Cross-language data sharing**: When data needs to be accessed from multiple languages ## Related Topics - [Cross-Language Serialization](cross-language.md) - XLANG mode - [Basic Serialization](basic-serialization.md) - Object serialization - [Row Format Specification](https://fory.apache.org/docs/specification/row_format_spec) - Protocol details