API

avroc

This module holds the main public API.

avroc.compile_encoder(schema)

Construct a callable which encodes Python objects to bytes according to an Avro schema.

Parameters

schema (dict (see Schema Types)) – The schema to use when encoding data. Usually, this is a dict.

Return type

function encoder(msg) -> bytes

avroc.compile_decoder(writer_schema, reader_schema=None)

Construct a callable which decodes Python objects from a bytes reader.

Parameters
  • writer_schema (dict (see Schema Types)) – The schema used by the writer when encoding data. Usually, this is a dict.

  • reader_schema (Optional[dict] (see Schema Types)) – An optional schema to transform messages into when decoding data. The schema must be compatible with the writer’s schema, in the sense described by the Avro spec; see Schema Resolution for details.

Return type

function decoder(fp) -> msg

avroc.read_file(fo, schema)

Read a file containing Avro messages. The file should already be opened, and should be opened in binary mode (like open(path, "rb"), for example).

The messages are provided as an iterator. To get all the messages in a list, you can use list(read_file(fp)), for example.

The optional schema parameter can be used to read data into a different shape than the writer used; see Schema Resolution for more.

Note that the writer’s schema is always included in an Avro data file, so the schema is purely optional - you only need to pass it if you want to use a different schema than the writer used during encoding.

Parameters
  • fo (IO[bytes]) – A handle to a file-like bytes source to read.

  • schema (Optional[dict] (see Schema Types)) – An optional schema to transform messages into when decoding data. The schema must be compatible with the writer’s schema, in the sense described by the Avro spec; see Schema Resolution for details. If no schema is provided, then the writer’s schema is used.

Returns

An iterator of the messages in the file. The messages’ type depend on the schema used when decoding, as laid out in Message Types.

Return type

Iterable[msg]

avroc.write_file(fo, schema, messages)

Write messages to an open file according under a given Avro schema.

All messages in the iterable will be consumed and written.

Parameters
  • fo (IO[bytes]) – A handle to a file-like bytes destination to write to.

  • schema (dict (see Schema Types)) – The schema to use when encoding data.

  • messages (Iterable[msg]) – An iterable of the messages to write into the file. The messages must be encodable under the given schema; see Message Types for details.

class avroc.AvroFileWriter(fo, schema, codec=NullCodec, block_size=1000)

A low-level class for writing Avro data to a file, complete with all persnickety details. Most users should use write_file.

AvroFileWriter provides these additional capabilities on top of write_file:
  • You can write messages one-by-one, rather than passing an entire iterator of messages.

  • You can choose a compression codec to apply to all data bytes written to the file; the codec is stored in the Avro header so other readers will know how to read the data automatically.

  • You can pick a block size and choose exactly when flushes occur.

Writes are buffered, and written in blocks of the given block-size. As a result, it is important to call flush() to be ensure that all writes are actually persisted to the underlying file.

This can be done by using the AvroFileWriter as a context manager. For example, like this:

with open("data.avro", "wb") as f:
    with AvroFileWriter(f, schema) as w:
        w.write(msg1)
        w.write(msg2)
        w.write(msg3)

# When the 'with' block is exited, all writes will be
# flushed, so this is safe.
Parameters
  • fo (File-like in bytes mode) – A file-like object that can be written to in binary mode.

  • schema (dict (see Schema Types)) – The schema to use when encoding data.

  • codec (avroc.codec.Codec) – A compression codec to use when encoding data. The valid options are all the classes in avroc.codec. Make sure to pass an instantiated instance, not a class.

write(msg)

Write a single message to the Avro file. Writes are batched into large blocks; call flush() to flush the current block.

Parameters

msg – A message conforming to the writer’s schema.

flush()

Flush any outstanding writes to the underlying file.

__enter__()

Returns self, allowing the writer to be used as a context manager.

__exit__(exc_type, exc_value, exc_traceback)

Flushes any buffered writes and exits the context-managed block.

avroc.codec

Avro has some officially-endorsed codecs which can be used when writing files (and are automatically selected when reading encoded files). Using these can help you save some space, at the cost of a bit of CPU time for compression and decompression.

Avroc implements all the codecs from the Avro specification.

class avroc.codec.Codec

Abstract base class, implemented by the other classes in this module. Those classes are:

Class

Description

NullCodec

No compression

DeflateCodec

Compress with DEFLATE, similar to gzip

SnappyCodec

Compress with snappy

Bzip2Codec

Compress with bzip2

XZCodec

Compress with xz, from the lzma family

ZstandardCodec

Compress with zstandard

class avroc.codec.NullCodec

A NullCodec does no compression. It just passes data through.

class avroc.codec.DeflateCodec(compression_level=None)

A DeflateCodec uses the deflate algorithm from RFC 1951.

Parameters

compression_level (int) – The Deflate compression level to use. Higher is more compressed. 0 is no compression, and 9 is max compression. Defaults to the default compression level of Python’s zlib (currently 6).

class avroc.codec.SnappyCodec

A SnappyCodec uses Google’s snappy compression algorithm, followed by a 4-byte CRC32 checksum.

class avroc.codec.Bzip2Codec

A Bzip2Codec uses the bzip2 compression algorithm.

class avroc.codec.XZCodec

A XZCodec uses the lzma compression algorithm.

class avroc.codec.ZstandardCodec(compressor=None)

A ZstandardCodec uses the zstandard compression algorithm.

Parameters

compressor (zstandard.ZstdCompressor) – A compressor, possibly which has already been trained on other data, which should be used when compressing data. If unset, then a compressor with all the default values is used.