Fury Java Serialization Format
Spec overview
Fury Java Serialization is an automatic object serialization framework that supports reference and polymorphism. Fury will convert an object from/to fury java serialization binary format. Fury has two core concepts for java serialization:
- Fury Java Binary format
- Framework to convert object to/from Fury Java Binary format
The serialization format is a dynamic binary format. The dynamics and reference/polymorphism support make Fury flexible, much more easy to use, but also introduce more complexities compared to static serialization frameworks. So the format will be more complex.
Here is the overall format:
| fury header | object ref meta | object class meta | object value data |
The data are serialized using little endian byte order overall. If bytes swap is costly for some object, Fury will write the byte order for that object into the data instead of converting it to little endian.
Fury header
Fury header consists starts one byte:
| 4 bits | 1 bit | 1 bit | 1 bit | 1 bit | optional 4 bytes |
+---------------+-------+-------+--------+-------+------------------------------------+
| reserved bits | oob | xlang | endian | null | unsigned int for meta start offset |
- null flag: 1 when object is null, 0 otherwise. If an object is null, other bits won't be set.
- endian flag: 1 when data is encoded by little endian, 0 for big endian.
- xlang flag: 1 when serialization uses xlang format, 0 when serialization uses Fury java format.
- oob flag: 1 when passed
BufferCallback
is not null, 0 otherwise.
If meta share mode is enabled, an uncompressed unsigned int is appended to indicate the start offset of metadata.
Reference Meta
Reference tracking handles whether the object is null, and whether to track reference for the object by writing corresponding flags and maintaining internal state.
Reference flags:
Flag | Byte Value | Description |
---|---|---|
NULL FLAG | -3 | This flag indicates the object is a null value. We don't use another byte to indicate REF, so that we can save one byte. |
REF FLAG | -2 | This flag indicates the object is already serialized previously, and fury will write a ref id with unsigned varint format instead of serialize it again |
NOT_NULL VALUE FLAG | -1 | This flag indicates the object is a non-null value and fury doesn't track ref for this type of object. |
REF VALUE FLAG | 0 | This flag indicates the object is referencable and the first time to serialize. |
When reference tracking is disabled globally or for specific types, or for certain types within a particular
context(e.g., a field of a class), only the NULL
and NOT_NULL VALUE
flags will be used for reference meta.
Class Meta
Fury supports to register class by an optional id, the registration can be used for security check and class
identification.
If a class is registered, it will have a user-provided or an auto-growing unsigned int i.e. class_id
.
Depending on whether meta share mode and registration is enabled for current class, Fury will write class meta differently.
Schema consistent
If schema consistent mode is enabled globally or enabled for current class, class meta will be written as follows:
- If class is registered, it will be written as a fury unsigned varint:
class_id << 1
. - If class is not registered:
- If class is not an array, fury will write one byte
0bxxxxxxx1
first, then write class name.- The first little bit is
1
, which is different from first bit0
of encoded class id. Fury can use this information to determine whether to read class by class id for deserialization.
- The first little bit is
- If class is not registered and class is an array, fury will write one byte
dimensions << 1 | 1
first, then write component class subsequently. This can reduce array class name cost if component class is or will be serialized. - Class will be written as two enumerated fury unsigned by default:
package name
andclass name
. If meta share mode is enabled, class will be written as an unsigned varint which points to index inMetaContext
.
- If class is not an array, fury will write one byte
Schema evolution
If schema evolution mode is enabled globally or enabled for current class, class meta will be written as follows:
- If meta share mode is not enabled, class meta will be written as schema consistent mode. Additionally, field meta such as field type and name will be written with the field value using a key-value like layout.
- If meta share mode is enabled, class meta will be written as a meta-share encoded binary if class hasn't been written before, otherwise an unsigned varint id which references to previous written class meta will be written.
Meta share
This mode will forbid streaming writing since it needs to look back for update the start offset after the whole object graph writing and meta collecting is finished. Only in this way we can ensure deserialization failure doesn't lost shared meta. Meta streamline will be supported in the future for enclosed meta sharing which doesn't cross multiple serializations of different objects.
For Schema consistent mode, class will be encoded as an enumerated string by full class name. Here we mainly describe the meta layout for schema evolution mode:
| 8 bytes meta header | meta size | variable bytes | variable bytes | variable bytes |
+-------------------------------+-----------|--------------------+-------------------+----------------+
| 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta | ... |
Class meta are encoded from parent class to leaf class, only class with serializable fields will be encoded.
Meta header
Meta header is a 64 bits number value encoded in little endian order.
- Lowest 4 digits
0b0000~0b1110
are used to record num classes.0b1111
is preserved to indicate that Fury need to read more bytes for length using Fury unsigned int encoding. If current class doesn't has parent class, or parent class doesn't have fields to serialize, or we're in a context which serialize fields of current class only(ObjectStreamSerializer#SlotInfo
is an example), num classes will be 1. - 5rd bit is used to indicate whether this class needs schema evolution.
- 6rd bit is used to indicate whether the size sum of all layers meta is less than 256.
- Other 56 bits is used to store the unique hash of
flags + all layers class meta
.
Meta size
- If the size sum of all layers meta is less than 256, then one byte is written next to indicate the length of meta.
- Otherwise, write size as two bytes in little endian.
Single layer class meta
| unsigned varint | meta string | meta string | field info: variable bytes | variable bytes | ... |
+----------------------------+-----------------------+---------------------+-------------------------------+-----------------+-----+
| num fields + register flag | header + package name | header + class name | header + type id + field name | next field info | ... |
- num fields: encode
num fields << 1 | register flag(1 when class registered)
as unsigned varint.- If class is registered, then an unsigned varint class id will be written next, package and class name will be omitted.
- If current class is schema consistent, then num field will be
0
to flag it. - If current class isn't schema consistent, then num field will be the number of compatible fields. For example, users can use tag id to mark some field as compatible field in schema consistent context. In such cases, schema consistent fields will be serialized first, then compatible fields will be serialized next. At deserialization, Fury will use fields info of those fields which aren't annotated by tag id for deserializing schema consistent fields, then use fields info in meta for deserializing compatible fields.
- Package name encoding(omitted when class is registered):
- encoding algorithm:
UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL
- Header:
6 bits size | 2 bits encoding flags
. The6 bits size: 0~63
will be used to indicate size0~63
, the value63
the size need more byte to read, the encoding will encodesize - 63
as a varint next.
- encoding algorithm:
- Class name encoding(omitted when class is registered):
- encoding algorithm:
UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL
- header:
6 bits size | 2 bits encoding flags
. The6 bits size: 0~63
will be used to indicate size0~63
, the value63
the size need more byte to read, the encoding will encodesize - 63
as a varint next.
- encoding algorithm:
- Field info:
- header(8
bits):
3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag
. Users can use annotation to provide those info.- 2 bits field name encoding:
- encoding:
UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID
- If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be
11
.
- encoding:
- size of field name:
- The
3 bits size: 0~7
will be used to indicate length1~7
, the value6
the size read more bytes, the encoding will encodesize - 7
as a varint next. - If encoding is
TAG_ID
, then num_bytes of field name will be used to store tag id.
- The
- ref tracking: when set to 1, ref tracking will be enabled for this field.
- nullability: when set to 1, this field can be null.
- polymorphism: when set to 1, the actual type of field will be the declared field type even the type if
not
final
.
- 2 bits field name encoding:
- type id:
- For registered type-consistent classes, it will be the registered class id.
- Otherwise it will be encoded as
OBJECT_ID
if it isn'tfinal
andFINAL_OBJECT_ID
if it'sfinal
. The meta for such types is written separately instead of inlining here is to reduce meta space cost if object of this type is serialized in current object graph multiple times, and the field value may be null too.
- Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will be written instead.
- header(8
bits):
Field order are left as implementation details, which is not exposed to specification, the deserialization need to resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and using a more compact encoding.
Other layers class meta
Same encoding algorithm as the previous layer except:
- header + package name:
- Header:
- If package name has been written before:
varint index + sharing flag(set)
will be written - If package name hasn't been written before:
- If meta string encoding is
LOWER_SPECIAL
and the length of encoded string<=
64, then header will be6 bits size + encoding flag(set) + sharing flag(unset)
. - Otherwise, header will
be
3 bits unset + 3 bits encoding flags + encoding flag(unset) + sharing flag(unset)
- If meta string encoding is
- If package name has been written before:
- Header:
Meta String
Meta string is mainly used to encode meta strings such as class name and field names.
Encoding Algorithms
String binary encoding algorithm:
Algorithm | Pattern | Description |
---|---|---|
LOWER_SPECIAL | a-z._$| | every char is written using 5 bits, a-z : 0b00000~0b11001 , ._$| : 0b11010~0b11101 , prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
LOWER_UPPER_DIGIT_SPECIAL | a-zA-Z0~9._ | every char is written using 6 bits, a-z : 0b00000~0b11001 , A-Z : 0b11010~0b110011 , 0~9 : 0b110100~0b111101 , ._ : 0b111110~0b111111 , prepend one bit at the start to indicate whether strip last char since last byte may have 7 redundant bits(1 indicates strip last char) |
UTF-8 | any chars | UTF-8 encoding |
Encoding flags:
Encoding Flag | Pattern | Encoding Algorithm |
---|---|---|
LOWER_SPECIAL | every char is in a-z._$| | LOWER_SPECIAL |
FIRST_TO_LOWER_SPECIAL | every char is in a-z[c1,c2] except first char is upper case | replace first upper case char to lower case, then use LOWER_SPECIAL |
ALL_TO_LOWER_SPECIAL | every char is in a-zA-Z[c1,c2] | replace every upper case char by | + lower case , then use LOWER_SPECIAL , use this encoding if it's smaller than Encoding LOWER_UPPER_DIGIT_SPECIAL |
LOWER_UPPER_DIGIT_SPECIAL | every char is in a-zA-Z[c1,c2] | use LOWER_UPPER_DIGIT_SPECIAL encoding if it's smaller than Encoding FIRST_TO_LOWER_SPECIAL |
UTF8 | any utf-8 char | use UTF-8 encoding |
Compression | any utf-8 char | lossless compression |
Notes:
- For package name encoding,
c1,c2
should be._
; For field/type name encoding,c1,c2
should be_$
; - Depending on cases, one can choose encoding
flags + data
jointly, uses 3 bits of first byte for flags and other bytes for data.
Shared meta string
The shared meta string format consists of header and encoded string binary. Header of encoded string binary will be inlined in shared meta header.
Header is written using little endian order, Fury can read this flag first to determine how to deserialize the data.
Write by data
If string hasn't been written before, the data will be written as follows:
| unsigned varint: string binary size + 1 bit: not written before | 56 bits: unique hash | 3 bits encoding flags + string binary |
If string binary size is less than 16
bytes, the hash will be omitted to save spaces. Unique hash can be omitted too
if caller pass a flag to disable it. In such cases, the format will be:
| unsigned varint: string binary size + 1 bit: not written before | 3 bits encoding flags + string binary |
Write by ref
If string has been written before, the data will be written as follows:
| unsigned varint: written string id + 1 bit: written before |