protozero
1.6.3
Minimalistic protocol buffer decoder and encoder in C++.
|
Protozero is a very low level library. You really have to know some of the insides of Protocol Buffers to work with it!
So before reading any further in this document, read the following from the Protocol Buffer documentation:
Make sure you understand the basic types of values supported by Protocol Buffers. Refer to this handy table and the cheat sheet if you are getting lost.
You need a C++11-capable compiler for Protozero to work. Copy the files in the include/protozero
directory somewhere where your build system can find them. Keep the protozero
directory and include the files in the form
pbf_reader
To use the pbf_reader
class, add this include to your C++ program:
The pbf_reader
class contains asserts that will detect some programming errors. We encourage you to compile with asserts enabled in your debug builds.
Lets say you have a protocol description in a .proto
file like this:
To read messages created according to that description, you will have code that looks somewhat like this:
You always have to call next()
and then either one of the accessor functions (like get_uint32()
or get_string()
) to get the field value or skip()
to ignore this field. Then call next()
again, and so forth. Never call next()
twice in a row or any if the accessor or skip functions twice in a row.
Because the pbf_reader
class doesn't know the .proto
file it doesn't know which field names or tags there are and it doesn't known the types of the fields. You have to make sure to call the right get_...()
function for each tag. Some assert()s
are done to check you are calling the right functions, but not all errors can be detected.
Note that it doesn't matter whether a field is defined as required
, optional
, or repeated
. You always have to be prepared to get zero, one, or more instances of a field and you always have to be prepared to get other fields, too, unless you want your program to break if somebody adds a new field.
If, out of a protocol buffer message, you only need the value of a single field, you can use the version of the next()
function with a parameter:
As you saw in the example, handling scalar field types is reasonably easy. You just check the .proto
file for the type of a field and call the corresponding function called get_
+ field type.
For string
and bytes
types the internal handling is exactly the same, but both get_string()
and get_bytes()
are provided to make the code self-documenting. Both theses calls allocate and return a std::string
which can add some overhead. You can call the get_view()
function instead which returns a data_view
containing a pointer into the data (access with data()
) and the length of the data (access with size()
).
Fields that are marked as [packed=true]
in the .proto
file are handled somewhat differently. get_packed_...()
functions returning an iterator range are used to access the data.
So, for example, if you have a protocol description in a .proto
file like this:
You can get to the data like this:
Or, with a range-based for-loop:
So you are getting a pair of normal forward iterators wrapped in an iterator range object. The iterators can be used with any STL algorithms etc.
Note that the previous only applies to repeated packed fields, normal repeated fields are handled in the usual way for scalar fields.
Protocol Buffers can embed any message inside another message. To access an embedded message use the get_message()
function. So for this description:
you can parse with this code:
Enums are stored as varints and they can't be differentiated from them. Use the get_enum()
function to get the value of the enum, you have to translate this into the symbolic name yourself. See the enum
test case for an example.
Protozero uses assert()
liberally to help you find bugs in your own code when compiled in debug mode (ie with NDEBUG
not set). If such an assert "fires", this is a very strong indication that there is a bug in your code somewhere.
(Protozero will disable those asserts and "convert" them into exception in its own test code. This is done to make sure the asserts actually work as intended. Your test code will not need this!)
Exceptions, on the other hand, are thrown by Protozero if some kind of data corruption was detected while it is trying to parse the data. This could also be an indicator for a bug in the user code, but because it can happen if the data was (intentionally or not intentionally) been messed with, it is reported to the user code using exceptions.
Most of the functions on the writer side can throw a std::bad_alloc
exception if there is no space to grow a buffer. Other than that no exceptions can occur on the writer side.
All exceptions thrown by the reader side derive from protozero::exception
.
Note that all exceptions can also happen if you are expecting a data field of a certain type in your code but the field actually has a different type. In that case the pbf_reader
class might interpret the bytes in the buffer in the wrong way and anything can happen.
end_of_buffer_exception
This will be thrown whenever any of the functions "runs out of input data". It means you either have an incomplete message in your input or some other data corruption has taken place.
unknown_pbf_wire_type_exception
This will be thrown if an unsupported wire type is encountered. Either your input data is corrupted or it was written with an unsupported version of a Protocol Buffers implementation.
varint_too_long_exception
This exception indicates an illegal encoding of a varint. It means your input data is corrupted in some way.
invalid_tag_exception
This exception is thrown when a tag has an invalid value. Tags must be unsigned integers between 1 and 2^29-1. Tags between 19000 and 19999 are not allowed. See https://developers.google.com/protocol-buffers/docs/proto#assigning-tags
invalid_length_exception
This exception is thrown when a length field of a packed repeated field is invalid. For fixed size types the length must be a multiple of the size of the type.
pbf_reader
classThe pbf_reader
class behaves like a value type. Objects are reasonably small (two pointers and two uint32_t
, so 24 bytes on a 64bit system) and they can be copied and moved around trivially.
pbf_reader
objects can be constructed from a std::string
or a const char*
and a length field (either supplied as separate arguments or as a std::pair
). In all cases objects of the pbf_reader
class store a pointer into the input data that was given to the constructor. You have to make sure this pointer stays valid for the duration of the objects lifetime.
pbf_message
One problem in the code above are the "magic numbers" used as tags for the different fields that you got from the .proto
file. Instead of spreading these magic numbers around your code you can define them once in an enum class
and then use the pbf_message
template class instead of the pbf_reader
class.
Here is the first example again, this time using this new technique. So you have the following in a .proto
file:
Add the following declaration in one of your header files:
The message name becomes the name of the enum class
which is always built on top of the protozero::pbf_tag_type
type. Each field in the message becomes one value of the enum. In this case the name is created from the type (including the modifiers like required
or optional
) and the name of the field. You can use any name you want, but this convention makes it easier later, to get everything right.
To read messages created according to that description, you will have code that looks somewhat like this, this time using pbf_message
instead of pbf_reader
:
Note the correspondance between the enum value (for instance required_uint32_x
) and the name of the getter function (for instance get_uint32()
). This makes it easier to get the correct types. Also the naming makes it easier to keep different message types apart if you have multiple (or embedded) messages.
See the test/t/complex
test case for a complete example using this interface.
Using pbf_message
in favour of pbf_reader
is recommended for all code. Note that pbf_message
derives from pbf_reader
, so you can always fall back to the more generic interface if necessary.
One problem you might run into is the following: The enum class lists all possible values you know about and you'll have lots of switch
statements checking those values. Some compilers will know that your switch
covers all possible cases and warn you if you have a default
case that looks unneccessary to the compiler. But you still want that default
case to allow for future extension of those messages (and maybe also to detect corrupted data). You can switch of this warning with -Wno-covered-switch-default
).
pbf_writer
To use the pbf_writer
class, add this include to your C++ program:
The pbf_writer
class contains asserts that will detect some programming errors. We encourage you to compile with asserts enabled in your debug builds.
Lets say you have a protocol description in a .proto
file like this:
To write messages created according to that description, you will have code that looks somewhat like this:
First you need a string which will be used as buffer to assemble the protobuf-formatted message. The pbf_writer
object contains a reference to this string buffer and through it you add data to that buffer piece by piece. The buffer doesn't have to be empty, the pbf_writer
will simply append its data to whatever is there already.
As you could see in the introductory example handling any kind of scalar field is easy. The type of field doesn't matter and it doesn't matter whether it is optional, required or repeated. You always call one of the add_TYPE()
method on the pbf writer object.
The first parameter of these methods is always the tag of the field (the field number) from the .proto
file. The second parameter is the value you want to set. For the bytes
and string
types several versions of the add method are available taking a const std::string&
or a const char*
and a length.
For enum
types you have to use the numeric value as the symbolic names from the .proto
file are not available.
Repeated packed fields can easily be set from a pair of iterators:
If you don't have an iterator you can use the alternative form:
Of course you can add as many elements as you want. If you add no elements at all, this code will still work, Protozero detects this special case and pretends you never even initialized this field.
The nested scope is important in this case, because the destructor of the field
object will make sure the length stored inside the field is set to the right value. You must close that scope before adding other fields to the pw
pbf writer.
If you know how many elements you will add to the field and your field contains fixed length elements, you can tell Protozero and it can optimize this case:
In this case you have to supply exactly as many elements as you promised, otherwise you will get a broken protobuf message.
This works for packed_field_fixed32
, packed_field_sfixed32
, packed_field_fixed64
, packed_field_sfixed64
, packed_field_float
, and packed_field_double
.
You can abandon writing of the packed field if this becomes necessary by calling rollback()
:
The result is the same as if the lines inside the nested brackets had never been called. Do not try to call add_element()
after a rollback.
Nested sub-messages can be handled by first creating the submessage and then adding to the parent message:
This is easy to do but it has the drawback of needing a separate std::string
buffer. If this concerns you (and why would you use protozero and not the Google protobuf library if it doesn't?) there is another way:
This can be nested arbitrarily deep.
Internally the sub-message writer re-uses the buffer from the parent. It reserves enough space in the buffer to later write the length of the submessage into it. It then adds the contents of the submessage to the buffer. When the pbf_sub
writer is destructed the length of the submessage is calculated and written in the reserved space. If less space was needed for the length field than was available, the rest of the buffer is moved over a few bytes.
You can abandon writing of submessage if this becomes necessary by calling rollback()
:
The result is the same as if the lines inside the nested brackets had never been called. Do not try to call any of the add_*
functions on the submessage after a rollback.
pbf_builder
Just like the pbf_message
template class wraps the pbf_reader
class, there is a pbf_builder
template class wrapping the pbf_writer
class. It is instantiated using the same enum class
described above and used exactly like the pbf_writer
class but using the values of the enum instead of bare integers.
See the test/t/complex
test case for a complete example using this interface.