1
0
Fork 0
mirror of git://git.psyc.eu/libpsyc synced 2024-08-15 03:19:02 +00:00
libpsyc/bench/benchmark.org
2012-05-06 15:39:02 +02:00

13 KiB

libpsyc Performance Benchmarks

In this document we present the results of performance benchmarks of libpsyc compared to json-c, libjson-glib, rapidxml and libxml2.

PSYC, JSON, XML Syntax Benchmarks

First we look at the mere performance of the PSYC syntax compared to equivalent XML and JSON encodings. We'll look at actual XMPP messaging later.

User Profile

In this test we'll compare the efficiency of the three syntaxes at serializing a typical user data base storage information. Let's start with XML:

In JSON this could look like this:

Here's a way to model this in PSYC (verbose mode):

A message with JSON-unfriendly characters

This message contains some characters which are impractical to encode in JSON. We should probably put a lot more inside to actually see an impact on performance. TODO

A message with XML-unfriendly characters

Same test with characters which aren't practical in the XML syntax, yet we should put more of them inside. TODO

A message with PSYC-unfriendly strings

PSYC prefixes data with length as soon as it exceeds certain sizes or contains certain strings. In the case of short messages this is less efficient than scanning the values without lengths. Also, lengths are harder to edit by hand.

Packets containing binary data

We'll use a generator of random binary data to see how well the formats behave with different sizes of data. We'll consider 7000 as a possible size of an icon, 70000 for an avatar, 700000 for a photograph, 7000000 for a piece of music, 70000000 for a large project and 700000000 for the contents of a CD.

PSYC vs XMPP Protocol Benchmarks

These tests use typical messages from the XMPP ("stanzas" in Jabber lingo) and compare them with equivalent JSON encodings and PSYC formats.

A presence packet

Since presence packets are by far the dominant messaging content in the XMPP network, we'll start with one of them. Here's an example from paragraph 4.4.2 of RFC 6121.

And here's the same information in a JSON rendition:

Here's the equivalent PSYC packet in verbose mode (since it is a multicast, the single recipients do not need to be mentioned):

And this is the same message in PSYC's compact form, but since compact mode hasn't been implemented nor deployed yet, you should only consider this for future projects:

An average chat message

#

Little difference: PSYC by default doesn't mention a "resource" in XMPP terms, instead it allows for more addressing schemes than just PSYC.

A new status updated activity

Example taken from http://onesocialweb.org/spec/1.0/osw-activities.html You could call this XML namespace hell.. :-)

http://activitystrea.ms/head/json-activity.html proposes a JSON encoding of this. We'll have to add a routing header to it.

http://about.psyc.eu/Activity suggests a PSYC mapping for activity streams. Should a "status post" be considered equivalent to a presence description announcement or just a message in the "microblogging" channel? We'll use the latter here:

It's nice about XML namespaces how they can by definition never collide, but this degree of engineering perfection causes us a lot of overhead. The PSYC approach is to just extend the name of the method - as long as people use differing method names, protocol extensions can exist next to each other happily. Method name unicity cannot mathematically be ensured, but it's enough to append your company name to make it unlikely for anyone else on earth to have the same name. How this kind of safety is delivered when using the JSON syntax of ActivityStreams is unclear. Apparently it was no longer an important design criterion.

Results

Parsing time of 1 000 000 packets, in milliseconds. A simple strlen() scan of the respective message is provided for comparison. These tests were performed on a 2.53 GHz Intel(R) Core(TM)2 Duo P9500 CPU.

strlen libpsyc json-c json-glib libxml sax libxml rapidxml
user profile 55 608 4715 16503 7350 12377 2477
psyc-unfriendly 70 286 2892 12567 5538 8659 1896
json-unfriendly 49 430 2328 10006 5141 7875 1751
xml-unfriendly 37 296 2156 9591 5571 8769 1765
/ < < > < >

Pure syntax comparisons above, protocol performance comparisons below:

strlen libpsyc libpsyc compact json-c json-glib libxml sax libxml rapidxml
presence 30 236 122 2463 10016 4997 7557 1719
chat msg 40 295 258 2147 9526 5911 8999 1850
activity 42 353 279 4666 16327 13357 28858 4356
/ < > < > < >

Parsing large amounts of binary data. For JSON & XML base64 encoding was used. Note that the results below include only the parsing time, base64 decoding was not performed.

strlen libpsyc json-c json-glib libxml sax libxml rapidxml
7K 978 77 18609 98000 11445 19299 8701
70K 9613 77 187540 1003900 96209 167738 74296
700K 95888 77 1883500 10616000 842025 1909428 729419
7M 1347300 78 26359000 120810000 12466610 16751363 7581169
70M 14414000 80 357010000 1241000000 169622110 296017820 75308906
/ < > < > < >

In each case we compared performance of parsing and re-rendering these messages, but consider also that the applicative processing of an XML DOM tree is more complicated than just accessing certain elements in a JSON data structure or PSYC variable mapping.

Explanations

As you can tell the PSYC data format outpaces its rivals in all circumstances. Extremely so when delivering binary data as PSYC simply returns the starting point and the length of the given buffer while the other parsers have to scan for the end of the transmission, but also with many simpler operations, when PSYC quickly figures out where the data starts and ends and passes such information back to the application while the other formats are forced to generate a copy of the data in order to process possibly embedded special character sequences. PSYC essentially operates like a binary data protocol even though it is actually text-based.

Criticism

Are we comparing apples and oranges? Yes and no, depends on what you need. XML is a syntax best suited for complex structured data in well-defined formats - especially good for text mark-up. JSON is a syntax intended to hold arbitrarily structured data suitable for immediate inclusion in Javascript source codes. The PSYC syntax is an evolved derivate of RFC 822, the syntax used by HTTP and E-Mail. It is currently limited in the kind and depth of data structures that can be represented with it, but it is highly efficient in exchange.

In fact we are currently looking into suitable syntax extensions to represent generic structures and semantic signatures, but for now PSYC only provides for simple typed values and lists of typed values.

Ease of Implementation

Another aspect is the availability of these formats for spontaneous use. You could generate and parse JSON yourself but you have to be careful about escaping. XML can be rendered manually if you know your data will not break the syntax, but you shouldn't dare to parse it without a bullet proof parser. PSYC is easy to render and parse yourself for simple tasks, as long as the body does not contain "\n|\n" and your variables do not contain newlines.

Conclusions

After all it is up to you to find out which format fulfils your requirements the best. We use PSYC for the majority of messaging where JSON and XMPP aren't efficient and opaque enough, but we employ XML and JSON as payloads within PSYC for data that doesn't fit the PSYC model. For some reason all three formats are being used for messaging, although only PSYC was actually designed for that purpose.

The Internet has developed two major breeds of protocol formats. The binary ones are extremely efficient but in most cases you have to recompile all instances each time you change something while the plain-text ones are reaching out for achieving perfection in data representation while leaving the path of efficiency. Some protocols such as HTTP and SIP are in-between these two schools, offering both a text-based extensible syntax (it's actually easier to add a header to HTTP than to come up with a namespace for XMPP…) and the ability to deliver binary data. But these protocols do not come with native data structure support. PSYC is a protocol that combines the compactness and efficiency of binary protocols with the extensibility of text-based protocols and still provides for enough data structuring to rarely require the use of other data formats.

Futures

After a month of development libpsyc is already performing pretty well, but we presume various optimizations, like rewriting parts in assembler, are possible.

Related Work

If this didn't help, you can also look into:

  • Adobe AMF
  • ASN.1
  • BSON
  • Cisco Etch
  • Efficient XML
  • Facebook Thrift
  • Google Protocol Buffers

The drawback of these binary formats is, unlike PSYC, JSON and XML you can't edit them manually and you can't produce valid messages by replacing variables in a simple text template. You depend on specialized parsers and renderers to be provided.

There's also

  • Bittorrent's bencode

This format is formally text-based, but not easy to read as it doesn't have any visual separators and isn't easy to edit as everything is prefixed by lengths even for very short items.

Further Reading

http://about.psyc.eu/Spec:Syntax provides you with the ABNF grammar of the PSYC 1.0 syntax. You may also be interested in PSYC's decentralized state mechanism provided by the +/-/= operators.

See http://about.psyc.eu/XML and http://about.psyc.eu/JSON for more biased information on the respective formats.

Appendix

Tools used

This document and its benchmarks are distributed with libpsyc. See http://about.psyc.eu/libpsyc on how to obtain it.

The benchmarks can be run with the following command (xmlbench is needed for the xml tests):

make bench