Thursday, January 12, 2017

increasing latency using Google RE2 regex library

Recently after building a text processing service using Google RE2 regex library which is assumed to be thread friendly, we benched it with multiple threads which share the same regex instances. We are surprised that the latency of our service is increasing as we bench the service again and again. Initially we could get something like 200 microsecond latency, but later we got 200 millisecond latency which is 1000 times slower.

According to Reference [1], RE2 uses a fixed amount of memory cache for each regex instance, and if we run out of the cache memory. We searched online to find RE2 documents about the memory cache, but found nothing about it. We finally had to look into the source codes of RE2, and found that there is a class re2::RE2::Options which could take a parameter called max_mem. The Options could be used to initialize an RE2 regex instance.
Moreover, inside RE2, there are two methods to do regex matching: one is DFA which is much faster than the other one NFA. However, DFA needs more memory and could run out of memory. After DFA runs out of memory for many times, RE2 would use NFA instead, which becomes much slower, as shown in the RE2 code comments (budget here means its configured memory):
Once a DFA fills its budget, it flushes its cache and starts over.
If this happens too often, RE2 falls back on the NFA implementation.
For more details about the max_mem option, please refer to Reference [2].

RE2's NFA interface could be found here, while its DFA interface is here.




References:

[1] https://swtch.com/~rsc/regexp/regexp3.html

[2] https://github.com/google/re2/blob/7bab3dc83df6a838cc004cc7a7f51d5fe1a427d5/re2/re2.h#L556
    // The max_mem option controls how much memory can be used
    // to hold the compiled form of the regexp (the Prog) and
    // its cached DFA graphs.  Code Search placed limits on the number
    // of Prog instructions and DFA states: 10,000 for both.
    // In RE2, those limits would translate to about 240 KB per Prog
    // and perhaps 2.5 MB per DFA (DFA state sizes vary by regexp; RE2 does a
    // better job of keeping them small than Code Search did).
    // Each RE2 has two Progs (one forward, one reverse), and each Prog
    // can have two DFAs (one first match, one longest match).
    // That makes 4 DFAs:
    //
    //   forward, first-match    - used for UNANCHORED or ANCHOR_LEFT searches
    //                               if opt.longest_match() == false
    //   forward, longest-match  - used for all ANCHOR_BOTH searches,
    //                               and the other two kinds if
    //                               opt.longest_match() == true
    //   reverse, first-match    - never used
    //   reverse, longest-match  - used as second phase for unanchored searches
    //
    // The RE2 memory budget is statically divided between the two
    // Progs and then the DFAs: two thirds to the forward Prog
    // and one third to the reverse Prog.  The forward Prog gives half
    // of what it has left over to each of its DFAs.  The reverse Prog
    // gives it all to its longest-match DFA.
    //
    // Once a DFA fills its budget, it flushes its cache and starts over.
    // If this happens too often, RE2 falls back on the NFA implementation.

    // For now, make the default budget something close to Code Search.
    static const int kDefaultMaxMem = 8<<20;

Monday, January 9, 2017

The magic 40 milliseconds deplay in TCP socket programming

Recently we have programmed a protocol based on TCP socket programming, and then built a service based on the protocol.

During the benchmarking, we found that the service itself is very fast at about 100 microseconds per request, while after benched with the protocol, the speed is much much slower at about 80 milliseconds per request.
After some investigation, we narrowed the problem down to the TCP network layer. In our protocol, we are using a so-called pack4 TCP, which means we are sending a 4-byte integrate before every chuck of data to indicate the number of bytes of the chuck. We thus need to call write() twice to send any data, i.e., we need to do write-write-read on the client side to send a request. This pattern actually causes the 40ms delay, as the default TCP implementation has a method to improve the performance by acknowledging consecutive packets in one packet with a timeout limit 40ms, if the received packets are small enough. As a result, the first packet int he pack4 TCP protocol would not be acknowledged within 40ms, as the first packet is only 4 bytes. This could slow the service badly.

The fix is simple enough that you just need to merge the two write calls together, i.e., sending the data length and the data in one TCP packet.

The Reference also has explained a similar problem in more details.






References:

http://jerrypeng.me/2013/08/mythical-40ms-delay-and-tcp-nodelay/