Friday, September 15, 2017

Tensorflow learning notes

This page would show you a good path to learn Tensorflow from its tutorials online:

1. To install Tensorflow, you can follow:
https://www.tensorflow.org/install/
Note: installing tensorflow with GPUs is complicated, but the installation with CPUs is straightforward and good enough for all the tutorials.

2. To know the basic pieces of Tensorflow, the following is a good and easy start:
https://www.tensorflow.org/get_started/get_started

3. To get to the first ML level and then go a little deeper:
https://www.tensorflow.org/get_started/mnist/pros


Friday, March 3, 2017

How to mount NFS on Ubuntu 12.04 with caching support


1 Install packages

sudo apt-get install nfs-common 
sudo apt-get install cachefilesd

2 File system setup

Your filesystem will need extended attribute support. If you're using EXT4 you're fine, if you're using EXT3 you'll need to ensure your filesystem is mounted with the user_xattr attribute. To check your filesystem type, you could use:
$ df -h -T
Filesystem         Type      Size  Used Avail Use% Mounted on
/dev/sda1          ext4      912G  816G   50G  95% /
Then you need to create a local directory which will be mounted with the NFS export.
$ mkdir /media/test
$ chmod 777 /media/test

3 Enable cachefilesd 

edit /etc/default/cachefilesd and changing the run line to RUN=yes

4 Ensure your NFS


mount in /etc/fstab has an fsc option. For example:
192.168.110.150:/nsf/test /media/test nfs rw,hard,intr,fsc


5 Remount 


with "mount -a"

6 Start the cachefilesd service


sudo /etc/init.d/cachefilesd start

7 Edit the configuration of the cachefilesd 


in /etc/cachefilesd.conf.

Reference

http://askubuntu.com/questions/4572/how-can-i-cache-nfs-shares-on-a-local-disk
http://xmodulo.com/how-to-enable-local-file-caching-for-nfs-share-on-linux.html

Thursday, January 12, 2017

increasing latency using Google RE2 regex library

Recently after building a text processing service using Google RE2 regex library which is assumed to be thread friendly, we benched it with multiple threads which share the same regex instances. We are surprised that the latency of our service is increasing as we bench the service again and again. Initially we could get something like 200 microsecond latency, but later we got 200 millisecond latency which is 1000 times slower.

According to Reference [1], RE2 uses a fixed amount of memory cache for each regex instance, and if we run out of the cache memory. We searched online to find RE2 documents about the memory cache, but found nothing about it. We finally had to look into the source codes of RE2, and found that there is a class re2::RE2::Options which could take a parameter called max_mem. The Options could be used to initialize an RE2 regex instance.
Moreover, inside RE2, there are two methods to do regex matching: one is DFA which is much faster than the other one NFA. However, DFA needs more memory and could run out of memory. After DFA runs out of memory for many times, RE2 would use NFA instead, which becomes much slower, as shown in the RE2 code comments (budget here means its configured memory):
Once a DFA fills its budget, it flushes its cache and starts over.
If this happens too often, RE2 falls back on the NFA implementation.
For more details about the max_mem option, please refer to Reference [2].

RE2's NFA interface could be found here, while its DFA interface is here.




References:

[1] https://swtch.com/~rsc/regexp/regexp3.html

[2] https://github.com/google/re2/blob/7bab3dc83df6a838cc004cc7a7f51d5fe1a427d5/re2/re2.h#L556
    // The max_mem option controls how much memory can be used
    // to hold the compiled form of the regexp (the Prog) and
    // its cached DFA graphs.  Code Search placed limits on the number
    // of Prog instructions and DFA states: 10,000 for both.
    // In RE2, those limits would translate to about 240 KB per Prog
    // and perhaps 2.5 MB per DFA (DFA state sizes vary by regexp; RE2 does a
    // better job of keeping them small than Code Search did).
    // Each RE2 has two Progs (one forward, one reverse), and each Prog
    // can have two DFAs (one first match, one longest match).
    // That makes 4 DFAs:
    //
    //   forward, first-match    - used for UNANCHORED or ANCHOR_LEFT searches
    //                               if opt.longest_match() == false
    //   forward, longest-match  - used for all ANCHOR_BOTH searches,
    //                               and the other two kinds if
    //                               opt.longest_match() == true
    //   reverse, first-match    - never used
    //   reverse, longest-match  - used as second phase for unanchored searches
    //
    // The RE2 memory budget is statically divided between the two
    // Progs and then the DFAs: two thirds to the forward Prog
    // and one third to the reverse Prog.  The forward Prog gives half
    // of what it has left over to each of its DFAs.  The reverse Prog
    // gives it all to its longest-match DFA.
    //
    // Once a DFA fills its budget, it flushes its cache and starts over.
    // If this happens too often, RE2 falls back on the NFA implementation.

    // For now, make the default budget something close to Code Search.
    static const int kDefaultMaxMem = 8<<20;

Monday, January 9, 2017

The magic 40 milliseconds deplay in TCP socket programming

Recently we have programmed a protocol based on TCP socket programming, and then built a service based on the protocol.

During the benchmarking, we found that the service itself is very fast at about 100 microseconds per request, while after benched with the protocol, the speed is much much slower at about 80 milliseconds per request.
After some investigation, we narrowed the problem down to the TCP network layer. In our protocol, we are using a so-called pack4 TCP, which means we are sending a 4-byte integrate before every chuck of data to indicate the number of bytes of the chuck. We thus need to call write() twice to send any data, i.e., we need to do write-write-read on the client side to send a request. This pattern actually causes the 40ms delay, as the default TCP implementation has a method to improve the performance by acknowledging consecutive packets in one packet with a timeout limit 40ms, if the received packets are small enough. As a result, the first packet int he pack4 TCP protocol would not be acknowledged within 40ms, as the first packet is only 4 bytes. This could slow the service badly.

The fix is simple enough that you just need to merge the two write calls together, i.e., sending the data length and the data in one TCP packet.

The Reference also has explained a similar problem in more details.






References:

http://jerrypeng.me/2013/08/mythical-40ms-delay-and-tcp-nodelay/