Friday, October 7, 2016

How to install latest tmux on Ubuntu 12.04


(1) Install add-apt-repository
sudo apt-get install python-software-properties


(2) Install tmux
sudo add-apt-repository ppa:pi-rho/dev
sudo apt-get update
sudo apt-get install tmux

(3) Check tmux version
$ tmux -V
tmux 1.9a

Friday, September 30, 2016

How to install new chrome on Ubuntu 12.04

The official Google Chrome website only keeps the link to the latest Chrome which is right now 53. But 53 could not be installed on Ubuntu 12.04 because 12.04 has old g++ (4.6), while Chrome needs at least g++ 4.8.

We could only install some old version of Chrome:
http://www.slimjet.com/chrome/google-chrome-old-version.php

, e.g., Chrome 48 could work on Ubuntu 12.04.

How to migrate git repo from one server to another

Your company may have changed its git server. You need to use the new git server, but your company did not migrate your repository.

A simple way you could do is that:
(1) clone the latest copy of your repo from the old git server;
(2) change the git url in your cloned copy:
git remote set-url origin ssh://git@bitbucket.com/your-repo-name.git
(3) git push -u origin master

After the steps, you will have your master branch in the new repo.

Install Ubuntu 12.04 in VirtualBox on MAC OS



Installation environment



(1) Macbook Pro:


  • Macbook: Retina
  • CPU: Intel Core i7
  • Memory: 16GB 1600MHz DDR3
  • Video card: NVIDIA GeForce GT 650M 1024 MB
  • Host OS: OS X 10.9.5

(2) Virtual machineProvider: VirtualBox 4.2.20 or 5.0.22 (I also tried the latest one, VirtualBox 4.3, but failed to change the guest OS (Ubuntu) resolution)
Guest OS: Ubuntu 12.04.3 desktop version 64bit

Steps:


(1) Download VirtualBox 4.2.20 and install it on Macbook;
(2) Download Ubuntu 12.04 AMD 64 Desktop from its official website;
(3) Create a virtual machine in VirtualBox (Do note that you need to specify the "Type" as "Linux" and the "Version" as "Ubuntu (64 bit)") with a disk space more than 10GB;
(4) Mount the downloaded Ubuntu image to the CDROM of the created virtual machine;
(5) Start the virtual machine and install Ubuntu on the machine;
(6) After the installation is done, start virtual machine and login the Ubuntu just installed;
(7) The default screen resolution is 1024*768, which is surely too small, to fix which we click the menu "Devices -> Install Guest Additions" of VirtualBox VM; in the Ubuntu, you will find some pop-up window asking you whether to install the new package; just follow the instructions to install the guest additions; please do not install the the guest additions using apt-get in Ubuntu, which may crash the whole Ubuntu somehow;
(8) Restart the virtual machine, and now the screen size can be automatically changed to fit the window size of the virtual machine.
(9) If you need Ubuntu to display emojis correctly, you have to install some new fonts: download Symbola.ttf from http://users.teilar.gr/~g1951d/ in the Ubuntu virtual machine, and then double click the ttf file and click install to add the font to Ubuntu.


References:

http://askubuntu.com/questions/22743/how-do-i-install-guest-additions-in-a-virtualbox-vm




Thursday, September 29, 2016

How to quickly benchmark mysql with mysqlslap

MySQL client also provides a benchmarking tool called mysqlslap which could be installed on CentOS by:
sudo yum install mysql

If you just want to quickly benchmark some database with a simple query, you could try something like:
mysqlslap --user=[YourUsername] --password=[YourPassword] --host=[YourHostname] --verbose --concurrency 10 --iterations 10 --query 'SELECT * from [YourDatabaseName].lookup_en' --create-schema=[YourDatabaseName]

If you want to know more options, please refer to the Reference.


Reference:
http://dev.mysql.com/doc/refman/5.7/en/mysqlslap.html

Wednesday, September 21, 2016

How to use Eclipse with an existing sbt project

Assuming there is a sbt project, and now you want to use Eclipse to maintain the project.

(1) Install sbteclipse plugin in your sbt project to generate Eclipse project configs
You should follow the instruction on (use "sbt sbtVersion" to get your sbt version):
https://github.com/typesafehub/sbteclipse
but I did not manage to install the sbteclipse plugin globally for any sbt project, so I had to install it per project, i.e., adding
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "4.0.0")
to PROJECT_DIR/project/plugins.sbt in your sbt project.
After sbteclipse is installed, you could run:
sudo sbt eclipse
to generate project config files for Eclipse:
    .classpath
    .project
    .settings/
(2) Import the project into Eclipse
Use menu:

File → Import → Existing Projects into Workspace

(3) How to use it
On the command line, you should run:
sudo sbt ~compile
in the root directory of your sbt project such that sbt could automatically run when you make any changes.


Saturday, August 20, 2016

How to find ranking of conferences and journals

(1) conference rankings

(1.1) Google Scholar offers conference rankings in various fields, e.g., the rankings of conferences and journals in the Computational Linguistics field could be found here (other fields could be found on the left category tree)

(1.2) Microsoft Academic Search also offers conference rankings in different fields, e.g., ranking of conferences in the Natural Language & Speech field could be found here.

(1.3) conference ranking in the Computer Science field could also be found on the Computing Research & Education Conference Portal


(2) journals

(2.1) free journal rankings
Scimago Journal & Country Rank

(2.2) non-free ones
Thomson Reuters Journal Citation Reports® (most of the universities have this)

How to use sqoop to copy MySQL tables to Hive

sqoop is useful when you need to copy MySQL tables to Hive

Here is an example to copy MySQL tables from different shards to one Hive table:


 #!/bin/bash  
   
 set -x  
 set -e  
   
 game=gameexample  
   
 mysql_host_prefix=userdb  
 mysql_host_suffix=myhostname.com  
 mysql_tab=user  
 mysql_database=mydatabase  
 mysql_user=myusername  
 mysql_pwd=xxxxx  
   
 hive_tab=user  
   
 echo "Log: `date` dropping Hive table $game.${hive_tab}"  
 hive -S -e "drop table \`$game.${hive_tab}\`;"   
   
 for SHARD_ID in {1..200}; do  
   
   mysql_host=${mysql_host_prefix}-${SHARD_ID}-${mysql_host_suffix}  
   mysql_conn="mysql --user=$mysql_user --password=$mysql_pwd -D${mysql_database} --host=${mysql_host} -s --skip-column-names"  
   hive_shard_tab=${hive_tab}_shard${SHARD_ID}  
   hdfs_dir=/user/mapr/${mysql_tab}_shard${SHARD_ID}  
   if ping -c 1 -W 1 "$mysql_host"; then  
     echo "Log: `date` $mysql_host is alive"  
   else  
     echo "Log: `date` $mysql_host is not alive"  
     exit 0  
   fi   
   
   echo "Log: `date` dropping Hive table $game.${hive_shard_tab}"  
   hive -S -e "drop table $game.${hive_shard_tab}"   
   echo "Log: `date` removing ${hdfs_dir} on HDFS"  
   hadoop fs -rm -r -f ${hdfs_dir}  
     
   sql="select count(*) from \`${mysql_tab}\`"  
   mysql_row_cnt=`echo "$sql" | $mysql_conn`  
   echo "Log: `date` found ${mysql_row_cnt} rows in the MySQL table ${mysql_tab} with query: $sql"  
     
   sqoop import \  
    --connect jdbc:mysql://$mysql_host/${mysql_database} \  
    --table "${mysql_tab}" \  
    --username $mysql_user \  
    --password $mysql_pwd \  
    --num-mappers 1 \  
    --hive-overwrite \  
    --hive-table $game.${hive_shard_tab} \  
    --hive-import \  
    --target-dir ${hdfs_dir} \  
    --hive-delims-replacement ' '   
     
   hive_row_cnt=`hive -S -e "select count(*) from $game.${hive_shard_tab}"`  
   echo "Log: `date` ended up with ${hive_row_cnt} rows in the Hive table $game.${hive_shard_tab} which are copied from the MySQL table ${mysql_tab} (${mysql_row_cnt} rows)"  
     
   # merging  
   if [ $SHARD_ID = 1 ]; then  
      sql_str="create table $game.\`$hive_tab\` as select * from $game.${hive_shard_tab};"  
      echo "Log: `date` creating the Hive table $game.${hive_tab} with the data from the first Shard with sql: $sql_str"  
      hive -S -e "$sql_str"   
   else  
      sql_str="insert into table $game.\`$hive_tab\` select * from $game.${hive_shard_tab};"  
      echo "Log: `date` merging into the Hive table $game.${hive_tab} the data from Shard $SHARD_ID with sql: $sql_str"  
      hive -S -e "$sql_str"   
   fi  
 done  
 exit 0  
   

Wednesday, August 17, 2016

How to use ngrep

ngrep is a very useful tool on Linux to capture TCP packages for a given host, a given port number, or a given key word.

(1) to capture packages (printed in hex format) from port 1234 with keyword "my-word" (using network device bond0 (see ifconfig to pick a device))
 sudo ngrep -l -t -d bond0 -q -x my-word port 1234

(2) to capture packages to a host my.hostname.com
sudo ngrep -l -t -d bond0 -q -W byline host my.hostname.com

Wednesday, July 27, 2016

How to find your citations

As a researcher, you may want to find all the papers which cite to your papers. Some ways are listed here:
(1) Google Scholar: is a good place but it may ignore some minor sources, e.g., some workshop papers;
(2) ResearchGate: you could upload your papers to it, and it could filter papers to find out what papers cite to your papers;
(3) Microsoft Academy: you could search for your papers first and then it could show you what papers cite to your papers;
(4) CiteSeerX

How to split terminal on remote server with background sessions

Screen

screen is a useful command to leave some processes running in the background on Linux.
We could actually split the screen terminal into sub-terminals with the following short-cuts:
(1) to split a terminal horizontally: press Ctrl+A, release them, press Shift+s
(2) to split a terminal vertically: press Ctrl+A, release them, press Shift+\
(3) to switch among sub-terminals: press Ctrl+A, release them, press Tab
But please note that the sub-terminals would be converted to background windows in screen if you detach (Ctrl+A+D) your screen session and then attach it again. To avoid this, you could use tmux instead.

Tmux

Tmux is more convenient than screen on this way.
type "tmux new -s sessionname" to create a new session
in the session, you could use:
(1) to split a terminal horizontally: press Ctrl+B, release them, press % (Shift+5);
(2) to split a terminal vertically: press Ctrl+B, release them, press " (Shift+');
(3) to detach the current session: press Ctrl+B, release them, press D;
(4) to attach to a session: "tmux attach -t sessionname";
(5) to switch to a pane: Ctrl+B, release them, press q (Show pane numbers, when the numbers show up type the key to go to that pane).


Reference:

http://fosshelp.blogspot.com/2014/02/how-to-linux-terminal-split-screen-with.html
https://www.youtube.com/watch?v=BHhA_ZKjyxo
https://gist.github.com/MohamedAlaa/2961058

Thursday, June 9, 2016

Enable core dumps for systemd services on CentOS 7

Core dumps are very useful to C++ programs to debug critical crashes like segfault, etc.

How to enable core dumps for systemd services on CentOS 7?

(1) change the core_pattern to some place you could write to
$ cat /proc/sys/kernel/core_pattern
/home/your-user-name/coredumps/core-%e-sig%s-user%u-group%g-pid%p-time%t
Note: your-user-name is the user name you are using to run your program which would crash to generate a core dump.


(2) create a new file:
/etc/security/limits.d/core.conf
like:

*       hard        core        unlimited
*       soft        core        unlimited
to enable core dumps for all users



(3) modify /etc/systemd/system.conf
to add:
DefaultLimitCORE=infinity

(4) modify your systemd service conf
e.g., /etc/systemd/system/your-service.service
to add:
LimitCORE=infinity
in the "[Service]" section.

(5) reload the new systemd conf and restart your service
systemctl daemon-reexec
systemctl stop your-service
systemctl start your-service
(6) how to test it
You could kill your service process by sending signal 11 (SIGSEGV) to your process. By right, you should see a new core dump at:
/home/your-user-name/coredumps/core-%e-sig%s-user%u-group%g-pid%p-time%t


References:
http://www.kibinlabs.com/re-enabling-core-dumps-redhat-7/

yum update meets "No packages marked for update"

When you use "yum update", you could see "No packages marked for update", though you are sure that there are some package updates.

In this case, you could try:
yum clean all
yum update your-package-name

How to print std::shared_ptr in GDB

GDB is a very useful debugging tool for C++, especially when your program crashes due to unknown errors. At such case, you could enable core dump and then debug your program with the core dump generated when your program crashes.

One thing to discuss here is how to print std::shared_ptr variables:
by default, if we print a std::shared_ptr variable "msg", we could see the contents that the "msg" is pointing to:
(gdb) print msg
$1 = std::shared_ptr (count 1, weak 0) 0x7f2ba8002740

what we could do instead is to:
(gdb) print (*msg._M_ptr)
which would print the contents


References:
(1) print shared_ptr
http://stackoverflow.com/questions/24917556/how-to-access-target-of-stdtr1shared-ptr-in-gdb
(2) print variables:
http://ftp.gnu.org/old-gnu/Manuals/gdb/html_chapter/gdb_9.html
(3) change call stack frame:
http://www.unknownroad.com/rtfm/gdbtut/gdbstack.html

Monday, May 30, 2016

Thread safty issues of openssl when used with curl

When you use libcurl to send any SSL connections like HTTPS, FTPS, etc., you need to have a look at the underlying SSL library used by libcurl which does not have native SSL support.

Per:
https://curl.haxx.se/libcurl/c/threadsafe.html
and
https://github.com/openssl/openssl/blob/OpenSSL_1_1_0-pre5/CHANGES
if you are using openssl (libssl) whose version is lower than 1.1.0, the openssl is not thread safe. You thus have to add thread locks to the openssl layer in libcurl following:
https://curl.haxx.se/libcurl/c/threaded-ssl.html

Alternatively, you could start using openssl 1.1.0 though it does not have a stable version at this time. libcurl later than 7.49.0 could compile with openssl 1.1.0, as shown on:
https://curl.haxx.se/changes.html


Other notes: openssl 1.1.0 could not compile with the latest MySQL C++ connector 1.1.7, as some symbols that the connector needs have been deprecated in openssl 1.1.0.

Sunday, May 29, 2016

A C++ Pool class to reuse connections or pointers


 #include <list> 
 #include <string> 
 #include <iostream> 
 #include <pthread.h> 
 #include <thread> 
 #include <mutex> 

/**
 * A pool for caching connections for HTTP requests and MySQL queries.
 * This class has been adapted from http://www.codeproject.com/Articles/8108/Template-based-Generic-Pool-using-C
 */
template <class T>  
class Pool
{
private:
    typedef std::shared_ptr
<T> ObjHolder_t;///< typedef for ObjHolder_t which is actually an std::shared_ptr
    typedef std::list
<ObjHolder_t> ObjList_t;///< typedef for ObjList_t

    unsigned m_size;///< pool size : default 0
    unsigned m_waitTimeSec;///< wait time: How long calling function can wait to find object
    bool m_isTempObjAllowed;///< if pool is full, is temp object allowed
    ObjList_t m_reservedList;///< reserved object list
    ObjList_t m_freeList;///< free object list
    std::mutex m_dataMutex;///< mutex for Pool data
    std::shared_ptr m_nullptr;///< a convenient nullptr std::shared_ptr
    long m_checkAbandonedIntervalSec;///< how often we should check the abandoned objects because some borrowers may fail to checkin the objects they borrowed (default: 3600 seconds)
    long m_lastCheckTimestampForAbandonedObjs;///< the last timestamp when we checked for abandoned objects
    std::function()> m_constructFunc;
    std::function&)> m_checkHealthFunc;
    std::function&)> m_reactiveFunc;
    std::function&)> m_destructFunc;

    /**
     * Initialize this instance with default member variables.
     */
    void initialize()
    {
        std::lock_guard scopelock(m_dataMutex);
        for (auto &it: m_freeList)
        {
            m_destructFunc(it);
            it.reset();
        }
        for (auto &it: m_reservedList)
        {
            m_destructFunc(it);
            it.reset();
        }
        m_reservedList.clear();
        m_freeList.clear();
        m_size = 0;
        m_isTempObjAllowed = true;
        m_waitTimeSec = 3;
        m_checkAbandonedIntervalSec=3600;
    }
public:
    /**
     * A default constructor.
     */
    Pool()
    {
        initialize();
    }
    /**
     * A default deconstructor.
     */
    ~Pool()
    {
        initialize();
    }
    /**
     * Reset the Pool.
     */
    void reset()
    {
        initialize();
    }
    /**
     * Initialize the pool with specific parameters.
     * This method could be only called once per instance.
     * @param nPoolSize
     * @param nExpirationTime
     * @param bTempObjAllowed
     * @param nWaitTime
     */
    void initialize(const unsigned nPoolSize,
            std::function()> constructFunc,
            std::function&)> checkHealthFunc,
            std::function&)> reactiveFunc,
            std::function&)> destructFunc,
            const bool bTempObjAllowed=true,
            const unsigned nWaitTime = 3)
    {
        std::lock_guard scopelock(m_dataMutex);
        if (m_size == 0)
        {
            m_size = nPoolSize;
            m_isTempObjAllowed = bTempObjAllowed;
            m_waitTimeSec = nWaitTime;
            m_constructFunc=constructFunc;
            m_checkHealthFunc=checkHealthFunc;
            m_reactiveFunc=reactiveFunc;
            m_destructFunc=destructFunc;
        }
        else
            throw FailureException("can't Initialize the pool again");
    }

    /**
     * Borrow an object from the Pool.
     * This method promises finding a new object.
     * @return the object pointer
     */
    std::shared_ptr& checkout()
    {
        while (true)
        {
            {
                std::lock_guard scopelock(m_dataMutex);
                std::shared_ptr &pObj=findFreeObject();
                if (pObj!=nullptr)
                {
                    return pObj;
                }
                // did not find a free one
                if (m_freeList.size() + m_reservedList.size() < m_size)
                    return createObject();
                else if ((long)time(NULL) - m_lastCheckTimestampForAbandonedObjs > m_checkAbandonedIntervalSec)
                {
                    collectAbandonedObjects();
                    std::shared_ptr &pObj = findFreeObject();
                    if (pObj!=nullptr)
                        return pObj;
                }
                else if (m_isTempObjAllowed)
                    return createObject();
                collectAbandonedObjects();
                {
                    std::shared_ptr &pObj = findFreeObject();
                    if (pObj!=nullptr)
                        return pObj;
                }
            }
            sleep(m_waitTimeSec);
        }
    }
    /**
     * Return an object to this Pool.
     * This method will first validate the returned object, then put it in the free object list, and finally remove it from the reserved object list.
     * @param pObj the object to return
     */
    void checkin(std::shared_ptr& pObj)
    {
        std::lock_guard scopelock(m_dataMutex);
        if (validateObject(pObj))
        {
            m_freeList.push_back(pObj);
            // Todo: why?
            //oTemp.setObject(NULL);
        }
        else
        {// the object is bad, so deconstruct it
            m_destructFunc(pObj);
        }
        // remove the object from the reserved list
        for (typename ObjList_t::iterator i=m_reservedList.begin(); i!=m_reservedList.end(); ++i)
        {
            if (*i==pObj)
            {
                i = m_reservedList.erase(i);
                break;
            }
        }
    }

private:
    /**
     * Create a new object and add it to the reserved object list.
     * @return the newly created object
     */
    std::shared_ptr& createObject()
    {
        std::shared_ptr newObj=m_constructFunc();
        if (newObj!=nullptr && m_checkHealthFunc(newObj))
        {
            m_reservedList.push_back(newObj);
            return m_reservedList.back();
        }
        else
        {
            throw FailureException("could not create Object");
        }
    }
    /**
     * It will move abandoned objects to the free object list from the reserved object list,
     * if they could be active.
     */
    void collectAbandonedObjects()
    {
        for (typename ObjList_t::iterator it=m_reservedList.begin(); it!=m_reservedList.end(); ++it)
        {
            ObjHolder_t &oHolder = *it;
            if (oHolder.unique())
            {// checks whether the managed object is managed only by the current shared_ptr instance
                if (validateObject(oHolder)==true)
                {
                    m_freeList.push_back(oHolder);
                }
                it = m_reservedList.erase(it);
            }
        }
        m_lastCheckTimestampForAbandonedObjs=(long)time(NULL);
    }

    /**
     * Validate object if it is still usable.
     * If not, try to make it usable.
     * @param obj the pointer to the object which needs check
     * @return true if obj is good; false otherwise
     */
    bool validateObject(std::shared_ptr &obj)
    {
        if (obj==nullptr)
        {
            return false;
        }
        else if (m_checkHealthFunc(obj) || m_reactiveFunc(obj))
        {
            return true;
        }
        return false;
    }

    /**
     * Find a free object which could be active from the free object list.
     * The free object list is checked. If any free object is inactive, we will try to reactive it.
     * If reactiving it failed, we will drop the object.
     * @return nullptr if no free object available which could be active
     */
    std::shared_ptr &findFreeObject()
    {
        // find existing free Object
        while (!m_freeList.empty())
        {
            ObjHolder_t &obj=m_freeList.front();
            if (validateObject(obj))
            {
                m_reservedList.push_back(obj);
                m_freeList.pop_front();
                return m_reservedList.back();
            }
            else// delete the Object
            {
                m_freeList.pop_front();
                m_destructFunc(obj);
            }
        }
        return m_nullptr;
    }

public:
    /**
     * Print the info of this Pool to a string.
     * @return the string representation of this Pool
     */
    std::string toString() const
    {
        std::stringstream ss;
        ss << "Pool(size=" << m_size
                << " isTempObjAllowed=" << m_isTempObjAllowed
                << " reservedList=" << m_reservedList.size()
                << " freeList=" << m_freeList.size();
        ss << ")";
        return ss.str();
    }
};




/////////////////////////////////////////////////////////////////////////////////////////////////////////
// How to use Pool

 #include <iostream> 
 #include <curl/curl.h> 

int main(int argc, char *argv[])
{

    typedef CURL T;
    const unsigned nPoolSize=2;
    std::string url="https://datamarket.accesscontrol.windows.net/v2/OAuth2-13/";
    std::function()> constructFunc(
    [url]() -> std::shared_ptr
    {
        CURL *curl=curl_easy_init();
        CURLcode res;
        if (curl==NULL)
            throw MZFailureException("could not get a curl handle in "
                    +std::string(__FILE__)+"("+std::string(__FUNCTION__)+") on line "+std::to_string(__LINE__));

        /* First set the URL that is about to receive our POST. This URL can
           just as well be a https:// URL if that is what should receive the
           data. */
        curl_easy_setopt(curl, CURLOPT_URL, url.c_str());

        curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0);
        // we will delete the pointer by ourselves. Otherwise, shared_ptr would call delete
        std::shared_ptr p(curl, [](T*){});
        return p;
    });
    std::function&)> checkHealthFunc=
    [](std::shared_ptr& curlConn) -> bool
    {
        return true;
    };
    std::function&)> reactiveFunc=
    [](std::shared_ptr& curlConn) -> bool
    {
        return true;
    };
    std::function&)> destructFunc=
    [](std::shared_ptr& curlConn)
    {
        curl_easy_cleanup(curlConn.get());
        curlConn.reset();
    };
    const bool bTempObjAllowed=true;
    const unsigned nWaitTime = 1;
    // get the pool instance
    Pool
<T>  pool;
    // initialize the pool
    pool.initialize(nPoolSize,
                constructFunc,
                checkHealthFunc,
                reactiveFunc,
                destructFunc,
                bTempObjAllowed,
                nWaitTime);
    // checkout the object
    std::shared_ptr pObj=pool.checkout();
    std::cerr << "after checkout, the pool is: " << pool.toString() << std::endl;
    pObj=pool.checkout();
    std::cerr << "after checkout, the pool is: " << pool.toString() << std::endl;
    pObj=pool.checkout();
    std::cerr << "after checkout, the pool is: " << pool.toString() << std::endl;
    pObj=pool.checkout();
    std::cerr << "after checkout, the pool is: " << pool.toString() << std::endl;
    pObj=pool.checkout();
    std::cerr << "after checkout, the pool is: " << pool.toString() << std::endl;
    if(pObj!=nullptr)
    {
        std::cerr << "got an object which is not nullptr" << std::endl;
        pool.checkin(pObj); // checkin the object
        std::cerr << "after checkin, the pool is: " << pool.toString() << std::endl;
    }
    else
        std::cerr << "got an object which is nullptr" << std::endl;
    // reset the pool
    pool.reset();
    std::cerr << "after reset, the pool is: " << pool.toString() << std::endl;
}

Wednesday, May 25, 2016

How to let MeCab library use a given dictionary directory

MeCab is a famous analysis tool for a few languages. It is used to tokenize Japanese sentences into words by me in my project. I installed it in its default system directories and everything just works well.

Recently I have to hack it to give it a specific dictionary directory which I want it to use in my codes, without installing it on the target machine. I ended up getting issues:MeCab just does not use the dictionary directory I have given, throwing errors.

After reading MeCab source codes, I found mecab-0.996/src/utils.cpp actually looks for the dictionary files using the codes in Reference (3). The function is called load_dictionary_resource() which has to find mecabrc first before loading the real dictionaries. The mecabrc is like a configuration file installed by MeCab to record the dictionary path etc. which looks like:
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
dicdir =  /home/your-name/local/lib/mecab/dic/ipadic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
, where the dicdir could be a wrong path and we could tell MeCab to use a given dictionary directory instead.

The mecabrc could be configured via the option "--rcfile" and the dictionary directory could be configured via "--dicdir".
#include
MeCab::Tagger *m_mecabTagger;
        m_mecabTagger=MeCab::createTagger("--rcfile /path/to/dummy/mecabrc -O wakati --dicdir /path/to/your/dictionary/dir");
        if (!m_mecabTagger)
       {
             const char *e = m_mecabTagger ? m_mecabTagger->what() :  MeCab::getLastError();
             std::cerr << "ERROR: " << e << std::endl;
             delete m_mecabTagger;
       }

References
(1) MeCab: http://taku910.github.io/mecab/libmecab.html
(2) MeCab API: http://taku910.github.io/mecab/doxygen/classMeCab_1_1Tagger.html
(3) piece of mecab-0.996/src/utils.cpp:
292 bool load_dictionary_resource(Param *param) {
293   std::string rcfile = param->get("rcfile");
294
295 #ifdef HAVE_GETENV
296   if (rcfile.empty()) {
297     const char *homedir = getenv("HOME");
298     if (homedir) {
299       const std::string s = MeCab::create_filename(std::string(homedir),
300                                                    ".mecabrc");
301       std::ifstream ifs(WPATH(s.c_str()));
302       if (ifs) {
303         rcfile = s;
304       }
305     }
306   }
307
308   if (rcfile.empty()) {
309     const char *rcenv = getenv("MECABRC");
310     if (rcenv) {
311       rcfile = rcenv;
312     }
313   }
314 #endif
315
316 #if defined (HAVE_GETENV) && defined(_WIN32) && !defined(__CYGWIN__)
317   if (rcfile.empty()) {
318     scoped_fixed_array buf;
319     const DWORD len = ::GetEnvironmentVariableW(L"MECABRC",
320                                                 buf.get(),
321                                                 buf.size());
322     if (len < buf.size() && len > 0) {
323       rcfile = WideToUtf8(buf.get());
324     }
325   }
326 #endif
327
328 #if defined(_WIN32) && !defined(__CYGWIN__)
329   HKEY hKey;
330   scoped_fixed_array v;
331   DWORD vt;
332   DWORD size = v.size() * sizeof(v[0]);
333
334   if (rcfile.empty()) {
335     ::RegOpenKeyExW(HKEY_LOCAL_MACHINE, L"software\\mecab", 0, KEY_READ, &hKey);
336     ::RegQueryValueExW(hKey, L"mecabrc", 0, &vt,
337                        reinterpret_cast(v.get()), &size);
338     ::RegCloseKey(hKey);
339     if (vt == REG_SZ) {
340       rcfile = WideToUtf8(v.get());
341     }
342   }
343
344   if (rcfile.empty()) {
345     ::RegOpenKeyExW(HKEY_CURRENT_USER, L"software\\mecab", 0, KEY_READ, &hKey);
346     ::RegQueryValueExW(hKey, L"mecabrc", 0, &vt,
347                        reinterpret_cast(v.get()), &size);
348     ::RegCloseKey(hKey);
349     if (vt == REG_SZ) {
350       rcfile = WideToUtf8(v.get());
351     }
352   }
353
354   if (rcfile.empty()) {
355     vt = ::GetModuleFileNameW(DllInstance, v.get(), size);
356     if (vt != 0) {
357       scoped_fixed_array drive;
358       scoped_fixed_array dir;
359       _wsplitpath(v.get(), drive.get(), dir.get(), NULL, NULL);
360       const std::wstring path =
361           std::wstring(drive.get()) + std::wstring(dir.get()) + L"mecabrc";
362       if (::GetFileAttributesW(path.c_str()) != -1) {
363         rcfile = WideToUtf8(path);
364       }
365     }
366   }
367 #endif
368
369   if (rcfile.empty()) {
370     rcfile = MECAB_DEFAULT_RC;
371   }
372
373   if (!param->load(rcfile.c_str())) {
374     rcfile = "mecab_etc/mecabrc";
375     if (!param->load(rcfile.c_str())) {
376         return false;
377     }
378   }
379
380   std::string dicdir = param->get("dicdir");
381   if (dicdir.empty()) {
382     dicdir = ".";  // current
383   }
384   remove_filename(&rcfile);
385   replace_string(&dicdir, "$(rcpath)", rcfile);
386   param->set("dicdir", dicdir, true);
387   dicdir = create_filename(dicdir, DICRC);
388
389   if (!param->load(dicdir.c_str())) {
390     return false;
391   }
392
393   return true;
394 }

Tuesday, May 17, 2016

Openssl segfault bug and building Curl with new openssl libraries

Recently, I met a bug in the openssl library which results in segfault when sending HTTPS requests using the Curl library (which uses openssl).
All the methods of this post have been tested on Ubuntu 12.04.

When using GDB to backtrace the segfault, the segfault looks like:
(gdb) bt
#0  0x00007fbaf69a54cb in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#1  0xca62c1d6ca62c1d6 in ?? ()
#2  0xca62c1d6ca62c1d6 in ?? ()
#3  0xca62c1d6ca62c1d6 in ?? ()
#4  0xca62c1d6ca62c1d6 in ?? ()
#5  0xca62c1d6ca62c1d6 in ?? ()
#6  0xca62c1d6ca62c1d6 in ?? ()
#7  0xca62c1d6ca62c1d6 in ?? ()
#8  0xca62c1d6ca62c1d6 in ?? ()
#9  0x00007fbaf6d10935 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#10 0x00007fba3c12cb70 in ?? ()
#11 0x000000000000000a in ?? ()
#12 0x00007fbaf69a1900 in SHA1_Update () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#13 0x00007fbaf6a23def in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#14 0x00007fbaf69d75e5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#15 0x00007fbaf69d73c8 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#16 0x00007fbaf69edf9b in EC_KEY_generate_key () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
#17 0x00007fbaf6d2f2a4 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#18 0x00007fbaf6d30c03 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#19 0x00007fbaf6d3a373 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.0.0
#20 0x000000000049a015 in ossl_connect_common ()
#21 0x000000000046a428 in Curl_ssl_connect_nonblocking ()
#22 0x000000000046ee9e in https_connecting ()
#23 0x00000000004671ee in multi_runsingle ()
#24 0x0000000000467cd5 in curl_multi_perform ()
#25 0x0000000000462e9e in curl_easy_perform ()
If you have a look at the dependent libraries used by curl, you could find the ssl libraries it is using:
$ ldd ./curl
    linux-vdso.so.1 =>  (0x00007fffd6039000)
    libidn.so.11 => /usr/lib/x86_64-linux-gnu/libidn.so.11 (0x00007f48cf957000)
    libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007f48cf6f9000)
    libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f48cf31d000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f48cf106000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f48ceefe000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f48ceb3f000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f48ce93b000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f48ce71e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f48cfb9d000)


After googling for a while, I found this should be due to a bug in the openssl library which has been fixed since 1.0.1c, as mentioned on Reference (1). The openssl versions are 1.0.1, 1.0.1a, 1.0.1b, etc. But Ubuntu 12.04 is using some buggy version of openssl 1.0.1 by default.

The next question is just how to build Curl with new versions of openssl without the segfault bug. Again after googling, I found it is not that easy as expected. Many people created hacking ways to do this. I end up finding an easy way to do this:

# install openssl, as the default openssl on Ubuntu 12.04 is buggy about https connections, which could result in segfault
git clone https://github.com/openssl/openssl.git
cd openssl
git checkout OpenSSL_1_0_2g./config --prefix=$LOCAL_DIR no-shared
make -j
make install

# install curl
git clone https://github.com/curl/curl.git
cd curl
git checkout curl-7_48_0
autoreconf -iv
CPPFLAGS="-I$LOCAL_DIR/include" \
LDFLAGS="-L$LOCAL_DIR/lib" \
LIBS="-ldl" \
./configure --disable-shared --prefix=$LOCAL_DIR --without-ldap-lib --without-librtmp --with-ssl
make -j 
make install

After installing curl, the new curl binary executable file contains the new libssl inside statically (OpenSSL/1.0.2g):
$ ldd curl
    linux-vdso.so.1 =>  (0x00007fff8f9fe000)
    libidn.so.11 => /usr/lib/x86_64-linux-gnu/libidn.so.11 (0x00007f5fdae42000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f5fdac2b000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f5fdaa22000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5fda81e000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5fda460000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5fda242000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f5fdb080000)
$ ./curl --version
curl 7.48.0-DEV (x86_64-unknown-linux-gnu) libcurl/7.48.0-DEV OpenSSL/1.0.2g zlib/1.2.3.4 libidn/1.23
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP UnixSockets
Important notes for developers: when you build your own program against the curl library (libcurl.a), you may find you end up having a binary executable file which still requires the buggy openssl library like:
    libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007f0338f94000)
    libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f0338bb8000)
but no worries, I found that this may just be due to the curl library would force you link your program with these buggy openssl libraries by default if you use curl-config to generate your g++ arguments:
$ ./curl-config --libs
-L$LOCAL_DIR/lib -lcurl -lidn -lssl -lcrypto -lssl -lcrypto -lz -lrt -ldl
even if you follow the methods mentioned in this post. However, your program won't really use the buggy openssl libraries on your Ubuntu 12.04 (I have tested my program, and the segfault does not happen any more).









References
(1) https://bugs.launchpad.net/ubuntu/+source/s3cmd/+bug/973741

(2) https://curl.haxx.se/mail/lib-2014-12/0053.html

(3) https://github.com/openssl/openssl/tree/OpenSSL_1_0_2g

(4) https://github.com/curl/curl/tree/curl-7_48_0

Tuesday, April 26, 2016

Increase the number of open files limit on Ubuntu 12.04

On Ubuntu 12.04, every process could only use up to 1024 open files including socket handlers, file handlers, etc.
When we develop scalable programs, this could hinder the throughput of your programs. Many people try increasing the number, but apparently it is not straightforward.

I am listing the steps which work for me here:
(1) change /etc/security/limits.conf by adding the following lines:
your-user-name soft nofile 4096
your-user-name hard nofile 4096

(2) change /etc/pam.d/common-session* by adding the following line:
session required pam_limits.so

(3) logout and login again if you use ssh.



References

http://askubuntu.com/questions/162229/how-do-i-increase-the-open-files-limit-for-a-non-root-user

Tuesday, April 12, 2016

Memory leak issue in MySQL C++ connector 1.1.7

I have used the official C++ connector (Version 1.1.7) for MySQL recently, but found some weird memory leak, though I was totally following the official documents. The connector could be found here.

How does the memory leak look like?

I used Valgrind to detect memory leak of my program, and found the memory used by the MySQL thread was not released somehow:
==20882== 8,000 bytes in 40 blocks are definitely lost in loss record 1,707 of 1,791
==20882==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==20882==    by 0x620F83E: my_thread_init (in /usr/lib/libmysqlcppconn.so.7.1.1.7)

==20882==    by 0x61F7384: mysql_server_init (in /usr/lib/libmysqlcppconn.so.7.1.1.7)
==20882==    by 0x61FD1C6: mysql_init (in /usr/lib/libmysqlcppconn.so.7.1.1.7)
==20882==    by 0x61F21F3: sql::mysql::NativeAPI::LibmysqlStaticProxy::init(st_mysql*) (in /usr/lib/libmysqlcppconn.so.7.1.1.7)
==20882==    by 0x61F369E: sql::mysql::NativeAPI::MySQL_NativeConnectionWrapper::MySQL_NativeConnectionWrapper(boost::shared_ptr) (in /usr/lib/libmysqlcppconn.so.7.1.1.7)
==20882==    by 0x61F341D: sql::mysql::NativeAPI::MySQL_NativeDriverWrapper::conn_init() (in /usr/lib/libmysqlcppconn.so.7.1.1.7)
==20882==    by 0x61ABE63: sql::mysql::MySQL_Driver::connect(sql::SQLString const&, sql::SQLString const&, sql::SQLString const&) (in /usr/lib/libmysqlcppconn.so.7.1.1.7)

Why does it happen?

After some investigation, I found the post in the Reference. It seems that the official document did not tell users to do anything for the MySQL_Driver pointer (e.g., this sample code), but actually it is necessary in order to release the memory allocated for the thread used by the MySQL connector.
My finalized codes look like:
    bool runQueryWithResult(const std::string &query,
            std::function callbackFunction)
    {
        sql::mysql::MySQL_Driver *sqlDriver=NULL;
        sql::Connection *connection=NULL;
        sql::Statement *stmt=NULL;
        sql::ResultSet *res=NULL;
        try
        {
            // get a driver
            sqlDriver = sql::mysql::get_driver_instance();
            // create connection
            connection=sqlDriver->connect(m_db_tcpAddress, m_db_username, m_db_password);
            // run a statement
            stmt=connection->createStatement();
            res=stmt->executeQuery(query);
            bool returnBool=false;
            callbackFunction(res, returnBool);
            delete res;
            res=NULL;
            delete stmt;
            stmt=NULL;
            connection->close();
            delete connection;
            connection=NULL;
            // this step is necessary to avoid memory leak, though it is not mentioned in the document
            sqlDriver->threadEnd();

            sqlDriver=NULL;
            return returnBool;
        }
        catch (sql::SQLException &e)
        {
            std::string warning="SQLException in "+std::string(__FILE__)+"("+std::string(__FUNCTION__)+") on line "+std::to_string(__LINE__)
                +"\n# ERR: "+std::string(e.what())
                +" (MySQL error code: "+std::to_string(e.getErrorCode())+", SQLState: "+e.getSQLState()+") with query: "+query;
            Tools::error(warning);
            if (sqlDriver!=NULL)
                sqlDriver->threadEnd();

            if (res!=NULL)
                delete res;
            if (stmt!=NULL)
                delete stmt;
            if (connection!=NULL)
            {
                connection->close();
                delete connection;
            }
            return false;
        }
    }




Reference

(1) Post about memory leak in MySQL C++ connector:
http://stackoverflow.com/questions/13082389/memory-leak-in-mysql-c-connector

(2) Official document of MySQL C++ connector:
https://dev.mysql.com/doc/connector-cpp/en/connector-cpp-examples-connecting.html


Wednesday, March 30, 2016

Fix cURL CURLOPT_TIMEOUT_MS bug

cURL is a very useful library written in C which allows you to do a lot of network jobs like calling HTTP REST API, etc.

Recently, I need to add a timeout value to my program which uses cURL, and I found there is a convenient option called CURLOPT_TIMEOUT_MS:
https://curl.haxx.se/libcurl/c/CURLOPT_TIMEOUT_MS.html
which could add a timeout in milliseconds to a cURL connection.

But I found that whenever I set the timeout less than 1000 milliseconds, the cURL connection timeouts immediately after it is performed. After googling online for a while, I found the issue is caused by:
If libcurl is built to use the standard system name resolver, that portion of the transfer will still use full-second resolution for timeouts with a minimum timeout allowed of one second. The problem is that on Linux/Unix, when libcurl uses the standard name resolver, a SIGALRM is raised during name resolution which libcurl thinks is the timeout alarm.
One quick fix for this problem is to disable signals using CURLOPT_NOSIGNAL. For example:
curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 1);
// timeout in milliseconds
// https://curl.haxx.se/libcurl/c/CURLOPT_TIMEOUT_MS.html
curl_easy_setopt(curl, CURLOPT_TIMEOUT_MS, timeoutMilliseconds);
// Perform the request, res will get the return code
res = curl_easy_perform(curl);
// Check for errors
if (res==CURLE_OPERATION_TIMEDOUT)
       throw "timeout";
else if(res != CURLE_OK)
       throw "error info: "+std::string(curl_easy_strerror(res);




Reference
https://ravidhavlesha.wordpress.com/2012/01/08/curl-timeout-problem-and-solution/

How to add timeout to an XMLRPC-C client

XMLRPC-C is very efficient if you need a remote process call protocol.

I happen to need to add timeout limit to an XMLRPC client based on C, but I went into a problem: the timeout does not work as expected.

I started with an example code in the xmlrpc-c package located at:
xmlrpc-c-1.42.99/src/examples/cpp/sample_add_client_complex.cpp (as shown in the Reference at the end of this post)
, but found that when I set the timeout less than 1000, the timeout basically does not work at all, i.e., no timeout could happen.

After looking into the codes of xmlrpc-c, I found the trick. The xmlrpc-c has a source file at:
xmlrpc-c-1.42.99/lib/curl_transport/curltransaction.c
which actually sets the timeout in the following function:
static void
setCurlTimeout(CURL *       const curlSessionP ATTR_UNUSED,
               unsigned int const timeoutMs ATTR_UNUSED) {

#if HAVE_CURL_NOSIGNAL
    unsigned int const timeoutSec = (timeoutMs + 999)/1000;

    assert((long)timeoutSec == (int)timeoutSec);
        /* Calling requirement */
    curl_easy_setopt(curlSessionP, CURLOPT_TIMEOUT, (long)timeoutSec);
#else
    /* Caller should not have called us */
    abort();
#endif
}
You can see that it actually sets the CURLOPT_TIMEOUT option of a curl connection, while CURLOPT_TIMEOUT is in seconds not milliseconds. If I originally set the timeout of xmlrpc client as 600ms, this function would convert it into 1 second, which is definitely not what I want.

One quick fix is to use my another post as follows:
static void
setCurlTimeout(CURL *       const curlSessionP ATTR_UNUSED,
               unsigned int const timeoutMs ATTR_UNUSED) {

#if HAVE_CURL_NOSIGNAL
    curl_easy_setopt(curlSessionP, CURLOPT_NOSIGNAL, 1);
    curl_easy_setopt(curlSessionP, CURLOPT_TIMEOUT_MS, timeoutMs);
#else
    /* Caller should not have called us */
    abort();
#endif
}



Reference:

// taken from xmlrpc-c-1.42.99/examples/cpp/sample_add_client_complex.cp
/*=============================================================================
                        sample_add_client_complex.cpp
===============================================================================
  This is an example of an XML-RPC client that uses XML-RPC for C/C++
  (Xmlrpc-c).

  In particular, it uses the complex lower-level interface that gives you
  lots of flexibility but requires lots of code.  Also see
  xmlrpc_sample_add_server, which does the same thing as this program,
  but with much simpler code because it uses a simpler facility of
  Xmlrpc-c.

  This program actually gains nothing from using the more difficult
  facility.  It is for demonstration purposes.
=============================================================================*/

#include
#include
#include
#include

using namespace std;

#include
#include
#include

int
main(int argc, char **) {

    if (argc-1 > 0) {
        cerr << "This program has no arguments" << endl;
        exit(1);
    }

    try {
        xmlrpc_c::clientXmlTransport_curl myTransport(
            xmlrpc_c::clientXmlTransport_curl::constrOpt()
            .timeout(10000)  // milliseconds
            .user_agent("sample_add/1.0"));

        xmlrpc_c::client_xml myClient(&myTransport);

        string const methodName("sample.add");

        xmlrpc_c::paramList sampleAddParms;
        sampleAddParms.add(xmlrpc_c::value_int(5));
        sampleAddParms.add(xmlrpc_c::value_int(7));

        xmlrpc_c::rpcPtr myRpcP(methodName, sampleAddParms);

        string const serverUrl("http://localhost:8080/RPC2");

        xmlrpc_c::carriageParm_curl0 myCarriageParm(serverUrl);

        myRpcP->call(&myClient, &myCarriageParm);

        assert(myRpcP->isFinished());

        int const sum(xmlrpc_c::value_int(myRpcP->getResult()));
            // Assume the method returned an integer; throws error if not

        cout << "Result of RPC (sum of 5 and 7): " << sum << endl;

    } catch (exception const& e) {
        cerr << "Client threw error: " << e.what() << endl;
    } catch (...) {
        cerr << "Client threw unexpected error." << endl;
    }

    return 0;
}