Friday, October 19, 2012

Adventures of a backend developer

Recently I was given a rather interesting task which consisted of loading a CSV formatted file into a memcached bucket. Our original loader, which was a rough cli script written in (my beloved) PHP, just wasn't cutting it any longer. Error rates were rather high and the loading speed *suboptimal* to say the least, and lets not even talk about CPU/Mem usage. Back to the drawing board, I decided to get away from PHP and look at what the unix toolbox proposed as solutions. If you're not aware of memcached please check it out, it's a blazing fast and very high performant key-value store with a very simple ASCII protocol. I won't go into the details here but you can learn more about memcached's merits here and the protocol specification here. Out of that spec, the only piece which I was interested in was the "set" command, the command used for inserting/updating a key/value pair.
set <key> <flags> <exptime> <bytes> [noreply]\r\n
<value>\r\n
So what I have to work with is a linux server, the default toolbox, and a set of CSV files where the 3 columns represent: key, value, expiration time. Parsing CVS files with linux? That's a job for awk! Awk is amazing for parsing CSV files, it reads line by line and allows you to manipulate/conditionally compute on the values of the columns. So I decided to let awk tranform these CSV file into the memcached ASCII protocol. Half the problem is solved, I can transform a CSV file into a suite of memcached commands.. Now for the second half: how to pipe this into memcached? As soon as I hear network pipe, I think of another awesome linux tool: netcat. Netcat is a very simple tool which allows you to connect a socket to a remote machine and pipe whatever you like over the socket. All that said, here is a shortened demo of the final working solution, looking at the different steps I had taken: A snippet of an exmaple flat data file:
key_00001###value_00001###60
key_00002###value_00002###64
key_00003###value_00003###69
key_00004###value_00004_different_length_value###30
And the bash script using awk/netcat magic:
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 0 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" }' $FILE | netcat $HOST $PORT
}
So for each line in the csv file, 2 lines of output are generated, the memcached set commands. The first line of the example csv file would become:
set key_00001 0 60 11\r\n
value_00001\r\n
To execute the function on a set of csv files to warmup a bucket running on localhost:11211 would look something like this
for F in $(find /var/lib/csv_files/ -name '*.csv' -type -f); do
 warmup_memcached_from_csv $F 127.0.0.1 11211
done
That worked like a charm.. but there was more to the requirements. I needed to know the number of errors, if any, that occured during the sets. The memcached host will respond to the command with "ERROR\r\n" in the case of an error on a set command so all I needed was to count them. I know the tools for the job, grep and wc. Grep is a regex pattern matching filter which can work on streams and wc is the "word counter" which can also count lines.. So putting the 2 together, I'll filter out everything except error messages and then count them. Because netcat is a bidirection network pipe, this was very simple to tack on to the last implementation:
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 0 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" }' $FILE | netcat $HOST $PORT | grep 'ERROR' | wc -l
}
Now this function no longer spews a long list of memcached commands, it only returns a number, the error count. Using the function changes slightly to actually make use of the error counter:
for F in $(find /var/lib/csv_files/ -name '*.csv' -type -f); do
 ERROR_COUNT=$(warmup_memcached_from_csv $F 127.0.0.1 11211);
 if [ "$ERROR_COUNT" -gt 0 ]; then
  # .. error reporting! retry loops! everything is possible ..
 fi
done
Up till now everything is nice and simple.. The last requirement was bit of a mystery to me about how I would achieve at first: the values columns of these files were't actually plain text, but serialized php objects. The webservers using them were using the php-memcache extension. This is where the second parameter of the memcached set command comes into play, the records "flag". The memcached server has an extra interger flag field which is stored with the set command and retrieved with the get command but isn't actually used by the memcached server, instead it's only purpose it to have "metadata" for the memcached clients. Using the php-memcache extension it's possible to do something like so:
$m = new memcache();
$m->connect('127.0.0.1', 11211);
$m->set('key_00001', array('this', 'is', 'a', 'php', 'array'));

echo var_export( $m->get('key_00001'), true );

# which will display:
#
# array (
#   0 => 'this',
#   1 => 'is',
#   2 => 'a',
#   3 => 'php',
#   4 => 'array',
# )
So if the CSV file contained:
key_00001###a:5:{i:0;s:4:"this";i:1;s:2:"is";i:2;s:1:"a";i:3;s:3:"php";i:4;s:5:"array";}###60
After loading the file the frontend servers should be able to:
echo var_export( $m->get('key_00001'), true );
# and get the same result:
# array (
#   0 => 'this',
#   1 => 'is',
#   2 => 'a',
#   3 => 'php',
#   4 => 'array',
# )

So somehow, the php-memcache extension must be using this extra flag field internally to know what's to be unserialized after a get() and whats to be considered non-serialized data. A quick look into the inside of the extension told me exactly what I expected:
$ php --re memcache # this is an EXTREMELY usefull parameter ( --re ) of php for viewing which ini options/methods/functions/classes/interfaces/constants any extension installed exposes
And sure enough, in the constant definitions I found the one I was looking for: MEMCACHE_HAVE_SESSION. Admittingly the constants name wasn't very clear to me at first but then I realized that the method php-memcache uses for serializing/unserializing is the same that's defined for session storage .. so maybe not un/serialize(), it could just as well be defined as json encoding or maybe even some exotic XML format.. joy.
$ php --re memcache
... snippet ...
    Constant [ integer MEMCACHE_COMPRESSED ] { 2 }
    Constant [ integer MEMCACHE_HAVE_SESSION ] { 1 }
... snippet ...
Now I was ready to make the last and final change to the loader, simply setting the keys flag to 1 so that the php-memcache extension would automatically unserialize the value after issuing a get():
#!/bin/bash

function warmup_memcached_from_csv {
 local FILE="${1}";
 local HOST="${2}";
 local PORT="${3}";

 awk 'BEGIN { FS="###" } {
  printf "set %s 1 %i %i\r\n%s\r\n", $1, $3, length($2), $2
 } END { printf "quit\r\n" } ' $FILE | netcat $HOST $PORT | grep 'ERROR' | wc -l
}
Problem solved. We can now load the CSV files using only the linux toolbox and still get the values from memcached from php without having to make any change to the frontend webservers. Comments? Bug reports? Critics? They're all welcome!