Merry Christmas and Happy New Year everyone. Looking forward to the new year as I expect to be a father in January :)
Let's now talk a little bit about the parenting and the past: Since 2006 ADOdb has supported Active Record, the object-oriented paradigm for processing records using SQL. One of the most powerful features of Active Record is the ability to define parent-child relationships. The old way was:
$person = new person();
$person->HasMany('children', 'person_id');
$person->Load("id=1");
Where "persons" is the parent table, "children" is the child table and "children.person_id" is a field in "children" pointing to "persons". All the children of person with id=1 would be dynamically loaded into the array $person->children when the property was accessed (lazy loading).
This was confusing for the programmer and had many limitations, as was pointed out by Arialdo Martini in this post.
Firstly it was confusing to the programmer. Should HasMany() be called everytime you create a new person()? The answer was no, it's global, but the implementation made it look like it was local to the instance. The HasMany function really should be defined as a statically, before new person() was used.
Another problem was you could not override the class of the child objects. So you couldn't modify the behaviour of child object easily.
My objective was to fix all this, and still keep backward compatibility so your old code continued to work. The good news is that all the metadata to keep track of all the object-table relationships could still be reused. The problem was one of a weak API, but the internals were sound. The solution implemented in ADOdb 5.07 was to create a new set of static functions that override the default behaviour:
The new way defines the relationship in a static function, which makes it clearer that it only needs to be called once in your init php code:
class person extends ADOdb_Active_Record{}
ADODB_Active_Record::ClassHasMany('person', 'children','person_id');
$person = new person();
$person->Load("id=1");
One of the things that I try to do in ADOdb is maintain backward compatability. You are able to override the defaults of Active Record (id is the primary key, the name of the table is the plural version of the class name). So if the table name of the parent is not "persons", but "people", you can use:
ADODB_Active_Record::TableHasMany('people', 'children','person_id');
$person = new person();
$person->Load("id=1");
The default primary key name is "id". You can override it (say "pid" is used) using
ADODB_Active_Record::TableKeyHasMany('people', 'pid', 'children', 'person_id')
Formerly, the child class was always an ADODB_Active_Record instance. Now you can derive the class of the children like this:
class childclass extends ADODB_Active_Record {... };
ADODB_Active_Record::ClassHasMany('person', 'children','person_id', 'childclass');
Works the same way with TableHasMany().
Analogously, there are functions ClassBelongsTo, TableBelongsTo, TableKeyBelongsTo for defining child pointing to parent.
In my previous post Easy Parallel Processing in PHP, I showed you how to implement parallel batch processing using PHP and a web server. In this post, I want to discuss partitioning your tasks so that they become easily parallelized.
The strategy I prefer is divide-and-conquer. This works by recursively breaking down a problem into two or more sub-problems of the same type, until these become simple (and fast) enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.
To illustrate with an example, lets say you have millions of financial payment data records in a database you want to process in parallel using PHP:
To find the median of a set of records in a database, I have extended ADOdb, the popular PHP open source database ebstraction library I maintain with the following function defined in the ADOConnection class:
function GetMedian($table, $field,$where = '')
{
$total = $this->GetOne("select count(*) from $table $where");
if (!$total) return false;
$midrow = (integer) ($total/2);
$rs = $this->SelectLimit("select $field from $table $where order by 1",1,$midrow);
if ($rs && !$rs->EOF) return reset($rs->fields);
return false;
}
If you have a Quad-Core CPU then you can call GetMedian 3 times to break up the data into 4 approximately equal parts, and pass then to 4 child processes:
$mid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO');
if (!$mid) return 'Error';
$lomid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO < $mid");
$himid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO >= $mid");
The above GetMedian function is not particularly optimal when you want need to run it multiple times on the same dataset. Improvements are left to the reader (or in a future blog entry).
PS: Another strategy for parallelization popularised by Google is Map Reduce.
The proliferation of multicore CPUs and the inability of our learned CPU vendors to squeeze many more GHz into their designs means that often the only way to get additional performance is by writing clever parallel software.
One problem we were having is that some of our batch processing jobs were taking too long to run. In order to speed the processing, we tried to split the processing file into half, and let a separate PHP process run each job. Given that we were using a dual core server, each process would be able to run close to full speed (subject to I/O constraints).
Here is our technique for running multiple parallel jobs in PHP. In this example, we have two job files: j1.php and j2.php we want to run. The sample jobs don't do anything fancy. The file j1.php looks like this:
$jobname = 'j1';
set_time_limit(0);
$secs = 60;
while ($secs) {
echo $jobname,'::',$secs,"\n";
flush(); @ob_flush(); ## make sure that all output is sent in real-time
$secs -= 1;
$t = time();
sleep(1); // pause
}
The reason why we flush(); @ob_flush(); is that when we echo or print, the strings are sometimes buffered by PHP and not sent until later. These two functions ensure that all data is sent immediately.
We then have a 3rd file, control.php, which does the real work. This script will call j1.php and j2.php asynchronously using fsockopen in JobStartAsync(), so we are able to run j1.php and j2.php in parallel. The output from j1.php and j2.php are returned to control.php using JobPollAsync().
#
# control.php
#
function JobStartAsync($server, $url, $port=80,$conn_timeout=30, $rw_timeout=86400)
{
$errno = '';
$errstr = '';
set_time_limit(0);
$fp = fsockopen($server, $port, $errno, $errstr, $conn_timeout);
if (!$fp) {
echo "$errstr ($errno)<br />\n";
return false;
}
$out = "GET $url HTTP/1.1\r\n";
$out .= "Host: $server\r\n";
$out .= "Connection: Close\r\n\r\n";
stream_set_blocking($fp, false);
stream_set_timeout($fp, $rw_timeout);
fwrite($fp, $out);
return $fp;
}
// returns false if HTTP disconnect (EOF), or a string (could be empty string) if still connected
function JobPollAsync(&$fp)
{
if ($fp === false) return false;
if (feof($fp)) {
fclose($fp);
$fp = false;
return false;
}
return fread($fp, 10000);
}
###########################################################################################
if (1) { /* SAMPLE USAGE BELOW */
$fp1 = JobStartAsync('localhost','/jobs/j1.php');
$fp2 = JobStartAsync('localhost','/jobs/j2.php');
while (true) {
sleep(1);
$r1 = JobPollAsync($fp1);
$r2 = JobPollAsync($fp2);
if ($r1 === false && $r2 === false) break;
echo "<b>r1 = </b>$r1<br>";
echo "<b>r2 = </b>$r2<hr>";
flush(); @ob_flush();
}
echo "<h3>Jobs Complete</h3>";
}
And the output could look like this:
r1 = HTTP/1.1 200 OK Date: Wed, 03 Sep 2008 07:20:20 GMT Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8d X-Powered-By: Zend Core/2.5.0 PHP/5.2.5 Connection: close Transfer-Encoding: chunked Content-Type: text/html 7 j1::60 r2 = HTTP/1.1 200 OK Date: Wed, 03 Sep 2008 07:20:20 GMT Server: Apache/2.2.4 (Unix) mod_ssl/2.2.4 OpenSSL/0.9.8d X-Powered-By: Zend Core/2.5.0 PHP/5.2.5 Connection: close Transfer-Encoding: chunked Content-Type: text/html 7 j2::60 ---- r1 = 7 j1::59 r2 = 7 j2::59 ---- r1 = 7 j1::58 r2 = 7 j2::58 ----
Note that "7 j2::60" is returned by PollJobAsync(). The reason is that the HTTP standard requires the packet to return the payload length (7 bytes) in the first line.
I hope this was helpful. Have fun!
PS: Also see Divide-and-conquer and parallel processing in PHP.
Last week, I got an email from Garrett Serack, M'soft Open Source Community Developer. Microsoft have been kind enough to donate a set of ADOdb drivers for the new MSSQL Native Extension for PHP. You can download the extension here and the ADOdb drivers here.
Garrett also mentions that ADOdb is the first LGPL project that Microsoft has ever contributed to. I quote from his email to me:
ADODB is actually the first LGPL Open Source project that Microsoft has ever contributed to. We've got a dozen or so others lined up and ready to go to other open source PHP projects (GPL, BSD and others), But ADODB was the *FIRST*. You could say that contributing to ADODB is Microsoft going from zero to one. We announced it at OSCON, (see the post at http://port25.technet.com/archive/2008/07/25/oscon2008.aspx ) along with Microsoft becoming a platinum sponsor of the Apache Software Foundation. Either of these two steps is such a good move for Microsoft, and both together, is a good sign that the Company is learning.
Thanks Garrett.
Story in The Register.
PS: ADOdb is dual licensed as LGPL and BSD. Choose which license you want.
Jeff Atwood writes:
If you've used Windows Vista, you've probably noticed that Vista's file copy performance is noticeably worse than Windows XP. I know it's one of the first things I noticed. Here's the irony-- Vista's file copy is based on an improved algorithm and actually performs better in most cases than XP. So how come it seems so darn slow?
PS: Jeff adds that Vista SP1 has switched back to XP's algorithm. Duhh!
In my previous post I asked what would be the output of of the following numbers:
echo 09," => (09) <br>"; echo 9," => (9) <br>";
The answer is:
0 => (09) 9 => (9)
That's because any number preceded by 0 is treated as an octal number, and 9 is an invalid octal number. Octal numbers are base 8, e.g.:
| Octal Value | Decimal Value |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 10 | 8 |
| 11 | 9 |
The silly thing is that hardly anyone uses octal nowadays, but it continues to be part of the C, C++, Java and PHP standards. The mistake is also very common. C-style languages pride themselves in their terse and minimalist syntax, but this is one scenario where a language design error was probably made. Perhaps 0c should have been used to represent octal in analogy to 0x for hexadecimal, but this suggestion is sadly 35 years too late. 0 for octal is too deeply imprinted in modern compiler DNA.
PS: Here's the mistaken ADOdb bug report that started it.
Someone reported a bug in ADOdb, the open source db library i maintain. I went crazy for half an hour until i realised the problem. Here's a little gotcha you can try:
echo 09," => (09) <br>"; echo 9," => (9) <br>";
If you expect the above code to produce the same values, you are sadly mistaken. Try it. Also see the followup.
Steve Yegge talks about choosing the right programming language in the face of Code's Worse Enemy:
I'll give you the capsule synopsis, the one-sentence summary of the learnings I had from the Bad Thing that happened to me while writing my game in Java: if you begin with the assumption that you need to shrink your code base, you will eventually be forced to conclude that you cannot continue to use Java. Conversely, if you begin with the assumption that you must use Java, then you will eventually be forced to conclude that you will have millions of lines of code.
There's a lot of truth in what he says. You can't fault his taste: He prefers Mozilla Rhino, a Javascript/Ecmascript implementation.
When we coded in ASP, we were a JScript shop. When we were looking for something that ran cross-platform, the closest thing that fit the bill was PHP. Javascript with perl-style $variables. Neither language is perfect. But good enough - or as some prefer to put it: worse is better.
Happy New Year Folks! Let's keep on blogging.
This posting by Ka-Ping Yee on why PHP should never be taught is precisely why PHP should be taught. If something is popular but hard to understand then we need an education process. To just shake our heads and give up is simply immature (or trolling). Otherwise we might as well say that English (or any other spoken language for that matter) should not be taught, because spoken languages are illogical, imprecise and therefore ... useless :)
Most programming languages have similar gotchas. Oracle's PL/SQL has "" being equivalent to null. Javascript believes that 0 and "" are equivalent. C has non-zero being equivalent to boolean true. Lisp's gotcha begins with ( and ends with ) -- (just a joke). Java's gotcha begins with J2EE and the obsolete baggage that comes with it.
PS: The confusion is because === is the real equality operator and should be used here. In PHP, == is equality after typecasting, where 0 == "0" and "" == "0" evaluating to true are accepted conventions. They are useful constructs in many situations in a similar vein to C's convention of 0 being boolean true and non-zero being boolean false, which the critics cleverly ignore. See the PHP Manual on type comparisons
In my last blog entry on mod_backhand, I mentioned that you could implement redundancy in a web cluster by having multiple load balancers and using round robin DNS pointing to the load balancers. This technique is mentioned in the mod_backhand presentation notes by Theo Schlossnagle, one of the author's of mod_backhand.
But if you think about it carefully, in my opinion, this solution doesn't really work well:
I can think of several solutions, but the best one as far as I can see (if you still want to use mod_backhand of course) is to run mod_backhand on a high availability hardware solution (or if a few minutes downtime is acceptable, keep a spare box configured with mod_backhand around to swap with any balancer that goes down), and not push the hard problem of high availability and redundancy to DNS.
Ahmad Amran Kapi from Melaka, Malaysia, points out that you can use Wackamole:
Wackamole is an application that helps with making a cluster highly available. It manages a bunch of virtual IPs, that should be available to the outside world at all times. Wackamole ensures that a single machine within a cluster is listening on each virtual IP address that Wackamole manages. If it discovers that particular machines within the cluster are not alive, it will almost immediately ensure that other machines acquire these public IPs. At no time will more than one machine listen on any virtual IP. Wackamole also works toward achieving a balanced distribution of number IPs on the machine within the cluster it manages.
Since I got married last year, I haven't had much motivation to blog. The good news is that since the previous year, I've accumulated a list of accomplishments and experiences that i feel are worthwhile sharing.
Recently, my company implemented our first mod_backhand implementation. Mod_backhand is a load-balancing and clustering solution that runs on Apache. Let's say you have 3 web servers that you need to load balance in a cluster. When a server goes down, it will auto-detect that server is down and route subsequent http requests to other servers. You can buy a load-balancing box such as Cisco Redirector, or roll your own package using Linux, Apache 1.3 and mod_backhand.
The mod_backhand load balancer is basically a Linux system with Apache 1.3 (httpds) with the mod_backhand patches installed. The load-balancer also supports redirects and https. Load balancing uses the Apache virtual directory mechanism, so you can configure different load balancing behaviour for different applications on a directory basis.
The nice thing about mod_backhand is that it autodetects servers going up and down in the cluster, with no additional configuration required. All web servers in the cluster need to broadcast that they are alive and available on the cluster at regular intervals. So if you want to add another web server to the cluster, you just need to install the backhand broadcasting service and start it on the server, and the load balancer will pick it up. CPU and load information is also broadcast to the load balancer, so the load balancer can make an intelligent guess as to which web server to pass the http request to.
There are some issues that you need to be aware of:
Additional Windows Notes The Windows Registry settings for Backhand Broadcaster are a bit obscure. Here is a sample config:
HKEY_LOCAL_MACHINE\SOFTWARE\CerebraSoft] [HKEY_LOCAL_MACHINE\SOFTWARE\CerebraSoft\Backhand_Broadcast] "Arriba"=dword:28213950 "numCPU"=dword:00000002 [HKEY_LOCAL_MACHINE\SOFTWARE\CerebraSoft\Backhand_Broadcast\BroadcastParams] "HostName"="Windows Server 1 BHB" "ContactIP"="192.168.0.121" "SendIP"="192.168.0.255" "SendPort"=dword:0000115d "ContactPort"=dword:00000050 "SendTTL"=dword:00000001
The ContactIP is the IP address of my Windows server running BHB: 192.168.0.121 The SendIP is the broadcast IP for the subnet (not the IP address of the load-balancer).
Sometimes the first time you run NT Backhand Broadcaster, it will fail to calculate Arriba (a benchmark measuring the power of your server) and just die. You need to manually enter the Arriba value yourself in the registry. Just use the above example value if you aren't sure what to do.
The load-balancer provides a page to check out the status of all servers in the cluster, typically http://load-balancing-server/backhand/
Lastly, if you still cannot get it working, check out the Windows Event Log (Applications), as errors and status messages are logged there.
Matt of WordPress fame gives his opinion on PHP4 and the transition to PHP5. As he says:
None of the most requested features for WordPress would be any easier (or harder) if they were written for PHP 4 or 5 or Python. They’d just be different. The hard part usually has little to do with the underlying server-side language.
Very true. Most of our code continues to run fine on both PHP 4 and 5, with hardly any checking of PHP_VERSION. Migrating from PHP4 to PHP5 has been relatively painless (each time we port an application, we spend at most half a day fixing warnings that didn't appear in PHP4).
Matt asks why the takeup of PHP5 been so low, and is quite disparaging to the PHP internals devs. I don't see it in such black and white terms. PHP5 never had a feature that was must-have or to-die-for. In fact, if you look at PHP's recent changes, most of them are performance improvements, or fixing past mistakes (adding proper date support for example in 5.2.1), or feature tweaks (iterators, etc). Given that most PHP4 developers have found workarounds to things fixed in PHP5, migrating to PHP5 is probably a low priority.
Also some fixes in PHP5 can cause serious problems. For example, In PHP 5.2.0, time-zone calculations started to support epochs, which are time-changing events. Now my timezone was +7.30 GMT until 1980, when we got a new Prime Minister who decreed that the country would combine its 2 time-zones to 1, so Malaysia standardised on +8.00 GMT. A timestamp such as "12.10am Nov 12, 1979 MYT" in PHP 5.1 would be displayed as "11.40pm Nov 11, 1979 MYT" in PHP 5.2.
On a more optimistic note, as long as the market share of PHP remains strong, I think the take-up rate of PHP6 will probably be higher than PHP5 in the non-English world, simply because managing multi-lingual and sites that don't use ASCII for their native script is much easier in Unicode. Now that's a compelling reason!
Peter Gutmans, a noted security expert talks about Windows Vista Content Protection. Quite disturbing and saddening. I'm about to buy a notebook for my mum, and all the one's I'm looking at have Vista preinstalled. Pfutt.
Beyond the obvious playback-quality implications of deliberately degraded output, this measure can have serious repercussions in applications where high-quality reproduction of content is vital. Vista's content-protection means that video images of premium content can be subtly altered, and there's no safe way around this — Vista will silently modify displayed content under certain (almost impossible-to-predict in advance) situations discernable only to Vista's built-in content-protection subsystem.
Stefan Esser, one of the foremost PHP security gurus in the world is interviewed in Security Focus. He's also well known for disagreeing with the PHP Group (that oversees PHP Core development) about the way PHP security issues are treated. Disturbing in more ways than one.
I read this article on CPU trends Converging Design Features in CPUs and GPUs. Matthew Papakipos writes:
Where are both CPU and GPU designs converging?
Both processors will be massively multi-core –- think hundreds of cores -- within a five-year period. Both processors will have complex memory hierarchies, with programmer managed core-local memories and core-local hardware-managed cache. (My own belief is that hardware-managed cache will decrease substantially in importance.) Memories will be strongly non-uniform with significant latency and throughput differences between local and non-local memory. Accelerators that can offer substantial speedups for specific tasks, either integrated on-chip or available via a HyperTransport-type interconnect, will be ubiquitous.
I'm more interested in modern CPUs trends and their relation to PHP, and not GPUs. Here are some of my thoughts:
Well PHP running in pre-fork mode on Apache or FastCGI on IIS/Apache should have no problems handling massively multi-core architectures, assuming the cores are uniform in design.
As to complex memory hierarchies, we already have to handle the different latencies in harddisk -> harddisk cache -> cpu data/instruction caches. We always had the option of caching data on a hard disk or RAM disk, and some PHP Accelerators already give you the option of caching data in shared memory -- I just see it as more of the same for PHP developers. Things get more interesting for PHP compiler and opcode cache designers as they will have more options for caching PHP opcodes and data.
What is interesting is the possibility of hardware acceleration of PHP. To me, it's not likely that any CPU vendor will come up with a hardware accelerator for PHP, but a CPU accelerator for .NET or java opcodes is a strong possibility. Thus in the long run, .NET or java compilers for PHP (and Python and Perl) could become mainstream.