Simple URL caching while doing experiment with huge data in batch mode

I am doing experiments with tweets. For each tweet or each set of tweets, several APIs need to be called trough http protocol. This takes time. I often repeat the same experiment, which means issuing the same http request over and over again. Each request takes at least half a second. Considering a million tweets, one experiment takes half million seconds which makes over 5 days. And, this is the case for only one request per tweet.

In order to overcome this, I cache requests and responses. I made a library for php. Below I will be giving the codes. The library issues curl. But before issuing curl, if cache option is set, it first check if there is a cached content for this request. If there is, it simply returns the cache content. Otherwise, it requests the content over the network, and caches it for further requests.

Caching is done by saving the response to a file named by the md5 of the request url. Since the number of files in the cache directory is too high, I applied a two level strong mechanism which gets first two chracters of the md5 hash and saves the file in the directory named with those two chracters. This reduces the number of files in one directory, enabling a two level look-up in the file system.

Below is the simple code. It, for sure, needs further improvement but, for now, it works for me.

function curl_get($url, $cache=false)
{
$md5filename=getFileName("urlcache",$url);

if($cache==true)

{

if(file_exists($md5filename))

return file_get_contents($md5filename);

}

$defaults = array();

@$defaults[CURLOPT_URL] = $url;

@$defaults[CURLOPT_HEADER] = 0;

@$defaults[CURLOPT_RETURNTRANSFER] = TRUE;

@$defaults[CURLOPT_TIMEOUT] = 0;

$ch = curl_init();

curl_setopt_array($ch, $defaults);

if( ! $result = curl_exec($ch))

{

$retry = 0;

while($retry < 10){

sleep(10);

$result = curl_exec($ch);

if($result)break;

if(!$result)

trigger_error(curl_error($ch));

}

curl_close($ch);

if($cache==true)

{

$f=fopen($md5filename,"w");

fwrite($f,$result);

fclose($f);

}

return $result;

}

function getFileName($infix,$data)

{

$md=md5($data);

$st=substr($md,0,2);

$md=$st."/".$md;

@mkdir(__DIR__."/../caches/$infix/$st");

return __DIR__."/../caches/$infix/".$md;

}

Ahmet Yıldırım

Simple URL caching while doing experiment with huge data in batch mode

Search

Subjects

Blog Archive

Popular Posts

Followers

Ahmet Yıldırım