Home > PHP > PHP get Google index for domain/site. Indexed by Google pages count

PHP get Google index for domain/site. Indexed by Google pages count

February 19th, 2010

The function below allow you to get the count of pages indexed by Google.

function getGoogleIndexCount($url) {
   $html = "http://www.google.com/search?hl=en&safe=off&q=site%3A".$url."&btnG=Search"; // I strongly recommend you to use www.google.<you country zone> version of Google

   $content = getRemoteFile($html);
   if (preg_match('/<div id=resultStats>(\d+) results/i',$content,$arr)) {
       $t = $arr[1];
       $t = str_replace(' ','',$t);
       $t = str_replace(',','',$t);
       return (int)$t;
   } else return 0;
}

function getRemoteFile($url) {
   $ch = curl_init($url);
   curl_setopt($ch, CURLOPT_HEADER, 1);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

   $data = curl_exec($ch);

   if(curl_errno($ch)) {
      echo "Error!  Error code:".curl_errno($ch)." Error:".curl_error($ch);
      curl_close($ch);
      return false;
   }

   list($header, $data) = preg_split("/\r?\n\r?\n/", $data, 2);

   return $data;
}
Categories: PHP Tags: ,
  1. November 9th, 2010 at 06:57 | #1

    Hi Alex,

    while updating a similar script of mine to the latest google output I came across your blog post, your regular expression does not seem to be correct/uptodate to me, check the diff below (I hope the ‘pre’ element if handled correctly):

    diff -pruN getGoogleIndexCount.php.orig getGoogleIndexCount.php
    — getGoogleIndexCount.php.orig 2010-11-09 14:46:39.000000000 +0000
    +++ getGoogleIndexCount.php 2010-11-09 14:47:19.000000000 +0000
    @@ -3,9 +3,8 @@ function getGoogleIndexCount($url) {
    $html = “http://www.google.com/search?hl=en&safe=off&q=site%3A”.$url.”&btnG=Search”; // I strongly recommend you to use http://www.google. version of Google

    $content = getRemoteFile($html);
    – if (preg_match(‘/(\d+) results/i’,$content,$arr)) {
    – $t = $arr[1];
    – $t = str_replace(‘ ‘,”,$t);
    + if (preg_match(‘/(About )?([0-9,]+) results/’,$content,$arr)) {
    + $t = $arr[2];
    $t = str_replace(‘,’,”,$t);
    return (int)$t;
    } else return 0;

    The comma handling is needed for numbers greater than 1000 (like in 2,400), the “About” is optional so to handle exact results count (that’s happen for results lesser than 10).

    Can I ask you under what terms I can take, reuse, republish your code? (Giving credit to you is the obvious part of that).

    Regards,
    Antonio

Comments are closed.