Avoiding IFRAMES via PHP and cURL

Posted: November 23, 2009 Comments(12)

A current project requires integration with a certain third party that provides a “Web service” to allow data integration into member websites. Unfortunately for me, this service revolves around plopping an IFRAME into your page where you’d like the data to appear. Not great.

In an ideal world, we’d be able to pull said content via (the real) AJAX but due to security (particularly cross domain) issues, that’s not a possibility. All is not lost, however. Enter cURL.

A brief introduction to cURL

cURL is defined as:

PHP supports libcurl, a library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.

These functions have been added in PHP 4.0.2.

In summary, cURL allows you to have PHP fetch a page for you to do with what you will.

Setting up cURL

There’s a bit of a learning curve when using cURL, so you’ll want to review the manual. If you’re looking to set something up quick and dirty, the function I’ve come to use is (via):

function get_url( $url,  $javascript_loop = 0, $timeout = 5 )
{
    $url = str_replace( "&", "&", urldecode(trim($url)) );

    $cookie = tempnam ("/tmp", "CURLCOOKIE");
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
    curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt( $ch, CURLOPT_ENCODING, "" );
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
    curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
    curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
    $content = curl_exec( $ch );
    $response = curl_getinfo( $ch );
    curl_close ( $ch );

    if ($response['http_code'] == 301 || $response['http_code'] == 302)
    {
        ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

        if ( $headers = get_headers($response['url']) )
        {
            foreach( $headers as $value )
            {
                if ( substr( strtolower($value), 0, 9 ) == "location:" )
                    return get_url( trim( substr( $value, 9, strlen($value) ) ) );
            }
        }
    }

    if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
            $javascript_loop < 5
    )
    {
        return get_url( $value[1], $javascript_loop+1 );
    }
    else
    {
        return array( $content, $response );
    }
}

This function allows me to pass a URL and have it be returned as the first index in an array. The second index contains another array of response headers as well.

Replacing an IFRAME with cURL

The particular service I'm working with uses GET variables to filter the data presented. I can literally use the same URL string in my cURL function and work with the data straight away. For example:

$service_url  = $service_base_url;
$service_url .= "&var1=X";
$service_url .= "&var2=Y";
$service_url .= "&api_key=" . $service_api_key;

$request_results = get_url($service_url);

preg_match("/<body.*\/body>/s", $request_results[0], $pagecontent);

$pagecontent = $pagecontent[0];

$pagecontent = str_replace('<body>', '', $pagecontent);
$pagecontent = str_replace('</body>', '', $pagecontent);

// I'd like to resize the images...
$pattern = '/\< *[img][^\>]*[src] *= *[\"\']{0,1}([^\"\'\ >]*)/i';
$replacement = '<img src="' . $imgpath . '/phpthumb/phpThumb.php?src=' . '$1' . '&w=160&h=110&zc=1';
$pagecontent = preg_replace($pattern, $replacement, $pagecontent);
	
echo $pagecontent;

What's happening there is I'm first building the request URL (as the GET variables will change based on a few things) and then firing my get_url() function and passing the final URL. That's a great start, but of course the cURL request is going to return a full HTML document (including the head) which we don't need. A quick preg_match will pull out everything included within the body of the document, and we'll finally strip that out as well.

That leaves you with the remote page as would have been included in the IFRAME itself. You can write applicable CSS and do what you will with the markup. You can even go a step further and continue to refine the markup returned. In my case, I'd like to resize the images returned to fit the design I'm trying to implement. I've come to use phpThumb for all of my resizing needs and a quick preg_replace lets you reformat the img src to better match your design.

Keep in mind the terms of service

I'm currently waiting to hear back from the third party in an effort of following their terms of service. With the official documentation revolving around the inclusion of an IFRAME I'd like to make sure that this alternative method is acceptable before I put the remaining hours into customizing the output.

Get my newsletter

Receive periodic updates right in the mail!

  • This field is for validation purposes and should be left unchanged.

Comments

  1. Nice write up. A good suggestion if it makes sense for the data would be to cache the data in a local file and only update the data if it has been longer than some set amount of time since the last request (10 minutes or whatever works). That way, you’re not sending a lot of requests unless it’s necessary. Something like this should work:
    filemtime($cachedfile) > (time()-$offsettime)

  2. @Chad: Definitely a great suggestion, and something to do should it work out with the third party. Trying not to put too much time into it until I can confirm the third party will allow it haha. Billable hours!

  3. I would extend Chad’s suggestion to the thumbnails too. And if the TOS allow it, you can download and cache the images locally.

    Have you thought about using DOM or other XML library to manipulate the HTML, instead of regular expressions?

  4. I see that you’ve run a reg exp to follow redirects by javascript, but if I remember my PHP documentation correctly — the normal redirects, 30xs etc. etc. should NOT followed directly by PHP cURL because of security and will not happen based on your base_dir, safe_mode php settings.

    Theoretically speaking, you could end up getting a file:// following the redirects. You might want to parse the header contents to be able to follow those redirects too while checking the protocol of the redirects. There’s a good discussion as well as some code available here: http://in.php.net/manual/en/function.curl-setopt.php#71313 .

    Apart from that, you lose all the stylesheet information too — any links in the header, etc. are lost. You might want to modify that as well.

  5. @Gonzo: phpThumb actually takes care of thumbnail caching behind the scenes so that’s already taken care of. I would love to use a proper library to manipulate the document, but the source of the page I’m pulling is far from valid. Regular expressions were the safest way to go about it this time around.

    @Kunal Bhalla: I’ll have to look into your first point. As mentioned, I’m using the function offered here but will definitely look into that. Thanks for the note! I’m purposely omitting the stylesheets and other bits simply because I’m looking to import the IFRAME content and style it myself.

  6. Hi Jonathan,

    I have a few clients that get iframe links to MLS for their websites. Not all of the clients brokers want them to access the ftp data for different reasons. So I am looking for a clean way to convert their iframe content and pass it into the site pages we built. It seems that what you wrote here will do that. What do you think? My php skills are hacking open source projects and php includes.

    Thanks

  7. I’ve used this implementation a few times when I ran into cross domain issues and really didn’t have the time to get other service providers on board for the client.

    Great article, love your blog, the pods series is awesome. Have you thought about contributing to the core? I think if developers get together around Pods and contributed it’d take it to new levels. I’ve got a couple crazy ideas…

  8. Glad to hear it’s worked out for you. At this point I’m not able to dedicate an appropriate amount of time to core Pods development but I hope once things calm down for my personal life that may be an option. Right now I like contributing via the walkthroughs, tutorials, and Helpers I’ve put out there thus far.

  9. Hello I used your method with some modifications and it worked, even after the hours of wasting my time with things like file_get_contents() which wasn’t the way to go. Anyway the only problem is that the site I’m pulling information off of has checks to see if javascript is enabled and hides the content if it isnt. I guess using this method makes it think that javascript is disabled and so I got all the code but its hidden. should I take out all the checks or is there a way to set something when it checks that tricks it into thinking its enabled?

Leave a Reply

Your email address will not be published. Required fields are marked *