Downloading with cURL - with custom user-agent and referer

Bill

84.***.***.***
428 days ago

Downloading with cURL - with custom user-agent and referer

http://curl.haxx.se is a very powerful library for working with files (and headers) over the web, and it is included in most popular PHP setups.

cURL may be a little intimidating at first for some, so I decided to write a to the point PHP function for basic file retrieval with custom user-agent and referer. It is pretty well commented and I will add PHP code for example uses in posts below.

PHP code:


// Defines the default values for the variables the function can take. Note that the time_out option is in seconds.
function getFile($url, $time_out = 3, $user_agent = FALSE, $referer = FALSE)
{
// Returns false if no URL is provided.
if (!$url)
{
return FALSE;
}

// Initializes cURL.
$ch = curl_init();

// Sets connection options. Use the function call to override default values rather than editing anything here.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Allows cURL to follow header Location: redirects. Remove or comment out if you do not desire this behaviour.

if (!empty($time_out))
{
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $time_out);
curl_setopt($ch, CURLOPT_TIMEOUT, $time_out);
}

if (!empty($user_agent))
{
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
}

if (!empty($referer))
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}

// Executes the connection and places the source of the target file in the ['data'] part of the returned array.
$return['data'] = curl_exec($ch);

// Stores all the information we have on the connection (returned HTTP code, etc etc) in the ['info'] part of the returned array.
$return['info'] = curl_getinfo($ch);

// Closes the connection.
curl_close($ch);

// Returns the array we built.
return $return;
}

Bill

84.***.***.***
428 days ago
Downloading and echoing the HTML source of a web-site

This example code uses the above getFile() function to download http://example.com/ with a faked user-agent and the URL of this thread as the HTTP_REFERER. If everything went well, it will spit out the retrieved HTML source along with a HTML comment containing some stats about the connection we made.

PHP code:


if ($woohoo = getFile('http://example.com/', 3, 'PHPCentral Custom User-Agent/1.0 (http://www.phpcentral.com/)', 'http://www.phpcentral.com/947-downloading-curl-custom-user-agent-referer.html'))
{
echo"<!--\n";
echo print_r($woohoo['info']);
echo"\n-->\n\n";
echo $woohoo['data'];
}
else
{
echo'Failure.';
}



The above will output:

Code:


<!--
Array
(
[url] => http://example.com/
[content_type] => text/html; charset=UTF-8
[http_code] => 200
[header_size] => 263
[request_size] => 125
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 1.489471
[namelookup_time] => 1.379533
[connect_time] => 1.434085
[pretransfer_time] => 1.434099
[size_upload] => 0
[size_download] => 438
[speed_download] => 294
[speed_upload] => 0
[download_content_length] => 438
[upload_content_length] => 0
[starttransfer_time] => 1.489294
[redirect_time] => 0
)
1
-->

<HTML>
<HEAD>
<TITLE>Example Web Page</TITLE>
</HEAD>
<body>
<p>You have reached this web page by typing "example.com",
"example.net",
or "example.org" into your web browser.</p>
<p>These domain names are reserved for use in documentation and are not available
for registration. See <a href="http://www.rfc-editor.org/rfc/rfc2606.txt">RFC
2606</a>, Section 3.</p>
</BODY>
</HTML>

Bill

84.***.***.***
427 days ago
A couple notes and ideas

- For more information about nifty cURL options you can add into the function, check http://www.tuxradar.com/practicalphp/15/10/3 page out.

- It is generally a good idea to check that the returned http_code is 200 before assuming that the data is the content you are after and not some type of server flunk.

- You could also validate that the returned content_type is a text/html type (if you are scraping website HTML), or treat the data differently depending on type of file.