php - Receiving "502 Bad Gateway" fron Cloudfront when scraping a website? -


i've built scraping script in php gather information particular website. have tested script thoroughly downloaded html files target website, xpath queries correct. yesterday tried script first time locally targetting actual website , worked. took script, placed on server farm, , turned on.

this morning awoke 90 emails 1 particular server telling me there's been error. server sent 10 emails, while other 2 seem working away fine. i've checked logs keep on database , of errors encountered have been "502 bad gateway". i've tried url through normal web browser , loads fine, , i've tried url via wget method on same server. wget returns error:

error: certificate of http://www.targetwebsite.com not trusted.

error: certificate of http://www.targetwebsite.com hasn't got known issuer.

​using "--no-check-certificate" flag still produces error, downloads html file anyway.

so anyway, in script have following code:

// assign curl options $curloptions = array( curlopt_returntransfer => true, curlopt_header => true, curlopt_followlocation => true, curlopt_encoding => "", curlopt_autoreferer => true, curlopt_connecttimeout => 120, curlopt_timeout => 120, curlopt_maxredirs => 10, curlinfo_header_out => true, curlopt_ssl_verifypeer => false, curlopt_http_version => 'curl_http_version_1_1', curlopt_cookie => $cookiesjar, curlopt_useragent => $useragent, );  // url scrape $url = "https://www.targetsite.com/specific/page/";   // build curl headers $ch = curl_init($url); curl_setopt_array($ch, $curloptions);  $content = curl_exec($ch); $err = curl_errno($ch); $errmsg = curl_error($ch); $header = curl_getinfo($ch); $responsecode = curl_getinfo($ch, curlinfo_http_code); 

​ tends work, isn't 100% reliable @ moment due cloudfront returning 502 bad gateway errors. but, @ same time, i've never created web scraping script before, , while i'm sure that's options need make website think i'm legitimate user, i'm missing something!

from reading i've done, there talk how need pass on ciphers through target site. added curl options:

curlopt_ssl_cipher_list => 'ecdhe-rsa-aes128-gcm-sha256, ecdhe-rsa-aes128-sha256, ecdhe-rsa-aes128-sha, ecdhe-rsa-aes256-gcm-sha384, ecdhe-rsa-aes256-sha384, ecdhe-rsa-aes256-sha ,aes128-gcm-sha256, aes256-gcm-sha384, aes128-sha256, aes256-sha, aes128-sha, rc4-md5', 

i've noticed improvement, i'm still getting fair number of 502's.

if me amazing.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -