java - Parallelizing many GET requests -
is there efficient way parallelize large numer of requests in java. have file 200,000 lines, each 1 needing request wikimedia. , have write part of response common file. i've pasted main part of code below reference.
while ((line = br.readline()) != null) { count++; if ((count % 1000) == 0) { system.out.println(count + " tags parsed"); fbw.flush(); bw.flush(); } //system.out.println(line); string target = new string(line); if (target.startswith("\"") && (target.endswith("\""))) { target = target.replaceall("\"", ""); } string url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles="; url = url + urlencoder.encode(target, "utf-8"); url obj = new url(url); httpurlconnection con = (httpurlconnection) obj.openconnection(); // optional default con.setrequestmethod("get"); //add request header //con.setrequestproperty("user-agent", user_agent); int responsecode = con.getresponsecode(); //system.out.println("sending 'get' request url: " + url); bufferedreader in = new bufferedreader(new inputstreamreader(con.getinputstream())); string inputline; stringbuffer response = new stringbuffer(); while ((inputline = in.readline()) != null) { response.append(inputline); } document doc = loadxmlfromstring(response.tostring()); nodelist x = doc.getelementsbytagname("revisions"); if (x.getlength() == 1) { string time = x.item(0).getfirstchild().getattributes().item(0).gettextcontent().substring(0,10).replaceall("-", ""); bw.write(line + "\t" + time + "\n"); } else if (x.getlength() == 2) { string time = x.item(1).getfirstchild().getattributes().item(0).gettextcontent().substring(0, 10).replaceall("-", ""); bw.write(line + "\t" + time + "\n"); } else { fbw.write(line + "\t" + "null" + "\n"); } }
i googled , seems there 2 options. 1 create threads , other use called executor. provide little guidance on 1 more appropriate task?
if really, need via requests, recommend use threadpoolexecutor small thread pool (2 or 3) avoid overloading wikipedia servers. avoid lot of coding ...
also consider using apache httpclient libraries (with persistent connections!).
but better idea use database download option. depending on doing, may able choose 1 of smaller downloads. this page discusses various options.
note: wikipedia prefers people download database dumps (etcetera) rather pounding on web servers.
Comments
Post a Comment