java - Parallelizing many GET requests -

August 15, 2015

is there efficient way parallelize large numer of requests in java. have file 200,000 lines, each 1 needing request wikimedia. , have write part of response common file. i've pasted main part of code below reference.

while ((line = br.readline()) != null) {     count++;     if ((count % 1000) == 0) {         system.out.println(count + " tags parsed");         fbw.flush();         bw.flush();     }     //system.out.println(line);     string target = new string(line);     if (target.startswith("\"") && (target.endswith("\""))) {         target = target.replaceall("\"", "");     }     string url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=xml&rvprop=timestamp&rvlimit=1&rvdir=newer&titles=";     url = url + urlencoder.encode(target, "utf-8");     url obj = new url(url);     httpurlconnection con = (httpurlconnection) obj.openconnection();     // optional default     con.setrequestmethod("get");     //add request header     //con.setrequestproperty("user-agent", user_agent);     int responsecode = con.getresponsecode();     //system.out.println("sending 'get' request url: " + url);     bufferedreader in = new bufferedreader(new inputstreamreader(con.getinputstream()));     string inputline;     stringbuffer response = new stringbuffer();     while ((inputline = in.readline()) != null) {         response.append(inputline);              }     document doc = loadxmlfromstring(response.tostring());     nodelist x = doc.getelementsbytagname("revisions");     if (x.getlength() == 1) {         string time = x.item(0).getfirstchild().getattributes().item(0).gettextcontent().substring(0,10).replaceall("-", "");         bw.write(line + "\t" + time + "\n");     } else if (x.getlength() == 2) {         string time = x.item(1).getfirstchild().getattributes().item(0).gettextcontent().substring(0, 10).replaceall("-", "");                   bw.write(line + "\t" + time + "\n");     } else {         fbw.write(line + "\t" + "null" + "\n");     } }

i googled , seems there 2 options. 1 create threads , other use called executor. provide little guidance on 1 more appropriate task?

if really, need via requests, recommend use threadpoolexecutor small thread pool (2 or 3) avoid overloading wikipedia servers. avoid lot of coding ...

also consider using apache httpclient libraries (with persistent connections!).

but better idea use database download option. depending on doing, may able choose 1 of smaller downloads. this page discusses various options.

note: wikipedia prefers people download database dumps (etcetera) rather pounding on web servers.

Search This Blog

Error

java - Parallelizing many GET requests -

Comments

Post a Comment

Popular posts from this blog

basic authentication with http post params android -

c++ - End of file on pipe magic during open -

vb.net - Virtual Keyboard commands -