scrapy duplicate filter with csv file -
im trying avoid scraping same information more once, run spider every morning scrape jobs job board, copy them excel , press remove duplicates list using url. in scrapy (i can change txt file csv). happy implement middleware
this pipleing trying use
class craigslistsamplepipeline(object): def find_row_by_id(item): open('urllog.txt', 'r') f: # open txt file urls previous scrapes urlx = [url.strip() url in f.readlines()] # extract each url if urlx == item ["website_url"]: # compare old url url being scraped raise dropitem('item in db') # skip record if in url list return
im sure code wrong, can please suggest how can this, im new explaining each line me alot. hope question makes sense , can me
ive looked @ these posts help, not able solve problem:
how filter csv file using python script
use in
keyword. so:
if item['website_url'] in urlx: raise dropitem('item in db')
you loaded urlx
file each line url. list. in
keyword checks see if website url in list urlx
. if is, returns true. keep in mind comparison case sensitive in example. may want call .lower()
on website url , on urls loaded file.
there more efficient ways of doing this, assume want works.
Comments
Post a Comment