scrapy duplicate filter with csv file -


im trying avoid scraping same information more once, run spider every morning scrape jobs job board, copy them excel , press remove duplicates list using url. in scrapy (i can change txt file csv). happy implement middleware

this pipleing trying use

class craigslistsamplepipeline(object):        def find_row_by_id(item):         open('urllog.txt', 'r') f:                # open txt file urls previous scrapes             urlx = [url.strip() url in f.readlines()] # extract each url             if urlx == item ["website_url"]:              # compare old url url being scraped             raise dropitem('item in db')      # skip record if in url list         return 

im sure code wrong, can please suggest how can this, im new explaining each line me alot. hope question makes sense , can me

ive looked @ these posts help, not able solve problem:

how filter csv file using python script

scrapy - spider crawls duplicate urls

how filter duplicate requests based on url in scrapy

use in keyword. so:

 if item['website_url'] in urlx:       raise dropitem('item in db') 

you loaded urlx file each line url. list. in keyword checks see if website url in list urlx. if is, returns true. keep in mind comparison case sensitive in example. may want call .lower() on website url , on urls loaded file.

there more efficient ways of doing this, assume want works.


Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

c++ - End of file on pipe magic during open -