r - Reading in text file with unmatched quotes -


i have large (>1gb) csv file i'm trying read data frame in r.

the non-numeric fields enclosed in double-quotes internal commas not interpreted delimiters. that's , good. however, there unmatched double-quotes in entry, "2" nails".

what best way work around this? current plan use text processor awk relabel quoting character double-quote " non-conflicting character pipe |. heuristic finding quoting characters double-quotes next comma:

gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt  

this question related, solution (argument in read.csv of quote="") not viable me because file has non-delimiting commas enclosed in quotation marks.

your idea of looking quotes next comma best thing can do; try turn around , have regex escape quotes not next comma (or start/end of line):

search for

(?<!^|,)"(?!,|$) 

and replace matches "".

r might not best tool because regex engine doesn't have multiline mode, in perl one-liner:

$subject =~ s/(?<!^|,)"(?!,|$)/""/mg; 

Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

css - Firefox for ubuntu renders wrong colors -