r - Reading in text file with unmatched quotes -
i have large (>1gb) csv file i'm trying read data frame in r.
the non-numeric fields enclosed in double-quotes internal commas not interpreted delimiters. that's , good. however, there unmatched double-quotes in entry, "2" nails"
.
what best way work around this? current plan use text processor awk relabel quoting character double-quote "
non-conflicting character pipe |
. heuristic finding quoting characters double-quotes next comma:
gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt
this question related, solution (argument in read.csv
of quote=""
) not viable me because file has non-delimiting commas enclosed in quotation marks.
your idea of looking quotes next comma best thing can do; try turn around , have regex escape quotes not next comma (or start/end of line):
search for
(?<!^|,)"(?!,|$)
and replace matches ""
.
r might not best tool because regex engine doesn't have multiline mode, in perl one-liner:
$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;
Comments
Post a Comment