Problem Statement
I have a comma delimited csv file that contained quoted strings that could contain commas and newlines. Unfortunately the program needed to process this file could not handle quoted strings, so it was treating commas as new fields and newlines as creating a new record.
Solution
I came up with the below code that uses awk to convert the csv to pipe delimited, replace the newlines with a space, and strip out the quotes. You can change the delimiter to anything, but since my data did not contain pipes, I used that.
Disclaimer
This does not handle escaped quotes, but since my data did not have any it works for me.
Code
awk -F"\"" '
BEGIN{
j=0; OFS="\"";
}
{
if (NF%2==1 && j == 0){
ORS="\n";
for(i=1;i<=NF;i++){
if(i%2){gsub(",","|",$i);}
}
}
else if(NF%2==0 && j == 0){
ORS=" ";
for(i=1;i<=NF;i++){
if(i%2){gsub(",","|",$i);}
}
j=1;
}
else if(NF%2==0 && j == 1){
ORS="\n";
for(i=1;i<=NF;i++){
if(i+1%2){gsub(",","|",$i);}
}
j=0;
}
else if(NF%2==1 && j == 1){
ORS=" ";
for(i=1;i<=NF;i++){
if(i+1%2){gsub(",","|",$i);}
}
j=1;
}
print $0;}
' file.csv | tr -d '"' > file.pipe
No comments:
Post a Comment