blog community

Welcome to blog community Sign in | Join | Help
in Search

Raimond Brookman

Architecture and .NET Development

Regex fun with CSV

I recently needed to do some escaping of a semicolon separated file. This was because a lot of the fields contained data that contained semicolons, and they had forgot to escape those fields within quotes.

I was lucky that the bad behaving fields contained structured data, SOAP requests in this case. So I came up with an expression to match the this. Because multiple SOAP datafields existed, I needed to do a non-greedy match to find the first occurence of the closing tag.

I also needed to do something with different parts of the match, so I created some namend groups for that

This resulted in the expression below:

(?<START><\?xml)(?<DATA>.*?)(?<END></soap:Envelope>;)

Then I noticed some of the fields were incorreclty matched. The reason was that in the file, a field had a maximum size of 4000 characters. So in that case I was'nt guaranteed to find a match on the end tag when a field was overflowed.

I needed an expression to match either on the semicolon after 4000 characters after start or on the end tag in case the match was less than 400 characters.
I came up with the following solution:

(?<START><\?xml)
(?:
   (?<DATA>.{3995})(?<END>;)
  |
   (?<DATA>.{0,3994}?)(?<END></soap:Envelope>;)
)

The 3995 is of course 5 less than 4000, because 5 characters have already been matched for the start part.

The next trick was to find an expression that I could use to find out if my row was properly escaped, i.e. contained the correct number of columns. Luckily, I found this expression on the Internet, as it is not so simple to also deal with fields escaped by quotes that also contain quotes and semicolons. For example the following string:

1 ; “2;3” ; “4;””;““5”

Here only the semicolons in red should be treated as column separators. This proved to be a difficult task for most expressions I found. However, I did find a working one (after a tiny tweak, the blue part was'nt there) that I'd like to share with you:

;(?=(?:[^"]*"[^"]*")*(?![^"]*"))

This expression matches all field separators, so simply perform a split using the regex and you end up with an array of all fields in your row. Checking the number of columns is a piece of cake then :-).

So why was the blue part needed? It turns out that performing a Regex.Split in .NET wil also return captured groups, als well as the splitted parts of the strings. So making the group non-capturing solved the problem.

To conclude, I think regex is fun, but sometimes the drawback is that you need a lot of comment lines to explain what that regex actually does...:-)

Published Wednesday, April 27, 2005 9:02 PM by Raimondb
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Michael said:

Thanks, you saved me some time writing my own expression

October 15, 2007 4:04 AM
 

Franz said:

You saved me from writting my own expression, too.

Thanks a lot!

October 24, 2007 9:26 AM
 

andy said:

this slight modification will remove the quotes from the returned values:

"?;"?(?=(?:[^"]*"[^"]*")*(?![^"]*"))"?

thanks!! =)

February 21, 2008 10:34 PM

Leave a Comment

(required) 
(optional)
(required) 
Submit

This Blog

Syndication

News

Subscribe in NewsGator Online

Subscribe in Bloglines

Add to Google


 Visitors since June 2005




Powered by Community Server, by Telligent Systems