python - Regex for CSV split including multiple double quotes -
i have csv column data containing text. each row separated double quotes "
sample text in row similar (notice: new lines , spaces before each line are intended)
"lorem ipsum dolor sit amet, consectetur adipisicing elit, sed eiusmod tempor incididunt ut labore et dolore magna aliqua. ut ""enim ad"" minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat ""nulla pariatu""" "ex ea commodo consequat. duis aute irure ""dolor in"" reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
the above represent 2 subsequent rows.
i want select separated groups text contained between every first double quote "
(starting line) , every last double quote "
as can see tho, there line break in text, along subsequent escaped double quotes ""
wich part of text need select.
i came this
(?s)(?!")[^\s](.+?)(?=")
but multiple double quotes breaking desired match
i'm real novice regex, think maybe i'm missing basic. dunno if relevant i'm using sublime text 3 should python think.
what can achieve need?
you can use following regex:
"[^"]*(?:""[^"]*)*"
see demo
this regex match either non-quote, or 2 consequent double quotes inside double quotation marks.
how work? let me share graphics debuggex.com:
with regex, match:
"
- (1) - literal quote[^"]*
- (2, 3) - 0 or more characters other quote (yes, including newline, negated character class), if there none, regex searches final literal quote (6)(?:""[^"]*)*
- (4,5) - 0 or more sequences of:""
- (4) - double double quotation marks[^"]*
- (5) - 0 or more characters other quote
"
- (6) - final literal quote.
this works faster "(?:[^"]|"")*"
(although yielding same results), because processing former linear, involving less backtracking.
Comments
Post a Comment