python - Regex for CSV split including multiple double quotes -


i have csv column data containing text. each row separated double quotes "

sample text in row similar (notice: new lines , spaces before each line are intended)

"lorem ipsum dolor sit amet,   consectetur adipisicing elit, sed eiusmod  tempor incididunt ut labore et dolore magna   aliqua. ut ""enim ad"" minim veniam,  quis nostrud exercitation ullamco laboris nisi   ut aliquip ex ea commodo  consequat. duis aute irure dolor in reprehenderit in voluptate velit esse  cillum dolore eu fugiat ""nulla pariatu""" "ex ea commodo  consequat. duis aute irure ""dolor in"" reprehenderit   in voluptate velit esse  cillum dolore eu fugiat nulla pariatur.   excepteur sint occaecat cupidatat non  proident, sunt in culpa qui officia deserunt   mollit anim id est laborum." 

the above represent 2 subsequent rows.

i want select separated groups text contained between every first double quote " (starting line) , every last double quote "

as can see tho, there line break in text, along subsequent escaped double quotes "" wich part of text need select.

i came this

(?s)(?!")[^\s](.+?)(?=") 

but multiple double quotes breaking desired match

i'm real novice regex, think maybe i'm missing basic. dunno if relevant i'm using sublime text 3 should python think.

what can achieve need?

you can use following regex:

"[^"]*(?:""[^"]*)*" 

see demo

this regex match either non-quote, or 2 consequent double quotes inside double quotation marks.

how work? let me share graphics debuggex.com:

enter image description here

with regex, match:

  • " - (1) - literal quote
  • [^"]* - (2, 3) - 0 or more characters other quote (yes, including newline, negated character class), if there none, regex searches final literal quote (6)
  • (?:""[^"]*)* - (4,5) - 0 or more sequences of:
    • "" - (4) - double double quotation marks
    • [^"]* - (5) - 0 or more characters other quote
  • " - (6) - final literal quote.

this works faster "(?:[^"]|"")*" (although yielding same results), because processing former linear, involving less backtracking.


Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -