Python - parsing user input using a verbose regex -


i try design regex parse user input, in form of full sentences. stuggling expression work. know not coded trying hard learn. trying parse precent 1 string see under code.

my test "sentence" = how i'm 15.5% wholesome-looking u.s.a. radar () [] {} -- are, ... you?

text = input("please type coherently: ")  pattern = r'''(?x)              # set flag allow verbose regexps     (?:[a-z]\.)+                # abbreviations, e.g. u.s.a.     |\w+(?:[-']\w+)*            # permit word-internal hyphens , apostrophes     |[-.(]+                     # double hyphen, ellipsis, , open parenthesis     |\s\w*                       # sequence of word characters     # |[\d+(\.\d+)?%]           # percentages, 82%     |[][\{\}.,;"'?():-_`]       # these separate tokens     '''  parsed = re.findall(pattern, text) print(parsed) 

my output = ['how', "i'm", '15', '.', '5', '%', 'wholesome-looking', 'u.s.a.', 'we', 'radar', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']

i looking have '15', '.', '5', '%' parsed '15.5%'. line commented out should it, when commented in absolutly nothing. searched resources have not.

thank you time.

if want have percentage match whole entity, should aware regex engine analyzes input string , pattern left right. if have alternation, leftmost alternative matches input string chosen, rest won't tested.

thus, need pull alternative \d+(?:\.\d+)? up, , capturing group should turned non-capturing or findall yield strange results:

(?x)              # set flag allow verbose regexps (?:[a-z]\.)+                # abbreviations, e.g. u.s.a. |\d+(?:\.\d+)?%           # percentages, 82%  <-- pulled on here |\w+(?:[-']\w+)*            # permit word-internal hyphens , apostrophes |[-.(]+                     # double hyphen, ellipsis, , open parenthesis |\s\w*                       # sequence of word characters# |[][{}.,;"'?():_`-]       # these separate tokens 

see regex demo.

also, please note replaced [][\{\}.,;"'?():-_`] [][{}.,;"'?():_`-]: braces not have escaped, , - forming unnecessary range colon (decimal code 58) , underscore (decimal 95) matching ;, <, =, >, ?, @, uppercase latin letters, [, \, ] , ^.


Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -