06-22-2020 12:25 AM
Hi all,
I have a problem using regexes with word boundaries.
First of all: my goal. I want to remove all words in my text, that are shorter than, let's say, 5 characters. As usually, I started playing with DEMO_REGEX_TOY. First try: \<\S{1,5}\>. This should match every word with 1 to 5 characters. But in the sample program (text: Cathy's black cat…) it matches, by the option FIRST OCCURRENCE, Cathy, by all occurrences Cathy and s. Documentation for \< and \> says that Words are defined as uninterrupted strings of alphanumeric characters. Alphanumeric are characters / letters and digits, so my mistake. But if I change the length to 7 or more, then, all of a sudden, the whole word is matched (Cathy's). How's that possible? What am I missing here? It's working the same when I change \S to [[:graph:]].
I would very appreciate any clues. Many thanks in advance!
Best regards
Michał
06-22-2020 7:53 AM
I would say that the only question is about the text "Words are defined as uninterrupted strings of alphanumeric characters" in the documentation, which is completely wrong according to me. If think \< and \> just imply a condition on the left and right characters, they must be alphanumeric characters. All characters in between can be anything. You may use the following to exclude single quotes:
\<[[:alpha:]]{1,7}\>
Concerning [[:graph:]], the documentation says "Set of all displayable characters except for blanks and horizontal tabs". So the result you get is very logical, I don't know why you mention it.
06-22-2020 7:53 AM
I would say that the only question is about the text "Words are defined as uninterrupted strings of alphanumeric characters" in the documentation, which is completely wrong according to me. If think \< and \> just imply a condition on the left and right characters, they must be alphanumeric characters. All characters in between can be anything. You may use the following to exclude single quotes:
\<[[:alpha:]]{1,7}\>
Concerning [[:graph:]], the documentation says "Set of all displayable characters except for blanks and horizontal tabs". So the result you get is very logical, I don't know why you mention it.
06-22-2020 10:05 AM
Thank You sandra.rossi for Your answer! It is an interessting point of view, which would explain the strange behaviour, but also mean, that the documentation is wrong, or even completly wrong, as You stated.
I've just tried with a different approach: for the sample text in DEMO_REGEX_TOY first occurrence of \<.{1,13}\> matches the text Cathy's black, so it seems to be exactly as You described!
I don't want to exclude single quotes. I want them to count in the word. I want to exclude all word that are up to 5 characters long. Cathy's is 7-character long. But when I use the regex as described, I get Cathy as a match, which is wrong for my use case.
I mentioned [[:graph:]] because it was my another approach to achieve my goal. It shows once again, that the problem is not the expression between word boundaries, but the word boundaries themselves. Or more their ABAP documentation, as You already noticed.
06-22-2020 11:16 AM
So you want the single quote to not be considered as a word boundary. So you can't use \< and \>.
Something close could be to use the following regular expression + an additional ABAP logic to refine:
regex = `(?:^|[^[:alnum:]'])` " start of text or any character neither alphanum nor single quote<br> && `[[:alnum:]']{1,5}` " 1 to 5 alphanum or single quote
&& `(?![[:alnum:]'])`. " followed by neither alphanum nor single quote
After the search, you may have to refine the matches because they contain an extra character at the beginning which corresponds to the character before each word.
06-22-2020 11:40 AM
06-22-2020 11:59 AM
sandra.rossi - how about enclosing the second part of the regex into a capturing group? This could give the 1 to 5 characters words (without spaces) as a result of the group.
Mateusz
06-22-2020 5:04 PM
Mateusz Adamus It's up to the developer. I learnt that regex is cool, but don't go too far because not everything can be done with it, and if insisting, the regex becomes too complex with "subtleties". The one above is the maximum I think that people can tolerate. They are scared by regex. It's the same issue with constructor expressions, keep the expressions at an "acceptable" size. I think that if you use submatches (capturing group), the code won't be more readable than without submatches.
Your solution with the simple regex is even better I think... 🙂 maybe just use [[:alnum]'] insted of [^\s] (worth an answer probably!)
06-22-2020 5:11 PM
Thanks, but the question was about regex, which your answer answers best. My comment is just a workaround of the issue.
06-22-2020 8:58 PM
Thank You both for Your answers and comments, sandra.rossi and mateuszadamus. Maybe I didn't put myself clear, but I have the answer I needed! The ABAP documentation is misleading and the regex proposed by Sandra is just beautiful! In my use case it's not just single quotes, but anything which is not an empty character. So the regex would be:
SPAN {
font-family: "Courier New";
font-size: 10pt;
color: #000000;
background: #FFFFFF;
}
.L0S55 {
color: #800080;
}
DATA(regex) = `(?:^|[^[:graph:]])[[:graph:]]{1,5}(?![[:graph:]])`.
This one can be used in one go, unlike mine (^\S{1,5}\s|\s\S{1,5}\s|\s\S{1,5}$), for which I would have to implement a DO-loop, to replace all occurrences.
06-22-2020 8:23 AM
Hello michalbadura
I agree, there seems to be an issue with the regex handling you explained. Why is the apostrophe a non-alpha numeric character if you make the length as 1 to 5? Why is it an alphanumeric if you make the length 1 to 7?
Looking at your task, how about a different approach? Find all words separated by space and then remove the ones that are longer than specified width (something like: [^\s]+).
Kind regards,
Mateusz
06-22-2020 10:18 AM
Thank You mateuszadamus for Your answer! I think Sandra Rossi has got the right solution.
I want to get rid of words that are shorter. Since it won't work with word boundaries, I tried another approach: ^\S{1,5}\s|\s\S{1,5}\s|\s\S{1,5}$. But for this, I would have to run the replacement in a loop, until there is nothing more to replace.
06-22-2020 11:14 AM
Hi Michał
I'm not sure if you have already got the answer you needed or not. You mention Sandra got the right solution.
If not, then here's my ABAP solution for the issue.
DATA(lv_haystack) = 'Cathy''s black cat'.
DATA(lv_regex) = '[^\s]+'.
FIND ALL OCCURRENCES OF REGEX lv_regex IN lv_haystack IN CHARACTER MODE RESULTS DATA(lt_results).
DELETE lt_results WHERE length > 5.
LOOP AT lt_results REFERENCE INTO DATA(ld_result).
lv_haystack+ld_result->offset(ld_result->length) = ''.
ENDLOOP.