Application Development Discussions
Join the discussions or start your own on all things application development, including tools and APIs, programming models, and keeping your skills sharp.
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem by using regex with word boundaries and text with non-alphanumerical characters

michał_badura
Participant

Hi all,

I have a problem using regexes with word boundaries.

First of all: my goal. I want to remove all words in my text, that are shorter than, let's say, 5 characters. As usually, I started playing with DEMO_REGEX_TOY. First try: \<\S{1,5}\>. This should match every word with 1 to 5 characters. But in the sample program (text: Cathy's black cat…) it matches, by the option FIRST OCCURRENCE, Cathy, by all occurrences Cathy and s. Documentation for \< and \> says that Words are defined as uninterrupted strings of alphanumeric characters. Alphanumeric are characters / letters and digits, so my mistake. But if I change the length to 7 or more, then, all of a sudden, the whole word is matched (Cathy's). How's that possible? What am I missing here? It's working the same when I change \S to [[:graph:]].

I would very appreciate any clues. Many thanks in advance!

Best regards

Michał

1 ACCEPTED SOLUTION

Sandra_Rossi
Active Contributor

I would say that the only question is about the text "Words are defined as uninterrupted strings of alphanumeric characters" in the documentation, which is completely wrong according to me. If think \< and \> just imply a condition on the left and right characters, they must be alphanumeric characters. All characters in between can be anything. You may use the following to exclude single quotes:

\<[[:alpha:]]{1,7}\>

Concerning [[:graph:]], the documentation says "Set of all displayable characters except for blanks and horizontal tabs". So the result you get is very logical, I don't know why you mention it.

11 REPLIES 11

Sandra_Rossi
Active Contributor

I would say that the only question is about the text "Words are defined as uninterrupted strings of alphanumeric characters" in the documentation, which is completely wrong according to me. If think \< and \> just imply a condition on the left and right characters, they must be alphanumeric characters. All characters in between can be anything. You may use the following to exclude single quotes:

\<[[:alpha:]]{1,7}\>

Concerning [[:graph:]], the documentation says "Set of all displayable characters except for blanks and horizontal tabs". So the result you get is very logical, I don't know why you mention it.

0 Kudos

Thank You sandra.rossi for Your answer! It is an interessting point of view, which would explain the strange behaviour, but also mean, that the documentation is wrong, or even completly wrong, as You stated.

I've just tried with a different approach: for the sample text in DEMO_REGEX_TOY first occurrence of \<.{1,13}\> matches the text Cathy's black, so it seems to be exactly as You described!

I don't want to exclude single quotes. I want them to count in the word. I want to exclude all word that are up to 5 characters long. Cathy's is 7-character long. But when I use the regex as described, I get Cathy as a match, which is wrong for my use case.

I mentioned [[:graph:]] because it was my another approach to achieve my goal. It shows once again, that the problem is not the expression between word boundaries, but the word boundaries themselves. Or more their ABAP documentation, as You already noticed.

So you want the single quote to not be considered as a word boundary. So you can't use \< and \>.

Something close could be to use the following regular expression + an additional ABAP logic to refine:

regex = `(?:^|[^[:alnum:]'])` " start of text or any character neither alphanum nor single quote<br>     && `[[:alnum:]']{1,5}`   " 1 to 5 alphanum or single quote
     && `(?![[:alnum:]'])`.   " followed by neither alphanum nor single quote

After the search, you may have to refine the matches because they contain an extra character at the beginning which corresponds to the character before each word.

That's a nice regex there!


Kind regards,
Mateusz

0 Kudos

sandra.rossi - how about enclosing the second part of the regex into a capturing group? This could give the 1 to 5 characters words (without spaces) as a result of the group.


Kind regards,

Mateusz

Mateusz Adamus It's up to the developer. I learnt that regex is cool, but don't go too far because not everything can be done with it, and if insisting, the regex becomes too complex with "subtleties". The one above is the maximum I think that people can tolerate. They are scared by regex. It's the same issue with constructor expressions, keep the expressions at an "acceptable" size. I think that if you use submatches (capturing group), the code won't be more readable than without submatches.

Your solution with the simple regex is even better I think... 🙂 maybe just use [[:alnum]'] insted of [^\s] (worth an answer probably!)

Thanks, but the question was about regex, which your answer answers best. My comment is just a workaround of the issue.


Kind regards,
Mateusz

Thank You both for Your answers and comments, sandra.rossi and mateuszadamus. Maybe I didn't put myself clear, but I have the answer I needed! The ABAP documentation is misleading and the regex proposed by Sandra is just beautiful! In my use case it's not just single quotes, but anything which is not an empty character. So the regex would be:

SPAN {
font-family: "Courier New";
font-size: 10pt;
color: #000000;
background: #FFFFFF;
}
.L0S55 {
color: #800080;
}

DATA(regex) = `(?:^|[^[:graph:]])[[:graph:]]{1,5}(?![[:graph:]])`.

This one can be used in one go, unlike mine (^\S{1,5}\s|\s\S{1,5}\s|\s\S{1,5}$), for which I would have to implement a DO-loop, to replace all occurrences.

mateuszadamus
Active Contributor

Hello michalbadura

I agree, there seems to be an issue with the regex handling you explained. Why is the apostrophe a non-alpha numeric character if you make the length as 1 to 5? Why is it an alphanumeric if you make the length 1 to 7?

Looking at your task, how about a different approach? Find all words separated by space and then remove the ones that are longer than specified width (something like: [^\s]+).

Kind regards,

Mateusz

michał_badura
Participant
0 Kudos

Thank You mateuszadamus for Your answer! I think Sandra Rossi has got the right solution.

I want to get rid of words that are shorter. Since it won't work with word boundaries, I tried another approach: ^\S{1,5}\s|\s\S{1,5}\s|\s\S{1,5}$. But for this, I would have to run the replacement in a loop, until there is nothing more to replace.

mateuszadamus
Active Contributor

Hi Michał

I'm not sure if you have already got the answer you needed or not. You mention Sandra got the right solution.

If not, then here's my ABAP solution for the issue.

DATA(lv_haystack) = 'Cathy''s black cat'.
DATA(lv_regex) = '[^\s]+'.

FIND ALL OCCURRENCES OF REGEX lv_regex IN lv_haystack IN CHARACTER MODE RESULTS DATA(lt_results).
DELETE lt_results WHERE length > 5.

LOOP AT lt_results REFERENCE INTO DATA(ld_result).
  lv_haystack+ld_result->offset(ld_result->length) = ''.
ENDLOOP.

Kind regards,
Mateusz