ahocorasick: improve matching with subdomains (#2331)

The basic idea is to have the following logic:
* pattern "DOMAIN" matches the domain itself (i.e exact match) *and* any
subdomains (i.e. "ANYTHING.DOMAIN")
* pattern "DOMAIN." matches *also* any strings for which is a prefix
[please, note that this kind of match is handy but it is quite
dangerous...]
* pattern "-DOMAIN" matches *also* any strings for which is a postfix

Examples:
* pattern "wikipedia.it":
  * "wikipiedia.it" -> OK
  * "foo.wikipedia.it -> OK
  * "foowikipedia.it -> NO MATCH
  * "wikipedia.it.com -> NO MATCH
* pattern "wikipedia.":
  * "wikipedia.it" -> OK
  * "foo.wikipedia.it -> OK
  * "foowikipedia.it -> NO MATCH
  * "wikipedia.it.com -> OK
* pattern "-wikipedia.it":
  * "wikipedia.it" -> NO MATCH
  * "foo.wikipedia.it -> NO MATCH
  * "0001-wikipedia.it -> OK
  * "foo.0001-wikipedia.it -> OK

Bottom line:
* exact match
* prefix with "." (always, implicit)
* prefix with "-" (only if esplicitly set)
* postfix with "." (only if esplicitly set)

That means that the patterns cannot start with '.' anymore.

Close #2330
This commit is contained in:
Ivan Nardi 2024-03-06 19:25:59 +01:00 committed by GitHub
parent 8f63a11735
commit 21da53d3a0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 527 additions and 464 deletions

Binary file not shown.