1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776
/*! Teddy is a simd accelerated multiple substring matching algorithm. The name and the core ideas in the algorithm were learned from the [Hyperscan][1_u] project. Background ---------- The key idea of Teddy is to do *packed* substring matching. In the literature, packed substring matching is the idea of examing multiple bytes in a haystack at a time to detect matches. Implementations of, for example, memchr (which detects matches of a single byte) have been doing this for years. Only recently, with the introduction of various SIMD instructions, has this been extended to substring matching. The PCMPESTRI instruction (and its relatives), for example, implements substring matching in hardware. It is, however, limited to substrings of length 16 bytes or fewer, but this restriction is fine in a regex engine, since we rarely care about the performance difference between searching for a 16 byte literal and a 16 + N literal; 16 is already long enough. The key downside of the PCMPESTRI instruction, on current (2016) CPUs at least, is its latency and throughput. As a result, it is often faster to do substring search with a Boyer-Moore variant and a well placed memchr to quickly skip through the haystack. There are fewer results from the literature on packed substring matching, and even fewer for packed multiple substring matching. Ben-Kiki et al. [2] describes use of PCMPESTRI for substring matching, but is mostly theoretical and hand-waves performance. There is other theoretical work done by Bille [3] as well. The rest of the work in the field, as far as I'm aware, is by Faro and Kulekci and is generally focused on multiple pattern search. Their first paper [4a] introduces the concept of a fingerprint, which is computed for every block of N bytes in every pattern. The haystack is then scanned N bytes at a time and a fingerprint is computed in the same way it was computed for blocks in the patterns. If the fingerprint corresponds to one that was found in a pattern, then a verification step follows to confirm that one of the substrings with the corresponding fingerprint actually matches at the current location. Various implementation tricks are employed to make sure the fingerprint lookup is fast; typically by truncating the fingerprint. (This may, of course, provoke more steps in the verification process, so a balance must be struck.) The main downside of [4a] is that the minimum substring length is 32 bytes, presumably because of how the algorithm uses certain SIMD instructions. This essentially makes it useless for general purpose regex matching, where a small number of short patterns is far more likely. Faro and Kulekci published another paper [4b] that is conceptually very similar to [4a]. The key difference is that it uses the CRC32 instruction (introduced as part of SSE 4.2) to compute fingerprint values. This also enables the algorithm to work effectively on substrings as short as 7 bytes with 4 byte windows. 7 bytes is unfortunately still too long. The window could be technically shrunk to 2 bytes, thereby reducing minimum length to 3, but the small window size ends up negating most performance benefits—and it's likely the common case in a general purpose regex engine. Faro and Kulekci also published [4c] that appears to be intended as a replacement to using PCMPESTRI. In particular, it is specifically motivated by the high throughput/latency time of PCMPESTRI and therefore chooses other SIMD instructions that are faster. While this approach works for short substrings, I personally couldn't see a way to generalize it to multiple substring search. Faro and Kulekci have another paper [4d] that I haven't been able to read because it is behind a paywall. Teddy ----- Finally, we get to Teddy. If the above literature review is complete, then it appears that Teddy is a novel algorithm. More than that, in my experience, it completely blows away the competition for short substrings, which is exactly what we want in a general purpose regex engine. Again, the algorithm appears to be developed by the authors of [Hyperscan][1_u]. Hyperscan was open sourced late 2015, and no earlier history could be found. Therefore, tracking the exact provenance of the algorithm with respect to the published literature seems difficult. DISCLAIMER: My understanding of Teddy is limited to reading auto-generated C code, its disassembly and observing its runtime behavior. At a high level, Teddy works somewhat similarly to the fingerprint algorithms published by Faro and Kulekci, but Teddy does it in a way that scales a bit better. Namely: 1. Teddy's core algorithm scans the haystack in 16 byte chunks. 16 is significant because it corresponds to the number of bytes in a SIMD vector. If one used AVX2 instructions, then we could scan the haystack in 32 byte chunks. Similarly, if one used AVX512 instructions, we could scan the haystack in 64 byte chunks. Hyperscan implements SSE + AVX2, we only implement SSE for the moment. 2. Bitwise operations are performed on each chunk to discover if any region of it matches a set of precomputed fingerprints from the patterns. If there are matches, then a verification step is performed. In this implementation, our verification step is naive. This can be improved upon. The details to make this work are quite clever. First, we must choose how to pick our fingerprints. In Hyperscan's implementation, I *believe* they use the last N bytes of each substring, where N must be at least the minimum length of any substring in the set being searched. In this implementation, we use the first N bytes of each substring. (The tradeoffs between these choices aren't yet clear to me.) We then must figure out how to quickly test whether an occurrence of any fingerprint from the set of patterns appears in a 16 byte block from the haystack. To keep things simple, let's assume N = 1 and examine some examples to motivate the approach. Here are our patterns: ```ignore foo bar baz ``` The corresponding fingerprints, for N = 1, are `f`, `b` and `b`. Now let's set our 16 byte block to: ```ignore bat cat foo bump xxxxxxxxxxxxxxxx ``` To cut to the chase, Teddy works by using bitsets. In particular, Teddy creates a mask that allows us to quickly compute membership of a fingerprint in a 16 byte block that also tells which pattern the fingerprint corresponds to. In this case, our fingerprint is a single byte, so an appropriate abstraction is a map from a single byte to a list of patterns that contain that fingerprint: ```ignore f |--> foo b |--> bar, baz ``` Now, all we need to do is figure out how to represent this map in vector space and use normal SIMD operations to perform a lookup. The first simplification we can make is to represent our patterns as bit fields occupying a single byte. This is important, because a single SIMD vector can store 16 bytes. ```ignore f |--> 00000001 b |--> 00000010, 00000100 ``` How do we perform lookup though? It turns out that SSSE3 introduced a very cool instruction called PSHUFB. The instruction takes two SIMD vectors, `A` and `B`, and returns a third vector `C`. All vectors are treated as 16 8-bit integers. `C` is formed by `C[i] = A[B[i]]`. (This is a bit of a simplification, but true for the purposes of this algorithm. For full details, see [Intel's Intrinsics Guide][5_u].) This essentially lets us use the values in `B` to lookup values in `A`. If we could somehow cause `B` to contain our 16 byte block from the haystack, and if `A` could contain our bitmasks, then we'd end up with something like this for `A`: ```ignore 0x00 0x01 ... 0x62 ... 0x66 ... 0xFF A = 0 0 00000110 00000001 0 ``` And if `B` contains our window from our haystack, we could use shuffle to take the values from `B` and use them to look up our bitsets in `A`. But of course, we can't do this because `A` in the above example contains 256 bytes, which is much larger than the size of a SIMD vector. Nybbles to the rescue! A nybble is 4 bits. Instead of one mask to hold all of our bitsets, we can use two masks, where one mask corresponds to the lower four bits of our fingerprint and the other mask corresponds to the upper four bits. So our map now looks like: ```ignore 'f' & 0xF = 0x6 |--> 00000001 'f' >> 4 = 0x6 |--> 00000111 'b' & 0xF = 0x2 |--> 00000110 'b' >> 4 = 0x6 |--> 00000111 ``` Notice that the bitsets for each nybble correspond to the union of all fingerprints that contain that nybble. For example, both `f` and `b` have the same upper 4 bits but differ on the lower 4 bits. Putting this together, we have `A0`, `A1` and `B`, where `A0` is our mask for the lower nybble, `A1` is our mask for the upper nybble and `B` is our 16 byte block from the haystack: ```ignore 0x00 0x01 0x02 0x03 ... 0x06 ... 0xF A0 = 0 0 00000110 0 00000001 0 A1 = 0 0 0 0 00000111 0 B = b a t _ t p B = 0x62 0x61 0x74 0x20 0x74 0x70 ``` But of course, we can't use `B` with `PSHUFB` yet, since its values are 8 bits, and we need indexes that are at most 4 bits (corresponding to one of 16 values). We can apply the same transformation to split `B` into lower and upper nybbles as we did `A`. As before, `B0` corresponds to the lower nybbles and `B1` corresponds to the upper nybbles: ```ignore b a t _ c a t _ f o o _ b u m p B0 = 0x2 0x1 0x4 0x0 0x3 0x1 0x4 0x0 0x6 0xF 0xF 0x0 0x2 0x5 0xD 0x0 B1 = 0x6 0x6 0x7 0x2 0x6 0x6 0x7 0x2 0x6 0x6 0x6 0x2 0x6 0x7 0x6 0x7 ``` And now we have a nice correspondence. `B0` can index `A0` and `B1` can index `A1`. Here's what we get when we apply `C0 = PSHUFB(A0, B0)`: ```ignore b a ... f o ... p A0[0x2] A0[0x1] A0[0x6] A0[0xF] A0[0x0] C0 = 00000110 0 00000001 0 0 ``` And `C1 = PSHUFB(A1, B1)`: ```ignore b a ... f o ... p A1[0x6] A1[0x6] A1[0x6] A1[0x6] A1[0x7] C1 = 00000111 00000111 00000111 00000111 0 ``` Notice how neither one of `C0` or `C1` is guaranteed to report fully correct results all on its own. For example, `C1` claims that `b` is a fingerprint for the pattern `foo` (since `A1[0x6] = 00000111`), and that `o` is a fingerprint for all of our patterns. But if we combined `C0` and `C1` with an `AND` operation: ```ignore b a ... f o ... p C = 00000110 0 00000001 0 0 ``` Then we now have that `C[i]` contains a bitset corresponding to the matching fingerprints in a haystack's 16 byte block, where `i` is the `ith` byte in that block. Once we have that, we can look for the position of the least significant bit in `C`. That position, modulo `8`, gives us the pattern that the fingerprint matches. That position, integer divided by `8`, also gives us the byte offset that the fingerprint occurs in inside the 16 byte haystack block. Using those two pieces of information, we can run a verification procedure that tries to match all substrings containing that fingerprint at that position in the haystack. Implementation notes -------------------- The problem with the algorithm as described above is that it uses a single byte for a fingerprint. This will work well if the fingerprints are rare in the haystack (e.g., capital letters or special characters in normal English text), but if the fingerprints are common, you'll wind up spending too much time in the verification step, which effectively negate the performance benefits of scanning 16 bytes at a time. Remember, the key to the performance of this algorithm is to do as little work as possible per 16 bytes. This algorithm can be extrapolated in a relatively straight-forward way to use larger fingerprints. That is, instead of a single byte prefix, we might use a three byte prefix. The implementation below implements N = {1, 2, 3} and always picks the largest N possible. The rationale is that the bigger the fingerprint, the fewer verification steps we'll do. Of course, if N is too large, then we'll end up doing too much on each step. The way to extend it is: 1. Add a mask for each byte in the fingerprint. (Remember that each mask is composed of two SIMD vectors.) This results in a value of `C` for each byte in the fingerprint while searching. 2. When testing each 16 byte block, each value of `C` must be shifted so that they are aligned. Once aligned, they should all be `AND`'d together. This will give you only the bitsets corresponding to the full match of the fingerprint. The implementation below is commented to fill in the nitty gritty details. References ---------- - **[1]** [Hyperscan on GitHub](https://github.com/01org/hyperscan), [webpage](https://01.org/hyperscan) - **[2a]** Ben-Kiki, O., Bille, P., Breslauer, D., Gasieniec, L., Grossi, R., & Weimann, O. (2011). _Optimal packed string matching_. In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 13). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. DOI: 10.4230/LIPIcs.FSTTCS.2011.423. [PDF](http://drops.dagstuhl.de/opus/volltexte/2011/3355/pdf/37.pdf). - **[2b]** Ben-Kiki, O., Bille, P., Breslauer, D., Ga̧sieniec, L., Grossi, R., & Weimann, O. (2014). _Towards optimal packed string matching_. Theoretical Computer Science, 525, 111-129. DOI: 10.1016/j.tcs.2013.06.013. [PDF](http://www.cs.haifa.ac.il/~oren/Publications/bpsm.pdf). - **[3]** Bille, P. (2011). _Fast searching in packed strings_. Journal of Discrete Algorithms, 9(1), 49-56. DOI: 10.1016/j.jda.2010.09.003. [PDF](http://www.sciencedirect.com/science/article/pii/S1570866710000353). - **[4a]** Faro, S., & Külekci, M. O. (2012, October). _Fast multiple string matching using streaming SIMD extensions technology_. In String Processing and Information Retrieval (pp. 217-228). Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-34109-0_23. [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro32.pdf). - **[4b]** Faro, S., & Külekci, M. O. (2013, September). _Towards a Very Fast Multiple String Matching Algorithm for Short Patterns_. In Stringology (pp. 78-91). [PDF](http://www.dmi.unict.it/~faro/papers/conference/faro36.pdf). - **[4c]** Faro, S., & Külekci, M. O. (2013, January). _Fast packed string matching for short patterns_. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (pp. 113-121). Society for Industrial and Applied Mathematics. [PDF](http://arxiv.org/pdf/1209.6449.pdf). - **[4d]** Faro, S., & Külekci, M. O. (2014). _Fast and flexible packed string matching_. Journal of Discrete Algorithms, 28, 61-72. DOI: 10.1016/j.jda.2014.07.003. [1_u]: https://github.com/01org/hyperscan [5_u]: https://software.intel.com/sites/landingpage/IntrinsicsGuide */ use std::cmp; use aho_corasick::{Automaton, AcAutomaton, FullAcAutomaton}; use syntax::hir::literal::Literals; use vector::ssse3::{SSSE3VectorBuilder, u8x16}; /// Corresponds to the number of bytes read at a time in the haystack. const BLOCK_SIZE: usize = 16; /// Match reports match information. #[derive(Debug, Clone)] pub struct Match { /// The index of the pattern that matched. The index is in correspondence /// with the order of the patterns given at construction. pub pat: usize, /// The start byte offset of the match. pub start: usize, /// The end byte offset of the match. This is always `start + pat.len()`. pub end: usize, } /// A SIMD accelerated multi substring searcher. #[derive(Debug, Clone)] pub struct Teddy { /// A builder for SSSE3 empowered vectors. vb: SSSE3VectorBuilder, /// A list of substrings to match. pats: Vec<Vec<u8>>, /// An Aho-Corasick automaton of the patterns. We use this when we need to /// search pieces smaller than the Teddy block size. ac: FullAcAutomaton<Vec<u8>>, /// A set of 8 buckets. Each bucket corresponds to a single member of a /// bitset. A bucket contains zero or more substrings. This is useful /// when the number of substrings exceeds 8, since our bitsets cannot have /// more than 8 members. buckets: Vec<Vec<usize>>, /// Our set of masks. There's one mask for each byte in the fingerprint. masks: Masks, } impl Teddy { /// Returns true if and only if Teddy is supported on this platform. /// /// If this returns `false`, then `Teddy::new(...)` is guaranteed to /// return `None`. pub fn available() -> bool { SSSE3VectorBuilder::new().is_some() } /// Create a new `Teddy` multi substring matcher. /// /// If a `Teddy` matcher could not be created (e.g., `pats` is empty or has /// an empty substring), then `None` is returned. pub fn new(pats: &Literals) -> Option<Teddy> { let vb = match SSSE3VectorBuilder::new() { None => return None, Some(vb) => vb, }; if !Teddy::available() { return None; } let pats: Vec<_> = pats.literals().iter().map(|p|p.to_vec()).collect(); let min_len = pats.iter().map(|p| p.len()).min().unwrap_or(0); // Don't allow any empty patterns and require that we have at // least one pattern. if min_len < 1 { return None; } // Pick the largest mask possible, but no larger than 3. let nmasks = cmp::min(3, min_len); let mut masks = Masks::new(vb, nmasks); let mut buckets = vec![vec![]; 8]; // Assign a substring to each bucket, and add the bucket's bitfield to // the appropriate position in the mask. for (pati, pat) in pats.iter().enumerate() { let bucket = pati % 8; buckets[bucket].push(pati); masks.add(bucket as u8, pat); } Some(Teddy { vb: vb, pats: pats.to_vec(), ac: AcAutomaton::new(pats.to_vec()).into_full(), buckets: buckets, masks: masks, }) } /// Returns all of the substrings matched by this `Teddy`. pub fn patterns(&self) -> &[Vec<u8>] { &self.pats } /// Returns the number of substrings in this matcher. pub fn len(&self) -> usize { self.pats.len() } /// Returns the approximate size on the heap used by this matcher. pub fn approximate_size(&self) -> usize { self.pats.iter().fold(0, |a, b| a + b.len()) } /// Searches `haystack` for the substrings in this `Teddy`. If a match was /// found, then it is returned. Otherwise, `None` is returned. pub fn find(&self, haystack: &[u8]) -> Option<Match> { // This is safe because the only way we can construct a Teddy type // is if SSSE3 is available. unsafe { self.find_impl(haystack) } } #[allow(unused_attributes)] #[target_feature(enable = "ssse3")] unsafe fn find_impl(&self, haystack: &[u8]) -> Option<Match> { // If our haystack is smaller than the block size, then fall back to // a naive brute force search. if haystack.is_empty() || haystack.len() < (BLOCK_SIZE + 2) { return self.slow(haystack, 0); } match self.masks.len() { 0 => None, 1 => self.find1(haystack), 2 => self.find2(haystack), 3 => self.find3(haystack), _ => unreachable!(), } } /// `find1` is used when there is only 1 mask. This is the easy case and is /// pretty much as described in the module documentation. #[inline(always)] fn find1(&self, haystack: &[u8]) -> Option<Match> { let mut pos = 0; let zero = self.vb.u8x16_splat(0); let len = haystack.len(); debug_assert!(len >= BLOCK_SIZE); while pos <= len - BLOCK_SIZE { let h = unsafe { // I tried and failed to eliminate bounds checks in safe code. // This is safe because of our loop invariant: pos is always // <= len-16. let p = haystack.get_unchecked(pos..); self.vb.u8x16_load_unchecked_unaligned(p) }; // N.B. `res0` is our `C` in the module documentation. let res0 = self.masks.members1(h); // Only do expensive verification if there are any non-zero bits. let bitfield = res0.ne(zero).movemask(); if bitfield != 0 { if let Some(m) = self.verify(haystack, pos, res0, bitfield) { return Some(m); } } pos += BLOCK_SIZE; } self.slow(haystack, pos) } /// `find2` is used when there are 2 masks, e.g., the fingerprint is 2 bytes /// long. #[inline(always)] fn find2(&self, haystack: &[u8]) -> Option<Match> { // This is an exotic way to right shift a SIMD vector across lanes. // See below at use for more details. let zero = self.vb.u8x16_splat(0); let len = haystack.len(); // The previous value of `C` (from the module documentation) for the // *first* byte in the fingerprint. On subsequent iterations, we take // the last bitset from the previous `C` and insert it into the first // position of the current `C`, shifting all other bitsets to the right // one lane. This causes `C` for the first byte to line up with `C` for // the second byte, so that they can be `AND`'d together. let mut prev0 = self.vb.u8x16_splat(0xFF); let mut pos = 1; debug_assert!(len >= BLOCK_SIZE); while pos <= len - BLOCK_SIZE { let h = unsafe { // I tried and failed to eliminate bounds checks in safe code. // This is safe because of our loop invariant: pos is always // <= len-16. let p = haystack.get_unchecked(pos..); self.vb.u8x16_load_unchecked_unaligned(p) }; let (res0, res1) = self.masks.members2(h); // Do this: // // (prev0 << 15) | (res0 >> 1) // // This lets us line up our C values for each byte. let res0prev0 = res0.alignr_15(prev0); // `AND`'s our `C` values together. let res = res0prev0.and(res1); prev0 = res0; let bitfield = res.ne(zero).movemask(); if bitfield != 0 { let pos = pos.checked_sub(1).unwrap(); if let Some(m) = self.verify(haystack, pos, res, bitfield) { return Some(m); } } pos += BLOCK_SIZE; } // The windowing above doesn't check the last byte in the last // window, so start the slow search at the last byte of the last // window. self.slow(haystack, pos.checked_sub(1).unwrap()) } /// `find3` is used when there are 3 masks, e.g., the fingerprint is 3 bytes /// long. /// /// N.B. This is a straight-forward extrapolation of `find2`. The only /// difference is that we need to keep track of two previous values of `C`, /// since we now need to align for three bytes. #[inline(always)] fn find3(&self, haystack: &[u8]) -> Option<Match> { let zero = self.vb.u8x16_splat(0); let len = haystack.len(); let mut prev0 = self.vb.u8x16_splat(0xFF); let mut prev1 = self.vb.u8x16_splat(0xFF); let mut pos = 2; while pos <= len - BLOCK_SIZE { let h = unsafe { // I tried and failed to eliminate bounds checks in safe code. // This is safe because of our loop invariant: pos is always // <= len-16. let p = haystack.get_unchecked(pos..); self.vb.u8x16_load_unchecked_unaligned(p) }; let (res0, res1, res2) = self.masks.members3(h); let res0prev0 = res0.alignr_14(prev0); let res1prev1 = res1.alignr_15(prev1); let res = res0prev0.and(res1prev1).and(res2); prev0 = res0; prev1 = res1; let bitfield = res.ne(zero).movemask(); if bitfield != 0 { let pos = pos.checked_sub(2).unwrap(); if let Some(m) = self.verify(haystack, pos, res, bitfield) { return Some(m); } } pos += BLOCK_SIZE; } // The windowing above doesn't check the last two bytes in the last // window, so start the slow search at the penultimate byte of the // last window. // self.slow(haystack, pos.saturating_sub(2)) self.slow(haystack, pos.checked_sub(2).unwrap()) } /// Runs the verification procedure on `res` (i.e., `C` from the module /// documentation), where the haystack block starts at `pos` in /// `haystack`. `bitfield` has ones in the bit positions that `res` has /// non-zero bytes. /// /// If a match exists, it returns the first one. #[inline(always)] fn verify( &self, haystack: &[u8], pos: usize, res: u8x16, mut bitfield: u32, ) -> Option<Match> { while bitfield != 0 { // The next offset, relative to pos, where some fingerprint // matched. let byte_pos = bitfield.trailing_zeros() as usize; bitfield &= !(1 << byte_pos); // Offset relative to the beginning of the haystack. let start = pos + byte_pos; // The bitfield telling us which patterns had fingerprints that // match at this starting position. let mut patterns = res.extract(byte_pos); while patterns != 0 { let bucket = patterns.trailing_zeros() as usize; patterns &= !(1 << bucket); // Actual substring search verification. if let Some(m) = self.verify_bucket(haystack, bucket, start) { return Some(m); } } } None } /// Verifies whether any substring in the given bucket matches in haystack /// at the given starting position. #[inline(always)] fn verify_bucket( &self, haystack: &[u8], bucket: usize, start: usize, ) -> Option<Match> { // This cycles through the patterns in the bucket in the order that // the patterns were given. Therefore, we guarantee leftmost-first // semantics. for &pati in &self.buckets[bucket] { let pat = &*self.pats[pati]; if start + pat.len() > haystack.len() { continue; } if pat == &haystack[start..start + pat.len()] { return Some(Match { pat: pati, start: start, end: start + pat.len(), }); } } None } /// Slow substring search through all patterns in this matcher. /// /// This is used when we don't have enough bytes in the haystack for our /// block based approach. #[inline(never)] fn slow(&self, haystack: &[u8], pos: usize) -> Option<Match> { self.ac.find(&haystack[pos..]).next().map(|m| { Match { pat: m.pati, start: pos + m.start, end: pos + m.end, } }) } } /// A list of masks. This has length equal to the length of the fingerprint. /// The length of the fingerprint is always `min(3, len(smallest_substring))`. #[derive(Debug, Clone)] struct Masks { vb: SSSE3VectorBuilder, masks: [Mask; 3], size: usize, } impl Masks { /// Create a new set of masks of size `n`, where `n` corresponds to the /// number of bytes in a fingerprint. fn new(vb: SSSE3VectorBuilder, n: usize) -> Masks { Masks { vb: vb, masks: [Mask::new(vb), Mask::new(vb), Mask::new(vb)], size: n, } } /// Returns the number of masks. fn len(&self) -> usize { self.size } /// Adds the given pattern to the given bucket. The bucket should be a /// power of `2 <= 2^7`. fn add(&mut self, bucket: u8, pat: &[u8]) { for i in 0..self.len() { self.masks[i].add(bucket, pat[i]); } } /// Finds the fingerprints that are in the given haystack block. i.e., this /// returns `C` as described in the module documentation. /// /// More specifically, `for i in 0..16` and `j in 0..8, C[i][j] == 1` if and /// only if `haystack_block[i]` corresponds to a fingerprint that is part /// of a pattern in bucket `j`. #[inline(always)] fn members1(&self, haystack_block: u8x16) -> u8x16 { let masklo = self.vb.u8x16_splat(0xF); let hlo = haystack_block.and(masklo); let hhi = haystack_block.bit_shift_right_4().and(masklo); self.masks[0].lo.shuffle(hlo).and(self.masks[0].hi.shuffle(hhi)) } /// Like members1, but computes C for the first and second bytes in the /// fingerprint. #[inline(always)] fn members2(&self, haystack_block: u8x16) -> (u8x16, u8x16) { let masklo = self.vb.u8x16_splat(0xF); let hlo = haystack_block.and(masklo); let hhi = haystack_block.bit_shift_right_4().and(masklo); let res0 = self.masks[0].lo.shuffle(hlo).and(self.masks[0].hi.shuffle(hhi)); let res1 = self.masks[1].lo.shuffle(hlo).and(self.masks[1].hi.shuffle(hhi)); (res0, res1) } /// Like `members1`, but computes `C` for the first, second and third bytes /// in the fingerprint. #[inline(always)] fn members3(&self, haystack_block: u8x16) -> (u8x16, u8x16, u8x16) { let masklo = self.vb.u8x16_splat(0xF); let hlo = haystack_block.and(masklo); let hhi = haystack_block.bit_shift_right_4().and(masklo); let res0 = self.masks[0].lo.shuffle(hlo).and(self.masks[0].hi.shuffle(hhi)); let res1 = self.masks[1].lo.shuffle(hlo).and(self.masks[1].hi.shuffle(hhi)); let res2 = self.masks[2].lo.shuffle(hlo).and(self.masks[2].hi.shuffle(hhi)); (res0, res1, res2) } } /// A single mask. #[derive(Debug, Clone, Copy)] struct Mask { /// Bitsets for the low nybbles in a fingerprint. lo: u8x16, /// Bitsets for the high nybbles in a fingerprint. hi: u8x16, } impl Mask { /// Create a new mask with no members. fn new(vb: SSSE3VectorBuilder) -> Mask { Mask { lo: vb.u8x16_splat(0), hi: vb.u8x16_splat(0), } } /// Adds the given byte to the given bucket. fn add(&mut self, bucket: u8, byte: u8) { // Split our byte into two nybbles, and add each nybble to our // mask. let byte_lo = (byte & 0xF) as usize; let byte_hi = (byte >> 4) as usize; let lo = self.lo.extract(byte_lo); self.lo.replace(byte_lo, ((1 << bucket) as u8) | lo); let hi = self.hi.extract(byte_hi); self.hi.replace(byte_hi, ((1 << bucket) as u8) | hi); } }