The goal of this is to provide a string deduplication pool that can be used as a step between storing String
everywhere and storing a usize
and requiring all of the users to know about the interner if they want to figure out what string we’re talking about.
The comments in the code should explain what’s going on (as unsafe
should take a proof of correctness), but the basic idea is that the interner loans out &'a str
where 'a
guarantees that the loan dies with the interner. So long as the interner is alive, the backing String
is immutable, so the heap-allocated string slice will never move, and the reference is sound.
This is fully documented code, so there’s not much more I can say that’s not repeating myself.
//! A very simplistic string interning interface based around giving out `&str` references //! rather than some placeholder symbol. This means that strings can be interned in systems //! based around `&str` without rewriting to support a new `Symbol` type. //! //! The typical use case for something like this is text processing chunks, where chunks are very //! likely to be repeated. For example, when parsing source code, identifiers are likely to come up //! multiple times. Rather than have a `Token::Identifier(String)` and allocate every occurrence of //! those identifiers separately, interners allow you to store `Token::Identifier(Symbol)`, and //! compare identifier equality by the interned symbol. //! //! This crate provides the option of using the `&str` directly as the `Symbol` type rather than //! have another layer of indirection to getting the backing slice. This is good for overlaying //! on top of an existing system that doesn't need to know about the interning going on behind the //! scenes. However, this means that comparison of interned strings is still `O(len)` when it could //! be a simple pointer compare, and interned symbols cannot be persisted across serialization. //! //! If it doesn't make sense for you to give up the benefits of using dedicated symbols in order to //! get the niche benefit of just using `&str`, you should not use this crate. Consider instead //! [string-interner](https://crates.io/crates/string-interner), which is based off of the Rust //! compiler's string interner. #![forbid(missing_debug_implementations, unconditional_recursion, future_incompatible)] #![deny(bad_style, missing_docs, unsafe_code, unused)] #![warn(unreachable_pub)] #[macro_use] extern crate serde_derive; use std::collections::HashSet; use std::collections::hash_map::RandomState; use std::hash::BuildHasher; use std::marker::PhantomData; use std::mem; // The `StringInterner` contains a lifetime `'a` and loans out string references with that lifetime. // This guarantees that for as long as the interner is alive, so will the loan. // Because a `String`'s data lives on the heap and we don't mutate them, // their data will live until they are freed, and will not move, even as our set grows. // They will not be freed until we are, as we are an append-only collection of `String`s. /// A string interner based on a `HashSet`. See the crate-level docs for more. #[derive(Clone, Debug, Eq, PartialEq)] #[derive(Serialize, Deserialize)] pub struct StringInterner<'a, H: BuildHasher = RandomState> { #[serde(bound(deserialize = "H: Default"))] // HashSet: Serialize arena: HashSet<Box<str>, H>, marker: PhantomData<&'a str>, } // Cannot be derived with the BuildHasher generic impl<'a> Default for StringInterner<'a> { fn default() -> Self { StringInterner { arena: HashSet::default(), marker: PhantomData, } } } #[inline(always)] fn coerce<T>(t: T) -> T { t } #[allow(unsafe_code)] /// The string interner interface impl<'a, H: BuildHasher> StringInterner<'a, H> { /// Get an interned string slice out of this interner, or insert if it doesn't exist. /// Takes borrowed or owned strings. If given a new borrowed string, it will be boxed /// and saved into the interner. If given an owned string, no new allocation will /// happen for the string. /// /// Note that the interner may need to reallocate to make space for the new reference, /// just the same as a `Vec<String>` would. This cost is amortized to `O(1)` as it is /// in other standard library collections. /// /// If you have an owned string and no longer need the ownership, pass it in directly. /// Otherwise, just pass in a string slice. /// /// See `get` for more about the interned `&str`. #[inline] pub fn get_or_insert<S>(&mut self, s: S) -> &'a str where S: AsRef<str> + Into<Box<str>>, { if self.arena.contains(s.as_ref()) { self.get(s.as_ref()).expect("Just entered") } else { let s: Box<str> = s.into(); // Get the reference to loan out _after_ boxing up our data let _s: &'a str = unsafe { mem::transmute(coerce::<&str>(&s)) }; self.arena.insert(s); _s } } /// Get an interned string slice out of this interner. /// /// The returned string slice is `&'a str`. This guarantees that the returned slice /// will live at least as long as this interner does. All strings in the interner are /// never mutated, so the heap-allocated string slice is never going to move, which /// makes loaning these references out sound. #[inline] pub fn get(&self, s: &str) -> Option<&'a str> { self.arena .get(s) .map(|s| unsafe { mem::transmute(coerce::<&str>(s)) }) } } /// Constructors impl<'a> StringInterner<'a, RandomState> { /// Create an empty string interner. /// /// The backing set is initially created with a capacity of 0, /// so it will not allocate until it is first inserted into. pub fn new() -> Self { StringInterner { arena: HashSet::new(), marker: PhantomData, } } /// Create an empty string interner with the specified capacity. /// /// The interner will be able to hold at least `capacity` strings without reallocating. /// If `capacity` is 0, the interner will not initially allocate. pub fn with_capacity(capacity: usize) -> Self { StringInterner { arena: HashSet::with_capacity(capacity), marker: PhantomData, } } } /// Constructors to control the backing `HashSet`'s hash function impl<'a, H: BuildHasher> StringInterner<'a, H> { /// Create an empty string interner which will use the given hasher to hash the strings. /// /// The string interner is also created with the default capacity. pub fn with_hasher(hasher: H) -> Self { StringInterner { arena: HashSet::with_hasher(hasher), marker: PhantomData, } } /// Create an empty interner with the specified capacity, using `hasher` to hash the strings. /// /// The interner will be able to hold at least `capacity` strings without reallocating. /// If `capacity` is 0, the interner will not initially allocate. pub fn with_capacity_and_hasher(capacity: usize, hasher: H) -> Self { StringInterner { arena: HashSet::with_capacity_and_hasher(capacity, hasher), marker: PhantomData, } } } #[cfg(test)] mod tests { use super::*; #[test] fn basic_usage() { // Create the interner let mut interner = StringInterner::default(); // Intern some strings let a1 = interner.get_or_insert(Box::<str>::from("a")); let b1 = interner.get_or_insert(Box::<str>::from("b")); let c1 = interner.get_or_insert("c"); // Get the interned strings let a2 = interner.get_or_insert("a"); let b2 = interner.get_or_insert("b"); let c2 = interner.get_or_insert("c"); // Force the interner to move onto the heap let interner = Box::new(interner); // Get the interned strings from the new location let a3 = interner.get("a").unwrap(); let b3 = interner.get("b").unwrap(); let c3 = interner.get("c").unwrap(); // The same strings better be the same pointers or it's broken assert_eq!(a1.as_ptr(), a2.as_ptr()); assert_eq!(a2.as_ptr(), a3.as_ptr()); assert_eq!(b1.as_ptr(), b2.as_ptr()); assert_eq!(b2.as_ptr(), b3.as_ptr()); assert_eq!(c1.as_ptr(), c2.as_ptr()); assert_eq!(c2.as_ptr(), c3.as_ptr()); } }
Oh, and I still agree that everyone should prefer to use string-interner when possible, this is just a convenience for the small niche where storing a bunch of String
is too wasteful and yet we still want easy access to the backing &str
, such as when dealing with an existing system where revamping to use string-interner would take more work, and this provides a middle ground.