A darn good write up! It's clarity is refreshing. Well well done. Thanks for posting.
stuhood 11 hours ago [-]
An interesting exercise would be to compare this with the (confusingly similarly named) `fsst` string compression strategy: https://github.com/cwida/fsst
VivaTechnics 11 hours ago [-]
Impressive! This approach can be applied to designing a NoSQL database. The flow could probably look something like this? Right?
- The client queries for "alice123".
- The Query Engine checks the FST Index for an exact or prefix match.
- The FST Index returns a pointer to the location in Data Storage.
- Data Storage retrieves and returns the full document to the Query Engine.
yazaddaruvala 5 hours ago [-]
What you’ve described is the foundation of Lucene and as such the foundation of Elastic Search.
FSTs are “expensive” to re-optimize and so it’s typically done “without writes”. So the database would need some workaround for that low write throughput.
To save you the time thinking about it: The only extra parts you’re missing are what Lucene calls segments and merge operations. Those decisions obviously have some tradeoffs (in Lucene’s case the tradeoff is CRUD).
There are easily another 100 ways to be creative in these tradeoffs depending on your specific need. However, I wouldn’t be surprised if the super majority of databases’ indexing implementations are roughly similar.
sa-code 4 hours ago [-]
Lucene's WFST is an insanely good and underappreciated in-process key value store. Assuming that you're okay with a 1 hour lag on your data.
- The client queries for "alice123". - The Query Engine checks the FST Index for an exact or prefix match. - The FST Index returns a pointer to the location in Data Storage. - Data Storage retrieves and returns the full document to the Query Engine.
FSTs are “expensive” to re-optimize and so it’s typically done “without writes”. So the database would need some workaround for that low write throughput.
To save you the time thinking about it: The only extra parts you’re missing are what Lucene calls segments and merge operations. Those decisions obviously have some tradeoffs (in Lucene’s case the tradeoff is CRUD).
There are easily another 100 ways to be creative in these tradeoffs depending on your specific need. However, I wouldn’t be surprised if the super majority of databases’ indexing implementations are roughly similar.
Keyvi is also interesting in this regard