I've decided to add an autocomplete select field into my sample project. It is quite a common requirement even for small applications. I wanted to know how to do it in a serverless way and keep my budget within limits.
I downloaded a list of all Polish streets with 244,226 entries. I think it is an appropriate size for testing response times. Functionality is limited only to the search, the user must enter a minimum of 3 characters to trigger a search request. The results are limited to 10 entries for each search term.
My backend is hosted on the AWS cloud. To implement an autocomplete, I would have to add ElasticSearch to my stack. Unfortunately, I am not eligible for the free tier for this service anymore. The minimal cost of running autocomplete app on production and development machines for 1 month is around 60$ (micro instances). Proper setup, according to the AWS guidelines, with medium VM instances, and 3 availability zones, will cost around 500$. This is only for PROD and varies depending on multiple factors. For my sample application, 60$ would do it. However it would run only a few minutes per month, it is simply a waste of money.
I've decided to make it simpler and stay away from the ElasticSearch AWS offering. For streets search, it is enough to index all unique combinations of terms starting from unique 3 letters long term and ending on 97 (max street length in the data set). This combined with storing keys in lower case and replacing special characters should give a pretty decent results.
I was thinking of storing all such terms in the DynamoDb and perform a query against it. It had potential in the future to enable DAX functionality in case a performance was too low. Additionally, I could enable cache on Cloudflare side for all requests going through autocomplete endpoints. When request reach my application, it goes over Cloudflare proxy -> Api Gateway -> Lambda -> DynamoDb. None of these steps can't be skipped, pressing each letter requires to go over all application layers. That is both, too slow and expensive even considering caching on Cloudflare edge.
I decided to use Cloudflare KV storage for all my terms. I'm already using Cloudflare worker instances to host my web application. What I needed was only a build/deploy setup to migrate data from my data set (zipped file) to the KV storage. To upload files to KV I run the wrangler CLI command with bulk upload parameters. More details on official documentation site:
https://developers.cloudflare.com/workers/cli-wranglerThe process is quite simple. Firstly, I run a wrangler CLI command to create a namespace i.e. with the name AUTOCOMPLETE_KV. In the wrangler command output, I get a namespace id that can be used to upload data into KV. This configuration must be included in wrangler.toml, sample below. This is required to include it there to allow your worker to access KV via bind name, in my case AUTOCOMPLETE_KV.
A sample worker code to perform the search of terms provided in the URL query param using AUTOCOMPLETE_KV storage
Screenshots on this page show a UI design, a live version is available after the login. I get quite consistent response times. When a term is frequently accessed, it is available in the memory on the Cloudflare edge server, response times are less than 60 ms. In other cases, when a term is not used very often, it takes a maximum of 250 ms from my location. This is quite good for my use-case. I didn't notice any slowdowns and the search experience is ok for me. What is most important, it is a very cheap solution, the pricing for writing 1M entries is with the limit of the Worker Bundled plan (5$). Additional usage is very cheap. I set the TTL for my key entries to 1 year, this is data that don't change frequently.
One important note, I have a rate-limiting enabled for all my /api calls. This also applies here, each user abusing /api resource will be banned for some fixed amount of time. For obvious reasons, I've increased the number of allowed requests for the autocomplete resource.
To summary, I'm satisfied with the results. I think I will use the KV storage for general configuration data. Of course, this is no-go for a sensitive data.