How To Build A Scalable Search Engine For A Python Application With Manticore
There are a lot of excellent search tools available, but sometimes you need something a bit more customizable. In this Python tutorial, learn how to add Manticore as a customized search engine for your app.
What Is Manticore?
Manticore is an open source indexed full-text search tool. It has been optimized for speed and comes packed with features for customization and ease of use. You can set up and deploy it on any operating system or cloud platform in only seconds – no additional configuration required. Plus, it works with any programming language that supports JSON and PHP. For example, if you’re using Python and Django, integration is simple thanks to the manticore_django library designed specifically for your needs.
You can set up and run Manticore in only a few seconds, no additional configuration required, on any platform or cloud. It supports multiple programming languages like PHP, Perl and Python and works with the most common databases such as MySQL, PostgreSQL and MongoDB.
As an open source project that’s completely free to use , you can truly tailor your Manticore deployment to fit your exact needs. In this tutorial, we’ll go through how to build a search engine for a Python app using Django – but you should feel free to follow along if you’re working with another framework or language of your choice. For example, this post shows how to integrate it into the Flask ecosystem.
What You’ll Need
To follow along with this tutorial, you will need the following:
Python 3.5 and up installed on your machine (along with pip ) – check if it is using python –version . If needed, install Python here .
(Optional) An IDE or code editor of your choice. I recommend Visual Studio Code but any other code editor should work just as well. For more recommendations, see my article “ The Best Code Editors for Linux , Mac & Windows .” Alternatively, you can use a browser based development environment such as Thonny Debugger in-browser IDE to experiment with Python interactively. Typically for learning purposes though, using IDEs and editors are better.
Manticore installed on your machine. You can find instructions for installation here , and you can test that it is working by running manticore -e “search” . If this command does not return any error, then Manticore is up and running properly on your machine.
A local or remote MySQL database (if using a remote one, ensure you have set up the connection parameters properly). For more information about setting up MySQL check out “ How To Install A Local MySQL Server On Ubuntu 18 .” Again, make sure to enable the sql_mode flag in /etc/mysql/my.cnf if you are working with strict mode SQL queries. Make sure also to have a user account with access to create tables and insert data in that database.
Finally, you’ll need a Django project (or any other project) ready on your machine. If you don’t have one yet, follow these steps:
Download the latest version of Django from the official website . Unzip it using unzip for example then cd into its directory. To start a new project use django-admin startproject my_project , where my_project is the name of your project directory. Now inside your main folder run python manage.py runserver 0.0.0.0:8000 , which will launch a local server at port 8000 of this machine on all interfaces. You can now access it in your browser by visiting localhost:8000.
How It Works
When using Manticore, you don’t need any special tools or libraries for indexing. Searching is carried out via a simple HTTP GET request that returns JSON results. That means you can easily use it in conjunction with other frameworks such as Django and Flask (as mentioned before), or as part of an endpoint in your microservice architecture. Note that Manticore has been designed to support any programming language – not just Python – so the steps we take here apply equally well to other languages too. However, we are going to explain things specifically with respect to the Python ecosystem and Django application development because they’re relatively easy to follow along with.
To understand how it works, you only need to understand two concepts:
The Query Parser that breaks up the query string into tokens; and The Analyzer that groups those tokens into an inverted index for fast search.
Let’s begin by looking at which pieces of the system are involved in making searches happen:
In this article we mainly focus on using Manticore as a daemon with its web interface exposed via HTTP (the figure above). However, if you would like more information about using Manticore within your own code, then have a look at this page . There is also a list of language bindings .
Writing a Query Parser Yourself is not Easy!
One of the challenges with using Manticore directly from your code or from inside Python’s REPL (or any other language) is to write properly structured queries. This requires that you know about writing parsers, which can be a little overwhelming for newbies in programming but not necessarily for our purposes here: we are going to stick to two basic types of queries: simple keyword searches and exact phrase matches. To that end, let’s take a look at an example query and see how it should be formatted, before moving on to create the parser itself. For this section we are going to use HTTP requests via cURL :
If you want more information about cURL , take a look at this guide.
First, extract the query string from the request, which is done by writing some code similar to this:
Then you need to break it into tokens, which can be achieved with something like this:
Adding Quotes and Spaces as Supported Token Types adds support for exact phrase matches as well (see Chapter 3 of the “ Search Full Text of PDF and HTML in Manticore ” guide for more details). The question remains how to check if a token is an exact phrase match or not. This can be done by checking its distance from other tokens in your dictionary. For example, let’s say we have a very simple vocabulary consisting of only two words (a and b):
Now, you can check if token is an exact phrase match by measuring its distance from all other tokens in the index. If it is greater than 1, then we know it’s not a phrase match:
In this article we are going to implement support for exact phrase matches ourselves. However, I highly recommend that you use Manticore’s built-in support for this instead as described in Chapter 7 of the “ Search Full Text of PDF and HTML in Manticore ” guide. You will also learn how to configure your analyzer to achieve specific results such as stop words removal and stemming – which are important optimization techniques when building search engines. We can start by defining the code to build a query parser as follows:
You’ll notice that I used the class name QueryParser instead of Tokenizer because this one actually parses text into tokens (with help from our vocabulary) and then builds the search query. Let’s go through each method in turn, starting with tokenize() :
Tokens can be built from our vocabulary or from individual characters by using the function get_token . The latter is useful for exact phrase matches which we are going to look at now. If you want more information about how to implement different types of queries such as prefix matching, take a look at Chapter 4 of “ Search Full Text of PDF and HTML in Manticore ” guide. For now, let’s look at exact phrase matches:
In this case we are using a simple regular expression to match the tokens we need and then either replace them with themselves or add quotes depending on if they have whitespace around them. In other words, it replaces single word keywords with itself (e.g. ‘cat’) but adds quotes around exact phrase matches that include spaces (e.g. “cat in the hat”). This is an extremely simplistic implementation of what will be a very important optimization in full-text search engines and I highly recommend you read more about it as well as implement proper query parsers for your own needs before moving on!
You can now start writing queries against your index again:
And you should see the results you expect! Great job!
The next step would be to create a search API. You can achieve this by using one of many popular frameworks out there such as Django, Flask, or Rails. This is an entirely separate topic, so stay tuned for more articles on it in the future! Check my other posts to learn how to do full-text search with Elasticsearch and Manticore. Also check out our new cookbook which shows examples of usage for real use cases. If you have any thoughts or suggestions – let us know in the comments below! If you want to chat about full-text search – join our chat channel on Gitter. Finally, don’t forget to follow us on Twitter at twitter.com/manticoresearch and on Facebook at facebook.com/manticoresearch