Spelling-based Search Engine for Quran. Deployed it, and routed with nginx!

~/blog$ render faiz.blog.spelling-based-search-engine-for-quran-deployed-it-and-routed-with-nginx

Spelling-based Search Engine for Quran. Deployed it, and routed with nginx!

Tuesday, 12 November 2024 03:03:11 WIB | tags: python, search-engine, web | 496 hits | 0 comment(s)

Back in college, I managed to finish my final project in less than two weeks leaving me plenty of free time. Once my friends found out about this, they started knocking at my door, desperate for help with their own projects (okay, a bit dramatic, but you get the idea!). Intrigued with the opportunity to learn about many unknown, scientifically-unique projects, I officially opened a workshop in my house.

I helped both in technical things (re: code) as well as report writing. Disclaimer: as far as I understood, I didn’t cross any ethical lines as most of the time I’d guide them with suggestions, sometimes contribute a bit of code, but never do everything for them, and foremost I made sure my friend understood what they created & wrote. This was a labor of friendship, not profit—my “payment” was often cakes or, more commonly, a mandatory FIFA match as part of the consultation deal. One of the most interesting projects is my wife’s (then girlfriend), which is to build a search engine for Quran verses, but the queries and the verses need to be normalized with Indonesian-spelling first. Made with PHP 6 years ago together with my wife, now, I have decided to re-write it into Python as this is my most starred (re: 2 stars) GitHub repo yet!

The Concept

Arabic Transliteration

The Concept

As mentioned above, the search engine was meant so Indonesian-spelling folks could easily search Quran verses based on how they usually spell it. As Quran verse is in Arabic, (at least in my circle) most of us have different ways of writing the verse in latin-text (say for the Prophet name Mohammad, some have Muhammad, others Muhamad, even Muchamad, but all of them are spelt the almost-same way). Apparently, there are several algorithms to normalize them, one of them is Soundex (which I will discuss in the following section), while for comparing the similarity between the query and the verses, cosine similarity was picked.

Arabic Transliteration

To ensure Arabic transliteration is aligned with the goal of the project, a guideline is followed, which are:

String	Substitute
apostrophe (‘) from ain	ignored
hamzah	not symbolized
sukun hamzah at the end of a word	not symbolized
ya tasyid with prefix letter (kasrah)	ignored
“al” article	ignored/connected
sh & ts	s
dz	z
zh	z
dh	d
th	t
f	p
q	k
sukun ain	k
sukun hamzah	k
diphthongs	u or i
Ikhfa	remove G from the vague NG
Iqlab	substitute NB into MB
Idgham	dissolve N

Soundex Algorithm

Soundex algorithm converts a string into phonetic code to make queries based on spelling possible. Before converting into phonetic code, the string is normalized first. The phonetic codes are mapped below:

Character	Phonetic Code
B, F, P V	1
C, G, J, K, Q, S, X, Z	2
D, T	3
L	4
M, N	5
R	6

Apart from the codes, there are some other rules:

Take the first character as the beginning of phonetic code.
Remove these characters during normalization: A, E, H, I, O, U, W, Y
Remove consecutive characters
Output the first four digits of the codes, if the codes length is less than four, then put 0 as the padding

The implementation sample is:

String: Faiz

Phonetic code: 1200

The problem with the above algorithm is, it is tailored for spelling in the United States, surveyed around 1890 until 1920. Fortunately, Sukma Rahadian (et al.) customized this algorithm for Indonesian spelling:

Character	Phonetic Code
A, E, H, I, O, U, W, Y	/
B, P	0
C, J, S, X, Z	1
D	2
F (V)	3
G, K (Q)	4
L	5
M	6
N	7
R	8
T	9
W	w
Special character	*

Same with the original one, there are also additional rules:

The string has to be normalized first
Take the lowercase first character as the beginning of phonetic code.
Remove consecutive phonetic codes
Ignore character with phonetic code of /
Output the first four digits of the codes, if the codes length is less than four, then put * as the padding

As mentioned, the string has to be normalized once again, this is due to Indonesia's spelling evolution (Indonesian: ejaan lama ke EYD). The normalizations are:

Old Spelling	New Spelling (substituted into this)
AY	AI
AW	AU
OY	OI
DJ	JJ
TJ	CC
SJ	SY
V	F
OE	UU
CK, CQ	KK
J	Y
DZ	ZZ
PH	FF
CH	KF
IE	II
X	KS

Not all of the normalizations are implemented in the project, as it is focused on the Arabic-spelling related.

Cosine Similarity

Cosine similarity is a measurement method to compare a similarity rate between two strings, and in this case, our query and Quran verses. The formula is:

Similarity(x,y): \[\frac{\sum_{i}^{n} x_i y_i }{\sqrt{\sum_{i}^{n} x_i^{2}} * \sqrt{\sum_{i}^{n} y_i^{2}}}\]

With:

x: query

y: Quran verse

In the actual calculation, each unique letter in query and verse is merged into one new list of letters (let’s call this merged_letters). Then, each unique letter in the query is compared with the merged_letters, this is the x value. The same method is repeated for Quran verse and this becomes the y value. Don’t worry if you’re confused, I believe it will be better once we do some examples:

Query: rohim

Unique letters: r, o, h, i, m

Verse: rahiim

Unique letters: r, a, h, i, m

merged_letters: r, o, h, i, m, a

merged_letters	Occurrence in Query	Occurrence in Verse
r	1	1
o	1	0
h	1	1
i	1	2
m	1	1
a	0	1

From the above table, we see that the x value is [1, 1, 1, 1, 1, 0] and y value is [1, 0, 1, 2, 1, 1]. Now, put this into the cosine similarity equation:

Similarity(query,verse): \[\frac{(1*1) + (1*0) + (1*1) + (1*2) + (1*1) + (0*1)}{\sqrt{1^2 + 1^2 + 1^2 + 1^2 + 1^2} * \sqrt{1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 1^2}}\]

Now we got the similarity value of 79.06%.

The Code

In this rewrite project, I use Python Flask as I am currently in the process of being more fluent in Python, while utilizing Poetry for the dependency management. Re-writing this code by following the legacy code is a pain in the bottom, because my wife and I were terrible coders back then and we made spaghetti code on this project. What made things worse is we put duplicate codes in multiple files and it is very difficult to trace the flow. So instead of just re-writing what we did, I focus on writing codes that follow the intended logic. I follow Flask's best practice for the project structure to make it easier to maintain, while also easier for others to read the code.

    
    py-cari-ayat/
    |-- instance/
    |   `-- cariayat.db
    |-- py_cari_ayat/
    |   |-- ayat/
    |   |   |-- __init__.py
    |   |   |-- cosine_similarity.py
    |   |   `-- routes.py
    |   |-- home/
    |   |   |-- __init__.py
    |   |   `-- routes.py
    |   |-- models/
    |   |   |-- ayat.py
    |   |   |-- base.py
    |   |   |-- kata.py
    |   |   `-- surat.py
    |   |-- static/
    |   |   |-- images/
    |   |   |-- styles/
    |   |   `-- vendor/
    |   |-- surat/
    |   |   |-- __init__.py
    |   |   `-- routes.py
    |   |-- templates/
    |   `-- __init__.py
    |-- config.py
    |-- config.yml
    |-- poetry.lock
    |-- pyproject.toml
    |-- README.md
    `-- requirements.txt

Models

Since the dataset is static, I used SQLite and Flask-SQL Alchemy for the ORM. I separated this into 3 models:

Kata: the most atomic class, has normalized and phonetic_code property
Ayat: the Quran verse class which has a list of Kata(s)
Surat: Quran surah, back-referenced by Ayat

Routers

List of the routers in the project:

Home: handle the home page where user can query directly or access list of Surah page
Surat: handle index of Surah and the individual Surah page
Ayat: handle query and individual verse page

The Search Engine

Concepts from the previous sections are implemented in separate files in the project. The Arabic transliteration and Soundex algorithm, it is handled by Kata model, embedded in the property function:

normalized property: normalize for the Arabic transliteration as well as the Indonesian Soundex
phonetic_code property: return the four digits phonetic code generated by Soundex algorithm

So, how is the process under the hood when a user queries something?

User sends query
Query string is instantiated into Kata class, separated by each word. Since Kata class has normalized and phonetic_code as its property, no further action is required for now.

...
...
...
@property
    def normalized(self) -> str:
        normalized: str = self.kata.lower()
        # Normalisasi huruf ain dan hamzah mati menjadi 'k'
        normalized = re.sub(r"\b'([^aiu])", r'k\1', normalized)
        normalized = re.sub(r'\b`([^aiu])', r'k\1', normalized)
        
        # Penghilangan 'al` hamzah hidup
        normalized = re.sub(r'\bal`', '', normalized)
        # Penghilangan petik
        normalized = normalized.replace("'", '').replace('`', '')
        
        # Alif Lam Syamsiah and tasdid removal
        normalized = re.sub(r'\b(a)l([t,s,d,z,r,d,l,n])', r'\1\2', normalized)
        normalized = re.sub(r'\b([^aiu][aiu])l([t,s,d,z,r,d,l,n])', r'\1\2', normalized)

        # Remove all symbols
        normalized = re.sub(r'[()-`]', '', normalized)
        
        # Other substitutions based on specific phonetic mappings
        substitutions: dict[str, str] = {
            'iyy': 'i', 
            'kh': 'h', 
            'sh': 's', 
            'ts': 's', 
            'sy': 's',
            'dz': 'z', 
            'zh': 'z', 
            'dh': 'd', 
            'th': 't', 
            'q': 'k',
            'aw': 'au', 
            'ay': 'ai', 
            'v': 'f', 
            'p': 'f', 
            'j': 'z',
            'ng': 'n', 
            'nb': 'mb', 
            'ny': 'y', 
            'nw': 'w', 
            'nm': 'm', 
            'nn': 'n', 
            'nl': 'l', 
            'nr': 'r'
        }
        
        for pattern, replacement in substitutions.items():
            normalized = normalized.replace(pattern, replacement)

        return normalized
    
    @property
    def phonetic_code(self) -> str:
        # Reference: TA Sukma Rahadian string matching soundex dan metaphone 113030012 (2007)
        encoding_map: dict[str, str] = {
            'a': '/',
            'e': '/',
            'i': '/',
            'o': '/',
            'u': '/',
            'h': '/',
            'y': '/',
            'b': '0',
            'p': '0',
            'c': '1',
            'j': '1',
            's': '1',
            'x': '1',
            'z': '1',
            'd': '2',
            'f': '3',
            'v': '3',
            'g': '4',
            'k': '4',
            'q': '4',
            'l': '5',
            'm': '6',
            'n': '7',
            'r': '8',
            't': '9',
            'w': 'w'
        }
        
        encoded: str = ''.join(encoding_map.get(char, '') for char in self.normalized)
        result: list[str] = [encoded[0]]
        for i in range(1, len(encoded)):
            if encoded[i] != encoded[i - 1]:
                result.append(encoded[i])
        encoded: str = ''.join(result)

        phonetic_code: str = self.normalized[0] + encoded[1:]
        phonetic_code: str = phonetic_code.replace('/', '') + '***'
        return phonetic_code[:4]

Now the actual similarity checks are starting, in ayat/cosine_similarity.py file.
All Quran verses are fetched by querying all Ayats.

Normalized query’s phonetic code is compared to all Ayat’s Kata, all Ayat’s Kata with similar phonetic code to the query is then stored in a separate list of Kata named similar_ayat_katas.

def get_similar(queries: str) -> list[Any]:
    kata_queries: list[Kata] = [Kata(query) for query in queries.split(' ')]
    queries_normalized: str = ''.join(kata.normalized for kata in kata_queries)
    ayats: list[Ayat] = Ayat.query.all()
    similar_ayat_katas: list[Kata] = []
    results: list[Kata] = []
    for ayat in ayats:
        similar_ayat_katas.extend([
            kata_ayat
            for kata_ayat in ayat.katas
            for kata_query in kata_queries
            if kata_ayat.phonetic_code == kata_query.phonetic_code
        ])
    ...
    ...

The similarity of similar Ayat’s Kata actual word (not phonetic code) is compared to the actual query (also not phonetic code), this is where Cosine Similarity comes up. Ayat’s Kata with a similarity rate equal to or higher than 45% then inserted into results list.

    query_ayat_sets = set(queries_normalized)
       ...
       ...
    for idx, kata in enumerate(similar_ayat_katas):
       ...
       ...
        query_ayat_sets.update(set(kata.normalized))
        similarity_kata: float = calculate_cosine_similarity(list(query_ayat_sets), queries_normalized, kata.normalized)
        if similarity_kata > 45:
            kata.similarity = similarity_kata
            results.append(kata)
       ...
       ...
       
def calculate_cosine_similarity(query_ayat_sets: list[str], query: str, ayat: str) -> float:
    query_similar: list[int] = [
        sum(1 for query_char in query if query_char == merged_char) 
        for merged_char in query_ayat_sets
    ]
    ayat_similar: list[int] = [
        sum(1 for ayat_char in ayat if ayat_char == merged_char) 
        for merged_char in query_ayat_sets
    ]

    bottom_left_formula: float = math.sqrt(sum(val**2 for val in query_similar))
    bottom_right_formula: float = math.sqrt(sum(val**2 for val in ayat_similar))
    bottom_formula: float = bottom_left_formula * bottom_right_formula

    if bottom_formula == 0:
        return 0.00
    top_formula: float = sum(q * a for q, a in zip(query_similar, ayat_similar))

    return round((top_formula / bottom_formula) * 100, 2)

Another action is conducted to check if there a two or more words connected and have a similarity rate equal to or higher than 45%, if yes, include this in the results as well.

    merged_sets = set(queries_normalized)
    merged_words: str = ''
    merged_words_index_start: str = ''
    is_verse_merged: bool = False
    for idx, kata in enumerate(similar_ayat_katas):
        if is_verse_merged == False and (kata.surat_id == similar_ayat_katas[idx -1].surat_id) and (kata.ayat == similar_ayat_katas[idx -1].ayat) and ((int(kata.index)-1) == int(similar_ayat_katas[idx -1].index)):
            merged_words_index_start = str(similar_ayat_katas[idx -1].index)
            merged_words = similar_ayat_katas[idx -1].kata
            is_verse_merged = True
        else:
            merged_sets = set(queries_normalized)
            merged_words: str = ''
            merged_words_index_start: str = ''
            is_verse_merged = False

        if is_verse_merged:
            merged_sets.update(set(kata.normalized))
            merged_words = f'{merged_words} {kata.kata}'
            kata_merged_words: Kata = Kata(merged_words, f'{merged_words_index_start}-{kata.index}', kata.surat, kata.surat_id, kata.ayat)
            similarity_merged: float = calculate_cosine_similarity(list(merged_sets), queries_normalized, kata_merged_words.normalized)
            if similarity_merged > 45:
                kata_merged_words.similarity = similarity_merged
                results.append(kata_merged_words)

Apart from the search engine, I also put a list of Quran Surah features so users can access them directly. Moreover, during the rewrite I found some algorithm errors from the code from six years ago where on the PHP version the xy (or in the legacy code, they are defined as ayatKeyCampur, kataPecahCampur & ayatPecah) keeps appending for all matched words (it should be reset once checking the next word).

Deployment

Since I am not a billionaire at the moment, I currently need to efficiently deploy the search engine under my apps subdomain, in a sub path, on a shared nginx server. However, this limitation apparently became a learning opportunity for me to learn how to configure nginx that serves multiple domains, with each subpath routing to a different process. Before jumping into the nginx configuration, I’d like to share how I serve the Flask web server using gunicorn and run as a service:

[Unit]
Description=Py Cari Ayat
After=network.target

[Service]
User=
Group=
WorkingDirectory=/[private path]/py-cari-ayat
Environment="PATH=/[private path]/py-cari-ayat/bin"
ExecStart=/[private path]/py-cari-ayat/bin/gunicorn --workers 1 --bind 0.0.0.0:[private port] 'py_cari_ayat:create_app()'
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=mixed
TimeoutStopSec=5
PrivateTmp=true

[Install]
WantedBy=multi-user.target

To ensure I sandboxed Python environment, I installed virtualenv on the project directory and set that as the path of the service. Then, install the requirements on the Python on that virtualenv, including the gunicorn. For this project, I only used 1 worker. Once the service is configured and running, I configure my existing nginx (which hosts PHP projects) by adding an additional location block:

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name apps.rahiemy.id www.apps.rahiemy.id;
    root [private path];

    access_log /var/log/[private path].log;
    ssl_certificate /[private path].pem;
    ssl_certificate_key /[private path].pem;

    index index.html index.htm index.php;

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    location /py-cari-ayat/ {
        proxy_pass http://127.0.0.1:[private port]/py-cari-ayat/;
        proxy_redirect http://127.0.0.1:[private port]/py-cari-ayat http://$host/py-cari-ayat;
        proxy_set_header SCRIPT_NAME /py-cari-ayat;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout  90;
    }

    location ~ ^/hashedvirtual\/([a-zA-Z\/0-9]*)$ {
        rewrite ^/hashedvirtual/([a-zA-Z]*)/?([a-zA-Z]*)?/?([^/.]*)?/?$ /hashedvirtual/index.php?controller=$1&action=$2&id=$3 last;
    }

    ...
    ...
    ...

    location ~ \.php$ {
        include snippets/fastcgi-php.conf;
        fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
     }

    location ~ /\.ht {
        deny all;
    }
}

The search engine is located on location /py-cari-ayat/ block:

proxy_pass directive is used to proxy the request matched with /py-cari-ayat/ to this defined IP and port.
proxy_redirect directive is to use public-facing domain (apps.rahiemy.my.id) on the Location/Refresh header, instead of the local IP and port. This directive also prevent non-stop redirect.
proxy_set_header directive supports Flask deployment on the subpath, with this, `/py-cari-ayat/` is added as url prefix for all Flask routes. Without this the routes will route to root path, for example, the /static will be router to https://apps.rahiemy.id/static instead of https://apps.rahiemy.id/py-cari-ayat/static

As you may see on my configuration file, the nginx contains not only the Flask service proxy but also serve PHP project on my hashedvirtual project (by the way, this is the implementation of my second paper, definitely need to write about this later). After configuring the nginx, voila, my Flask running in a subpath of a subdomain, from a shared nginx server! Pay a visit and try the search engine here https://apps.rahiemy.id/py-cari-ayat!

References

Zulfa Umi Wardani, Moch Arif Bijaksana, and Said Al-Faraby. Analisis Aplikasi Pencarian Berdasarkan Ucapan Ayat Al-Qur’an Berbasis Web Menggunakan Algoritma Soundex dan Metode Cosine Similarity. Tugas Akhir:Universitas Telkom, 2018.
Sukma Rahadian, Rimba Whidiana, and Adiwijaya. Perbandingan pencocokan string nama dalam Bahasa Indonesia berdasarkan kemiripan ucapan (phonetic string matching) soundex dan doublemetaphone. Tugas Akhir:Universitas Telkom, 2007. - openlibrary.telkomuniversity.ac.id

Comments

Be the first to comment!

Give Comments

Name*

Email*

Website

Comment*

* required fields