Spelling-based Search Engine for Quran. Deployed it, and routed with nginx!
Tuesday, 12 November 2024 03:03:11 WIB | tags: python, search-engine, web | 129 hits | 0 comment(s)Back in college, I managed to finish my final project in less than two weeks leaving me plenty of free time. Once my friends found out about this, they started knocking at my door, desperate for help with their own projects (okay, a bit dramatic, but you get the idea!). Intrigued with the opportunity to learn about many unknown, scientifically-unique projects, I officially opened a workshop in my house.
I helped both in technical things (re: code) as well as report writing. Disclaimer: as far as I understood, I didn’t cross any ethical lines as most of the time I’d guide them with suggestions, sometimes contribute a bit of code, but never do everything for them, and foremost I made sure my friend understood what they created & wrote. This was a labor of friendship, not profit—my “payment” was often cakes or, more commonly, a mandatory FIFA match as part of the consultation deal. One of the most interesting projects is my wife’s (then girlfriend), which is to build a search engine for Quran verses, but the queries and the verses need to be normalized with Indonesian-spelling first. Made with PHP 6 years ago together with my wife, now, I have decided to re-write it into Python as this is my most starred (re: 2 stars) GitHub repo yet!
Contents
The Concept
As mentioned above, the search engine was meant so Indonesian-spelling folks could easily search Quran verses based on how they usually spell it. As Quran verse is in Arabic, (at least in my circle) most of us have different ways of writing the verse in latin-text (say for the Prophet name Mohammad, some have Muhammad, others Muhamad, even Muchamad, but all of them are spelt the almost-same way). Apparently, there are several algorithms to normalize them, one of them is Soundex (which I will discuss in the following section), while for comparing the similarity between the query and the verses, cosine similarity was picked.
Arabic Transliteration
To ensure Arabic transliteration is aligned with the goal of the project, a guideline is followed, which are:
String |
Substitute |
apostrophe (‘) from ain |
ignored |
hamzah |
not symbolized |
sukun hamzah at the end of a word |
not symbolized |
ya tasyid with prefix letter (kasrah) |
ignored |
“al” article |
ignored/connected |
sh & ts |
s |
dz |
z |
zh |
z |
dh |
d |
th |
t |
f |
p |
q |
k |
sukun ain |
k |
sukun hamzah |
k |
diphthongs |
u or i |
Ikhfa |
remove G from the vague NG |
Iqlab |
substitute NB into MB |
Idgham |
dissolve N |
Soundex Algorithm
Soundex algorithm converts a string into phonetic code to make queries based on spelling possible. Before converting into phonetic code, the string is normalized first. The phonetic codes are mapped below:
Character |
Phonetic Code |
B, F, P V |
1 |
C, G, J, K, Q, S, X, Z |
2 |
D, T |
3 |
L |
4 |
M, N |
5 |
R |
6 |
Apart from the codes, there are some other rules:
- Take the first character as the beginning of phonetic code.
-
Remove these characters during normalization: A, E, H, I, O, U, W, Y
-
Remove consecutive characters
-
Output the first four digits of the codes, if the codes length is less than four, then put 0 as the padding
The implementation sample is:
String: Faiz Phonetic code: 1200 |
The problem with the above algorithm is, it is tailored for spelling in the United States, surveyed around 1890 until 1920. Fortunately, Sukma Rahadian (et al.) customized this algorithm for Indonesian spelling:
Character |
Phonetic Code |
A, E, H, I, O, U, W, Y |
/ |
B, P |
0 |
C, J, S, X, Z |
1 |
D |
2 |
F (V) |
3 |
G, K (Q) |
4 |
L |
5 |
M |
6 |
N |
7 |
R |
8 |
T |
9 |
W |
w |
Special character |
* |
Same with the original one, there are also additional rules:
-
The string has to be normalized first
- Take the lowercase first character as the beginning of phonetic code.
-
Remove consecutive phonetic codes
-
Ignore character with phonetic code of /
-
Output the first four digits of the codes, if the codes length is less than four, then put * as the padding
As mentioned, the string has to be normalized once again, this is due to Indonesia's spelling evolution (Indonesian: ejaan lama ke EYD). The normalizations are:
Old Spelling |
New Spelling (substituted into this) |
AY |
AI |
AW |
AU |
OY |
OI |
DJ |
JJ |
TJ |
CC |
SJ |
SY |
V |
F |
OE |
UU |
CK, CQ |
KK |
J |
Y |
DZ |
ZZ |
PH |
FF |
CH |
KF |
IE |
II |
X |
KS |
Not all of the normalizations are implemented in the project, as it is focused on the Arabic-spelling related.
Cosine Similarity
Cosine similarity is a measurement method to compare a similarity rate between two strings, and in this case, our query and Quran verses. The formula is:
Similarity(x,y): \[\frac{\sum_{i}^{n} x_i y_i }{\sqrt{\sum_{i}^{n} x_i^{2}} * \sqrt{\sum_{i}^{n} y_i^{2}}}\] |
With:
x: query
y: Quran verse
In the actual calculation, each unique letter in query and verse is merged into one new list of letters (let’s call this merged_letters
). Then, each unique letter in the query is compared with the merged_letters
, this is the x value
. The same method is repeated for Quran verse and this becomes the y value
. Don’t worry if you’re confused, I believe it will be better once we do some examples:
-
Query: rohim
-
Unique letters: r, o, h, i, m
-
Verse: rahiim
-
Unique letters: r, a, h, i, m
-
merged_letters: r, o, h, i, m, a
merged_letters |
Occurrence in Query |
Occurrence in Verse |
r |
1 |
1 |
o |
1 |
0 |
h |
1 |
1 |
i |
1 |
2 |
m |
1 |
1 |
a |
0 |
1 |
From the above table, we see that the x value
is [1, 1, 1, 1, 1, 0] and y value
is [1, 0, 1, 2, 1, 1]. Now, put this into the cosine similarity equation:
Similarity(query,verse): \[\frac{(1*1) + (1*0) + (1*1) + (1*2) + (1*1) + (0*1)}{\sqrt{1^2 + 1^2 + 1^2 + 1^2 + 1^2} * \sqrt{1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 1^2}}\] |
Now we got the similarity value of 79.06%.
The Code
In this rewrite project, I use Python Flask as I am currently in the process of being more fluent in Python, while utilizing Poetry for the dependency management. Re-writing this code by following the legacy code is a pain in the bottom, because my wife and I were terrible coders back then and we made spaghetti code on this project. What made things worse is we put duplicate codes in multiple files and it is very difficult to trace the flow. So instead of just re-writing what we did, I focus on writing codes that follow the intended logic. I follow Flask's best practice for the project structure to make it easier to maintain, while also easier for others to read the code.
py-cari-ayat/
|-- instance/
| `-- cariayat.db
|-- py_cari_ayat/
| |-- ayat/
| | |-- __init__.py
| | |-- cosine_similarity.py
| | `-- routes.py
| |-- home/
| | |-- __init__.py
| | `-- routes.py
| |-- models/
| | |-- ayat.py
| | |-- base.py
| | |-- kata.py
| | `-- surat.py
| |-- static/
| | |-- images/
| | |-- styles/
| | `-- vendor/
| |-- surat/
| | |-- __init__.py
| | `-- routes.py
| |-- templates/
| `-- __init__.py
|-- config.py
|-- config.yml
|-- poetry.lock
|-- pyproject.toml
|-- README.md
`-- requirements.txt
Models
Since the dataset is static, I used SQLite and Flask-SQL Alchemy for the ORM. I separated this into 3 models:
-
Kata: the most atomic class, has
normalized
andphonetic_code
property -
Ayat: the Quran verse class which has a list of Kata(s)
-
Surat: Quran surah, back-referenced by Ayat
Routers
List of the routers in the project:
-
Home: handle the home page where user can query directly or access list of Surah page
-
Surat: handle index of Surah and the individual Surah page
-
Ayat: handle query and individual verse page
The Search Engine
Concepts from the previous sections are implemented in separate files in the project. The Arabic transliteration and Soundex algorithm, it is handled by Kata
model, embedded in the property function:
-
normalized property: normalize for the Arabic transliteration as well as the Indonesian Soundex
-
phonetic_code property: return the four digits phonetic code generated by Soundex algorithm
So, how is the process under the hood when a user queries something?
-
User sends
query
-
Query string is instantiated into
Kata
class, separated by each word. SinceKata
class hasnormalized
andphonetic_code
as its property, no further action is required for now.
...
...
...
@property
def normalized(self) -> str:
normalized: str = self.kata.lower()
# Normalisasi huruf ain dan hamzah mati menjadi 'k'
normalized = re.sub(r"\b'([^aiu])", r'k\1', normalized)
normalized = re.sub(r'\b`([^aiu])', r'k\1', normalized)
# Penghilangan 'al` hamzah hidup
normalized = re.sub(r'\bal`', '', normalized)
# Penghilangan petik
normalized = normalized.replace("'", '').replace('`', '')
# Alif Lam Syamsiah and tasdid removal
normalized = re.sub(r'\b(a)l([t,s,d,z,r,d,l,n])', r'\1\2', normalized)
normalized = re.sub(r'\b([^aiu][aiu])l([t,s,d,z,r,d,l,n])', r'\1\2', normalized)
# Remove all symbols
normalized = re.sub(r'[()-`]', '', normalized)
# Other substitutions based on specific phonetic mappings
substitutions: dict[str, str] = {
'iyy': 'i',
'kh': 'h',
'sh': 's',
'ts': 's',
'sy': 's',
'dz': 'z',
'zh': 'z',
'dh': 'd',
'th': 't',
'q': 'k',
'aw': 'au',
'ay': 'ai',
'v': 'f',
'p': 'f',
'j': 'z',
'ng': 'n',
'nb': 'mb',
'ny': 'y',
'nw': 'w',
'nm': 'm',
'nn': 'n',
'nl': 'l',
'nr': 'r'
}
for pattern, replacement in substitutions.items():
normalized = normalized.replace(pattern, replacement)
return normalized
@property
def phonetic_code(self) -> str:
# Reference: TA Sukma Rahadian string matching soundex dan metaphone 113030012 (2007)
encoding_map: dict[str, str] = {
'a': '/',
'e': '/',
'i': '/',
'o': '/',
'u': '/',
'h': '/',
'y': '/',
'b': '0',
'p': '0',
'c': '1',
'j': '1',
's': '1',
'x': '1',
'z': '1',
'd': '2',
'f': '3',
'v': '3',
'g': '4',
'k': '4',
'q': '4',
'l': '5',
'm': '6',
'n': '7',
'r': '8',
't': '9',
'w': 'w'
}
encoded: str = ''.join(encoding_map.get(char, '') for char in self.normalized)
result: list[str] = [encoded[0]]
for i in range(1, len(encoded)):
if encoded[i] != encoded[i - 1]:
result.append(encoded[i])
encoded: str = ''.join(result)
phonetic_code: str = self.normalized[0] + encoded[1:]
phonetic_code: str = phonetic_code.replace('/', '') + '***'
return phonetic_code[:4]
-
Now the actual similarity checks are starting, in
ayat/cosine_similarity.py
file. -
All Quran verses are fetched by querying all
Ayats
.
-
Normalized query’s phonetic code is compared to all
Ayat’s
Kata
, allAyat’s
Kata
with similar phonetic code to the query is then stored in a separatelist
ofKata
namedsimilar_ayat_katas
.
def get_similar(queries: str) -> list[Any]:
kata_queries: list[Kata] = [Kata(query) for query in queries.split(' ')]
queries_normalized: str = ''.join(kata.normalized for kata in kata_queries)
ayats: list[Ayat] = Ayat.query.all()
similar_ayat_katas: list[Kata] = []
results: list[Kata] = []
for ayat in ayats:
similar_ayat_katas.extend([
kata_ayat
for kata_ayat in ayat.katas
for kata_query in kata_queries
if kata_ayat.phonetic_code == kata_query.phonetic_code
])
...
...
-
The similarity of similar
Ayat’s
Kata
actual word (not phonetic code) is compared to the actual query (also not phonetic code), this is whereCosine Similarity
comes up.Ayat’s
Kata
with a similarity rate equal to or higher than 45% then inserted intoresults
list.
query_ayat_sets = set(queries_normalized)
...
...
for idx, kata in enumerate(similar_ayat_katas):
...
...
query_ayat_sets.update(set(kata.normalized))
similarity_kata: float = calculate_cosine_similarity(list(query_ayat_sets), queries_normalized, kata.normalized)
if similarity_kata > 45:
kata.similarity = similarity_kata
results.append(kata)
...
...
def calculate_cosine_similarity(query_ayat_sets: list[str], query: str, ayat: str) -> float:
query_similar: list[int] = [
sum(1 for query_char in query if query_char == merged_char)
for merged_char in query_ayat_sets
]
ayat_similar: list[int] = [
sum(1 for ayat_char in ayat if ayat_char == merged_char)
for merged_char in query_ayat_sets
]
bottom_left_formula: float = math.sqrt(sum(val**2 for val in query_similar))
bottom_right_formula: float = math.sqrt(sum(val**2 for val in ayat_similar))
bottom_formula: float = bottom_left_formula * bottom_right_formula
if bottom_formula == 0:
return 0.00
top_formula: float = sum(q * a for q, a in zip(query_similar, ayat_similar))
return round((top_formula / bottom_formula) * 100, 2)
-
Another action is conducted to check if there a two or more words connected and have a similarity rate equal to or higher than 45%, if yes, include this in the
results
as well.
merged_sets = set(queries_normalized)
merged_words: str = ''
merged_words_index_start: str = ''
is_verse_merged: bool = False
for idx, kata in enumerate(similar_ayat_katas):
if is_verse_merged == False and (kata.surat_id == similar_ayat_katas[idx -1].surat_id) and (kata.ayat == similar_ayat_katas[idx -1].ayat) and ((int(kata.index)-1) == int(similar_ayat_katas[idx -1].index)):
merged_words_index_start = str(similar_ayat_katas[idx -1].index)
merged_words = similar_ayat_katas[idx -1].kata
is_verse_merged = True
else:
merged_sets = set(queries_normalized)
merged_words: str = ''
merged_words_index_start: str = ''
is_verse_merged = False
if is_verse_merged:
merged_sets.update(set(kata.normalized))
merged_words = f'{merged_words} {kata.kata}'
kata_merged_words: Kata = Kata(merged_words, f'{merged_words_index_start}-{kata.index}', kata.surat, kata.surat_id, kata.ayat)
similarity_merged: float = calculate_cosine_similarity(list(merged_sets), queries_normalized, kata_merged_words.normalized)
if similarity_merged > 45:
kata_merged_words.similarity = similarity_merged
results.append(kata_merged_words)
Apart from the search engine, I also put a list of Quran Surah features so users can access them directly. Moreover, during the rewrite I found some algorithm errors from the code from six years ago where on the PHP version the xy
(or in the legacy code, they are defined as ayatKeyCampur
, kataPecahCampur
& ayatPecah
) keeps appending for all matched words (it should be reset once checking the next word).
Deployment
Since I am not a billionaire at the moment, I currently need to efficiently deploy the search engine under my apps subdomain
, in a sub path
, on a shared nginx server
. However, this limitation apparently became a learning opportunity for me to learn how to configure nginx that serves multiple domains, with each subpath routing to a different process. Before jumping into the nginx configuration, I’d like to share how I serve the Flask web server using gunicorn
and run as a service:
[Unit]
Description=Py Cari Ayat
After=network.target
[Service]
User=
Group=
WorkingDirectory=/[private path]/py-cari-ayat
Environment="PATH=/[private path]/py-cari-ayat/bin"
ExecStart=/[private path]/py-cari-ayat/bin/gunicorn --workers 1 --bind 0.0.0.0:[private port] 'py_cari_ayat:create_app()'
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=mixed
TimeoutStopSec=5
PrivateTmp=true
[Install]
WantedBy=multi-user.target
To ensure I sandboxed Python environment, I installed virtualenv
on the project directory and set that as the path of the service. Then, install the requirements on the Python on that virtualenv, including the gunicorn. For this project, I only used 1 worker. Once the service is configured and running, I configure my existing nginx (which hosts PHP projects) by adding an additional location block:
server {
listen 443 ssl;
listen [::]:443 ssl;
server_name apps.rahiemy.id www.apps.rahiemy.id;
root [private path];
access_log /var/log/[private path].log;
ssl_certificate /[private path].pem;
ssl_certificate_key /[private path].pem;
index index.html index.htm index.php;
location / {
try_files $uri $uri/ /index.php?$args;
}
location /py-cari-ayat/ {
proxy_pass http://127.0.0.1:[private port]/py-cari-ayat/;
proxy_redirect http://127.0.0.1:[private port]/py-cari-ayat http://$host/py-cari-ayat;
proxy_set_header SCRIPT_NAME /py-cari-ayat;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 90;
}
location ~ ^/hashedvirtual\/([a-zA-Z\/0-9]*)$ {
rewrite ^/hashedvirtual/([a-zA-Z]*)/?([a-zA-Z]*)?/?([^/.]*)?/?$ /hashedvirtual/index.php?controller=$1&action=$2&id=$3 last;
}
...
...
...
location ~ \.php$ {
include snippets/fastcgi-php.conf;
fastcgi_pass unix:/var/run/php/php8.1-fpm.sock;
}
location ~ /\.ht {
deny all;
}
}
The search engine is located on location /py-cari-ayat/
block:
proxy_pass
directive is used to proxy the request matched with/py-cari-ayat/
to this defined IP and port.proxy_redirect
directive is to use public-facing domain (apps.rahiemy.my.id) on theLocation
/Refresh
header, instead of the local IP and port. This directive also prevent non-stop redirect.proxy_set_header
directive supports Flask deployment on the subpath, with this, `/py-cari-ayat/` is added as url prefix for all Flask routes. Without this the routes will route to root path, for example, the/static
will be router tohttps://apps.rahiemy.id/static
instead ofhttps://apps.rahiemy.id/py-cari-ayat/static
As you may see on my configuration file, the nginx contains not only the Flask service proxy but also serve PHP project on my hashedvirtual
project (by the way, this is the implementation of my second paper, definitely need to write about this later). After configuring the nginx, voila, my Flask running in a subpath of a subdomain, from a shared nginx server! Pay a visit and try the search engine here https://apps.rahiemy.id/py-cari-ayat!
References
- Zulfa Umi Wardani, Moch Arif Bijaksana, and Said Al-Faraby. Analisis Aplikasi Pencarian Berdasarkan Ucapan Ayat Al-Qur’an Berbasis Web Menggunakan Algoritma Soundex dan Metode Cosine Similarity. Tugas Akhir:Universitas Telkom, 2018.
- Sukma Rahadian, Rimba Whidiana, and Adiwijaya. Perbandingan pencocokan string nama dalam Bahasa Indonesia berdasarkan kemiripan ucapan (phonetic string matching) soundex dan doublemetaphone. Tugas Akhir:Universitas Telkom, 2007. - openlibrary.telkomuniversity.ac.id
Give Comments
* required fields
Comments
Be the first to comment!