1. Index design

Elasticsearch is a document-oriented database that uses an inverted index. An inverted index is an annotated data structure that stores a mapping from content, such as words, numbers, or phrases, to its locations in a document or a collection. Inverted indexing stores a list of documents per token and retrieve all documents containing the query token in a fast time. There are two common indexing strategies: BSBI, which is faster and adds terms directly to the postings list, and SPIMI, which is slower and collects term-document ID pairs, sorts them, and then creates postings list. Elasticsearch is also a non-relational database, which is unlike the SQL database. It stores data without functional dependencies or any unique constraints such as primary key and foreign key, but each document will have an _id that identifies the document. The _id will be generated automatically for each document by Elasticsearch

In Elasticsearch, an index is similar to a database in a relational database where we should load the data and a type is like a table in a relational database, which contains fields (column in a relational database) and documents (row in a relational database). According to the question, we could create a ‘companies’ index and the fields are divided into several parts: ‘company’, ‘city’, ‘state’, ‘country’, ‘description’, ’employees’, ‘founded’, ‘annual revenue’, etc. The fields are key-value pairs, where the value can be any basic data type (e.g., integer, string) or a nested data structure (e.g., array, dictionary, object, etc.). The ‘company’, ‘city’, ‘state’, ‘country, and ‘description’ fields are strings, the ’employee’, ‘founded’, and ‘annual revenue’ fields are numbers. The documents contain eight company JSON objects. Mapping is like a ‘Schema’ in a relational database, which defines each type within the index. Each document is stored in a primary shard and by default, an index has 5 primary shards. Each primary shard can have zero or more replica shards, which are copies of the primary shard. The mapping of the index we created could be generated automatically when our loaded documents are indexed.

1. Index creation

We use elasticsearch_loader to create a new index called companies and we import the company JSON object file which is a local JSON file to this Elasticsearch index.

!elasticsearch_loader --index companies json companies.json

!elasticsearch_loader --index companies json companies.json {'index': 'companies', 'bulk_size': 500, 'es_host': ('http://localhost:9200',), 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'http_auth': None, 'delete': False, 'update': False, 'progress': False, 'type': '_doc', 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': 'localhost', 'port': 9200}])>} /Users/janetlin/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/base.py:200: ElasticsearchWarning: [types removal] Specifying types in bulk requests is deprecated. warnings.warn(message, category=ElasticsearchWarning)

We could see in Kibana that data has been loaded successfully.

2. Index connection

We establish a connection using elasticsearch-dsl and we build the client utilizing elasticsearch-dsl, which creates the connection between elasticsearch and jupyter notebook.

from elasticsearch_dsl import connections
client = connections.create_connection()

3. Elasticsearch Queries

1. Which company is the oldest?

We search the companies index and in order to find the oldest company, we use the ‘founded’ field to find the minimum value. We use the aggregation command ‘.aggs.metric’ and ‘min’ to find the minimum value in the ‘founded’ field, and then we use a query with ‘match’ to find the company which has the minimum founded year. Finally, the code prints only the company names and the founded year meeting this criteria

from elasticsearch_dsl import Search, Q

s = Search(using = client, index = "companies")
s.aggs.metric("earliest_found", "min", field = 'founded')
response = s.execute()
s = Search(using = client, index = "companies").query("match", founded = response.aggregations.earliest_found.value)

response = s.execute() # by default you get 10 hits
for hit in response:
    print("The oldest company is: " + str(hit.company) + ". The founded year for this company is: " + str(hit.founded))

The oldest company is: Daimler AG. The founded year for this company is: 1926

We sort all the companies by their founded years in ascending order, the first company which is also the oldest one is Daimler AG.

2. Which company’s product is a search engine?

We search the companies index and use the query method ‘match’ to find companies whose description field contains the word “search engine”. The code will return all results and only print the company names meeting this criteria

from elasticsearch_dsl import Search

s = Search(using = client, index = "companies").query("match", description = "search engine")  
response = s.execute() # by default you get 10 hits

for hit in response:
    print("The company whose products contain a search engine is:", hit.company)

The company whose products contain a search engine is: Google

We find the description field contains the word “search engine” and the company meeting this criteria is Google.

3. Which companies are headquartered in the USA?

We search the companies index and use the query method ‘match’ to find companies whose country field contains the word “USA”. The code will return all results and only print only the company names meeting this criteria

from elasticsearch_dsl import Search

s = Search(using = client, index = "companies").query("match", country = "USA")  
response = s.execute() # by default you get 10 hits

for hit in response:
    print(hit.company, "is headquatered in the USA")

Apple is headquartered in the USA
Microsoft is headquartered in the USA
Amazon is headquartered in the USA
Google is headquartered in the USA

We find the country field contains the word “USA” and the companies meeting this criteria are Apple, Microsoft, Amazon, and Google.

4. Which companies have been founded in the 1990s?

We use the Elasticsearch client to search the companies index and return the companies whose founded year is greater than or equal to 1990 and less than or equal to 1999. Then the code print only the companies’ name and the founded year.

TIME_FROM = 1990
TIME_TO   = 1999
s = Search(using = client, index = "companies") \
        .filter("range", founded ={'gte': TIME_FROM, 'lte': TIME_TO})
#By default we get 10 records as a response
response = s.execute()
for hit in response:
    print("The compnay's name is: " + str(hit.company) + ". The founded year is: "+ str(hit.founded))

The company’s name: Amazon. The founded year is: 1994
The company’s name: Google. The founded year is: 1998
The company’s name: Alibaba. The founded year is: 1999

We find companies whose founded year is greater than or equal to 1990 and less than or equal to 1999. The companies meeting this criteria are Amazon, Google, and Alibaba.

5. What is the total annual revenue of the three companies in Asia?

First, we use the Elasticsearch client to search the companies index, and then we use ‘Q’ and ‘or’ command to return the companies whose country is Asia country (China, Japan, and South Korea). Then we use the aggregation command ‘.aggs.metric’ and ‘sum’ to get the sum of these companies’ annual revenue and assign this value to the variable “total_revenue”. Finally, the code print the total annual revenue of the three companies in Asia.

from elasticsearch_dsl import Search, Q

total_revenue = 0
s = Search(using=client, index="companies")
s.query = Q("match", country = 'China') | Q("match", country = 'Japan') | Q("match", country = 'South Korea')
s.aggs.metric("total_revenue", "sum", field = "annual_revenue")
response = s.execute() # by default you get 10 hits

total_revenue = response.aggregations.total_revenue.value
print("The total annual revenue of the three companies in Asia is", total_revenue, "billion dollars")

The total annual revenue of the three companies in Asia is 358.0 billion dollars

We find three Asia companies which are Alibaba, Sony, and Samsung Electronics. Their annual revenues are 72, 76, and 210 and the total revenue is 358 billion dollars.

4. Drop the companies index

client.indices.delete(index='companies', ignore=[400, 404])
print("Companies index is dropped!")

Companies index is dropped!

HTML File Download: NoSQL Elastic Search FULL FILE

Category
other data projects

NoSQL Elasticsearch