SYSTEM NOTICE

Auto translation by AI. Be sure, accuracy, nuances and authorial intent may not be fully reflected.
見出し画像

[Practical Drill] Learning SPARQL by Doing: Mastering the Wikidata Query Service

About this article

Target audience: Beginners to SPARQL and the Wikidata Query Service
Objective: To be able to solve practice problems on your own by referring to examples
Site used: Wikidata Query Service (https://query.wikidata.org/)


Introduction

Hello! In this article, I have created a drill to help you learn the basics of "SPARQL," the standard query language for the Semantic Web, by actually running queries on the Wikidata Query Service.

If you feel that "learning through theory alone doesn't quite click," try looking at the examples first, then write and execute your own queries in the subsequent practice problems. You should be able to get a feel for how to handle graph databases. Please try to complete it to the end!

Review of Prerequisites: Basics of SPARQL and RDF

Before learning SPARQL, let's briefly understand the RDF that is the target of our queries.

While relational databases (RDB) that we use daily manage data in tables, graph databases like Wikidata manage data in a format called "RDF: Resource Description Framework."

The basis of RDF is expressing data as a combination of three elements (a Triple): "Subject," "Predicate," and "Object."

  • Subject: The target you want to search for (e.g., Japan)

  • Predicate: The attribute or relationship of the target (e.g., capital is)

  • Object: The value of the attribute or the related target (e.g., Tokyo)

SPARQL involves writing queries as if you were solving this "Subject-Predicate-Object" puzzle.


SPARQL Core Syntax 1: SELECT and WHERE (Basic Graph Pattern)

The most fundamental syntax consists of SELECT and WHERE.

  • SELECT: Specify the variables you want to display on the screen at the end by adding a ? before the variable name.

  • WHERE: Describe the conditions of the triple you want to search for inside the { }.

Prerequisite knowledge: In Wikidata, entities (subjects and objects) are assigned unique IDs starting with wd:Qxxx, and properties (predicates) are assigned unique IDs starting with wdt:Pxxx.

Example)

  • wd:Q6256 = Country

  • wd:Q50337 = Prefectures of Japan

  • wdt:P31 = instance of

These unique IDs can be obtained from the Wikidata (https://www.wikidata.org/?uselang=en) website, but the IDs used in this drill have been provided in advance.

[Query Example 1]

To retrieve a list of "countries of the world," write it as follows.

Prerequisite knowledge: wd:Q6256 = country, wdt:P31 = instance of

The condition is "?country (subject) has wdt:P31 (classification of) wd:Q6256 (country)." The period (.) at the end indicates the end of the statement.

Executing this allows you to retrieve a list of IDs like the following.

If you try accessing Q16, you can access the data for Canada. As of May 29, 2026, I was able to retrieve 215 country IDs.

Practice Problem 1

Using the Wikidata Query Service, please write a query to retrieve a list of "(QIDs of) Japanese prefectures".

* Please use ?pref as the output variable.
wd:Q50337 = Prefectures of Japan, wdt:P31 = instance of

Answer 1

SELECT ?pref WHERE {
  ?pref wdt:P31 wd:Q50337 .
}

Executing this yielded 126 pieces of data. (Since there are 47 prefectures, I believe this also includes data for prefectures that no longer exist in the current administrative divisions.)


SPARQL Core Syntax 2: Automatic Label Retrieval (SERVICE Syntax)

When executing the query from the Core Syntax 1 problem, I noticed that unnecessary data was mixed in. I want to check which ones are incorrect, but in the previous query, only QIDs were lined up, making it difficult for humans to understand which data is which.

Therefore, we use SERVICE wikibase:label, a convenient feature unique to Wikidata. By including this, it automatically retrieves labels in the specified language (in this case, Japanese "ja").

When a variable ?xxx is specified, the name is automatically stored in a variable called ?xxxLabel.

[Query Example 2]

To add and display Japanese labels to the previous "countries of the world" query, write it as follows.

SELECT ?country ?countryLabel WHERE {
  ?country wdt:P31 wd:Q6256 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}

Practice Problem 2

Add a label retrieval service to the query from Practice Problem 1, and write a query that displays the 'QID' and 'Japanese name' of Japanese prefectures side-by-side.

* You need to list two variables you want to display in the SELECT clause.

Answer 2

SELECT ?pref ?prefLabel WHERE {
  ?pref wdt:P31 wd:Q50337 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}

We were able to confirm that unnecessary data, such as Amagasaki Prefecture, is included.

SPARQL Core Syntax 3: Data Combination and Sorting (ORDER BY / LIMIT)

Next, let's learn the syntax for combining even more data.

When adding triple conditions, you can use a semicolon (;) to write multiple predicates and objects for the same subject in succession.

  • wdt:P1082 = population

Also, use ORDER BY to sort the retrieved results and LIMIT to restrict the number of results. To sort in descending order, enclose it as ORDER BY DESC(?variable).

[Query Example 3]

To retrieve the top 5 'countries of the world' by population, write it as follows.

SELECT ?country ?countryLabel ?population WHERE {
  ?country wdt:P31 wd:Q6256 ;
           wdt:P1082 ?population .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
ORDER BY DESC(?population)
LIMIT 5

We are connecting after ?country wdt:P31 wd:Q6256 ; with a semicolon and adding the population condition.

Practice Problem 3

For Japanese prefectures, write a query that retrieves the 'top 48 by population' along with three pieces of information: the prefecture's QID, Japanese name, and population.

* Please use ?population for the population variable.

Answer 3

SELECT ?pref ?prefLabel ?population WHERE {
  ?pref wdt:P31 wd:Q50337 ;
        wdt:P1082 ?population .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
ORDER BY DESC(?population)
LIMIT 48


If Shinagawa Prefecture ever overtakes Tottori Prefecture in population, this ranking might stop working, might it?

SPARQL Core Syntax 4: OPTIONAL (Allowing for Missing Data)

When working with graph databases, it is no exaggeration to say that this OPTIONAL is the most important element. Unlike RDBs (relational databases), RDF does not have the concept of "Null." If data does not exist, specifying a condition in a standard WHERE clause will cause that row to disappear from the search results entirely.

Use OPTIONAL { } to wrap conditions when you want to "retrieve data if it exists, but keep the subject data even if it doesn't."

[Query Example 4]

Retrieve "countries of the world (Q6256)" and their "official website (P856)." Ensure that countries without a registered website are not removed from the list.

SELECT ?country ?countryLabel ?website WHERE {
  ?country wdt:P31 wd:Q6256 .
  OPTIONAL { ?country wdt:P856 ?website . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
LIMIT 5


Practice Problem 4

Wikidata is also rich in medical and drug discovery data. Write a query to retrieve a list of "infectious diseases" (QID: Q18123741), and if data for "treatment/medication used for treatment" (PID: P2176) is registered, display it as well. *Please use ?infectiousDisease and ?treatment as the variables.

Answer 4

SELECT ?infectiousDisease ?infectiousDiseaseLabel ?treatment ?treatmentLabel WHERE {
  ?infectiousDisease wdt:P31 wd:Q18123741 .
  OPTIONAL { ?ntd wdt:P2176 ?treatment . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}

[Explanation] If you do not use OPTIONAL, diseases for which no treatment has been registered (or developed) yet will be completely dropped from the results. This syntax is essential for research involving the investigation of undiscovered areas.


SPARQL Core Syntax 5: Property Paths (Recursive Hierarchy Exploration)

Wikidata has a tree structure where "A is a subclass of B." For example, "Random Forest" is a subclass of "Ensemble Learning," and "Ensemble Learning" is a subclass of "Machine Learning."

By adding * (zero or more repetitions) or / (path concatenation) after a property, you can automatically traverse this hierarchy all the way to the bottom to perform a search.

  • wdt:P279 = is a subclass of

[Query Example 5]

Extract items that fall under the subclass of "animal (Q729)" (including mammals, birds, and everything in the hierarchies below them).

Code Snippet

SELECT ?animal ?animalLabel WHERE {
  ?animal wdt:P279* wd:Q729 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
LIMIT 20

Practice Problem 5

Write a query to extract all algorithms and models that belong to the subclasses of "machine learning" (QID: Q2539). *Please use ?ml_model as the variable.

Answer 5

Code Snippet

SELECT ?ml_model ?ml_modelLabel WHERE {
  ?ml_model wdt:P279* wd:Q2539 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
LIMIT 50

[Explanation] By adding an asterisk to wdt:P279*, you can recursively search not only for "direct subclasses" but also for "grandchild classes" and "great-grandchild classes." This is a very powerful feature unique to graph databases.


SPARQL Core Syntax 6: UNION (Combining Multiple Conditions)

If you want to retrieve items that match either "Condition A or Condition B," use UNION to connect the blocks.

[Query Example 6]

Retrieve items that correspond to either "art museum (Q207694)" or "museum (Q33506)."

Code snippet

SELECT ?place ?placeLabel WHERE {
  { ?place wdt:P31 wd:Q207694 . }
  UNION
  { ?place wdt:P31 wd:Q33506 . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
LIMIT 10

Practice Problem 6

Let's look into Italian cuisine. Write a query to retrieve food items whose classification is either "pizza" (QID: Q177) or "pasta" (QID: Q178). *Please use ?food as the variable.

Answer 6

Code snippet

SELECT ?food ?foodLabel WHERE {
  { ?food wdt:P31 wd:Q177 . }
  UNION
  { ?food wdt:P31 wd:Q178 . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
LIMIT 20

[Explanation] Enclose each condition in { } and place UNION between them. The key is to keep the variable name (?food) consistent in both blocks.


SPARQL Core Syntax 7: GROUP BY and COUNT (Aggregating Data)

Finally, here is an aggregation syntax to get an overview of the retrieved data. Just like in SQL, you can use GROUP BY to group specific variables and COUNT to count them.

[Query Example 7]

Count the number of registered World Heritage sites (Q9259) for each country (P17) and display them in descending order. *It is a bit complex, but the rule is to write it as (COUNT(?heritage) AS ?count) within the SELECT statement.

Code snippet

SELECT ?countryLabel (COUNT(?heritage) AS ?count) WHERE {
  ?heritage wdt:P31 wd:Q9259 ;
            wdt:P17 ?country .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
GROUP BY ?countryLabel
ORDER BY DESC(?count)
LIMIT 5


Practice Problem 7

This is an application of Practice Problem 4. Write a query to count the number of registered "treatments (P2176)" for each infectious disease (Q18123741) and list them in descending order of the number of treatment types. *The variable should be ?ntdLabel, and the count result should be ?drug_count. (The result will change depending on whether you use OPTIONAL or not, but for this exercise, only count items that have treatments).

Answer 7

SELECT ?infectiousDiseaseLabel (COUNT(?treatment) AS ?drug_count) WHERE {
  ?ntd wdt:P31 wd:Q18123741 ;
       wdt:P2176 ?treatment .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "ja" . }
}
GROUP BY ?infectiousDiseaseLabel
ORDER BY DESC(?drug_count)

[Explanation] We group by disease (GROUP BY ?ntdLabel) and count the associated therapeutic drugs. This is useful for quantitatively understanding 'how many existing approaches exist in which area,' such as when narrowing down drug discovery targets.


Wrap-up

So far, we have learned all 7 core syntaxes.

  1. SELECT / WHERE: Basic matching

  2. SERVICE: Retrieving labels

  3. ORDER BY / LIMIT: Sorting and limiting

  4. OPTIONAL: Allowing for missing data

  5. Property paths: Deep diving into hierarchies

  6. UNION: Combining multiple conditions

  7. GROUP BY / COUNT: Aggregating data

By combining these like blocks, you can perform advanced data mining, such as 'aggregating data that has a certain property and is a subclass of something else, while matching specific conditions.'

SPARQL is a powerful query language for integrating scattered open data and discovering new insights. It is said to be used in AI reasoning recently, so it might be used even more in the future.

いいなと思ったら応援しよう!