Page MenuHomePhabricator

when analyzing a Wikifunctions dump, parent_id in page creation revisions is sometimes 0 and sometimes None
Open, Needs TriagePublicBUG REPORT

Description

Out of curiosity, I'm analyzing a Wikifunctions XML dump. (The code, if anyone wants to see it.)

I use the mwxml Python library for it. I've noticed a slightly strange thing: most of the time, page creation revisions have a parent_id value of None, but several hundreds of them have a parent_id value of 0 (int).

I'd expect it to be consistently 0, as in the database, or maybe consistently None, but not a mix.

@xcollazo suggested reporting a bug.

Event Timeline

Here are my findings:

-- I want to see the status of the source tables `event.mediawiki_page_content_change_v1` and `event.mediawiki_content_history_reconcile_enriched_v1` therefore I executed the following query

spark.sql("""
WITH page_change AS (
  SELECT
    COALESCE(revision.rev_parent_id, -1) AS rev_parent_id, -- to allow a proper join
    page.page_id AS page_id,
    revision.rev_id AS revision_id,
    TO_TIMESTAMP(revision.rev_dt) AS rev_ts,
    TO_TIMESTAMP(meta.dt) AS meta_ts
  FROM event.mediawiki_page_content_change_v1
  WHERE wiki_id = 'simplewiki'
    AND year = 2026
    AND month = 3
    AND (revision.rev_parent_id = 0 OR revision.rev_parent_id IS NULL)
),
enrich AS (
  SELECT
    COALESCE(revision.rev_parent_id, -1) AS rev_parent_id, -- to allow a proper join
    page.page_id AS page_id,
    revision.rev_id AS revision_id,
    TO_TIMESTAMP(revision.rev_dt) AS rev_ts,
    TO_TIMESTAMP(meta.dt) AS meta_ts
  FROM event.mediawiki_content_history_reconcile_enriched_v1
  WHERE wiki_id = 'simplewiki'
    AND year = 2026
    AND month = 3
    AND (revision.rev_parent_id = 0 OR revision.rev_parent_id IS NULL)
)
SELECT
  pc.page_id,
  pc.revision_id,
  en.rev_parent_id AS rev_parent_en,
  pc.rev_parent_id AS rev_parent_pc,
  en.rev_ts,
  en.meta_ts,
  pc.rev_ts,
  pc.meta_ts
FROM page_change pc
JOIN enrich en
  ON en.page_id = pc.page_id
  AND en.revision_id = pc.revision_id
WHERE en.rev_parent_id != pc.rev_parent_id
  AND en.rev_ts < TIMESTAMP '2026-03-28' -- this to have data that has sat couple days in our tables not the freshest.
ORDER BY en.rev_ts DESC
""").show(truncate=False)

+-------+-----------+-------------+-------------+-------------------+--------------------------+-------------------+--------------------------+
|page_id|revision_id|rev_parent_en|rev_parent_pc|rev_ts             |meta_ts                   |rev_ts             |meta_ts                   |
+-------+-----------+-------------+-------------+-------------------+--------------------------+-------------------+--------------------------+
|1260271|10797545   |0            |-1           |2026-03-27 19:09:52|2026-03-28 08:40:57.369379|2026-03-27 19:09:52|2026-03-27 19:09:54.633982|
|1260266|10797513   |0            |-1           |2026-03-27 18:43:26|2026-03-28 08:40:57.168967|2026-03-27 18:43:26|2026-03-27 18:43:28.874415|
|1260247|10797391   |0            |-1           |2026-03-27 17:35:11|2026-03-28 08:40:58.370919|2026-03-27 17:35:11|2026-03-27 17:35:14.809395|
|1260207|10797189   |0            |-1           |2026-03-27 15:00:49|2026-03-28 08:40:57.351437|2026-03-27 15:00:49|2026-03-27 15:00:54.292207|
|1260197|10797137   |0            |-1           |2026-03-27 14:05:15|2026-03-28 08:39:49.973912|2026-03-27 14:05:15|2026-03-27 14:05:18.870012|
|1260195|10797125   |0            |-1           |2026-03-27 13:56:08|2026-03-28 08:40:57.351877|2026-03-27 13:56:08|2026-03-27 13:56:11.04269 |
|1260069|10796301   |0            |-1           |2026-03-26 20:11:33|2026-03-27 09:01:41.574528|2026-03-26 20:11:33|2026-03-26 20:11:37.866196|
|1260017|10796009   |0            |-1           |2026-03-26 17:07:33|2026-03-27 09:01:41.115706|2026-03-26 17:07:33|2026-03-26 17:07:37.669232|
|1259949|10795583   |0            |-1           |2026-03-26 10:40:54|2026-03-27 08:59:49.917442|2026-03-26 10:40:54|2026-03-26 10:40:59.04031 |
|1259893|10795075   |0            |-1           |2026-03-26 01:41:28|2026-03-27 09:01:41.573412|2026-03-26 01:41:28|2026-03-26 01:41:33.032589|
|1259890|10794997   |0            |-1           |2026-03-26 00:09:51|2026-03-27 09:01:41.465823|2026-03-26 00:09:51|2026-03-26 00:09:54.65921 |
|1259587|10792972   |0            |-1           |2026-03-24 10:49:34|2026-03-25 08:09:07.842913|2026-03-24 10:49:34|2026-03-24 10:49:38.148943|
|1259586|10792970   |0            |-1           |2026-03-24 10:49:16|2026-03-25 08:08:47.529085|2026-03-24 10:49:16|2026-03-24 10:49:19.303587|
|1259585|10792968   |0            |-1           |2026-03-24 10:48:46|2026-03-25 08:07:07.618147|2026-03-24 10:48:46|2026-03-24 10:48:49.278432|
|1259567|10792814   |0            |-1           |2026-03-24 06:18:54|2026-03-25 08:07:07.61865 |2026-03-24 06:18:54|2026-03-24 06:18:57.858593|
|1259553|10792688   |0            |-1           |2026-03-24 04:04:48|2026-03-25 08:09:07.619311|2026-03-24 04:04:48|2026-03-24 04:04:49.213436|
|1259552|10792686   |0            |-1           |2026-03-24 04:03:40|2026-03-25 08:09:07.683451|2026-03-24 04:03:40|2026-03-24 04:03:43.851667|
|1259550|10792674   |0            |-1           |2026-03-24 03:51:32|2026-03-25 08:09:07.61862 |2026-03-24 03:51:32|2026-03-24 03:51:35.850462|
|1259549|10792673   |0            |-1           |2026-03-24 03:50:20|2026-03-25 08:09:07.731899|2026-03-24 03:50:20|2026-03-24 03:50:23.15421 |
|1259548|10792671   |0            |-1           |2026-03-24 03:45:23|2026-03-25 08:08:47.810214|2026-03-24 03:45:23|2026-03-24 03:45:26.153001|
+-------+-----------+-------------+-------------+-------------------+--------------------------+-------------------+--------------------------+

spark.sql("""
SELECT
  revision.rev_parent_id,
  TO_TIMESTAMP(revision.rev_dt) AS rev_ts,
  TO_TIMESTAMP(meta.dt) AS meta_ts
FROM event.mediawiki_content_history_reconcile_enriched_v1
WHERE wiki_id = 'simplewiki'
  AND page.page_id = 1260271
  AND revision.rev_id = 10797545
  AND year = 2026
  AND month = 3
""").show(truncate=False)

+-------------+-----------------------------+--------------------------+
|rev_parent_id|to_timestamp(revision.rev_dt)|to_timestamp(meta.dt)     |
+-------------+-----------------------------+--------------------------+
|0            |2026-03-27 19:09:52          |2026-03-28 08:40:57.369379|
+-------------+-----------------------------+--------------------------+

spark.sql("""
SELECT
  revision.rev_parent_id,
  TO_TIMESTAMP(revision.rev_dt) AS rev_ts,
  TO_TIMESTAMP(meta.dt) AS meta_ts
FROM event.mediawiki_page_content_change_v1
WHERE wiki_id = 'simplewiki'
  AND page.page_id = 1260271
  AND revision.rev_id = 10797545
  AND year = 2026
  AND month = 3
""").show(truncate=False)

+-------------+-----------------------------+--------------------------+
|rev_parent_id|to_timestamp(revision.rev_dt)|to_timestamp(meta.dt)     |
+-------------+-----------------------------+--------------------------+
|null         |2026-03-27 19:09:52          |2026-03-27 19:09:54.633982|
+-------------+-----------------------------+--------------------------+

-- Therefore we receive null in the `mediawiki_page_content_change_v1` and 0 in the `mediawiki_content_history_reconcile_enriched_v1` for the same page_id, revision_id 

spark.sql("""
SELECT
  revision_parent_id,
  revision_dt,
  row_content_update_dt,
  row_visibility_update_dt,
  row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1
WHERE wiki_id = 'simplewiki'
  AND page_id = 1260271 
  AND revision_id = 10797545 
  AND (revision_parent_id IS NULL OR revision_parent_id = 0)
""").show(truncate=False)

+------------------+-------------------+--------------------------+--------------------------+--------------------------+
|revision_parent_id|revision_dt        |row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+------------------+-------------------+--------------------------+--------------------------+--------------------------+
|0                 |2026-03-27 19:09:52|2026-03-28 08:40:57.369379|2026-03-27 19:09:54.633982|2026-03-27 19:09:54.633982|
+------------------+-------------------+--------------------------+--------------------------+--------------------------+

spark.sql("""
SELECT
  revision_parent_id,
  revision_dt,
  row_content_update_dt,
  row_visibility_update_dt,
  row_move_update_dt
FROM wmf_content.mediawiki_content_history_v1 TIMESTAMP AS OF '2026-03-29 00:00:00'
WHERE wiki_id = 'simplewiki'
  AND page_id = 1260271
  AND revision_id = 10797545
  AND (revision_parent_id IS NULL OR revision_parent_id = 0)
""").show(truncate=False)

+------------------+-------------------+--------------------------+--------------------------+--------------------------+
|revision_parent_id|revision_dt        |row_content_update_dt     |row_visibility_update_dt  |row_move_update_dt        |
+------------------+-------------------+--------------------------+--------------------------+--------------------------+
|null              |2026-03-27 19:09:52|2026-03-27 19:09:54.633982|2026-03-27 19:09:54.633982|2026-03-27 19:09:54.633982|
+------------------+-------------------+--------------------------+--------------------------+--------------------------+

-- depending from the moment we query the table we can either have null or 0 for the same.

My idea would be to force 0 when we write on the MWCH table. But, another solution would also be to force the 0 in the event.mediawiki_page_content_change_v1.

What do you think @xcollazo, @JAllemandou ?

That's interesting!
@Ottomata could you have a look at the event side of thing? This could mean a bug, right?

I concur that this looks like a page_change event bug.

In the source of truth database we indeed keep 0:

MariaDB [simplewiki]> select rev_id, rev_parent_id from revision where rev_id = 10797545;
+----------+---------------+
| rev_id   | rev_parent_id |
+----------+---------------+
| 10797545 |             0 |
+----------+---------------+
1 row in set (0.000 sec)

Revision table documentation states that 0 is a valid value:

rev_parent_id
The revision ID of the previous revision to the page. Corresponds to rc_last_oldid. For edits which are new page creations, rev_parent_id = 0.

Further, the revision fragment (currently at v1.1.0) that we use to define a revision for page_change events (currently at 1.3.0) states that a value of 0 is valid:

rev_parent_id:
  description: This revision's parent rev_id.
  type: integer
  maximum: 9007199254740991
  minimum: 0

Thus I think the fix here is to:

  1. Emit 0 instead of null on page_change events for rev_parent_id when there is no parent.
  2. Run a query on our datalake tables to correct data so that any null values on revison_parent_id are changed to 0.
  1. Emit 0 instead of null on page_change events for rev_parent_id when there is no parent.
  2. Run a query on our datalake tables to correct data so that any null values on revison_parent_id are changed to 0.

+1 on this plan!

Hm! Interesting, TIL. A comment in RevisionStoreRecord.php:

		// NOTE: rev_parent_id = 0 indicates that there is no parent revision, while null
		// indicates that the parent revision is unknown. As per MW 1.31, the database schema
		// allows rev_parent_id to be NULL.

But even so, in EventBus, we try to always omit rev_parent_id.

I suppose the right thing to do would be to do what MediaWiki does? Only omit rev_parent_id if $revisionRecord->getParentId() == null.

I suppose the right thing to do would be to do what MediaWiki does? Only omit rev_parent_id if $revisionRecord->getParentId() == null.

Current docs at https://www.mediawiki.org/wiki/Manual:Revision_table#rev_parent_id do not reflect those semantics, but I agree that code semantics win.

Change #1282442 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] RevisionEntitySerializer - allow rev_parent_id of 0

https://gerrit.wikimedia.org/r/1282442

Change #1282442 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] RevisionEntitySerializer - allow rev_parent_id of 0

https://gerrit.wikimedia.org/r/1282442