User profiles¶
Format¶
User profiles are stored in data/profiles.pickle. The pickled dict has the following format:
{
'<TARGET_ARTICLE>': [
'<PROFILE_ARTICLE_1>',
...,
'<PROFILE_ARTICLE_N>'
]
}
A user is identified by the target article ID: it is the article they are presumed to be searching for, and that should feature in the results; this is the key in the above dictionary.
The value is a list of Wikipedia article IDs that have been sampled to resemble a reading history with topical focus. The plaintext articles corresponding to these IDs are stored in data/_cleaned/<ARTICLE_ID> and can be accessed from there.
A note on model construction¶
The embedding builder currently has a code path where it attempts to extract textual data from Wikipedia via the Wikimedia API (ll. 30-41).
This step is not necessary, as you will likely have been provided the already cleaned article dumps.
Any additional profile handling should likely start off with reading the pickled profiles, fetch the associated articles, and perform further processing:
import pickle
from collections import defaultdict
from pathlib import Path
with open("data/profiles.pickle", "rb") as f:
profiles = pickle.load(f)
profiles_with_text = defaultdict(list)
for key, docs in profiles.items():
for doc in docs:
with open(Path("data/_cleaned") / doc) as f:
profiles_with_text[key].append(f.read())
At this point, profiles_with_text should contain the target article ID as the key, and a list of plaintext excerpts constituting the reading history.