Loading...

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: 4.0
Affects Version/s: 3.10.7, 4.0
Component/s: Caching, Libraries
Labels:
- dev_docs_required
- triaged

Affected Branches:
MOODLE_310_STABLE, MOODLE_400_STABLE
Fixed Branches:
MOODLE_400_STABLE
Pull from Repository:
https://github.com/sammarshallou/moodle.git
Pull Main Branch:
~~MDL-72837~~-master
Pull Main Diff URL:
https://github.com/sammarshallou/moodle/compare/master...MDL-72837-master
Testing Instructions:
Hide

To test this for real you would need a system with multiple servers, using a local cache on each server and a shared cache across all servers. This would then enable you to replicate the problem where when there are multiple requests, some of them have infeasibly long delays because they wait for multiple other users to rebuild the cache.

However, if we only need to test that it works to get the data from shared cache when there is older data in the local cache (which is the fundamental change here) then we can simulate this within a single test server if we use a 'local cache' that is a filesystem cache, and manually change the directory to simulate being on another server.

First, so as to make testing easier, hack a library function so that building a course cache always takes at least 20 seconds. (Use a test course that builds quickly so that the time is pretty much just 20 seconds.) Open lib/modinfolib.php and insert the line

sleep(20);

immediately after this line (currently line 700):

protected static function inner_build_course_cache($course, \core\lock\lock $lock) {

Select a suitable location on your web server computer to store this cache, for example it could be /tmp - anywhere that your web server can access.

If you are using Windows, it must be stored somewhere on the default disk drive i.e. drive C.

In that location, create two folders 'localmodinfo1' and 'localmodinfo2'

Go to the cache configuration screen (admin menu / Plugins / Caching / Configuration).

Click 'Add instance' in the 'File cache' line of the top table.

Give it the store name 'LOCAL modinfo'.

Set the cache path to the full path 'localmodinfo1' folder that you created earlier.

If you are using Windows, you can't type colons into this path, and you should probably also not use backslashes, so instead of c:\temp\localmodinfo1, type /temp/localmodinfo1.

Save changes.

Under 'Known cache definitions', find the line for coursemodinfo (usually first line). Click 'Edit mappings'.

Set it so that 'LOCAL modinfo' is the first entry, and a suitable shared cache (e.g. the default one) is the final store entry.

Keep this screen open in a tab because you're going to need it again soon.

In another tab, view any course page. (There may or may not be a 20-second delay at this point; it doesn't matter.)

Back in the first tab, find 'LOCAL modinfo' in the 'Configured store instances' table and click the 'Edit store' button. Change the end of the path from 'localmodinfo1' to 'localmodinfo2' and save changes.

Reload the course page.

EXPECTED: It should not take 20 seconds.

This guarantees that both the 1 and 2 local caches, and the shared cache, all now contain the current version of the course cache.

Now edit settings for the course, and save changes. (You don't actually have to change anything.)

EXPECTED: There should be a 2x 20 second delays at this point. (This is because editing settings rebuilds the course cache twice, which seems silly, but that's an unrelated issue.)

Now the new version of the course will be in the shared cache, and the 2 local cache, but not in the 1 local cache which should still have the previous version.

In the first tab, edit the store again and change the path back to 'localmodinfo1' and save changes.

Reload the course page.

EXPECTED: It should not take 20 seconds.

This test proves that the course cache rebuild now only happens once in the local+shared setup, even if an older version is stored on local cache.

If you want you can also turn on performance information and watch the cache hits/misses/sets values for modinfo (this is a bit advanced - it is hard to read, and you have to bear in mind that it always loads the site course as well as the one being looked at):

After the edit, when the course page first loads (so that it actually rebuilds course cache) there is 1 set on the local cache and 1 set on the shared cache.

If you load the course page again at that point, you'll see it has all hits on local and no reference to the shared cache.

After you change the cache setting to the other directory (so that it needs to get the newer one from shared cache) you'll see the shared cache appears again, but only with a hit; the local cache now has a set as it's updating the value.

Reloading it goes back to having only hits on local cache.
Show
To test this for real you would need a system with multiple servers, using a local cache on each server and a shared cache across all servers. This would then enable you to replicate the problem where when there are multiple requests, some of them have infeasibly long delays because they wait for multiple other users to rebuild the cache. However, if we only need to test that it works to get the data from shared cache when there is older data in the local cache (which is the fundamental change here) then we can simulate this within a single test server if we use a 'local cache' that is a filesystem cache, and manually change the directory to simulate being on another server. First, so as to make testing easier, hack a library function so that building a course cache always takes at least 20 seconds. (Use a test course that builds quickly so that the time is pretty much just 20 seconds.) Open lib/modinfolib.php and insert the line sleep(20); immediately after this line (currently line 700): protected static function inner_build_course_cache($course, \core\lock\lock $lock) { Select a suitable location on your web server computer to store this cache, for example it could be /tmp - anywhere that your web server can access. If you are using Windows, it must be stored somewhere on the default disk drive i.e. drive C. In that location, create two folders 'localmodinfo1' and 'localmodinfo2' Go to the cache configuration screen (admin menu / Plugins / Caching / Configuration). Click 'Add instance' in the 'File cache' line of the top table. Give it the store name 'LOCAL modinfo'. Set the cache path to the full path 'localmodinfo1' folder that you created earlier. If you are using Windows, you can't type colons into this path, and you should probably also not use backslashes, so instead of c:\temp\localmodinfo1, type /temp/localmodinfo1. Save changes. Under 'Known cache definitions', find the line for coursemodinfo (usually first line). Click 'Edit mappings'. Set it so that 'LOCAL modinfo' is the first entry, and a suitable shared cache (e.g. the default one) is the final store entry. Keep this screen open in a tab because you're going to need it again soon. In another tab, view any course page. (There may or may not be a 20-second delay at this point; it doesn't matter.) Back in the first tab, find 'LOCAL modinfo' in the 'Configured store instances' table and click the 'Edit store' button. Change the end of the path from 'localmodinfo1' to 'localmodinfo2' and save changes. Reload the course page. EXPECTED: It should not take 20 seconds. This guarantees that both the 1 and 2 local caches, and the shared cache, all now contain the current version of the course cache. Now edit settings for the course, and save changes. (You don't actually have to change anything.) EXPECTED: There should be a 2x 20 second delays at this point. (This is because editing settings rebuilds the course cache twice, which seems silly, but that's an unrelated issue.) Now the new version of the course will be in the shared cache, and the 2 local cache, but not in the 1 local cache which should still have the previous version. In the first tab, edit the store again and change the path back to 'localmodinfo1' and save changes. Reload the course page. EXPECTED: It should not take 20 seconds. This test proves that the course cache rebuild now only happens once in the local+shared setup, even if an older version is stored on local cache. If you want you can also turn on performance information and watch the cache hits/misses/sets values for modinfo (this is a bit advanced - it is hard to read, and you have to bear in mind that it always loads the site course as well as the one being looked at): After the edit, when the course page first loads (so that it actually rebuilds course cache) there is 1 set on the local cache and 1 set on the shared cache. If you load the course page again at that point, you'll see it has all hits on local and no reference to the shared cache. After you change the cache setting to the other directory (so that it needs to get the newer one from shared cache) you'll see the shared cache appears again, but only with a hit; the local cache now has a set as it's updating the value. Reloading it goes back to having only hits on local cache.

In order to support the way modinfo works across multiple-layer caches, we need to make an API that allows versioned data in the cache. Currently modinfo cache is kind of versioned because the system checks the 'cacherev' number from the course table, but we need the cache API to be aware of the versioning.

Why the current situation causes a bug

It is possible (and a good idea for performance in a large system) to configure the modinfo cache to have two levels, i.e. a local cache + a shared cache. In general the way this works is that modinfo will normally be loaded from local; if unavailable locally, it will be loaded from shared.

The problem is if the modinfo is available locally but out of date compared to the shared version. In that case, the cache will be rebuilt (even though it has already been built and the shared version is current). In cases where course cache rebuild takes a long time (e.g. 10 seconds) and you have frequent requests to the course and many server instances (e.g. 20) then some requests after a cache rebuild can take up to 200 seconds. (rebuild time * server/container instances, give or take).

In our system we have a front-end server timeout after 60 seconds, so lots of people see error pages briefly, every time somebody edits a course.

To explain how this works, consider a situation with three containers (C1, C2, and C3). Each has a local cache in addition to the global shared cache. Initially, all the containers have a current version (V1) of the modinfo cache for a particular course.

A user causes the cache to be cleared; their request is handled by C1. This will clear the modinfo cache for the course on C1 local cache and the shared cache.
A user requests the course; the request is handled by C3. This will get the lock for building the course and start building it, which takes a while (N seconds).
During this time another two requests come in, from C1 and C2. Both these requests also try to get the lock to build the cache because the cached data is missing (C1) or out of date (C2). They wait for the lock.
C3 finishes building modinfo cache V2 and saves it to local (C3) and shared cache, then releases the lock. C3's request was answered in about N seconds.
C1 now gets the lock. It retries requesting from cache (this code is supposed to stop this sort of duplication happening). There is nothing in C1 local cache, so it now requests it from shared cache, which already has V2. C1's request is also answered in about N seconds - no problem.
C2 now gets the lock. It retries requesting from cache, which uses C2 local cache, which still has V1. As a result, it decides to build the modinfo cache, which takes another N seconds. This request takes 2N seconds to answer because it waited (for database lock and for building the cache) twice...

I'll attach a diagram which may or may not help:

If there were more than 3 containers you can see that multiple containers can be in the same position as C2. They can all be waiting for the database lock, so you can rebuild it as many times as you have containers.

The solution to this is that instead of finding outdated data in the local cache, and then deciding to rebuild course cache, it should instead get the current data from the shared cache. Achieving this requires an API change.

New API

The new API allows for versioned data to be stored in a cache by using set_versioned() and get_versioned() functions instead of set and get.

These functions accept an integer version number. The get function will return a cached value if there is something in the cache with either the requested version, or a higher version. (The result is a cache_version_wrapper object so you can find out the actual version returned if you need it.) It automatically handles retrieving it from higher-level caches if necessary.

After a lot of consideration (and unit tests) I think this system is robust and suitable for use for modinfo. It fixes the problem described above.

Note: In addition to this new API, there are other ways to safely store data in a 'localisable' (multi-layer) cache which are suitable for other situations, for example if the version identifier can't be represented as a monotonically increasing integer. The main one is to incorporate a version identifier into the cache key; in that case, unless we are certain that there will be a strictly limited number of versions between cache clears, TTL should be enabled for the cache or it could grow infinitely large.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

explanatory-diagram.png
18/Oct/21 7:27 PM
19 kB
Sam Marshall
changes-2022-01-19.patch
20/Jan/22 12:48 AM
54 kB
Sam Marshall
image-2022-02-11-15-25-17-943.png
11/Feb/22 4:25 PM
571 kB
Huong Nguyen
MDL-72837-step-12.png
25/Feb/22 5:28 PM
13 kB
Michael Hawkins
MDL-72837-step-13.png
25/Feb/22 5:28 PM
28 kB
Michael Hawkins
MDL-72837-step-15.png
25/Feb/22 5:28 PM
11 kB
Michael Hawkins
MDL-72837-step-9.png
25/Feb/22 5:28 PM
33 kB
Michael Hawkins
MDL-72837-step-14.png
25/Feb/22 5:28 PM
138 kB
Michael Hawkins
MDL-72837-step-16.png
25/Feb/22 5:28 PM
27 kB
Michael Hawkins

blocks

MDL-72991 Regression from partial cache rebuild

Closed

caused a regression

MDL-78466 A static cache value returning an empty array will be fetched from the real cache

Closed

MDL-74020 unit test failures in master branch

Closed

MDL-74032 One-shot coding error happening on first request to site (versioned-caches)

Closed

has a non-specific relationship to

MDL-67020 The coursemodinfo cache item doesn't scale when localized due to global locking

Closed

MDL-55231 Partial course cache rebuild

Closed

is duplicated by

MDL-68456 Create cache localization helper

Closed

will help resolve

MDL-73382 Localize htmlpurifier cache using value versions instead of key versions

Open

(1 has a non-specific relationship to, 1 is duplicated by, 1 will help resolve)

Details

Description

Why the current situation causes a bug

New API

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking

Clockify