2021-08-24 Repositories, Paginator, Abstraction

Main takeaways:

Get rid of Paginator.

New topics

Alan: get an experience of best practices in terms of ORM performance
Alan: looking to migrate away from Doctrine ORM - more abstracted layer where we hydrate from other stores
Alan: trying Apache Unomi - low priority according to Sikandar
Alan: trying to move away in an iterative approach
Jan: Using repositories as subscriber dependencies, it can slow down the kernel

Jan: Using repositories as subscriber dependencies, it can slow down the kernel

$doctrine = makeMeADoctrine();

function makeMeADoctrine() {
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService1::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService2::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService3::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService4::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService5::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService6::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService8::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService9::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService10::class)));
    $eventSubscriber->addListener(new MyListener($dic->get(OtherService11::class)));
}

OtherService* must be lazy (https://symfony.com/doc/current/service_container/lazy_services.html)

Jan: listener fetched at runtime

$lazyEntityManager = new class implements EntityManagerInterface
{
    public ?EntityManagerInterface $inner = null; 
    public function flush() { $this->inner->flush(); }
}

Marco: by upgrading symfony, you get lazy EntityManager by default, because they need
it to reset the service (background workers).
Alan: https://github.com/mautic/mautic/blob/340f3440c23fbd48f34fc26b35e45170ebdfcc87/app/bundles/UserBundle/Config/config.php#L364-L371
Marco: that already breaks laziness, but we can mark the repository lazy. Make mautic.user.repository.user_token
lazy perhaps.
Marco: if you put laziness in hot paths, it won't lead to anything.
Sikandar: laziness will move initialization time into the runtime. Bootstrap not such a big issue, so we
need to be selective.
Marco: we need more information about a performance profile.
Sikandar: problem is not really at application-side (memory/cpu/latency).
Alan: clearly not a major concern. It may help in background processing.
Marco: are the background processes spawned once per task, or kept alive?
Alan: goes back to multi-tenancy.
Marco: maybe we can reboot individual services (EntityManager), worked fine for some integration test suite
in the past.
Marco: https://symfony.com/doc/current/reference/dic_tags.html#kernel-reset
Marco, Sikandar: only about stateful services
Sikandar: are connections pooled?
Marco: no, and resetting services would probably also reset a connection pool, if we had one
Alan: we don't have connection pooling
Marco: XDebug profiler output (cachegrind.*.out file)
Marco: problem probably not here

Alan: trouble with the paginator

Marco: paginator - as soon as you have issues, move away from it
Marco: tells you "how much", "give me a page"
Marco: explaining pagination abstraction - it's high level, work with every page
Marco: move to split methods if you can, write custom SQL/DQL if you have performance problems
Alan: explaining that InnoDB is slow at counting
Marco: pagination works like this

SELECT a, b
FROM MyUsers a 
JOIN a.posts b

SELECT COUNT(DISTINCT a)
FROM MyUsers a 
JOIN a.posts b

SELECT DISTINCT a.id
FROM MyUsers a 
JOIN a.posts b

SELECT a, b
FROM MyUsers a 
JOIN a.posts b
WHERE a.id IN (:ids)

Broken query: assume 2 user with 1000 posts each.
The following query will give you 1 user with 100 posts hydrated: wrong result, and wrong in-memory too.

SELECT a, b
FROM MyUsers a 
JOIN a.posts b
LIMIT 100

Simpler query does not need paginator:

SELECT a, p
FROM MyUsers a
LEFT JOIN a.profile p # this is a *-to-one association

Jan: problem with large numeric offsets - offset seems to become problematic
Marco: could force it to make a range query by using identifiers (find first identifier after X)
Jan: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
Jan: asking about a tool/library that implements this
Marco: IMO avoid more tools here, write SQL. Explaining OLTP (OnLine Transaction Processing) vs reporting
Marco: suggesting to do more SQL
Alan: not afraid of writing more SQL
Marco: avoid SQL generators, write SQL by hand, avoid magic to avoid also unpredictable performance
Alan: segmentation is the biggest issue

Schema change -> migration to other stores

Marco: suggesting using different schema for transactional and reporting data.
Sikandar: use a new data store (column storage) for this, but it's in pipeline and won't happen soon.
Alan: that's also the problem - Doctrine kinda forced us to stick to MySQL
Marco: explaining simple example of ES repository:

<?php

final class ContactInformationRepository
{
    public function get(ContactId $id): Contact
    {
        $events = $this->connection->query('SELECT * FROM EVENTS .... WHERE ...');

        $contact = Contact::bare();

        foreach ($events as $e) {
            $contact->applyEvent($e);
        }

        return $contact;
    }
}

Sikandar: what about an entity that has a column with JSON?
Marco: doesn't need to be a repository
Alan: we're looking at a way to get a single source of truth (event-sourcing potentially), and it's managed by the API.
Alan: then we have queries to perform, like segmentation, like "who has visited X in the last Y days"
Alan: we could store in unstructured JSON table, and allow searching
Alan: it's possible to index JSON columns now - https://stackoverflow.com/a/61040738
Marco: suggesting splitting two different schemas for reading/writing again
Sikandar: we attempted using replication (1:1 schema too)
Marco: referring to CQRS, avoid it until really necessary
Marco: start with query objects

<?php

final class GetCountOfContactsInState
{
    public function __invoke(ContactState $state): int
    {
        // ...
    }
}

Queries can then be made swappable (domain has definition, infrastructure has implementation):

<?php

namespace Mautic\SomeComonent\Infrastructure;

final class GetCountOfContactsInSegment implements \Mautic\SomeComponent\Domain\ContactsInSegment
{
    public function __invoke(SegmentDefinition $segment): int
    {
        // ...
    }
}

Alan: so suggestion is to move from repositories to more granular queries
Marco: suggesting to use the ORM for storing/modifying information (OLTP), and move to query objects that perhaps
avoid the ORM overall for larger batch tasks

Next week

Perf profile - xdebug output
Managing obj relationships without enforcing FK constraints
ORM generated queries vs Native SQL queries performance .. will it make any difference
add link to Zoom call directly to calendar entry