2021-09-07 Flushing Entities

Main Takeways:

Flushing every 20 entities and not clearing means that previous entities are still checked by Doctrine which can be a performance issue. Mautic doesn’t clear entities until after it is done with them since listeners my still manipulate the object. But we should consider some refactoring to improve this performance.

\Mautic\CoreBundle\Entity\CommonRepository::saveEntities performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 /** * Persist an array of entities. * * @param array|ArrayCollection $entities */ public function saveEntities($entities) { //iterate over the results so the events are dispatched on each delete $batchSize = 20; $i = 0; foreach ($entities as $entity) { $this->saveEntity($entity, false); if (0 === ++$i % $batchSize) { // saved 20, 20 in memory // saved 20, 40 in memory // saved 20, 60 in memory $this->getEntityManager()->flush(); $this->getEntityManager()->clear(); } } $this->getEntityManager()->flush(); }

Marco: discussed the above: ORM will have to consider 20 entities in the first flush, then 40, then 60...
Alan: do we use saveEntities() for large datasets (>200 records)
Jan: perhaps we use them on CSV imports
Marco: easy-pick on \Doctrine\ORM\Mapping\Builder\ClassMetadataBuilder::setReadOnly - will skip updates on entities
Marco: another good one could be to throw logs out UDP-style (logstash & co)
Alan: segments sometimes defined on logs
Jan: we call them logs, but they're mostly events
Marco: which segments do we lose if we throw logs somewhere else?
Alan: "did someone read an email within the last 30 days?"
Alan: "did they visit a specific page?"
Marco: sounds like missing "projector"

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # run-projector-1.php foreach ($db->query('SELECT * FROM logs OFFSET $startFrom') as $entry) { if ($entry['type'] === 'page_visited') { $db->insertIgnore('pages_visited_v1', ['page' => $entry['url'], 'lead' => $entry['lead_id']]); } if ($entry['type'] === 'page_visited') { $db->insertIgnore('pages_visited_v1', ['page' => $entry['url'], 'lead' => $entry['lead_id']]); } } # run-projector-2.php foreach ($db->query('SELECT * FROM logs OFFSET $startFrom') as $entry) { if ($entry['type'] === 'page_visited') { $db->insertIgnore('pages_visited_v2', [ 'page' => $entry['url'], 'another_field' => $entry['other'], 'lead' => $entry['lead_id'] ]); } }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 CREATE TABLE lead (...); CREATE TABLE page_visited ( lead_id INT(11) NOT NULL, mail_id INT(11) NOT NULL, page_url VARCHAR(1024) NOT NULL ); CREATE TABLE page_visited_count ( lead_id INT(11) NOT NULL, mail_id INT(11) NOT NULL, page_url VARCHAR(1024) NOT NULL, counter UNSIGNED INT(11) NOT NULL ); CREATE TABLE page_visited_delay ( lead_id INT(11) NOT NULL, mail_id INT(11) NOT NULL, page_url VARCHAR(1024) NOT NULL, mail_sent_at DATETIME NOT NULL, page_visited_AT DATETIME NOT NULL, time_elapsed INT(11) NOT NULL ); SELECT * FROM lead l JOIN page_visited p ON p.lead_id = l.id WHERE p.mail_id = :mailId;

Marco: projectors well understood in event-sourcing
Marco: pitfalls: deployments can lead to new projectors, and projectors can always lag
Sikandar: one loop and all if in it?
Marco: one process per projection table - all ifs are specific to the projection. One process per projection (parallel)
Sikandar: can we use materialized views for this?
Marco: materialized view often very limited
Alan: materialized views gone from MySQL 8 :-( https://dev.mysql.com/doc/refman/8.0/en/faqs-views.html#faq-mysql-have-materialized-views
Marco: can certainly try stored procedures
Marco: another detail is that we can add the SELECT on the replica, and INSERT/UPDATE on the leader
Alan: planning to do this with the existing events too - centralizing events more
Alan: problem is that current queries use EXISTS() and NOT EXISTS() and subqueries sometimes
Jan: we may have changed some queries from JOIN to EXISTS() for performance reasons in the past
Marco: EXISTS() in WHERE can be a disaster in MySQL, because the optimizer can decide to execute one subquery per resultset candidate
Alan: we use Percona's PMM for performance monitoring

Transactions in the ORM

Marco: example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 /** * Persist an array of entities. * * @param array|ArrayCollection $entities */ public function saveEntities($entities) { $this->gimmeSomething(); // SELECT (outside transaction) $this->getEntityManager()->transactional(function () use ($entities) { // START TRANSACTION; //iterate over the results so the events are dispatched on each delete $batchSize = 20; $i = 0; foreach ($entities as $entity) { $this->saveEntity($entity, false); if (0 === ++$i % $batchSize) { $this->getEntityManager()->flush(); } } $this->getEntityManager()->flush(); // redundant (already happens) }); // START TRANSACTION; }
1 $em->flush(); // START TRANSACTION; ... COMMIT;

Marco: discussed high consistency?
Alan: minutes OK, hours not OK - because of things like repeated operations
Sikandar: tried to relax from REPEATABLE READ - no performance gains
Marco: on query performance, did we try asking Percona directly?
Alan: already asked some DBAs - we're fine on that front.

What about further questions around the ORM?

  • FKs - discussed

  • TX isolation - discussed

EventSourcing intro

https://github.com/ShittySoft/symfony-live-berlin-2018-cqrs-es-workshop/pull/2

Marco: discussed rough intro of ES - full architectural overview deferred