Engineering data pipelines, but not code
Time: Thursday 18th May 14:45-15:15
Room: Main Theatre
Track: Drupal Development
The NSW Government is undergoing a large project to consolidate many *.nsw.gov.au websites into www.nsw.gov.au. We have a team of 10+ developers at any one time and this brings with it unique challenges.
This means we have problems like many developers solving similar problems in different ways. This adds time to the development lifecycle, makes it difficult for others to pick up the implementation (and the code) and reduces reuse across teams and developers.
One of these is that we pull data from a tonne of different sources (json files / http, csv files / http, APIs, push/pull, etc) and from a bunch of different providers. Given the scale of our site, using the data directly from the source wasn’t an option. We always imported the data into our site and wrapped it in a controller or imported into a custom entity.
A review was undertaken about the issues with the current solution and how to solve them. Data Pipelines (https://www.drupal.org/project/data_pipelines) is the result of that review.
This solved the problem by:
- Identifying that we could store most of the data in non structured noSQL storage.
- Removing the need to create a bunch of custom entities, saving time and database performance.
- Removing the custom controllers implementing custom API.
- Removing requests to our application server altogether by allowing access to the data by Elasticsearch.
- Removing a security issue of ‘trusting’ the data from the remote source.
- Mitigating our risk of invalid data causing flow on issues.
- Making the backend and the frontend problem the same for all developers.
- Making sure we don’t inadvertently take down our data providers with too many requests.
- Frontend react application reuse.
I'm not actually sure if 'Drupal Development' is the correct category for this. If you have some insight, I'd be happy to discuss it.
This means we have problems like many developers solving similar problems in different ways. This adds time to the development lifecycle, makes it difficult for others to pick up the implementation (and the code) and reduces reuse across teams and developers.
One of these is that we pull data from a tonne of different sources (json files / http, csv files / http, APIs, push/pull, etc) and from a bunch of different providers. Given the scale of our site, using the data directly from the source wasn’t an option. We always imported the data into our site and wrapped it in a controller or imported into a custom entity.
A review was undertaken about the issues with the current solution and how to solve them. Data Pipelines (https://www.drupal.org/project/data_pipelines) is the result of that review.
This solved the problem by:
- Identifying that we could store most of the data in non structured noSQL storage.
- Removing the need to create a bunch of custom entities, saving time and database performance.
- Removing the custom controllers implementing custom API.
- Removing requests to our application server altogether by allowing access to the data by Elasticsearch.
- Removing a security issue of ‘trusting’ the data from the remote source.
- Mitigating our risk of invalid data causing flow on issues.
- Making the backend and the frontend problem the same for all developers.
- Making sure we don’t inadvertently take down our data providers with too many requests.
- Frontend react application reuse.
I'm not actually sure if 'Drupal Development' is the correct category for this. If you have some insight, I'd be happy to discuss it.
Speakers
Nathan ter Bogt
I've been working with Drupal since version 4 and held a number of roles at different places. Recently I've worked for Flight Centre and built a framework that was the basis for 20 sites, upgrading from one Drupal 7 multi-site to different Drupal 8 instances. I now work for Department of Customer Service, helping to consolidate 700 sites across government into one site.
Additional speakersNo additional speakers