Fieldwork43
January 26, 2026

Under the Hood of openJII: Cloud Infrastructure, IoT Integration, and Data Pipelines

Building an open science platform for real-time sensor data and global collaboration means tackling significant infrastructure challenges.

Dominik
Dominik VrbicTech Lead (openJII)

Building an open science platform for real-time sensor data and global collaboration means tackling significant infrastructure challenges. In this second part of the openJII launch series, we dive into the cloud infrastructure and data engineering behind the platform. We’ll cover how openJII connects IoT sensors from the field to the cloud, how data flows through an event-driven pipeline, and how we manage and transform that data using a modern medallion architecture (Bronze–Silver–Gold layers). We’ll also highlight the tools and approaches – from AWS IoT Core to Databricks and Infrastructure-as-Code – that make this all possible.

Connecting Field Sensors to the Cloud (AWS IoT Core)

At the heart of openJII’s infrastructure is the ability to ingest data from distributed IoT sensors that measure photosynthesis. The Jan IngenHousz Institute and its partners have developed advanced sensors (such as the MultispeQ device) that continuously capture data on chlorophyll fluorescence, CO₂ levels, light intensity, and more[9]. These devices might be deployed in research plots, greenhouses, or even labs worldwide. How do we gather all that data in one place, in real time?

We leverage AWS IoT Core as the bridge between physical sensors and the cloud. Each authorized device is provisioned in IoT Core, which provides secure communication (via MQTT or HTTPS) to enable the sensor to stream data. AWS IoT Core acts as a scalable ingestion point – it can handle thousands of devices sending frequent readings. We set up IoT Rules in AWS: when a device publishes a new measurement, a rule triggers a pipeline of actions[10]. For example, a rule might forward the raw data payload to an AWS Lambda function for preprocessing[11] or push it to a streaming queue. This event-driven architecture means the moment a sensor reading arrives, it’s automatically picked up for processing without waiting on a batch job. As AWS documentation notes, IoT events can invoke services like Lambda to pass along data and trigger workflows in real-time[12].

By designing around events, we ensure that openJII can react instantaneously to incoming data. If a sensor in Kenya records a drop in photosynthetic efficiency during a drought experiment, that data is in our cloud and being analyzed within seconds. Researchers can even configure real-time alerts (for example, if certain conditions occur, like photosynthesis dropping below a threshold, send an email or SMS). The loosely coupled, event-driven approach also improves scalability – adding more devices or data types doesn’t slow down a central loop; instead each event is handled independently in parallel. This aligns with modern best practices where systems are composed of small services reacting to streams of events[13][14]. In openJII, we utilize AWS’s serverless infrastructure (Lambda, EventBridge, etc.) to orchestrate these events, which keeps operational overhead low and allows the system to scale automatically based on load.

Data Pipeline: From Raw Streams to Insightful Data Lake

Once data from sensors is securely in the cloud, the next challenge is to store it and transform it into useful insights. openJII adopts a medallion architecture for its data pipeline, inspired by the best practices of data lakehouses[15]. In essence, every piece of data goes through staged layers: Bronze (raw), Silver (filtered/ processed), and Gold (aggregated/analysis-ready)[16]. We’ve actually implemented two parallel medallion pipelines (a “dual medallion architecture”) to handle different data streams in the platform – one for the time-series sensor readings and another for higher-level experimental or genomic data. This ensures each domain of data is curated through appropriate steps while maintaining a full lineage.

Medallion architecture organizes data into Bronze, Silver, and Gold layers. In openJII’s pipeline, raw IoT sensor data is stored in the Bronze layer, then cleaned and combined into Silver, and finally aggregated or analyzed in Gold. This approach maintains a transparent chain of custody from original measurements to final insights[16][17].

Bronze Layer – Raw Data: All incoming sensor measurements are first stored in their raw form (Bronze). This might be a cloud storage bucket or a raw Delta Lake table where each record is exactly what the device sent, timestamped and immutable. By keeping this original data, we preserve a source-of-truth that can always be referred back to[18]. For example, if a sensor sends a reading every minute, those JSON payloads or CSV rows are stored as-is in the Bronze layer, complete with any noise or irregularities. This layer is invaluable when verifying results later; one can always trace back to see “what did the device actually observe at that moment.”

Silver Layer – Processed Data: From Bronze, data flows to Silver where we apply transformations, cleaning, and integration. Here, the platform’s processing pipeline (which can be powered by Databricks Spark jobs or AWS Glue jobs, etc.) will take raw readings and, for instance, calibrate values, filter out erroneous readings, and attach contextual metadata. We also combine data from multiple sensors if needed, or integrate weather data, etc., in this stage. The result is a refined dataset that’s ready for analysis but still detailed – nothing critical is aggregated away yet. Silver tables might include computed photosynthesis metrics like electron transport rate, with invalid measurements removed or flagged. We also apply validation rules (defined by scientists or by our data standards) here, which are versioned and documented. Because we keep the transformations transparent, anyone can see exactly how Silver data is derived from Bronze[19]. This ensures reproducibility: other researchers could take the Bronze data and our transformation code and get the same Silver results[17].

Gold Layer – Aggregated Insights: Finally, key results are distilled into the Gold layer. Gold data is what you’d use for visualizations, publications, or sharing with a wider community. In Gold, we might have daily summary statistics of photosynthetic performance for each plant, or the results of a model that predicts crop yield from the sensor data. The focus here is on being analysis-ready and easy to query. Because Gold is built from Silver, and Silver from Bronze, any derived insight comes with a full provenance trail. If someone sees an interesting pattern in a Gold dataset, they can trace it back through Silver to the original raw data point that contributed to it – all within openJII. This chain-of-custody enhances trust in the data: nothing is a black box. As the Info.nl development team described, this transparent pipeline means external researchers or reviewers can drill down into any result and see its origins, which is critical for scientific validity[16][17].

Implementing this medallion architecture was made easier by using Databricks on our infrastructure[20]. Databricks, with its Delta Lake technology, is well-suited for medallion patterns – it allows scalable processing of large datasets and managing table versions. We use Databricks notebooks/jobs to define the transformations from Bronze to Silver to Gold. The “dual” aspect of our medallion approach refers to the fact that we maintain multiple Delta Lake schemas: one for the core sensor telemetry, and one for related datasets (like genetic datasets, or user-uploaded manual measurements). Both follow Bronze/Silver/Gold principles, but they might have different transformations. For example, genetic data might go through its own pipeline (cleaning genotype info etc.) separate from the sensor pipeline, until maybe at Gold layer where insights from both are joined for final analysis. The medallion framework handles this complexity by keeping data organized and incremental – we can update Bronze continuously as new data streams in, and run scheduled jobs to incrementally update Silver and Gold.

An important design choice we made is favoring ELT (Extract, Load, Transform) over traditional ETL for these pipelines. In classic ETL, data would be transformed before loading – but with the scale and variety of openJII data, it’s more flexible to load everything raw first, then transform inside the data platform. This ELT approach takes advantage of modern cloud storage and compute power to handle transformations on-demand[21][22]. We extract from sources (IoT devices, external data APIs), load into our Bronze storage immediately, and then transform within our Databricks environment to produce Silver/Gold. The benefit is twofold: (1) We don’t lose or alter any data on ingestion – everything is saved, and transformations can be refined or re-run if needed. (2) We can easily onboard new data sources without having to rewrite a complex pre-processing pipeline; just dump to Bronze and worry about transformation logic separately. As the Databricks team notes, ELT aligns well with data lake architectures that handle both structured and unstructured data, offering flexibility for analysts to define transformations after loading all the raw data[22].

Event-Driven Orchestration and Workflow Automation

Coordinating all these moving parts – ingesting data, triggering transformations, updating dashboards – is handled by an event-driven orchestration model in openJII. We touched on this with AWS IoT events kicking off Lambdas. Beyond that, we use AWS EventBridge and/or step functions to schedule and orchestrate data pipeline jobs. For example, when new Bronze data arrives (event), it can trigger a Databricks job to update the Silver layer for that experiment. Or, we might have a nightly schedule to recompute certain Gold aggregates, but even those can be event-driven by a day-end event. The advantage of an event-driven approach is that components of the system remain loosely coupled and scalable[14]. If tomorrow we integrate a new sensor type that sends a completely new kind of data, we can write a new rule and function to process it without disturbing the existing flow – it will just produce a new Bronze table and corresponding Silver/Gold pipeline. This architecture naturally supports real-time analytics. We are essentially streaming data into Bronze and can stream-process it into Silver for certain use cases (using Spark Structured Streaming or AWS Kinesis, for instance). Researchers thus get a low-latency data loop: from field measurement to cloud to insight quickly. In the future, this could enable features like automated feedback to experiments (imagine an IoT device changing a parameter in the field if the data indicates a certain threshold reached – an IoT-driven experimental control loop).

Additionally, we incorporated sensor integration tools to simplify bringing new devices into the platform. One such tool is Node-RED, a visual flow programming environment for IoT. We have Node-RED available in our dev stack[23] to prototype how a new sensor’s data might be ingested and processed. A researcher or developer can whip up a Node-RED flow to subscribe to a device’s MQTT feed, apply some custom logic, and then send data into openJII’s API – all without writing traditional code. While our core pipeline is robust and automated, this flexibility is important for an evolving research field, where today’s “new device” might not have full integration code written yet. It invites community contributions on the integration front as well.

Modern Infrastructure as Code and Deployment

Running a complex platform like openJII requires managing cloud resources in a reproducible way. We’ve embraced Infrastructure as Code (IaC) to configure and deploy everything from IoT policies to databases to analytics clusters. Specifically, we use OpenTofu for our IaC needs[20]. OpenTofu, formerly part of Terraform, is an open-source tool for defining cloud infrastructure. Every component of openJII’s AWS setup is defined in code (in HCL syntax via OpenTofu), which means we can rebuild the entire environment or deploy to a new region with minimal effort. This approach also aligns with our open-source philosophy: our infrastructure definitions themselves can be open and collaborative.

Why OpenTofu and not Terraform? HashiCorp’s Terraform was a go-to, but by using OpenTofu, we ensure that even our deployment toolchain is open-source and community-driven (OpenTofu is a community fork of Terraform). This avoids any licensing complications and reinforces the message that openJII is open at every level – from code to cloud provisioning. Contributors can see our IaC scripts in the repo and understand how we set up AWS IoT Core, how we secure S3 buckets for data, how we configure Databricks workspaces, etc., and they can suggest improvements. For instance, if there’s a more cost-efficient way to handle our storage, a contributor could propose a change to the OpenTofu scripts.

On the deployment side, we use CI/CD pipelines (GitHub Actions) to test and deploy changes. When new code is merged, automated tests run (we maintain a battery of tests for the NestJS API, for example) and if all is well, the changes can be rolled out to our staging or production environment via the IaC pipeline. This minimizes downtime and regressions. Since researchers depend on the platform for ongoing experiments, we have to be careful with updates – infrastructure as code helps by making deployments predictable and revertible if something goes wrong.

Security and data privacy are also critical in the infrastructure. AWS IoT Core ensures secure device authentication and communication. Data in transit and at rest is encrypted. We enforce least privilege principles in our cloud roles (also managed via OpenTofu definitions). For example, the Lambda that processes sensor data only has permission to read from the IoT message queue and write to the specific Bronze storage location, nothing more. These measures protect against both external breaches and accidental misuse. Being an open science platform doesn’t mean open to attackers – it means open in collaboration and transparency, while rigorously safeguarding the integrity of the data.

Sensor Integrations and Future-Proofing

openJII launched with full support for the MultispeQ device – a popular handheld photosynthesis meter that many in the community use (originating from the PhotosynQ project)[24]. Researchers can connect their MultispeQ to openJII and have its measurements flow into the platform seamlessly. We achieved this by working closely with device firmware developers to speak the same data protocols and by providing users with easy pairing flows (via the mobile app or USB on the web). The platform is built to integrate new sensors as they emerge[24]. This could include novel IoT sensors measuring soil moisture, environmental conditions, or even camera-based systems assessing plant health. Thanks to our modular event-driven design, adding support for a new sensor might be as straightforward as writing a new parser for its data format and plugging it into the ingestion pipeline.

We also anticipate integrating external data sources. The platform could ingest satellite weather data, for instance, or accept data from other open science initiatives. Using industry-standard APIs and open data formats will ease these integrations. Our use of JSON and CSV in the raw layer, and Parquet/Delta in processed layers, means we can interface with numerous tools and languages. It’s all about being future-proof – ensuring that in 5 or 10 years, as technology advances, openJII can adapt without a complete rewrite.

In summary, the infrastructure of openJII is cloud-native and scalable by design. By marrying AWS IoT for device connectivity, event-driven serverless processing for responsiveness, and a robust data lakehouse architecture for analytics, we provide a platform that can grow with the science. Researchers get a reliable, fast system that abstracts away the complexity of big data and cloud, yet is built on cutting-edge tech under the hood. And importantly, all of this is configured and run in an open, transparent manner, inviting the community to peek under the hood, learn, and contribute.

Stay tuned for the final article in our series, where we focus on openJII’s open-source ethos and community – how this platform continues the legacy of PhotosynQ’s open research spirit, and how developers and scientists worldwide can get involved.