In the realm of software development, ensuring that different components of a product interact seamlessly with a consistent data model is crucial. For ContextSDK, this is especially important since data is generated and processed by applications written in multiple programming languages. This blog post explores our approach to maintaining a shared understanding of data across our entire stack, from our iOS SDK to our ingest servers, machine learning tools, and beyond.
Introduction to ContextSDK
ContextSDK is designed to empower developers by helping them leverage their users' real-world context. To achieve this, we first collect context signals on the user's iOS device, store them in our data warehouse, build a machine learning model from the data, and ship the new model back in our iOS SDK for the customer to integrate.
Data must flow through a system composed of different components, each developed in the language most suitable for its task. For example, our machine learning code is in Python, our ingest server is a NodeJS application, and our iOS SDK was, of course, written in Swift.
The challenge lies in ensuring that as data traverses through these components, its integrity and structure remain intact, and that each component understands how to read and work with the data.
Centralized Data Definition
At the core of our strategy to achieve data consistency is a centralized tool developed in TypeScript. This tool serves as the single source of truth for all data definitions within ContextSDK. It outlines the signals we track, specifying their data type, name, and unique ID. Whenever a new signal is introduced, it's defined here first, ensuring that every aspect of ContextSDK has a consistent understanding of the data from the get-go.
The beauty of this approach is that it allows us to automatically generate strongly typed bindings for all signals for all our services. Code generation need not be complicated either; we can achieve everything we need by using simple string templating.
We don’t have a fully automatic solution for running DB migrations in our Clickhouse DB yet, but this tool also validates that all defined signals have a matching column in our table. This helps ensure that at release time, all necessary migrations have been written and executed and that the DB is in the shape we expect.
Introducing this tool has really streamlined the process of adding new signals to ContextSDK, as in the past it would have required modifications to the code of multiple services, always with the risk of overlooking one.
Technical Details
This blog is kept intentionally surface level, as the actual code is not very interesting. It essentially boils down to variants of:
For different target applications. For example the output for our Swift codebase will look just like any other Swift enum would:
and for the DB validation we use a simple SQL command
We then compare the output with our listing and print any missing columns, or ones where the data type doesn’t match.
The point here is: Sometimes you shouldn’t get bogged down in trying to find the perfect solution, but simply build something that solves the problem at hand. If it serves the current requirements, and doesn’t take an excessive amount of time to build, it probably is a good investment.
Conclusion
Keeping multiple independent pieces of code in sync doesn’t need to be complicated. We chose to generate code based on some type definitions in TypeScript, but the same could be achieved by sharing a JSON file with all the signals, or anything else, really. The main point is that there should be only a single place where this information, in our case, all the signals we work with, is recorded. By centralizing them, processes around adding and removing signals become clear, and other services can easily be kept in sync with any changes made.