This is a development version of the documentation and may contain inaccuracies! Please find the official documentation at https://opendatahub.readthedocs.io/en/latest/
Platform Guidelines - Full Version¶
- 2018-05-28 version 1.0
- 2018-03-30 version 1.0-beta
This document represents Part 1 of the guidelines and presents the preferred programming languages, databases, and protocols to be used, data exchange and exposition methods, coding conventions, and regulates the use of third-party libraries.
There are scenarios where an exemption from the guidelines is acceptable. The following is a non-exhaustive list of such scenarios.
- Use of foreign technologies. The development of a Open Data Hub component requires the use of platforms, languages or generally technologies that are different from the ones listed in the guidelines. An example might be a component that depends on an already developed custom library written in a programming language not listed in the guidelines.
- Use of technologies that are not mentioned in the guidelines. Future Open Data Hub component might require technology that is not listed at all in the guidelines. An example is a component that must be hosted on specific hardware needed for machine learning platforms.
A Open Data Hub contributor who runs into such a scenario must contact the Open Data Hub team to discuss that specific scenario. If the exemption is reasonable and can be motivated the Open Data Hub team will agree and allow it. To avoid misunderstandings, contributors must expect to get a written statement about such a decision.
If you can not find any answer to your question or doubt in this document, please contact the Open Data Hub team or open an issue in the github repository of this document.
Platforms and Architectural Considerations¶
Java server applications running in Apache Tomcat¶
Apache Tomcat is a well established, light weight FOSS web server that implements among others the Java Servlet specification.
The Open Data Hub team generally uses the latest or second to last release of Tomcat, to run Java server applications in the previously mentioned contexts:
- API/REST end points.
- Web applications.
The desired design is that only API/REST end points directly access the database server, while web applications just talk to the API/REST end points.
Each Tomcat instance normally runs a few web applications, hence expect a Open Data Hub web application’s WAR file to be bundled together with other WAR files to run on a given instance.
The automatic build systems takes care of this bundling and deploying. It is therefore very important that all WARs can be build automatically, as mentioned in the section about Java.
No File System Persistence¶
Currently, the Open Data Hub team uses Amazon Web Services for Tomcat hosting, in particular the managed service known as Elastic Beanstalk. While there is no hard dependency on this provider -that could be changed at any point in the future, the architectural design of Elastic Beanstalk has partly modelled/shaped the engineering choices of the Open Data Hub team in the design of its web application.
First and foremost, servers are considered volatile. This means a Open Data Hub component running in Tomcat can not expect to see a persistent file system!
All web applications must therefore be developed with the database as the only persistent storage layer. This architectural choice has a few advantages:
- Web applications can be distributed over more than one web server (horizontal scaling), increasing availability and performance.
- Backup and disaster recovery is very much simplified - a failing instance can just be replaced by a new instance and the application can be deployed again.
Developers must pay particular attention to this point: There is no persistent file system. Hence no changeable configuration files, no application specific log files. Everything is stored in the database.
One subtle point is the question “Where is the JDBC data source and password stored?”. It cannot be stored in a file and it must not be stored in the source code or context files. The recommended way to store this information is in Java environment properties.
The system will set these variables when launching Tomcat:
The developer can then read them with:
The Open Data Hub encompasses a considerable number of web applications that are bundled together to run on a few Tomcat server instances. Contrary to popular belief, RAM is not an infinite resource. Contributors are kindly reminded to pay attention to the RAM usage of their web applications, since load testing is expected.
Java standalone applications, running headless¶
These are meant for special use cases, such as compute intensive jobs or batch processing, made upon request.
Almost everything said in the previous section about Tomcat, applies here as well.
Again, the preferred way to run these applications is in an environment where servers are volatile and the only persistence layer is the database.
PostgreSQL is one of the most established RDBMS on the market and is generally described as being by far the most advanced FOSS RDBMS and therefore it has been chosen as the primary database system for Open Data Hub.
There is a new major release of PostgreSQL per year and each release is supported for 5 years, according to the versioning policy. Contrary to the case of the other products mentioned in these guidelines, the Open Data Hub team generally will not run the latest or even previous version of PostgreSQL. Expect the version available for Open Data Hub to lag about 2-3 years behind the latest available release.
Other extensions are very likely not available, so ask the Open Data Hub team if in doubt.
Accessing the Database¶
The data source strings must be parsed from the environment variables (see section Java server applications running in Apache Tomcat).
The maximum number of concurrent database sessions will be generally limited per role, therefore each developer must clarify with the Open Data Hub team what an acceptable number is, depending on the application.
Since PostgreSQL will refuse a connection if that number is exceeded, developers must take this number into account, whether they configure a connection pool or not.
Open Data Hub databases generally are configured to accept connections only from the known hosts where the application servers are deployed.
- When processing large datasets, consider setting smaller values of
fetchsizeor equivalent parameter to avoid buffering huge result sets in memory and running out of RAM.
- When performing a huge number of DML statements consider switching off any client side autocommit feature and rather bundle statements into transactions.
- Do not open transactions without closing them, in other words, do not leave sessions in transaction!