Scaling Geospatial Techniques to Cloud-Native Platforms

*Scaling Geospatial Techniques to Cloud-Native Platforms* is a talk [[Diego Vicente|I]] gave on [[GeoPython 2025]] on different tips that work when trying to port traditional [[GIS]] methods to the [[Cloud Computing|cloud]]. Below, there is a full transcript of the talk, that can be complemented with [these slides](https://docs.google.com/presentation/d/1oZXS0ywJSZbR9Fz6V_UXpIwkhrl6HxUkWsf77zyHjT8/preview). ----- # A quick intro to [[CARTO]] We aim to democratize location intelligence and we have clients in insurance, telco, health, public sector, retail and supply – this is not flexing, is the way we typically use to illustrate that geospatial data is everywhere (though you all probably know). You also certainly know that is not only everywhere, is represented in a huge spectrum of data types and formats: from tabular data to vector or raster data, including remote sensing. With this incredible array of data, [[CARTO]] aims to be [[Cloud Computing|cloud]]-native: we rely on different cloud providers where companies are already storing their data. There are different pieces in the [[CARTO]] ecosystem: - Since we connect to cloud providers, probably the initial piece to mention is the [[CARTO Data Explorer]], which lets the users explore geospatial data in a very simple way, using visualizations that are usually not included in the cloud environments by default. - For more advanced visualizations we provide [[CARTO Builder]], which enables to users to perform [[GIS]] analysis using data stored in the cloud and represent the results in a map; - And we also provide [[CARTO Workflows]], which is a low-code tool that lets the users perform advanced analysis (we will come back to it in a bit). - Also special mentions to the [[CARTO Data Observatory]], a data marketplace of free and premium data that we provide for our users to ingest to their clouds on-demand, and the different SDKs that allow to integrate [[CARTO]] into custom apps. The first thing that we decided on [[CARTO]] to fulfill that goal of becoming [[Cloud Computing|cloud]]-native is that we had to focus on [[SQL]] when developing to ensure that the analysis was scalable and somehow [[Cloud Computing|cloud]]-agnostic. This is not always possible, since each cloud uses slightly-different flavors of [[SQL]], but it is certainly the one *lingua-franca* available in all data warehouses. It also enables each data warehouse to include their own optimizations to the physical plan as they seem fit, which is a very useful abstraction to have in our code. But as probably a lot of you agree with, the only thing better than writing [[SQL]] is *not* writing [[SQL]]; that is why we developed [[CARTO Workflows]], a no-code or low-code (depending on the user, actually) tool that lets the user connect different sources and components to perform spatial analysis. [[CARTO Workflows|Workflows]] is designed as a way to approach analysis to non-technical users as well as data manipulation, but each of the components include caching and intermediate results that are also useful to technical users. These Workflows are then compiled to [[SQL]] code that is run in the background without the user having to think about it if not needed. There are currently more than a hundred components available in Workflows, and we just released the [[CARTO Workflows Extension Packages|Extension Packages]], which lets the users install extra components into Workflows. ----- # Some general advice Apart from this reliance on [[SQL]] and each providers own optimizations, there are some other methods that serve as a silver bullet and it's important to take into account. First of them is promoting the use of [[Spatial Index|spatial indexes]] by the users. [[Spatial Index|Spatial indexes]] are grid systems with relatively uniform cells that are able to geocode a given portion of land with an unique index. They are also hierarchical, and therefore one can easily fetch the IDs of the cells contained within another cell, as well as the IDs of the cells surrounding a given cell. This kind of operations are super valuable for most of the methods I'll show later. Using these indexes, we can refer to these portions of land using their IDs instead of a geometry or geography type, that needs to explicitly declare the geometry that it defines. There are also several different [[Spatial Index|spatial indexes]], like [[H3]] which is based on hexagons or [[Quadbin]] and [[S2]], which are based on squares. Using one or the other depends on the exact use case: do you need accuracy when computing distances? You can use [[H3]], since in the hexagons all surrounding cells' centroids are equally distant from a cell. Do you need to perform hierarchical compositions? You can use [[Quadbin]], since a square can be perfectly divided in the cells of the level below. Do you need to take into account extremes latitudes? Use [[S2]]. There are several advantages to using [[Spatial Index|spatial indexes]]: - As we mentioned before they are smaller to store: instead of a WKT definition or other representation, we only need a unique ID that refers to the piece of land. That way it is not only easier and smaller to store, it also allows to perform classic database operations like joins, which are more efficient computations than its geometry counterparts. - They can serve as a common-ground to project all the data, allowing easier combination of different data sources to an uniform grid. - It removes any sampling bias that can come with administrative geographies, and even though it is not a solution to the [[Modifiable Area Unit Problem]], it is certainly more resilient to it since all cells in a single zoom level are the same size. - Not every analysis can switch to a grid, but if we are dealing with continuous data, this is certainly an alternative to take into account. Another important aspect to take into account when scaling to the cloud is data formats, in case we need to move data from one place to another. [[CARTO]] is part of the Open Geospatial Consortium and has been promoting the use of [[GeoParquet]] and taking active part in its standardization. That means that now we have a [[Cloud Computing|cloud]]-native format that allows for interoperability between different backends. In a similar vein we have been recently pushing for [[RaQuet]], which is the marketing name of including raster data in the [[Apache Parquet]] file format. This is a much less mature effort than [[GeoParquet]], but we believe it is worth pushing for it. ----- # Some success stories Now that we have gone through these general advices, let's review some of the methods we have included in the platform and review some of the decisions we took when doing so. This part of the slideshow will be a mixture anecdotical advices and a quick recap on different cases of advanced methods that may be interesting for some of you. Let's start with a method that is yet another turn-on-the-screw on a very widespread method that some or you are probably going to be very familiar with: [[Getis-Ord Gi Star|Getis-Ord]]. Regular [[Getis-Ord Gi Star|Getis-Ord]] is an autocorrelation metric that allows the analyst to find hot and cold spots: for a cell to have a high $G_i$ value it needs not only a high-value itself but a high value on its neighboring cells. That way we can represent a "smoother" version of the measurement, easier to understand visually that highlights different patterns in the data. The screenshot shows an example of space-time Getis-Ord, which not only smooths in two dimensions but also includes time, enabling us to understand space-time patterns in a similar fashion. It was possible to scale this to the cloud by relying heavily on spatial indexes to get neighboring cells, which is the previous step to the actual smoothing, as well as time-indexing to perform time-based operations (which are native in most data warehouses by now). Another method is hotspot classification, which digests the information of the previous method into a single dimension by classifying the trends into a set of given categories, which basically rely on whether it is a hot spot, cold spot or neither, and then on its dynamics: if it is a strengthening one, a declining one, occasional... To make it scale to the cloud, we rely mostly on the characteristic provided by spatial indexes on being very natural to work on in this kind of relational databases, since grouping by this ID and then applying these tests for each category is fairly simple and scalable. Last in the time-series toolbox, we have time series clustering, which is able to classify time series based on different two different methods: value, which clusters together series based on step-by-step distance (the closer the points, the closer the series) or profile, which clusters together series based on step-by-step correlation (the more similar the series behave, the closer they are). That way we can group together areas based on temperature, or stores based on sales. It is very useful to spot emerging patterns. To make it scale properly in the cloud, we mostly rely on the [[Machine Learning|ML]] services provided by each of the different providers like BigQuery ML or Snowpark models, we mostly take care on the preprocessing of the data and then delegate. We also include anomaly detection in the same vein, which lets us detect anomalous series by computing the drift in the real data compared to the expected values forecasted by a trained model. Same as before, the insight provided by this method is super valuable to spot series that suddenly lose explainability. Another way to scale these kind of [[Machine Learning|ML]] consists of making available in-cloud a model that has been previously trained elsewhere. In this use case, we use a convolutional model that is able to tag the risk of wildfires based on different geographic and land cover variables. It is a [[PyTorch]] model that uses a hand-made architecture to perform the convolution and is trained to predict the whether a cell is on fire or not each day using the 10 days prior. With this probability of wildfires, we use the predictions to check which antennas are at risk and let the users act accordingly to prevent any loss of service in case a fire happens, or locate new antennas. Finally, one more use case, which is provided more as an alternative way to tackle the problem of communicating a message, includes generating a label based on a scoring. One of my favorite functions that we provide in [[CARTO]] is the spatial scoring, which allows us to combine as many variables as needed into a single numeric value. A wonderful use case we had for it was scoring how needed was to plant more trees nearby each building in the city of Prague based on several variables, including temperature, air quality, population at risk and other urbanity metrics like how close a green are was. This lets us have a great overview because suddenly we have a single value to plot, and we can simply use color to show it. However, if we want a deeper look at the data, maybe at a single building, the scoring may fall short: why is the score high? We have to check each individual variable, and not only there are plenty of them but most may not even be relevant for the scoring at a given point. Now, what if we were able to create a text label for each building that explains, in plain English, what the score is for a building and what variables are relevant for it? We have leveraged in [[GenAI]] models that are available in the cloud, for example the Google Gemini models, to call them from Workflows and provide them with a prompt that explain what the score measures, each individual column meaning, and then three values per column: the first ones are the same for all rows and are the mean and the standard deviation of the whole population, the last one is the own building's value. We then ask the model to compare all these values and generate a text mentioning only those that are relevant for that building. This runs all on the cloud, and can all be executed from Workflows, even though it is not readily available as an analysis because we have not found a reliable way that fits all use cases, and each use case requires a little tuning on how to present the information. We have done similar label generation for several use cases. In the screen you can see the example of a real state index, but we have also applied it to clustering, to create a summary of each cluster's own characteristics. These methods are not 100% reliable, but may be useful and they are getting more reliable each day.