This is post #9 in our Under the Data Tree blog series, where members of Dimagi’s data science team share insights from analyzing CommCare data.

In this blog, we look at what CommCare submission data can tell us about the connectivity levels of our users, and interesting ways in which we might be able to use that information. While CommCare is designed to allow data collection to be conducted while a device is offline, connectivity is still important for allowing users to ultimately submit collected data to CommCare HQ’s servers. Connectivity is also something that we know can vary substantially between different projects, depending on their location, and even within a project, if the network they are using is not consistent. Lastly, connectivity is one of the factors in a CommCare project over which our team has the least control.

For these reasons, it seemed valuable to us to find a means of measuring connectivity levels for our users. But how to do it without access to very detailed network provider data or device-level data about connectivity? Fortunately, we were able to devise an indirect means of measuring connectivity using metadata from CommCare form submissions:

  • Each form that is submitted to CommCare HQ’s servers has 2 useful pieces of information: the time that it was completed on the device (“completion time”), and the time it was received by the server (“submission time”).
  • Since submission time can’t occur until the device has acquired strong enough connectivity to send the form, we reasoned that taking the difference between submission time and completion time for an individual form would serve as a good approximation for the level of connectivity that the user who submitted the form was experiencing at the time.
  • From here forward, we will refer to the time between completion and submission for a form as its “submission delay”

The first question we undertook to answer was whether this metric of submission delay was indeed a good proxy for connectivity levels. To do this, we did an analysis of different geographic regions that are using CommCare:

  • We started with the 20 countries for which there have been the most form submissions to CommCare HQ from a project located in that country, and computed the median overall submission delay for each of those countries
  • We then took the 5 countries with the highest median submission delays (so theoretically the worst connectivity) and the 5 countries with the lowest (so theoretically the best connectivity), and cross-compared that data with known connectivity metrics for each country (taken from http://www.internetsociety.org/map/global-internet-report/)
  • While some of these countries may have substantial internal variability on both submission delay and the comparison metrics, we reasoned that if submission delay was in fact reflective of connectivity levels, the overall values for both would still show some alignment
  • The results of the comparison are summarized in the table below:

By and large, the countries with the largest submission delays had poor values on the comparison metrics, while the countries with the smallest submission delays had good values on the comparison metrics. This gave us confidence in our hypothesis that submission delay is a good proxy for connectivity levels! With that information, we proceeded to investigate our second question: What can measuring connectivity tell us about our users’ data?

A significant consequence of variation in connectivity levels is the effect it can have on the short-term accuracy of a CommCare project’s data. Many CommCare projects use the data that is submitted to HQ’s servers in real-time or nearly real-time to inform supervisors and help them make important decisions. Still others rely on a system of “case sharing” where multiple CommCare users work with the same beneficiaries, making frequent data syncing even more critical to proper project functioning. When form submission is delayed by lack of connectivity, there is a discrepancy between the most recently collected data on individual devices (which is the the most up-to-date information) and the data that’s available on the server (and thus also to anyone who syncs data down from the server to a device). By analyzing typical submission delay values for CommCare users, we can help projects understand to what extent they may be looking at incomplete data at any given time.

With that in mind, we took our dataset of submissions to CommCareHQ, and organized it such that we had, for each user who submitted forms in a given month, their median submission delay value for that month. In other words, for every 1 month of CommCare usage by 1 user, we computed a single data point that represented the median submission delay value for all of the form submissions they made in that month. (We chose to organize the data using this “user-month” perspective so that users with higher form submission rates wouldn’t dominate the dataset). Using that dataset, we were able to plot the overall distribution of per-user, by-month median submission delay values for all users of CommCare over the last year:

This graph showed us that for 85% of user-months, the typical submission delay tended to be within a week or less (168 hours). Since it’s a little hard to extract further meaning from this picture, the same graph is reproduced below, but including only that 85% of data points with a maximum submission delay of 1 week:

This picture helps us see more clearly that over half of all user-months have a typical submission delay of 1 day or less. (The overall median submission delay for all user-months was also within a day, at 19.3 hours). It’s encouraging to see a majority of users with these low submission delay values, since that translates to only a small window of time in which a project may be dealing with incomplete data.

After seeing this information, we also wanted to drill down a little further to investigate what portion of users seem to have connectivity while they are using CommCare (which would be those with very small submission delays). We experimented with defining different thresholds for what this “usage-while-connected” means:

Threshold for “usage-while-connected” Percentage of submissions
Maximum monthly median submission delay of 1 minute 13.8%
Maximum monthly median submission delay of 10 minutes 16.3%
Maximum monthly median submission delay of 1 hour 19.4%
Maximum monthly median submission delay of 2 hours 21.7%

This tells us that roughly 15-20% of CommCare users have a connection while using CommCare, or viewed the other way, that at least 75% of users do not have connectivity when filling out their forms. This is valuable information because it shows us that the ability to do offline data collection is critical to most of our users!

In summary, 3 important things we’ve learned thus far about the connectivity levels of our users:

  1. We can get approximate measurements of our users’ connectivity levels by using a proxy indicator that is available through metadata from CommCare form submissions.
  2. We can understand how frequently, and to what extent, projects may be looking at incomplete data on CommCare HQ’s servers, due a delay in form submissions caused by lack of connectivity.
  3. Over the last year, over half of our users have had connectivity that was consistent enough to allow them to submit their forms within a day or less, but less than a quarter were able to submit them in real-time.

In the future, our high-level goal is to be able to use information like this to provide useful feedback to CommCare projects on the connectivity levels their users are experiencing, and how it may be affecting their projects!

There are a lot of other interesting analyses that we can do regarding connectivity data; stay tuned for a future blog post that will include more of them!