What we can learn from merging datasets

 

This article was written by Natacha Umutoni. The original article was published by Cenfri. You can find the article here.  

Data in isolation often tells only one part of the story. Sometimes, to gain a full picture, multiple datasets must be combined to incorporate different variables, uncovering insights that drive better decision-making.

Merging two or more datasets is a powerful technique that enables organisations to analyse information stored in different locations or databases. Businesses use it to enhance strategic planning, and governments rely on it to optimise public services. By integrating different datasets, analysts can uncover relationships between variables that might not be apparent when examining individual data sources separately.

It is something we rely on to derive policy insights as part of the Rwanda Economy Digitalisation (RED) programme.

By way of example, to analyse the impact of new regulations concerning closing hours for non-essential businesses, we merged mobile money data, electronic billing machine (EBM) data and tax records. To contribute to an improved public transport system and optimisation of the bus route planning in Kigali, our in-depth analysis involved combining e-ticketing data from card service providers containing ridership information, ticket purchase times and prices, with the GPS coordinates from motos and selected buses.

Individually, these datasets would not have provided the information required to deliver useful policy insights. We are always careful to anonymise the records or mask any personal details.

Data learning sessions

To ensure that our lessons on data-driven decision-making are shared more broadly, the RED programme has been hosting a series of learning sessions with partners in Rwanda. In January 2025, the learning session focused on merging datasets. During the session, the Cenfri data team and key partners, including 71point4, RISA, BKTechouse (BKTH) and RURA, shared case studies demonstrating how linked datasets can reveal trends that would otherwise remain hidden.

Insights from merging datasets from the RED programme

One case study focused on understanding the demographic and agricultural profile of farmers using mobile money. To get a complete picture, the team merged data from multiple sources:

  • Data from Smart Nkunganire System, Smart Kuhangara System, and One Acre Fund provided a count of registered farmers.
  • Mobile money transaction records showed which farmers were actively using digital financial services.

By integrating these datasets, we discovered that among the 590,000 identified farmers who transacted in April 2022, the majority were male (64%), and 21% were aged 55 and over. This kind of insight not only informs financial inclusion strategies but also helps policymakers understand who is benefiting from digital financial services.

During a panel discussion, the CEO of BKTH, Deo Massawe, shared that one key insight from linking datasets was discovering that farmers were using only 50% of the available subsidies (particularly in fertiliser use), which had a direct impact on agricultural production. This pointed to a financial barrier preventing full utilisation of subsidies.

As a result, BKTH introduced a digitised loan system, allowing farmers to access loans that covered the remaining cost of inputs. This approach ensured they could obtain full agricultural inputs at the beginning of the planting season, thereby increasing their yield potential.

Challenges in merging datasets

While the benefits are clear, combining datasets across different institutions or departments presents several challenges. The session highlighted some of the hurdles institutions face:

  1. Data inconsistency. Structural and formatting differences between datasets lead to inaccuracies. Some datasets may have different structures, while others may store the same information in varying formats. For example, date formats can differ, with some systems using DD-MM-YY while others use MM-DD-YY, leading to integration issues. Merging requires extensive data cleaning, which consumes time and resources.
  2. Data sharing barriers. Accessing data from institutions can be a time-consuming process; in the RED programme, it often takes between six months to a year. One of the main reasons for this delay is the lack of clear guidelines on which data can be shared publicly and which data must remain confidential. Without proper data-sharing policies, institutions struggle to navigate privacy concerns and security protocols, slowing down access to crucial information.
  3. Limited resources and capacity. Many institutions lack dedicated data teams to manage, clean, and merge data efficiently. Without adequate personnel, institutions face difficulties in allocating responsibilities and maintaining data quality. To address this, organisations must focus on internal capacity-building and providing specialised training to improve data management practices.

Handling sensitive data

Data privacy is always a concern. During the session, several methods were discussed to ensure that data integrity is maintained while still enabling access.

  • Ethical data handling is crucial. Data scientists should treat personal data with confidentiality, just as doctors safeguard patient records. By creating a culture of responsibility, organisations can build trust in data-sharing processes.
  • Data masking and pseudonymisation are effective techniques for protecting personal information while still allowing analysis. One approach is hashing, which converts sensitive data into unreadable formats. Another method is pseudonymisation, which replaces private information with artificial identifiers, ensuring that individual identities remain hidden. Institutions can also use link-out tables to enable data reference while maintaining privacy.
  • Aggregating unique IDs instead of using direct personal identifiers. By grouping data at a higher level, organisations can extract meaningful insights without exposing individual details.
  • Appropriate data governance measures: Institutions should also implement regular monitoring and compliance controls to ensure that data protection policies are being followed. This includes setting clear data governance frameworks that define who has access to what data and conducting frequent audits to prevent misuse.

When done right, merging datasets unlocks hidden patterns and helps to solve multiple issues. Addressing the challenges mentioned above requires a shift toward stronger data governance, cross-sector collaboration, and capacity development to ensure that institutions can fully leverage their data assets.

If you are experiencing similar challenges or are unsure where to start, this guidance on public sector data frameworks may be useful. It covers practical advice on data cataloguing, classification and sharing. The RED Programme and its partners have also developed a data sharing policy for the Government of Rwanda (GoR).

At the time of writing, the policy was yet to be approved by the cabinet, but keep an eye out for news of its approval and publication. It details some of the measures to be implemented by the GoR to ensure the smooth sharing of data between different public sector entities.

 

 

Share post

Your B2B Navigator in the World of Market Intelligence - Contact Us

Tailored B2B information solutions. We collect and integrate vital intelligence, empowering your growth strategies and competitive edge. Accelerate your pathway to success.