EWADA Summer 2023 Internship Report

Summer 2023 marks the third year of our highly successful internship program. We are delighted to host four internships with outstanding candidates, along with a master’s student who conducted their graduate project with us. Each student has made significant contributions to EWADA, and this report provides a summary of the key outcomes from these projects.

Overview of the projects

The four projects addressed various challenges aligned with EWADA’s core vision, including: A Solid-based application designed to assist families in managing children’s health data

Extending our previous research on privacy-preserving computation with an ability to generate privacy-preserving synthetic data
Extending our earlier work on decentralised recommendation algorithms with an ability to generate privacy-preserving movie recommendations
Extending our prior research on supporting gig workers with a Solid-based approach to help workers manage their data

A Solid-based application to assist families in managing children’s health data

The project aimed to ensure that children, especifically those with ADHD, can exercise better control over the sharing of their data within an ecosystem involving parents/guardians, teachers, the broad school community, as well as clinicians or hospital staff. This is crucial challenge as the current scenario sees parents/guardians as the sole stakeholders with access to children’s information, determining how the data is accessed by and shared. Thus, the project seeks to explore a new model, in which children will be equipped with smartwatches and parents/guardians could examine the data through smartphones.

The project focused on building an architecture on top of SOLID, to collect, store and synchronise data generated by children’s smartwatches. It provides a web interface that allows a child with ADHD to share data and control the extent of information to share with requesting stakeholders. Different types of data that can be collected, including emotional dysregulation, medication usage, food intake, sleep and heart rate, step count, and location. A primary objective is to build a more empowered ecosystem of communication within schools regarding how health data may be shared with clinicians.

The approach is grounded in extending the experience sampling method (ESM), a research technique used in psychology and other fields to study individuals’ experiences, behaviours, and thoughts in real-time, as they occur in their natural environment.

For this project, location serves as the primary use case data due to its personal and sensitive nature. We want to explore whether visualization of data sharing could help children decide the extent to which they want to share their location data or any other data.

Privacy-preserving Decentralised Information Filtering

This work is based on our SolidFlix project, which is a Solid-based application allowing friends to share movie interests by storing this information in their individual pods. The movie recommendation algorithm used by SolidFlix is content-based, whereas a collaborative filter could provide more personalised recommendations by suggesting movies based on what a user’s friends are interested in watching.

However, conventionally, this kind of recommendation algorithm requires centralised access to all users’ data. The challenge lies in supporting collaborative recommendations without compromising the decentralised architecture and our commitment to preserve users’ data privacy.

The approach taken by the project team was to first compute similarities between each user’s movie list, and then generate recommendation. In the first step, a hash is created for each user’s movie list, which is then locally stored in their Solid pod. Using the hash code, then users could be categorised into distinct buckets, and individuals within the same bucket are considered similar, thus receiving identical recommendations.

In the context of movie recommendations, when a user, Bob, seeks a recommendation, he fetches the min hashes from all his friends’ pods, which will trigger the delivery of personalised recommendations. Bob can then request access to these movies from friends.

There are several advantages to this approach. To begin with, using collaborative filtering might be more feasible as it does not rely on the use of movie metadata, which is not always provided. Also, the approach is more scalable approach because it is built on pre-computed hashes, although there is a dependency on users sharing their min hashes.

A more detailed technical description and a recorded presentation can be found in Dr Goel’s blog post.

Decentralised Scalable and Privacy Preserving Synthetic Data Generation

For AI model development, we require more diverse datasets. However, sharing real data can become problematic because of privacy-related issues. This is solved by using synthetic data.

The objective of this project is to take a holistic approach to working with synthetic data. However, there is a need to organise the curation of this data. Various models for curating synthetic data exist, including a central differential privacy approach and a local differential privacy approach: the central differential privacy approach assumes a trusted curator collects individual data and then engages in the synthetic dataset generation; and the local differential privacy approach assumes that everyone locally adds noise before sending it to the central curator. The disadvantage of the central approach is that it might be compromised, as someone can gain control of these datasets and compromise its privacy; and that of the local approach is the potential for a significant amount of noise and requires substantial local computational capability.

The approach explored in this project involves curating data from Solid users, with users having the ability to determine their participation in the synthetic data generation process. Importantly, the architecture is based on Solid pods enhanced with a multi-party computation protocol to preserve the security and privacy of this process. Initial results show promising performance, and further details about the approach can be found in the arxiv paper.

A Solid-based approach to help workers manage their data

This project continues last year’s efforts, and its key objective is to determine how we can better manage incompatible datasets across different gig workers contexts and platforms. This is a crucial challenge because gig workers regularly face the task of managing data from different, in compatible platforms. To address this issue, we propose a solution called “Frankenstein drivers”. The goal is to experiment with different methods of managing gig worker data across diverse content using the SOLID protocols. Central to this solution is the use of an embedded model that matches semantic information, and LLM as a data wrangler tool to extract information from different sources and create meaningful visualisations. This transformation has significantly increased the productivity of a previously manual process, and the team is looking into exploring the possibility of establishing a direct integration between the LLM models and Solid pods.

This wide range of summer projects produced rich results, and we hope that he work will continue with the aim of building a community around these topic areas and integrating this work in the core EWADA pipeline. We thank the contributions by Sydney C., Yushi Y, Vishal R, Vid V, and the supervisions by Jake Stein, Rui Zhao, Naman Goel and Jun Zhao.