 
In the early stages of your research, you'll be:
Even early on, you can practice data justice in your research by setting intentional goals for your research that work toward justice. Some actions you can take:
Responding to a rise in studies of public data like social media data from the platform formerly known as Twitter, which is often not subject to ethics board approval, this 2022 study looked at how researchers could ethically conduct research using public data- particularly from potentially vulnerable or marginalized communities. They considered the history of research in Black communities and interviewed self-identified #BlackTwitter users about how they preferred to engage with researchers and how they wanted their data to be used in research.
The participants in the study had a range of concerns and opinions about research on #BlackTwitter. Some hadn't considered that there would be academic papers written about Black Twitter, and others expressed that they wouldn't want some researcher just analyzing their tweets without consent. They understood that content was public, and expressed that using Tweets with consent wouldn't be unwelcome,. However, it would matter to them who the researcher was, what their positionality was, and whether they were transparent with the community about their intentions.
Although research on the platform has effectively ended due to the end of its free API, the article presents some guidelines and best practices for conducting research in marginalized communities online.

Online communities are vast and varied in their needs, desires, and cultural context. Engage ethically! Source: Randall Munroe || xkcd.
When you are using existing datasets compiled by another researcher, NGO, or governmental body to find the answer to your research question (as in secondary data analysis), don't take the data at face value. Ask why the data has taken the shape that it has. Examine how people are represented through the data or who might be missing from it.
This is important because the data you're using is only as good- and as just- as the methodology used to collect and create it. Even Statistics Canada is constantly revising its methodology for each new census. For example, it was not until the 2021 census that Statistics Canada included both a question on sex at birth and gender to address a gap in data on transgender individuals in Canada.
In original research projects, concerns about sampling might enter into the question of who is being represented, and who might be missing. A sample size that is too small might not be generalizable- a consideration to keep in mind when analyzing others' scholarly research as well as beginning your own collection of data.
Some questions you can ask yourself when looking at datasets to analyze:
In the study Impact of missing data strategies in studies of parental employment and health: Missing items, missing waves, and missing mothers, Nguyen et al. tackled the problem of missing data in longitudinal studies that reveal information about population health via social determinants like employment.
They looked at 5 waves of longitudinal data from the Longitudinal Study of Missing Children, finding that parents, especially mothers, participated in the study intermittently or completely quite the study across time. This led to a lot of missing and incomplete data, which further led to a skewed analysis when researchers tried to examine the adverse effects of work-family conflict and mental health. Mothers, because of gender and social circumstances, were misrepresented in the data.
Nyugen et al. state t"hat the extent and nature of resulting biases are unknown," and suggest that "considerable caution should be exercised in interpreting analyses that fail to explore and account for biases arising from missing data."
As a graduate student, Dr. Joy Buolamwini was working in an MIT lab when she discovered that a facial recognition software couldn't detect her face. She noted that her MIT peers didn't experience this issue, so she drew a face on her palm, which the machine recognized. Next, she put a white mask over her own face, which the machine also recognized.
This encounter uncovered large gender and skin color biases in commercially sold products, including many facial recognition and analysis softwares. It led Buolamwini, now a self-described "poet of code", to found the Algorithmic Justice League, which works toward "accountable AI" and unmasks human biases baked into technologies, such as the one Buolamwini uncovered.
The Algorithmic Justice League's website proclains, "The deeper we dig, the more remnants of prejudice we will find in our technology. We cannot afford to look away this time because the stakes are simply too high. We risk losing the gains made with the civil rights movement and other movements for equality under the false assumption of machine neutrality."
Watch Dr. Buolamwini's TED Talk below for more information on her mission to uncover AI bias.
Since 2016, visual artist Mimi Onuoha has been creating iterations of what she calls the "Library of Missing Datasets". This project is "a physical repository of those things that have been excluded in a society where so much is collected", prompting questions about the nature of what is collected and how we can be known- or invisibilized- through our data.
As Onouha's project points out, what we collect, and where we put our attention, reveals what we count as "important". What we do not collect "reveal[s] our hidden social biases and indifferences".
The last iteration of the project focused on private data, adding another layer, Onouha says, wherein "access is honor rather than right".

Throughout the course of your research project, and especially in deciding where to store and preserve your data, consider who will have access to the data you are producing, and who benefits from that access. Who can mobilize the data? Who do you want it to be findable and usable by? Where are you going to store the data, and who will have ownership of the data?
There are many stories of communities participating in research and finding out after the project is complete that the research is being used in ways that remove their authority as creators of this knowledge and deny them a voice in managing the data or having their name on the scholarly outputs created with their participation. This phenomenon, known as "extractive research", has spurred initiatives like Research 101: A Manifesto for Ethical Research in the Downtown Eastside and the DTES Research Access Portal.
Making the research itself available to the communities that created it is only one piece of the puzzle. They may also want sovereignty over their data for cultural reasons such as self-determination, or want to use it for practical applications like community organizing and advocacy. This necessitates that data is stored so that can be accessed, found and used. This is also helpful for future researchers, who might want to reuse a dataset for secondary data analysis. Some questions to ask yourself when preparing data for storage and preservation:
The FAIR principles (which stand for findability, accessbility, interoperability, and reusability) are a good starting point for assessing your data. A good rule of thumb is that data should be "as open as possible, as closed as necessary". This means, that while we should strive for open data for maximum reusability, the sensitivity of the data and the possibility of re-identification of participants introduces the necessity for privacy and anonymity.
If you have questions about the preservation or storage of data, visit our Research Data Management guide to learn more. You can also contact Amber Gallant, the Data Services Librarian, for information on cleaning data; storing your data in RRU's institutional repository, Borealis; about formats to store your data in; and about making your data FAIR.
The COVID Measures Archive at ICPSR, one of the world's largest repositories for social sciences data, is a great example of storing data in a way that balances open data sharing with protections for confidential and sensitive data.
The collection, and archiving, of data about COVID faces many challenges, including the fact that the data needs to be de-identified in order to follow HIPAA Safe Harbor laws in the United States that aim to prevent re-identification of participants. And yet, making COVID data open (and data open in general) benefits future researchers by offering transparency into data collection, allowing comparability between studies, and fostering trust and confident in public health guidance. ICPSR's solution is the COVID Measures Archive, which, instead of storing only datasets, also stores metadata. Depositors can choose whether it is appropriate to share the data or not. If not, the sharing of metadata still enhances the findability and accessibility of the data. As ICPSR says, this approach balances the approach of "as open as possible" with "as closed as necessary".

Promotional poster for ICPSR's COVID Measures Archive. Source: ICPSR.
Data sovereignty- the idea that data are subjects to the laws and governance of the nation where they are collected- is seen by Indigenous nations as a key piece of self-governance structures, involving the decolonization of data. Through data sovereignty, nations can choose which data is disseminated to the public at large and which is kept private, reflecting Indigenous ways of knowing and systems of knowledge that reserve that knowledge for certain times of year or individuals. It also allows the nation to decide what reflects their own interests, values, and priorities.
There are myriad Indigenous data sovereignty initiatives all over the world, including:
These are only a small subset, but they represent a range of implementations of data sovereignty and reflect the growing importance of data - and authority over the ways it reflects them and tells their stories to the broader public- to a historically marginalized community.
In their paper Datafication, development and marginalized urban communities: An applied data justice framework, Heeks and Shekhar explored how the data generated about formerly "dataless" marginalized communities was supposedly used to help them. They looked at the examples of data initiatives that mapped these communities in Chennai, Pune, Solo, and Nairobi. Each promised to represent communities better.
Heeks and Shekhar, however, found that these data initiatives largely included data activities and uses of data that provided value to individuals outside the community rather than those inside it. Skills and contacts were built by those within the initiating organisation involved in using the data for advocacy purposes. The communities' new legibility, while it did change government officials' view os the individual within the community, also disrupted local control over knowledge of the community by making them externally visible. Full utilisation of the data depended on its value to powerful local actors, and direct action by the community was not made visible.
Heeks and Shekhar suggested that the pro-equity data initiatives may actually increase data inequities in practice, because external actors use the data for their own agendas. New flows of data do have the opportunity to counteract the injustice of invisibility, but must be direct with an eye toward what Heeks and Shekhar call "distributive data justice", an understanding of the benefits of data systems and who gets them.