Fighting Spam in Dataverse Installations

October 28, 2022

When a Dataverse installation is configured to allow anyone to sign up and to allow those users to create and publish datasets and/or collections, which is useful for self-service data publishing, it opens the door for abuse, i.e. from spammers seeking to promote their websites by publishing content in the Dataverse installation with links back to their website.

The risk is inherent in running a Dataverse installation in this open configuration and, to date, the burden at most Dataverse installations from this sort of abuse has been minimal, or at least mitigated by anti-spam features added in version 5.9. However, larger, open Dataverse installations have recently seen more aggressive abuse.

The key factors that make a target attractive are:

the ease in signing up with free emails/external accounts (Gmail, Yahoo, etc.)
the ability to make collections and/or datasets public (by self-publishing them)
the potential for spammers to then count on the reputation of the site and/or analytics reports to boost their spam and their home websites. That is, Dataverse installations that run Google Analytics might be targeted.

If you want to mitigate or prevent this type of abuse, there are a variety of mechanisms in the Dataverse software that can help with each of these factors.

Quick Response: Ways to quickly respond that do not require code changes (varies by Dataverse release)

Block IP addresses of offenders. This would be done outside of the Dataverse software, i.e. in your firewall.
Turn off the creation of builtin accounts temporarily. There is an :AllowSignup setting you can set to false.
Deactivate existing spammer accounts via API. This will prevent existing accounts from creating more collections/datasets.
Temporarily disable "social" identity providers such as Google or GitHub. There is an API call to disable various authentication providers.
Turn off Google Analytics temporarily.
Adjust the rules on any validator script to catch the new type of spam (version 5.9+).
Adjust the rules on any validator script to deny all publication requests (version 5.9+).
Remove content, permissions, and/or accounts. The Dataverse software provides user interface mechanisms and API calls to deaccession or delete (destroy) datasets, to remove user permissions, etc. Direct changes in the database or rolling-back to a clean backup could also help in severe attacks.

General Policy Changes (useful during or prior to any abuse)

Reduce the permissions of new users, either to be unable to create content or to be unable to publish it. To support legitimate new users, options would include:
- Having superusers manually approve accounts before they have permission to create collections or datasets, or, if all users are allowed to create datasets, to have superusers decide when to give them publish permissions. (Note that allowing all users to create collections would currently allow them to create and publish datasets within that collection. Thus granting users permission to create collections could allow spam.) To avoid superusers having to approve each dataset for a new user, there could be an "approved users" group that would have create/publish permission for the root collection rather than "all authenticated users" and new users would have to be added to the "approved users" group once they have deposited a legitimate dataset.
- Configuring email or Shibboleth groups to allow users from, for example, “.edu” addresses or from specific Shibboleth providers to have additional privileges.
Turn off local accounts or OAuth providers (Google, GitHub) that are not used by your user community. There is an API call to disable various authentication providers. (If these are used by your community, consider limiting permissions and using email groups as discussed above).
Consider installing a spam detector via the :DataverseMetadataValidatorScript and :DatasetMetadataValidatorScript settings. Via these settings (version 5.9+), the Dataverse software can be made to call an external validation script (which in turn can call a third-party tool, e.g. SpamAssassin; or can be as simple as a shell command performing a few regex checks) and to block publication of collections/datasets if the tool reports a positive test result.

Code Changes Being Considered

Adjusting permissions and adding a "submit for review" step for collections, similar to datasets. It is currently possible to limit users to only create (draft) datasets and to not have permission to publish them. However, if a user can create a sub-collection, they are currently given permission to create and publish datasets within that sub-collection.
Require email addresses to be verified before accounts can create collections or datasets. The Dataverse software currently includes a mechanism to validate user emails, but permissions are not restricted prior to validation. See #3300.
Better ways to find/remove content created by a given account. See #7728.
API throttling. See #1339.
Empower less technical archivists by adding deactivate/delete user functionality to the admin dashboard. Allow archivists to see, via dashboard button(?), the collections/datasets created by a given user. See #7239.
Empower less technical superusers by offering a graphical “destroy” option for published spam datasets.

This blog post has also been sent to the dataverse-community mailing list, and you are welcome to continue the discussion in that thread or the usual channels. We've also added it to the agenda of next week's community call.

The source of the "fighting spam" image is https://www.4kcc.com/blog/2019/04/15/fighting-spam/?doing_wp_cron=16669…