Project goal is style transfer based on style identification and, more in the future, using this as basis for stylized rendering in computer graphics.
For the dataset, I try to extract artists that have an established style in drawing art that includes landscapes. This metric is estimated by the number of images with the desired tags and the number of total images the artist made.
The most lax rules I am considering result in around 130 thousand images. With some rudimentary artist filter that becomes only 30 thousand. I'd like to increase that last number as much as possible while still ensuring there are at least 10-20 images per artist with the same established style (very important for quality results).
Currently I only use the API to get the URLs for images once I filtered them (and caching those URLs to keep traffic down). I though about using Danbooru2018 but since I am not targetting image classification, I'd much prefer having access to the high resolution sources, and given I only need a small subset, getting a few TB of drives is a bit overkill for me right now. So I intend to download and cache in stages as my project advances - I've made sure the images are exactly what I need (and passed hand selection through preview images).
Now with Danbooru2018 being a year old, with many artists probably below the minimum image count that now would pass the test. So I thought I'd use the API to check how many images an artist has uploaded to date.
Ideally, I'd still want to filter "artist_name rating:s", to filter out artists that have tons of nsfw images and just very few matching images, not enough to stand on their own (plus it can be assumed that nsfw elements/influences can be found in them).