Improvements to content curation
Improvements to content curation
Edit: Now that we're over the $500 mark, this is our current goal.
So, let's face it. Right now, OGA's art search feature isn't very good. The advanced search interface is cumbersome to work with, and sometimes search results that ought to be there just don't show up. For instance, one might think that if you typed the words "lpc base" in the search box, you'd find the LPC Base Assets, but in fact you actually get no results at all (or you might get a huge flood, YMMV). (FYI, the work-around in this case is to search for it again in the advanced search box on the left side of the search results, but even in that case, you get too many results, which still isn't particularly useful.) So, the other option is to browse by category, but the categories are so broad that that isn't particularly helpful either. This needs to be fixed.
Unfortunately, fixing this will require a large time investment, for several reasons:
- Right now, we use a Drupal module called Views to do art searches. I actually like Views a lot; it's great for quickly searching through and displaying lots of data, but frankly it's better suited to simpler types of searches. What I can do with the Advanced Search form, for instance, is currently limited by the capabilities of Views.
- To make narrower categories, we have to start collecting a lot more metadata. The problem with this is that people don't like entering a lot of data when they submit art. One frequent complaint I get from artists is that the form is already kind of a pain to fill out. On the other hand, a complaint we get from people looking for art is that there isn't enough metadata to help them find things. So at once, we're collecting too much information, and not enough. In order to rectify this situation, someone (myself, specifically) will need to go through new art when it's added and add the appropriate metadata to it. (As a side note, ages ago, when OGA was very young, we had narrower categories and let people classify things themselves, but art was constantly being miscategorized, so we dumped categories in favor or tags.)
I'd love to be able to automate some generation of metadata, but unfortunately, metadata is an inherently complicated thing. For instance, the metadata required for vector art is different from the metadata you'd need for pixel art. For pixel art map tiles, for instance, you probably want to know the per-tile resolution, but in the case of vector map tiles, resolution is irrelevant. For music, you might want to know the length and genre of the song. For 3D models, you probably want to know the polycount, texture resolution, whether it's rigged or static, etc. If we're smart, we can programmatically guess some of these things, but certain things, like musical genre, would require a much more sophisticated algorithm than we have processing power to run.
So as I said, the solution to all this is that I'm going to have to go through and enter metadata as new art is submitted (not to mention going through the archives and adding it there too -- something that will likely take many, many hours). But before I even get to actually entering metadata, I need to figure out what is going to be the best way to store it, and then build a reasonably usable web form so that I can enter it without inducing any more headaches than absolutely necesssary. In the process of doing all this, I'd also like to rebuild the search interface into something that a) actually works, and b) is more appropriate for searching through art.
I recently spoke with my friend Clint Bellanger (developer of FLARE), who has a lot of experience with metadata and content curation, and he gave me some really good suggestions. I'd like to switch our searching and indexing over to Apache Solr, which should be a big perforamance win, and will also allow some major improvements to the search form itself (not to mention vastly better results). Ultimately, what I'd like to arrive at is a search system that works a lot more like, for example, this one at the Auburn University Library. Note how quickly and easily you can add and remove search filters. Now, imagine that you're searching for art on OGA, and you can do that with all sorts of data that's specific to individual types of art, as well as universal things like license, favorite count, download count (which we'll be re-adding), and submission date. Here's a mockup image (click to enlarge):
Since this isn't implemented yet, you'll have to imagine that the results returned in the image are accurate and relevant, but that should give you a general idea of what we were thinking. And just to reiterate, this is a mockup, so it's subject to change.
How long wiill this take? It's hard to say, but since it'll be such a huge change, I can say for certain that it's likely to take weeks of actual programming time (which could translate into several months out here in The Real World). Beyond that, new art isn't going to curate itself, which means that even after it's done, there will be a constant (and probably growing) workload of making sure that new art is properly curated.
People have suggested gamification (that is, reputation points) to encourage people to help out, and I think that's an excellent idea, but when we eventually go that route, I'll have to put a lot of thought into ways to make sure that items aren't miscategorized or categorized inconsistently (a common problem if multiple people are sorting things into categories). Even if we enlist the help of users through reputation points, ultimately I'll still need to review their metadata for consistency.
So, for those of you who have been wondering why the content curation goal (which everyone understandably wants) has been set so high, it's because it's going to require a huge initial time investment and then a fairly constant investment of time later on (on top of the few hours per week of basic site management and maintenance).
If you've been curious if there's a good reason for you to donate to the OGA Patreon fund drive, this might be it. We're just about half way to this point at the time of this post, so if you want to help us out, go to our Patreon page, or help spread the word. :)
Peace,
Bart
Comments
I love the mock-up!
With Solr it's okay to have a large data schema that supports various content types. A song could leave the "Tile Size" field empty and it's not wasting any space or speed.
Maybe some of the Submit Art form can be collapsed in an Advanced Metadata section, so that users can opt to fill in the extra info if they want. The fields with enumerated acceptable values can be put in dropdowns.
Solr queries are often sorted by Relevancy. That algorithm works well with full text documents, so it's easily able to index the descriptions of OpenGameArt content. Even if art items aren't properly tagged and classified, Solr can find relevant keywords anywhere in the item data. It'll also easily index on various word forms, so someone searching for "swords" will still find items that contain "sword" in the description/title/tags.
The key is curation. Libraries like ours have a good search interface because we have a team of skilled metdata experts (catalogers) who work full-time on tagging and classifying data. This takes a massive amount of effort, but it's necessary. A collection of data is obviously way more useful if people can search it in a detailed way to find what they're looking for.
This would be really awesome. I'd be happy to help.
One of the issues I've found is due to the scoring/ordering of search results being poor. Completely irrelevant results can be shown before ones that might be more relevant. In some cases, something is included in the results because someone happens to say the word in the comments - e.g., see how the top result for sword is http://opengameart.org/content/flare-portrait-pack-female-edition because someone says it in the comments. Okay, there is a sword in the picture, but this probably shouldn't be anywhere near the top result if someone is searching for sword :)
The problem with it being hard to find the LPC base assets is that it doesn't get the top hit (or anywhere near it). Extra categorisation wouldn't help here. (Compare say with Google, which has no trouble finding the LPC assets on that search term - Google also does better with the sword example I think.)
If Solr's sorting is better, then I think that would be a big improvement in itself (though the proposals for better categorisation do sound good, and would be very useful in their own right.)
Clint could answer this better than me, but there may be a way to adjust relevance based on whether the keyword is in the submission or the comments.
Clint could answer this better than me, but there may be a way to adjust relevance based on whether the keyword is in the submission or the comments.
Yep, it's easy to do in Solr. Directly in the query you can boost the relevancy for each specific field. Usually title weighs a lot more than the description, etc.
I'm guessing there's a fair amount of playing around with weights to get the best results.
That sounds good! There's also this post on user-tagging: http://opengameart.org/forumtopic/user-tagging . I don't know if allowing user tags would make the problem of miscategorsation worse (because of more people doing it inconsistently) or better (since we already have the problem of poor choice of tags, which other people can't fix) though :)
For the record, I've been curating a few collections sort of in preparation for this. I have archive-complete collections of 2D platformer and isometric assets. That may be useful for more specific categorization or tagging.
I'm more than happy to take care of my own art, and for my part I welcome a submission form that provides standardized input such as polycount and texture sizes. However, I'm not always certain what tags to add, I generally feel that my choice of tags can be myopic, that my tags are influenced by my own preconceived notions of what the item is designed for.
Re: search weighting - perhaps a feature that lets users indicate relevance on search results, a check box next to each result item.
Just a quick question. When you re-add the download count, is it going to start from 0, from where they stopped or you are still keeping track of them?
The data is still there, so theoretically we should be able to keep it. I don't want to make any promises, but at the moment I don't see why we wouldn't.
This might be more work than it's worth but allowing people that aren't the content owners to add tags might be useful. If someone uploads something and it's missing an obvious tag an admin or even long-term user might be allowed to add tags so content can be made more search-friendly.
I can't say that I've got enough time to be a full admin or anything but I am frequently on the site watching the new art come in so I can collect it into the LPC art collection. Maybe you could have a new tier of user privileges for people like me entitled "curator" where all they do is add the relevant metadata for new art submissions. I'd be happy to step up to a role like that for the 2d art category anyway.
+1 on pennomi's suggestion. I could do that too pretty easily.
Sounds good to me. I had considered making it something that anyone can do if they earn enough reputation, but it's probably best to keep it to a small group so that we can be consistent.
I would also like to volunteer as a "curator". I look at every piece of new 2D art anyway (usually within a day of its submission), and I've already been through the entire archive of 2D several times now. Reviewing each new submission would not take much more of my time than I already commit.
I also have no problem sticking to community standards for consistent categorization, though it may take me a few attempts to understand the eccentricites of categorizing things consistenly with the rest of the community. :)
...This may not be within the scope of this thread, but how do you feel about a more normalized list of tags? Instead of allowing people to enter any-dang-thing as a tag, have a predefined (though always expanding and improving) list of tags. Or perhaps let the submitter type in tags, but then suggest normalized tags that seem to relate to the art or the tag text they submitted? user types "32 pixel squares", then submit form says "we suggest '32x32 pixel tiles'" or something. Could be too much work for not enough return.
Would it be easier, as an interim solution, to embed the google site search box (not sure if this is costly)?
For example, this is what the "lpc base" example looks like: https://encrypted.google.com/search?hl=en&q=lpc%20base%20site%3Aopengame...
(lpc base assets shows up as #1)
@congusbongus
That's an excellent idea in theory but it doesn't work on a predictable schedule. Google search bot randomly goes through a site and indexes content on its own schedule that could be anywhere from 5 minutes to 2 months between sweeps. The other problem with that is it encourages spam bots if they see site search is provided by a search engine (more links to a site from external sources will bump a site up on search results).
It also searchs things like blog entries, static content, forums, and user pages even if you don't really care about any of those things (when you want game content you likely don't want a blog post). Using an external search function is only a good idea if most of your content is the same type (forum posts, art submissions, user pages, etc.). On a site the size and age of OGA that would make search results less significant and less useful than they are now (which people complain about).
Also, controlling how search results are displayed would be next to impossible since searching would be handled outside of OGA. You couldn't have a music preview button next to an image like it is now (rather nice if you ask me). Having a good internal search function is important for finding useful and relevant content.
A few ideas, not sure if practical or not:
Let's have the tagging work done in hierarchies. (Re-)tagging stuff could be done by the majority of people (some points earned, some medals, member since X, etc.). These changes are immediately published, to avoid (most of the) version conflicts. Then there are curators, they are able to see a list of recent retagged items: a preview image that serves as link as usual, and with kind of a "diff" display (added tags in green, removed tags in red). It should not take much time to go through that list and check for mistakes (at least significantly less time then actually doing the work!). Then, the change can be marked as "verified" and won't show up in that list again. Or, there is a option to view verified ones too, if one wants to doublecheck.
To help keeping the tagging consistent and structured: Tag the tags! In other words, categorize them. Instead of a single tag input field, there would be one per category. This adds more managing work in first place, but the advantages are:
Examples of what I mean: somebody decides to tag things by their dominant colour, or: somebody decides to tag beings by the impression they leave (sad, frightening, shady and such). For sure it might be useful here and there, but only if the majority of content is tagged with this in regard.
By using categories, it can be decided if there should be a "dominant colour" category or not. Of course, there can be a "misc" category where you put in tags that dont fit in categories, and of course taggers can add tags to a certain category if it isnt there yet.
Also, there is the thing with hierarchies. We have a sword, and add the tags "weapon, melee, sword" - would be nice if one adds just "sword" and the other two get automatically added. But then, hierarchies are kind of inflexible, and there might be occasions, where you dont want it. Because you invented the throwing sword, that is not a melee weapon :P Here should probably be a way to override the auto-adding. Maybe they appear in a separate field (so much fields! I know I know) below the input field with a [X] prefixed, and can be deleted if inappropiate.
(This is all written with complete ignorance to technical doability.)
I want to answer you but mollum blocks my posting without any way to post it (like captcha) if you need help with the spamfilter maybe i can suggest something... like as example to flag authentic members or to have an invisible box. Here only the word b.o.t does trigger the spam filter.
I postet in the general forum now.
opengameart org forumtopic improve-the-search-engine
thanky to mollum everyone needs to guess the . and /