Update: Okay, so I finally got around to running analysis on this after the semester ended. As mentioned above: I focused on Rasch Measurement techniques, and this was only intended as an initial trial. This is both a summary of how that went as well as the open to new discussion regarding how to improve the instrument for a second trial. If you're interested in what the instrument I used looked like, please check the post immediately above this one in the discussion thread. The TL;DR goes: A lot of things went right, and one major thing went wrong. This isn't gonna be short.
Methods
Sample: I went for a convenience sample. Ottawa Park is the closest to the University. Hole 9 allows me to catch players finishing the round, but in the presence of #9 which allowed them to place themselves easily in the mindset of a player at the hole. A number of pieces of research encourage reminding a survey respondent of their mindset through subtle activities like responding immediately after experiencing the phenomenon, being in its presence, being shown photos, etc. Hole 9's basket is also conveniently located near parking. I also chose the hole because it is generally fun, but straightforward and conventional in nature. Conventional is good for a first pilot, if not all successive runs. This hole has no water, so only 19 of 20 questions were applicable to the hole.
Hole Description: Hole #9 from the short tees at Toledo's Ottawa Park DGC was chosen. A fairly straightforward hole designed by Jim Kenner back in 1996 with casual competitive golfers in mind: straight to fade 250' for a RHBH with a low ceiling tree halfway up, basket sitting on a significant elevated spine that runs up the left side and then across the backside of the fading fairway in a lowercase "r" shape. The spine forces you to play a bit higher toward the ceiling, or throw a lower faster disc. With an OB road about 50-60' deep the lower and faster option brings that into play if you overshoot the crest of the ridge. The basket is positioned at the letter join of the r with a large bush short and deep of it creating low ceiling areas of the green. The left of the basket is also the steepest drop off, putting a 6' tall player looking up at the base of the pole.
Participants: In the end I wound up with 45 participants. I did not take demographics with this trial, but it seems best practice with Rasch is to run through trials to come up with a solid construct before beginning to look at how demographics impact output. Participants varied in terms of whether I knew them and their general skill level/set. Initially I only informed them that this was a trial on hole enjoyment. After a few participants I had to modify the conversation to include assurances that this was in no way intended to be a referendum on hole 9 nor was there any plan to change or remove hole 9. Even after that some participants who enjoy hole 9 expressed verbally some trepidation at not answering positively to some items, such as: "This hole is a reason to return to this course." Participants were told not to respond to the question about appropriateness of use of water hazards, as hole has no water. Any responses to that item were thrown out.
Analysis
I looked at 5 different elements of the instrument...
1. Reliability and separation of items and participants
2. Rating scale functionality
3. Item fit
4. Participant fit.
5. Instrument Dimensionality
I again want to note that this is a trial and that the purpose of this is revision of the tool itself. In its final form the input data would not be tweaked in the way we work with it here.
1. Reliability and separation of items and participants
Separation is the analysis of the separation of the participants/items' outcomes in terms of 'ranking' as well as the measurement error range. This tells us if the participants are being ranked in a way that means it is unlikely that they are "jumbled" or "out of order." It helps define the number of strata of participants and items you have in terms of ability/difficulty (respectively). Reliability is related to variance: the closer to 1.0 our reliability, the more of our variance comes from the instrument versus from the expected error.
For a standardized test you want the test to generate a .95 score on reliability. For most testing purposes .8, for example when you're working with item banks and selecting different tests for different sections of students. For survey data .7 is considered gold standard, given how many fewer things tend to be under the interviewer's control when compared to standardized or classroom circumstances. The instrument generated an initial score, without any modifications, of: item .87, participant .83. After a few modifications were made to the data, which will be explained in the fit sections, the reliability increased to: item .93, participant .89
What were the modifications made to the data? In the end 2 participants were removed, and 4 items were removed.
The separation measure is in logits, but can be converted to tell us approximately how many strata we can expect in participants and items, or groupings of them in terms of ability or difficulty. In the end our participant separation moved from 2.20 to 2.91 and for item separation 2.60 to 3.24. Depending on intent of instrument certain ranges can be acceptable, but all we want from this is to see if it lines up with the "eye test" - if not, we need to investigate further problems. As it is, from this, we see approximately 4 strata, though considering how close we are to 3.0 on both numbers there is some jumbling of participants. (confirmed from eye test using Guttman scalogram sorting all items and participants by descending difficulty/ability). Participant grouping is helpful for later classification after further trials. Item grouping demonstrates that right now, in terms of how "easy" it is to respond "strongly agree" we have 4 groups. Separation was increased by quite a bit removing the items/participants.
2. Rating scale functionality
This can be quick: I used a rating scale that has been validated across a number of different studies, and that is clear to participants. Just a 4 item setup: "strongly disagree" "disagree" "agree" "strongly agree" along with I prefer not to respond. I avoided a middle category because middle categories more often than not are interpreted differently by different participants, which harms rating scale functionality. The rating scale is validated using structure calibration statistics that can be represented graphically as a waveform probability of response graph. In order to have a valid rating scale each item should peak as the most probable response for some level of participant skill minus item difficulty. An appropriate monotonic increasing scale will show the easiest items easiest to respond to for participants, on up. One of the modifications made, to be discussed further in the fit section, involved an item in which respondents had stark inverse response trends. Removal of that item brought acceptable separation to the peaks as well as monotonicity that already existed.
3. Item Fit
I'm gonna start with a bit on what we're looking at with our fit scores.
Fit scores, both participant and item, are the same. Fit scores are reported in logits. What we're looking for here is an output between .6-1.4 (.8-1.2 for standardized testing). If your fit scores are too low your item or your participant is not contributing new information. For example, I mentioned a Guttman scalogram earlier. A sample of 10 participants with who score a 10%, 20%, 30% etc on a test where the easiest item is answered correctly by 100% of participants, next easiest 90%, 80%, etc. - that test output would get each item and each participant a relatively low fit score because the overall estimation of the test/participant skill levels could be done without any one item. High scores are like maxing out the radio, high fit scores mean you're being given misinformation: results from participants whose responses for some reason do not match up with the general response pattern and seem to be misinforming the results. We look at two forms of fit for each: infit and outfit. Infit is sensitive to where groups might shift outcome results (a high infit might indicate something like a group of forehand throwers disliking a backhand hole). Outfit is sensitive to outlier items or rogue individuals. At this trial stage we are more focused on the outfit. Infit will be looked at more in future pilots, especially when we get to the question of demographic analysis.
We had some small problems here with 4 items. 2 of those items were recommended by the DGCR community, and I feel need to be replaced with something that works in a similar fashion, but does not have the problems the items I came up with did.
The 2 items I do not have a problem with removing both had to do with participant perception of the safety of the hole. While I think that the questions have value - they were clearly not measuring the same construct as the rest of the survey. While a question of how GOOD a hole is could include these questions, I think I was of a wrong mind in including them. On the sample survey in the post above these are the 16th and 17th items.
The 2 items I want to include some form of were the 5th and 12th items, included on the recommendation of DGCR participants and I think simply worded poorly (which is a problem we will get to more in discussing dimensionality):
This hole's fairway is visually appealing.
The surrounding area, when viewed from this hole, is visually appealing.
In the cases of the items related to the visual appeal - the general structure of the output aligned with neither the rest of the survey, nor with each other. Given each item's difficulty, relative to the rest of the items, the scale responses were not in proportion with the rest. What this indicates is what each is measuring a separate construct from enjoyment of the disc golf hole. What we want is to find a way to word it so that the visual appeal applies to the enjoyment of the hole. That would tie it to the construct.
TO BE CONTINUED! (character limit)