Hole Enjoyment Survey - Creating a measure

ChrisWoj · Oct 2, 2019

IHearChains said:
My opinions of a single hole are inextricably tied to the rest of the holes on the course.

I love BRP#4 but if I'm on an entire course that plays 18 holes exactly like BRP#4, I would say "this hole sucks".

Part of an epic signature hole is the anticipation that builds while working your way through the course to get to that hole, and that anticipation factor is going to depend on what the rest of the course is like.

I agree. I hope that this sentiment is caught up in prompts like:
"A round is more enjoyable if this hole is skipped over." or "This hole is worth playing every round" (anticipation)
"This hole is likely to contribute to negative reviews online." (over repetitive?)

Just thoughts - something related to uniqueness could be a good idea though.

McCready · Oct 6, 2019

IHearChains said:
My opinions of a single hole are inextricably tied to the rest of the holes on the course.

I love BRP#4 but if I'm on an entire course that plays 18 holes exactly like BRP#4, I would say "this hole sucks".

Part of an epic signature hole is the anticipation that builds while working your way through the course to get to that hole, and that anticipation factor is going to depend on what the rest of the course is like.

^^^This right here. You have to look at a particular hole in the context of the entire course. I like courses with a wide variety of shot shapes (left/right, uphill/downhill, open/tight), scenery and distances. For that reason, Hobbs Farm and Perkerson Park are rightly considered two of the top courses in Georgia. No individual holes really jump out, but both are monster, championship-level courses that will challenge every aspect of your game, force you into risk/reward decisions, and make you throw almost every disc in your bag. And yet both have some shorter and not particularly inspiring holes, to cleanse the palate in between the harder ones. A 200' ace run is a welcome break after a massive par 5 that you can birdie one in 100 times if you're lucky. The combination of different levels of difficulty creates a certain feel & rhythm that makes you look forward to playing the course over and over.

BogeyNoMore · Oct 7, 2019

Yep. While incorporating certain features or design elements in a given hole can make it "better," ...pressing that same button too many times can result in a course that starts to feel repetitive.

I don't claim to speak for everyone, but after all the time I've spent on this site, and all the players I've nerded out with about courses we've played in my travels, I feel comfortable saying that for most players, VARIETY is what makes a course (and the game, for that matter), more enjoyable.

Repeatedly playing a course that doesn't offer much variety quickly gets old. Some courses have much better replayability than others.

DavidSauls · Oct 7, 2019

That truth just makes defining hole enjoyment all the more difficult. If you could put identical holes on 3 different courses, they might have 3 different enjoyment ratings, just as a matter of context.

DavidSauls · Oct 7, 2019

…..I guess the reviewers need to try to be more objectively subjective, and try to judge a hole on its own.

Personally, I think it's an ambitious but futile undertaking, but I wish you luck.

ChrisWoj · Oct 7, 2019

Good morning folk, I just updated this and will be collecting data in my first trial of the instrument on the coming weekend. I plan on trialing it on a few courses in my local area across a few events in the next 4 weeks. Unless someone sees something egregiously bad here, I'm going to be using this version.

This URL takes you to the PDF shared via Google Drive: CLICK HERE FOR INSTRUMENT

DavidSauls said:
uniqueness.

BogeyNoMore said:
VARIETY

Steve West said:
If I were forced to never again play one hole of my choosing for the rest of my life, I would gleefully pick BRP 444.

Yet, if I wanted to try to improve the enjoyability of the local disc golf scene by picking holes in the Twin Cities to be eliminated from existence, BRP 444 would be one of the last survivors.

Let me know what you all think of what I included to try to cover the element of variety/uniqueness with a few prompts.

BuzzzChief said:
"visual appeal"

I tried to incorporate some of this. Let me know what you think of how it looks. I tried a few options related to visuals.

chevis said:
risk/reward

I tried to build this in with a prompt: "The style of shot required for this hole is a fun challenge." - I feel like a hole with a poor risk reward element is going to naturally score low on this. I could still add something else in for risk reward later, trying not to make this too big a beast as you mentioned.

ToddL said:
I would reword one of the questions

I actually wound up rewording a number of them. I had been planning on reverse-coding some of the ones that were, although positive in language, were negative in nature... but upon reading up on prior work it appears to be a bad move in general. So all questions that were negative in nature were flipped - language AND nature both are positive now moving left to right on the scale.

IHearChains said:
anticipation that builds

Baked said:
^^^This right here context of the entire course

I definitely tried to add more elements related to context and anticipation, let me know how they work for you.

ChrisWoj · Dec 20, 2019

Update: Okay, so I finally got around to running analysis on this after the semester ended. As mentioned above: I focused on Rasch Measurement techniques, and this was only intended as an initial trial. This is both a summary of how that went as well as the open to new discussion regarding how to improve the instrument for a second trial. If you're interested in what the instrument I used looked like, please check the post immediately above this one in the discussion thread. The TL;DR goes: A lot of things went right, and one major thing went wrong. This isn't gonna be short.

Methods
Sample: I went for a convenience sample. Ottawa Park is the closest to the University. Hole 9 allows me to catch players finishing the round, but in the presence of #9 which allowed them to place themselves easily in the mindset of a player at the hole. A number of pieces of research encourage reminding a survey respondent of their mindset through subtle activities like responding immediately after experiencing the phenomenon, being in its presence, being shown photos, etc. Hole 9's basket is also conveniently located near parking. I also chose the hole because it is generally fun, but straightforward and conventional in nature. Conventional is good for a first pilot, if not all successive runs. This hole has no water, so only 19 of 20 questions were applicable to the hole.

Hole Description: Hole #9 from the short tees at Toledo's Ottawa Park DGC was chosen. A fairly straightforward hole designed by Jim Kenner back in 1996 with casual competitive golfers in mind: straight to fade 250' for a RHBH with a low ceiling tree halfway up, basket sitting on a significant elevated spine that runs up the left side and then across the backside of the fading fairway in a lowercase "r" shape. The spine forces you to play a bit higher toward the ceiling, or throw a lower faster disc. With an OB road about 50-60' deep the lower and faster option brings that into play if you overshoot the crest of the ridge. The basket is positioned at the letter join of the r with a large bush short and deep of it creating low ceiling areas of the green. The left of the basket is also the steepest drop off, putting a 6' tall player looking up at the base of the pole.

Participants: In the end I wound up with 45 participants. I did not take demographics with this trial, but it seems best practice with Rasch is to run through trials to come up with a solid construct before beginning to look at how demographics impact output. Participants varied in terms of whether I knew them and their general skill level/set. Initially I only informed them that this was a trial on hole enjoyment. After a few participants I had to modify the conversation to include assurances that this was in no way intended to be a referendum on hole 9 nor was there any plan to change or remove hole 9. Even after that some participants who enjoy hole 9 expressed verbally some trepidation at not answering positively to some items, such as: "This hole is a reason to return to this course." Participants were told not to respond to the question about appropriateness of use of water hazards, as hole has no water. Any responses to that item were thrown out.

Analysis
I looked at 5 different elements of the instrument...
1. Reliability and separation of items and participants
2. Rating scale functionality
3. Item fit
4. Participant fit.
5. Instrument Dimensionality

I again want to note that this is a trial and that the purpose of this is revision of the tool itself. In its final form the input data would not be tweaked in the way we work with it here.

1. Reliability and separation of items and participants
Separation is the analysis of the separation of the participants/items' outcomes in terms of 'ranking' as well as the measurement error range. This tells us if the participants are being ranked in a way that means it is unlikely that they are "jumbled" or "out of order." It helps define the number of strata of participants and items you have in terms of ability/difficulty (respectively). Reliability is related to variance: the closer to 1.0 our reliability, the more of our variance comes from the instrument versus from the expected error.

For a standardized test you want the test to generate a .95 score on reliability. For most testing purposes .8, for example when you're working with item banks and selecting different tests for different sections of students. For survey data .7 is considered gold standard, given how many fewer things tend to be under the interviewer's control when compared to standardized or classroom circumstances. The instrument generated an initial score, without any modifications, of: item .87, participant .83. After a few modifications were made to the data, which will be explained in the fit sections, the reliability increased to: item .93, participant .89

What were the modifications made to the data? In the end 2 participants were removed, and 4 items were removed.

The separation measure is in logits, but can be converted to tell us approximately how many strata we can expect in participants and items, or groupings of them in terms of ability or difficulty. In the end our participant separation moved from 2.20 to 2.91 and for item separation 2.60 to 3.24. Depending on intent of instrument certain ranges can be acceptable, but all we want from this is to see if it lines up with the "eye test" - if not, we need to investigate further problems. As it is, from this, we see approximately 4 strata, though considering how close we are to 3.0 on both numbers there is some jumbling of participants. (confirmed from eye test using Guttman scalogram sorting all items and participants by descending difficulty/ability). Participant grouping is helpful for later classification after further trials. Item grouping demonstrates that right now, in terms of how "easy" it is to respond "strongly agree" we have 4 groups. Separation was increased by quite a bit removing the items/participants.

2. Rating scale functionality
This can be quick: I used a rating scale that has been validated across a number of different studies, and that is clear to participants. Just a 4 item setup: "strongly disagree" "disagree" "agree" "strongly agree" along with I prefer not to respond. I avoided a middle category because middle categories more often than not are interpreted differently by different participants, which harms rating scale functionality. The rating scale is validated using structure calibration statistics that can be represented graphically as a waveform probability of response graph. In order to have a valid rating scale each item should peak as the most probable response for some level of participant skill minus item difficulty. An appropriate monotonic increasing scale will show the easiest items easiest to respond to for participants, on up. One of the modifications made, to be discussed further in the fit section, involved an item in which respondents had stark inverse response trends. Removal of that item brought acceptable separation to the peaks as well as monotonicity that already existed.

3. Item Fit
I'm gonna start with a bit on what we're looking at with our fit scores.

Fit scores, both participant and item, are the same. Fit scores are reported in logits. What we're looking for here is an output between .6-1.4 (.8-1.2 for standardized testing). If your fit scores are too low your item or your participant is not contributing new information. For example, I mentioned a Guttman scalogram earlier. A sample of 10 participants with who score a 10%, 20%, 30% etc on a test where the easiest item is answered correctly by 100% of participants, next easiest 90%, 80%, etc. - that test output would get each item and each participant a relatively low fit score because the overall estimation of the test/participant skill levels could be done without any one item. High scores are like maxing out the radio, high fit scores mean you're being given misinformation: results from participants whose responses for some reason do not match up with the general response pattern and seem to be misinforming the results. We look at two forms of fit for each: infit and outfit. Infit is sensitive to where groups might shift outcome results (a high infit might indicate something like a group of forehand throwers disliking a backhand hole). Outfit is sensitive to outlier items or rogue individuals. At this trial stage we are more focused on the outfit. Infit will be looked at more in future pilots, especially when we get to the question of demographic analysis.

We had some small problems here with 4 items. 2 of those items were recommended by the DGCR community, and I feel need to be replaced with something that works in a similar fashion, but does not have the problems the items I came up with did.

The 2 items I do not have a problem with removing both had to do with participant perception of the safety of the hole. While I think that the questions have value - they were clearly not measuring the same construct as the rest of the survey. While a question of how GOOD a hole is could include these questions, I think I was of a wrong mind in including them. On the sample survey in the post above these are the 16th and 17th items.

The 2 items I want to include some form of were the 5th and 12th items, included on the recommendation of DGCR participants and I think simply worded poorly (which is a problem we will get to more in discussing dimensionality):
This hole's fairway is visually appealing.
The surrounding area, when viewed from this hole, is visually appealing.

In the cases of the items related to the visual appeal - the general structure of the output aligned with neither the rest of the survey, nor with each other. Given each item's difficulty, relative to the rest of the items, the scale responses were not in proportion with the rest. What this indicates is what each is measuring a separate construct from enjoyment of the disc golf hole. What we want is to find a way to word it so that the visual appeal applies to the enjoyment of the hole. That would tie it to the construct.

TO BE CONTINUED! (character limit)

ChrisWoj · Dec 20, 2019

4. Participant Fit
We only removed a few participants, and their removal had the biggest impact on the reliability scores overall and the fit scores of the other participants and items (note: analyses were run with both removal of participants only, items only, and combinations of the two to ensure that no one item was throwing the measurements of the others that seemed to not fit). One of the participants removed was a fairly obvious case: the participant failed to note the items on the back side of page 1. This participant responded to items 1-6 and 14-20. If our goal was to measure the participant skill we could actually keep this participant, and report an output with caveat, based on the rest of the items' ability to predict the items not answered. For the purpose of calibration of a new instrument this is a problem, however. So we removed the participant.

Another participant was removed because that participant demonstrated a very high fit score, further analysis showed more than a handful of responses that did not coincide with the expected data, and looking item by item it was clear that this participant engaged in patterned response.

One participant did demonstrated very poor outfit scores, indicating a high likelihood of being similarly problematic, but this is where the eye test does come into play: the participant demonstrated a number of outlying results, but for some reason this was because of an aversion to the use of the middle items. The participant was either "strongly" agreeing or disagreeing. The general pattern of response did follow the more nuanced respondents, and the participant was kept in the output as a part of calibration.

5. Instrument Dimensionality
If you're scrolling through looking for the big problems, here they are.
We had a major issue with dimensionality and the development of what are called contrasts. This is a type of factor analysis. The analysis is looking for areas where the variance in response outputs seems to be caused by factors other than our primary construct. Prior to any changes we had a variance explained by the overall construct that sat at 40.6% - our goal is 60% of variance explained by the construct (enjoyment of the hole), a combination of the rest identified in participant variance, item variance, and then some unexplained/error variance. Three contrasts emerged within the unexplained variance - combinations of items that sit opposing each other and seem to explain variance. They account for 27.8% of variance, a significant chunk. After the modifications to remove items/persons the variance explained by the instrument as a whole was improved to 54.6%, still well short of our goal. Contrast variance combined 23.0% after that.

The three contrasts were all in opposition to a set of items that directly referenced the hole: this hole's fairway width is appropriate, this hole's use of obstacles is appropriate, etc. The first contrast showed variance explained by the contrast of those items with the responses to items related to uniqueness/variety. The second showed variance explained by the contrast of those items with the responses to items related to feelings while playing the hole. The third was items related to the target/green as opposed to referring directly to the hole.

I think that, given the problems with the visual appeal hole and the ways in which the contrasts emerged - one thing I will need to do before a spring pilot will be to clean up the language to bring it all a more coherent feel in which all elements are related directly back to the hole itself.

Thoughts? Questions? Expletives?

Golden Tuna · Dec 20, 2019

I'm bored at work, so just now digging into this.

First impression - it is very long for a survey. A marketing best practice is to try to keep surveys under 10 questions / 5 minutes.

Second impression - the survey will never yield accurate data unless your audience is segmented. The most likely way would be by skill or division.

I also think you use a lot of subjective language (appropriate, frustrating, worth playing, sense of accomplishment). And, the language that is not subjective does not necessarily lead to logical findings.

For example: This hole is likely to contribute to negative reviews online. That could be because it is a hard hole, or a hole where it is easy to loose a disc, or a tweener hole where it is hard to birdie - none of which necessarily make it a bad hole/course, but it would still be complained about online.

I guess I get the concept you're after, but I think you're spending a lot of time generating responses that won't necessarily indicate anything to anyone. In the case of course ratings, I would prefer to keep the 5-star method and rely on the written review for context behind what that means to them.

ChrisWoj · Dec 20, 2019

Golden Tuna said:
I'm bored at work, so just now digging into this.

First impression - it is very long for a survey. A marketing best practice is to try to keep surveys under 10 questions / 5 minutes.

Second impression - the survey will never yield accurate data unless your audience is segmented. The most likely way would be by skill or division.

I also think you use a lot of subjective language (appropriate, frustrating, worth playing, sense of accomplishment). And, the language that is not subjective does not necessarily lead to logical findings.

For example: This hole is likely to contribute to negative reviews online. That could be because it is a hard hole, or a hole where it is easy to loose a disc, or a tweener hole where it is hard to birdie - none of which necessarily make it a bad hole/course, but it would still be complained about online.

I guess I get the concept you're after, but I think you're spending a lot of time generating responses that won't necessarily indicate anything to anyone. In the case of course ratings, I would prefer to keep the 5-star method and rely on the written review for context behind what that means to them.

It appears you're responding to the first rough draft. Please check out posts 26-28. A few of your concerns stand, but I think all are being addressed.

Concern: Length. At this point its been winnowed down to 16. It is likely to have another question added. After we've settled concerns with misfit (post 27/28) and dimensionality (28) - then we'll look at overfit items (items which provide no additional information). It should wind up closer to your golden number for marketing surveys, which is efficient for a measurement tool related to a complex construct.

Concern: Subjectivity. Right now the language is being ironed out, as you can see from the v2 survey as well as my own comments regarding what contrasts emerged in the analysis of alternative sources of variance. Some of the subjective language will stay, because we're talking about a construct littered with subjectivity. But the strength of Rasch lies in its use of probabilities to tease out outlier responses when someone's interpretation of a particular item is unexpected.

Concern: Demographics. Demographic analysis will be done after we're through a few more trial steps. I agree with that completely.

Theme font size

Search

Theme font size

Hole Enjoyment Survey - Creating a measure

ChrisWoj

Common Core Crusader

McCready

Eagle Member

BogeyNoMore

* Ace No More *

DavidSauls

* Ace Member *

DavidSauls

* Ace Member *

ChrisWoj

Common Core Crusader

ChrisWoj

Common Core Crusader

ChrisWoj

Common Core Crusader

Golden Tuna

Eagle Member

ChrisWoj

Common Core Crusader

Similar threads

Latest posts