For Wimbledon 2017 we’ve been working on a project, #WhatMakesGreat, to combine statistics, cognitive analytics and fan opinion to help understand what makes a great Wimbledon champion.
With machine learning having a growing impact on people’s lives (in much more significant ways than working out what makes for a great tennis player) the social challenges it presents have evolved into controversial issues. Perhaps the most prominent of these is bias.
Bias in machine learning is a complex problem, but in simple terms, the algorithm is only as good as the training it receives. If there is bias in the training or source data then the answers it provides will be biased too. It’s argued that any training set derived from human decisions will have inherent bias, even though it may not be immediately visible. Developers may also inadvertently add their own biases into a system, based on experience, demographic, location, heritage and other factors impacting on objective decisions.
A machine in that sense can demonstrate more bias than a human, with bigger ramifications. Machine learning system don’t always, or can’t, give explanations for their decisions. People treat them as a black box and take the answers they give without question leaving inbuilt bias unchecked.
@daleinnis made a valid point about bias in #WhatMakesGreat. Working as one of the IBM developers, I wanted to explain our thinking around bias, where we thought it might be present and what we did to mitigate, or at least acknowledge it.
Although we wanted to make the most of IBM’s cognitive capability, the ultimate goal of the project was not to give a definitive nor objective answer for who is the greatest Wimbledon player of all time or what attributes they need. There can’t be a definitive answer as greatness means different things to different people and that is the point. The goal was to provide the insight to prompt debate, challenging people’s assumptions and make us all question long held beliefs.
Humans at the end of the Loop Throughout the project we used technology, whether it be statistical functions or cognitive analytics as tools. We formed an editorial team from sports journalists, tennis coaches, statisticians, creatives from the campaign team and software developers.
The project was structured around a weekly editorial cycle. We took one area of greatness each week and the different specialities within the team used their own tools (statistics, cognitive analytics or just years of tennis experience) to tackle the subject in their own way. We then regrouped for an afternoon and discussed what we’d found. Sometimes the statistics and the cognitive analysis supported each other, but it was more revealing when they did not. That prompted both sides to look back at their analysis and ask questions they’d not previously thought of. The surprise to me with using cognitive tools was not the results it gave us, but the questions it made us ask.
The best example was when Marion Bartoli featured prominently in the cognitive results for a positive perception of her serve. The tennis coaches and statisticians questioned this, Bartoli is not known for having a good serve, her average stats are weak for a top player. That made us look deeper at what the cognitive system was telling us. The positive descriptions were people noticing she was working hard on her serve and Bartoli herself talking about working to improve it. With this information the statisticians produced a new report, looking at her serve statistics over time. There was a clear upward trend, quite different from the other great players who were more consistent over their careers. The upward trend peaked in 2013, when she won Wimbledon. Even then, her serve wasn’t a strength, but she’d raised it to the point where it was no longer a weakness. It’s not that we couldn’t have worked this out before, it’s that without the cognitive analysis we wouldn’t have known to look.
The key here in terms of bias is that at no point did we rely on either statistics or cognitive analytics for an answer. The discussion around the data produced by the tools was where the conclusions came from. Much of the debate would focus on the influences or biases that led to the results, this was particularly true when we were comparing male and female players, or singles and doubles players. This is not a problem that is unique to machine learning, we spent time discussing how best to compare male and female statistics, particularly in areas such as stamina when the rules of the game have an affect (five sets compared to three).
We had a reasonable amount of diversity within the editorial team. Some people had been close to the game for decades, some (like me) knew very little about tennis. The developer and tennis team were both split 50-50 between men and women and we had people in their 20s through to those in their 60s. There is always more that can be done, but compared to most project teams I’ve been a part of, this was a reasonably diverse group.
Throughout the project this editorial team provided the final check and balance to the analysis we produced.
The biggest source of bias we had to contend with was in our source material. For #WhatMakesGreat we used archives from The Telegraph, Wimbledon and wimbledon.com. The most significant problem we faced was that the higher profile players just had more content written about them.
Female players were written about less than male players; doubles matches less than singles. Contemporary greats have more written about them than the past greats. I suspect Agassi and Sampras were just as significant stars as Federer or Nadal are today, but this is not reflected in the volume of content. This was true of female players too - there’s much more content about Serena Williams than Martina Navratilova. It would be interesting to look at the reason for this. Is there more overall volume today? Or are the top few written about at the expense of everyone else?
The length of a player’s career also affects the amount of content we could find about them. Someone like Federer who has been at the top for a long period had a greater volume of content written about him than most other players.
Our data sources were all British and this had an influence on the level of coverage different players received. Unsurprisingly, there was more content about Andy Murray than anyone else, which reflects the mainly British readership of The Telegraph. It would be interesting to look at if the coverage of a British player was biased compared to other nationalities. It’s not something we did. I am not sure if it would be a positive or negative bias.
To account for the different volumes of data on different players we normalised the results against the total volume written about each of them. So if, for example, we were looking for positive descriptions of Murray’s serve, we normalised that against the total amount of content about him. This still left us with the problem that some players (especially the doubles players) had very little source material about them. Any time we had too few data points we looked more carefully at the actual content generating that data. If ever we were in doubt about what the analysis was telling us we fell back to reading the original sources. In some cases we chose to ignore the cognitive results when the metrics were strong, but there was little volume behind them. We were careful to do this consciously and thoughtfully, but it was a subjective judgement made by the team.
It was a surprise to me just how much the style of sports articles has changed since the 90s. We didn’t look at this in depth, but subjectively the language used in older articles was more flowery and descriptive. It didn’t get to the point quite as efficiently. There were certainly phrases in the older articles that seem jarringly dated or sometimes sexist when read in 2017 and the questions asked in interviews were likely different between men and women and have changed over time.
Apart from normalising the data and being aware of the potential bias in the source material the huge advantage we had in making the analysis fair was the point-by-point match data. For each cognitive insight we found a way to compare it to the statistics. This would not have been possible without the tennis coaches on the editorial team. They could look at a topic like pressure and figure out which statistics could be combined to measure it using hard data. In one meeting we were discussing the length of matches different players play (Federer’s tend to be short), our tennis team pointed out one player who took longer between points. How did they know? They had been counting how many times the player bounced the ball between points. Amazing.
The statistics are much fairer. The more matches you play the more statistics will have been collected, but there is no bias towards capturing more data for the popular players. When cognitive and statistical analysis did not agree we spent time trying to understand why and that led us to look at the statistics differently (Bartoli’s serve) or to look at some of the preconceptions that were embedded in the articles (Nadal doesn’t play as well in the cold). At it’s best, rather than be subject to bias, the combination of cognitive and statistical analysis can show up the biases we hold ourselves.
Sentiment and Emotion Models
Watson Discovery Service (WDS) was the main tool we used to query and analyse unstructured data. WDS has what it terms enrichments, extra metadata that the service infers from the text. Developers can add their own own, but for #WhatMakesGreat we used the built in enrichments. These standard enrichments use natural language understandingto infer sentiment, emotion, topics and entities within the source text. This is done with a standard machine learning model that has been developed from AlchemyAPI (acquired by IBM in 2015) algorithms.
Using the standard model means we didn’t need to train it ourselves, limiting the opportunity for us to introduce bias. We were relying on the standard model being bias free though and the algorithms and data used for the standard model are not published. This is a potential weak spot in our analysis.
The enrichment we were most reliant on was entity extraction, specifically identifying people; things said about them and by them. It’s possible (though we didn’t see any evidence) that that this performs better in identifying one group of people over another, whether that be defined by race, sex or something else. That would have given us less evidence to work with for the analysis, but the normalisation (described previously) would mitigate it. The danger would come if the algorithm did a better or worse job of identifying people when there was a particularly negative or positive context for the mention. If the algorithm performed better at identifying Steffi Graf when the description was positive rather than negative, that would skew our results. It is right to note this, but I am confident that the human editorial process we had, combined with the hard match statistics provided enough mitigation to this.
The document sentiment and emotion inferences likely had a greater potential for bias or inaccuracy, but this would have had less impact on our project. We tended to use sentiment and emotion as a secondary level filter, more to help the editorial team than provide an absolute answer. If there was an unexpected result, we would use sentiment to organise the data and then read a sample of the raw content ourselves. This is what we did to help understand the positive references to Bartoli’s serve.
If we has been using the sentiment or emotion scores more directly we would have needed to develop a custom tennis model. One thing we have seen is that the standard language used in tennis has a significant affect on the sentiment analysis. Even at a simple level, the words used in sport and specifically tennis (love, strike, power, smash, ace, destroy) have a very different meaning to how they would normally be used. The standard sentiment and emotion models we used don’t take account of this.
We used Personality Insights to asses players personality based on things they had said in interviews. This is perhaps the part of the project where we have the least visibility to the inner workings. Personality Insights is used largely as a black box, you give it sample content of things someone has said and it returns a personality assessment. This is exactly the sort of system that has been highlighted as dangerous for introducing unseen bias.
Personality Insights is not a service that you can add your own training data to, you can’t introduce bias in to the model but you need to trust there is none already built in. For this service a lot of published research has been done to prove its effectiveness. Accepting the research that has been done, there was still areas of concern for us. The Personality Insights model was trained on social media content where as we were using content from player interviews. Research suggests that using non social media sources introduces an error of between 2% and 16%. 16% would be a significant error, but as all the content we had came from the same source it would at least be consistent across all players.
The questions asked of different players are not all the same and there is a possibility that this changes the type of reply they give sufficiently to skew the results. We worried about this between men and women. We can’t be sure if this played a role or not, but what we did find was a remarkable consistency across the personality profiles of all the top players *. None were quite the same, but they all had a very similar shape (this was true across men and women). When we compared to great sports men and women from other sports we could see a difference. The difference between the personality profile of different sports seemed much greater than the differences between the best players in each of them, giving us some level of confidence in the results.
* One notable exception was Natasha Zvereva who had a completely different personality profile to the rest of the Wimbledon greats. When we sorted the list by any of the individual personality attributes she tended to always be at the top or the bottom. Looking more in to Zvereva this does make sense, she doesn’t seem to fit the mould of a typical Wimbledon great. She has been in trouble for criticising the Soviet Union, flashing her bra and making an offensive gesture to the Wimbledon crowd. She also talks frequently about wanting to own a farm.
Personality is such a difficult thing to assess for humans never mind machines. Even after millions of years trying to judge it in others. I think this is why people have been so interested in the results. It’s clearly subjective, one person’s passion is another person’s stubbornness. As with the previous analysis, none of the results we’re sharing as part of #WhatMakesGreat came directly from the service without passing the editorial team’s scrutiny. More so, this is where we want fans input and perspective. The personality insight analysis is meant to prompt debate and challenge assumptions.
What Would We Have Done Differently?
Of all the potential causes of bias in our results, the source data is our weakest link. Given more time I would have liked to include many more data sources. It would add significantly to the challenge, but including non-British media sources would have given more coverage of all the greats. Using non-english material would go even further and allow us to analyse what makes great by location.
With more source data I would have also liked to build some custom models to analyse sentiment and emotion for the tennis content. It might have been sensible to first classify the text to decide if the content was describing a match, or the player outside of a match and use a different analysis model for each.
Join The Debate
We didn’t do a perfect analysis, but I think we did enough to meet the goal of sparking debate and challenging preconceptions. Machine learning and bias is something we didn’t underestimate or ignore, but I am sure it is something we could do even more to address. As well as debate on what makes a great Wimbledon champion, thoughts and feedback on the technical and machine learning side of the project would be welcome.