Duolingo stated to The Record: “These records were obtained by data scraping public profile information” The followed this up with “No data breach or hack has occurred” The problem with this claim is that the data presented included email addresses and other internal information which is not part of the public profile. The attacker claimed they scraped an exposed API which gave them access to the non-public information. Now, I am not sure about you, but scraping an exposed API to gain access to data that is not publicly available sounds a lot like a “hack” to me. The theft of protected internal (non-public) data through the use of a flaw in an exposed API sounds a lot like a breach. Regardless of your definition of “hack” or “breach” 2.6 million users’ information was removed from the site and put up for sale on a hacking forum. Originally it was up for sale for $1,500, but after Breached shut down not much was heard. Now the Duolingo data has resurfaced. This time it was spotted by the VX-Underground group when it hit the new and possibly improved Breached forum. The data popped up for sale for the equivalent of about $2.
As for the API, well it is still exposed and allows someone to not only scrape public information, but by feeding the API certain information it can confirm is that data is present in the non-public part of the profile. The issue with this API was reported back in January and has not been addressed by Duolingo yet. It is likely that Duolingo does not feel this is an issue as it is intended to only allow scraping of publicly available information. However, when that API can be abused to pull private information in combination with public, it should be of critical importance. In this case it is clear from conversations around the API that attackers are aware of exactly how to abuse this and how to get even more non-public information out of it including access levels inside Duolingo for more targeted attacks.
Again, to me, this sounds more like a breach and hack than a simple scrape, but who am I to judge?
In looking at the actual data exfiltrated, SurfShark has noted that the US seems to have been hit the hardest with 976,000 unique email addresses identified. If the way the API flaw works is accurate that could mean that the attackers have a larger list of US emails and names that they can feed into the API and get a return. Considering how loose and free many companies in the US are with user data protection and privacy, this is not a shocker at all.
The rest of the Duolingo data breaks down as follow:
• South Sudan comes in second, with five times fewer accounts leaked (175k) than the US. Spain follows in third place with 123k exposed accounts, followed by France with 105k, and the United Kingdom with 98k.
• In total, 16.3M data points of Duolingo users were exposed. On average, each email account was leaked with five data points, such as language (5.3M), profile picture (2.7M), username (2.7M), name (2.2M), country (0.7M) or bio (6k).
Duolingo can call this a ninja danger star for all I care, but they should be doing something to fix the still exposed API in their application. The previous comments from them after the original scrape and data theft are seriously concerning considering their popularity and large user base. Through their lack of attention on this flawed API they are exposing a considerable number of people to increased phishing attacks including spear phishing style attacks to gain greater access. To call this an embarrassment is a massive understatement. I hope that Duolingo responds to the many questions about this API with a response that indicates it is either fixed to prevent additional scraping, it is blocked from public access, or both.