logo

$30
How do I get the difference of 2 DISTINCT queries?

I feel stupid about this, but I'm drawing a total blank. These 2 queries have a difference of 3. How do I get those 3?

select count(distinct xmf_userid) from user_profile where xmf_userid > 0;
4524


select count(distinct email) from user_profile where xmf_userid > 0;
4521

I just want to see which distinct xmf_userid do not have a distinct email (which should be impossible, according to the logic of this software, so I am shocked to see this result).

Here is the table in yaml format:

user_profile:
user_id: { type: varchar(100) }
password: { type: varchar(60) }
email: { type: varchar(100) }
address1: { type: varchar(100) }
address2: { type: varchar(100) }
city: { type: varchar(50) }
state: { type: varchar(2) }
country: { type: varchar(50) }
zip: { type: varchar(10) }
phone: { type: varchar(20) }
first_name: { type: varchar(20) }
last_name: { type: varchar(20) }
user_type: { type: varchar(10) }
admin: { type: varchar(1) }
receive_email: { type: varchar(1) }
created_by_user: { type: varchar(30) }
modify_by_user: { type: varchar(30) }
last_login: { type: date }
dummy: { type: varchar(1) }
company_id: { type: integer, foreignTable: company, foreignReference: company_id }
sub_company_id: { type: integer, foreignTable: subcompany, foreignReference: sub_company_id }
gift_recipient: { type: varchar(1) }
user_cheese_industry: { type: varchar(30) }
user_mag_subscriber: { type: varchar(1) }
user_wso_subscriber: { type: varchar(1) }
xmf_userid: { type: varchar(100) }
old_userid_from_before_the_switch_to_cds: { type: varchar(100) }
xmf_demographics: { type: varchar(30) }
xmf_initial_export062210: { type: varchar(1) }
xmf_password: { type: varchar(50) }
xmf_company_name: { type: varchar(100) }



UPDATE:

I'm running MySql 5.1.37. Not sure if that is relevant to the very slow (non-existent) performance speed I'm getting from sub-queries.


UPDATE:

Thanks everyone. I've managed to get a wealth of data out of the database, with your help. The indexes helped with the sub-queries.

Lawrence Krubner | 07/16/10 at 12:10pm | Edit


(6) Possible Answers Submitted...

  • avatar
    Last edited:
    07/19/10
    1:54pm
    Oleg Butuzov says:

    SELECT email, concat('\'', group_concat( xmf_userid SEPARATOR  '\',\''), '\'') as emails FROM user_profile where xmf_userid > 0 GROUP BY email ORDER BY emails DESC


    something like that. how to fiter that 3 records that is the question!

    Previous versions of this answer: 07/16/10 at 12:24pm

    • 07/16/10 4:31pm

      Lawrence Krubner says:

      Right. I need those 3 records.

    • 07/16/10 4:39pm

      Oleg Butuzov says:


      SELECT xmf_userid, email
      FROM user_profile
      WHERE xmf_userid NOT IT
      (SELECT xmf_userid FROM user_profile WHERE count(email) > 1 GROUP BY email )

    • 07/16/10 4:40pm

      Lawrence Krubner says:

      That is extremely clever, Oleg. I did not realize I could run queries like that. But mostly I'd like to get just the difference between the 2 queries.

    • 07/16/10 4:48pm

      Oleg Butuzov says:

      Lawrence we can add to the query

      SELECT xmf_userid, email 
      FROM user_profile
      WHERE xmf_userid NOT IT
      (SELECT xmf_userid FROM user_profile WHERE count(email) > 1 GROUP BY email )
      GROUP BY email


      so we will get just three records...

    • 07/19/10 9:51am

      Lawrence Krubner says:

      hmm, I get:

      "ERROR 1111 (HY000): Invalid use of group function"

      That is clever, though, trying to double use the group by. I can see where you are going

    • 07/19/10 10:08am

      Oleg Butuzov says:

      Lawrence can you remove password email all privaci settings and give me the dump? just for testings?

    • 07/19/10 10:09am

      Oleg Butuzov says:

      i mean privacy data.

      not email eamil should be there. sorry.

    • 07/19/10 1:42pm

      Lawrence Krubner says:

      Oleg, sorry, the data is not mine. I'm limited in how I can share it.

    • 07/19/10 1:44pm

      Oleg Butuzov says:

      at least create table query?

  • avatar
    Last edited:
    07/16/10
    12:56pm
    Jarret Minkler says:

    use

    some query ...

    MINUS

    other query ...

    • 07/16/10 4:30pm

      Lawrence Krubner says:

      Jarret, that sounds like a good idea, but what is the syntax? I tried this:

      mysql> select count(distinct xmf_userid) from user_profile where xmf_userid > 0
      minus
      select count(distinct email) from user_profile where xmf_userid > 0;


      and got this error:


      ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'minus
      select count(distinct email) from user_profile where xmf_userid > 0' at line 2

    • 07/16/10 4:48pm

      Jarret Minkler says:

      Ahh crap ... I guess I was thinking Oracle :) you could always do like ..


      select * from x where NOT EXISTS (select id from x where .. )

  • avatar
    Last edited:
    07/19/10
    1:54pm
    k v says:

    Below will give you userids having duplicate emails.
    select xmf_userid from user_profile where xmf_userid>0 group by email having count(*)>1;


    RE: >>
    Your xmf_userid is not unique as well! In that case this is quite obvious that you're getting count difference!
    Distinct values of xmf_userid and distinct values of email will differ when there is m:m kinda relation between them.

    I'm not sure why're curious to find which entries caused that difference of 4!
    Of-course duplicate entries can cause this.
    My first query that I gave you will return IDs that have duplicate entries and if you're getting 255 rows, that means 255 rows have duplicate entries which is causing the difference.

    I will give you an example:


    CREATE TABLE `x` (
    `a` int(10) NOT NULL,
    `b` varchar(10) default NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1

    insert into x values(1,'x'), (1,'y'),(2,'z'),(3,'t'),(4,'y'),(5,'t'),(6,'t'),(2,'t'),(7,'t'),(8,'t');

    For above 10 records we have distinct counts as follows:
    SELECT count(distinct a) FROM x ; # Results 8
    SELECT count(distinct b) FROM x ; # Results 4


    1. #Values of 'b' related to non-unique 'a':
    select b,group_concat(a),count(*) from x group by b having count(*)>1;
    Above Query will reutrn two rows:
    't', '8,7,2,6,5,3', 6
    'y', '4,1', 2

    Means for value 't' duplicated for 6 entries of Field-a; while value 'y' is repeated twice in Field A for values 4 and 2.

    2. #Values of 'a' related to non-unique 'b':
    select a,group_concat(b),count(*) from x group by a having count(*)>1;
    This will return:
    1, 'x,y', 2
    2, 'z,t', 2

    i.e. for field A, values 1 and 2 are repeated twice in the field B for values of x & y and z & t respectively.

    >>>>
    Now if you add both count of results of query 1 and 2, you will get the difference between your distinct counts.
    i.e here 2+2 = 4 (diff between distinct counts).

    3.
    #Maximum values of 'a' related to 1 value of 'b'
    select b,group_concat(a),count(*) from x group by b having count(*)=(select count(*) from x group by b order by count(*) desc limit 1);
    This query gives the value of Field-B having maximum number of repeats.

    I hope this helps or please elaborate :)

    Previous versions of this answer: 07/17/10 at 2:04am

    • 07/16/10 4:54pm

      Lawrence Krubner says:

      Interesting. This gives me 255 rows. I wonder if these 255 rows could be related to the 3 row difference between my original 2 queries?

    • 07/17/10 6:04am

      k v says:

      updated.

    • 07/19/10 1:45pm

      Lawrence Krubner says:

      kv, we have been using a 3rd party to manage some of our user authorization. They have a bizarre system where the same email, but with a different password, will be treated as a new user account. I am as surprised as you are that the email is not unique.

    • 07/19/10 1:47pm

      Lawrence Krubner says:

      " that means 255 rows have duplicate entries which is causing the difference."


      Good point.

  • avatar
    Last edited:
    07/16/10
    3:08pm
    Bill Hunt says:

    If you want every user id that has a duplicate email you need to do a subselect that grabs just those email addresses having count > 1, and then matches against it:

    SELECT user_id, email FROM user_profile where email IN( (select email from group by email having count(email) > 1) );

    • 07/16/10 4:57pm

      Lawrence Krubner says:

      Bill, did you mean to repeat the table name when you wrote "select email from group by " in the sub-query?

    • 07/16/10 5:02pm

      Bill Hunt says:

      Yes, I did. :)

    • 07/16/10 6:28pm

      Lawrence Krubner says:

      I get 2311 rows, which seems like too many. Doesn't the outer clause need a group by statement? In fact, is the sub-select really necessary?



  • avatar
    Last edited:
    07/19/10
    1:54pm
    William Clark says:

    Try this:

    SELECT DISTINCT xmf_userid
    FROM user_profile
    WHERE email NOT IN (SELECT DISTINCT email
    FROM user_profile
    WHERE xmf_userid > 0)
    AND xmf_userid > 0

    • 07/19/10 11:45am

      Lawrence Krubner says:

      I am not sure what is up with this query, but it just runs and runs and runs and never completes. I gave it 2 minutes and went checked my email. When I came back, it still had not completed. Sort of like a poorly done JOIN with an out of control Cartesian product.

    • 07/19/10 11:47am

      Lawrence Krubner says:

      I like the idea of this query. I'm not sure why it is not working. Maybe the 2 DISTINCTs or something about the sub-query and the NOT IN? Hard to say.



    • 07/19/10 12:58pm

      William Clark says:

      Hi Lawrence,

      Do you have a unique index on xmf_userid, and a unique or non-unique index on email? if not, you should. Add the indexes and this query will speed up considerably.

    • 07/19/10 1:39pm

      Lawrence Krubner says:

      William, thanks. I'm doing that now. The queries were poorly indexed. Last week we had a similar query that ran for 15 minutes, then we added an index and it completed in 9.5 seconds. I'd forgotten about it but now it seems very important.

  • avatar
    Last edited:
    07/19/10
    1:54pm
    Steven Siebert says:

    This will work:

    select xmf_userid, email from user_profile where email in (select email from user_profile group by email having count(email) > 1);

    • 07/19/10 11:52am

      Lawrence Krubner says:

      Your suggestion has some things in common with William Clark's, and like his your query seems to last forever on my server.

      I am curious if having a particular table type, Mysami or innodb or what, effects performance of these queries?

    • 07/19/10 12:06pm

      Lawrence Krubner says:

      Yeah, so I waited more than 10 minutes and the query never completed. It was still running. Can you guess why?

    • 07/19/10 12:58pm

      Steven Siebert says:

      The query I provided shouldn't be too intensive...I can only assume that you don't have an index on your email field. Adding an index to this field will speed things up immensely.

      William Clarks suggestion is similar, but will not give you the correct answer. His inner query will return all distinct email addresses and his outer query uses this to give you any email address that exists outside of this list. Obviously, there will be no resultant emails returned, since he asked for all the emails from the same table. His query takes a long time as well because you don't have an index on email.

      Add that index, use the query I provided, and you should have your answer =)

      If you would like to provide your table schema (show create table user_profile), I can see what is going on exactly and make some suggestions.

      Regards,

      Steve

This question has expired.





Current status of this question: Completed